Data flow¶

This page follows the data through the pipeline: what each stage consumes, what it produces, and how the products feed the next stage. It is the conceptual companion to the hands-on walkthrough.

The pipeline at a glance¶

flowchart TD
    NANO[(CMS NanoAOD<br/>DAS / WLCG)]
    LIST[Input file lists]
    ANA[anaTuples<br/>analysis ntuples]
    HTUP[histTuples<br/>+ analysis observables]
    HIST[Histograms<br/>per process, per systematic]
    PLOT[Plots]
    STAT[Limits, scans,<br/>pulls & impacts]

    NANO -->|InputFileTask| LIST
    LIST -->|AnaTupleFileTask| ANA
    ANA -->|AnaTupleMergeTask| ANA
    ANA -->|HistTupleProducerTask| HTUP
    HTUP -->|HistFromNtupleProducerTask| HIST
    HIST -->|HistMergerTask| HIST
    HIST -->|HistPlotTask| PLOT
    HIST -->|StatInference + inference| STAT

Stage by stage¶

Stage (task)	Consumes	Produces
InputFileTask	A DAS query for the requested datasets and era.	The concrete list of NanoAOD files to process. Runs first and cheaply; everything else keys off it.
AnaTupleFileTask	One NanoAOD file (one branch per file).	One anaTuple: a slimmed/skimmed analysis ntuple with the objects, weights and flags the analysis needs. Runs inside CMSSW via `AnaProd/anaTupleProducer.py`.
AnaTupleMergeTask	The per-file anaTuples for a dataset.	One merged anaTuple per dataset (data merged across runs).
HistTupleProducerTask	Merged anaTuples.	histTuples: ntuples with the heavier analysis observables computed (the "payload producers").
HistFromNtupleProducerTask	histTuples.	Histograms of the requested variables, including systematic variations. Branches over variables.
HistMergerTask	Per-piece histograms.	Merged histograms per process, ready for plotting and fitting.
HistPlotTask	Merged histograms.	Plots (one branch per variable).
Statistical inference	Merged histograms / shapes.	Datacards, exclusion limits, likelihood scans, pulls & impacts (via `StatInference` and the `inference`/`dhi` combine tooling).

Two helper tasks you will also see

Some analyses (notably HH→bb̄WW) insert AnalysisCacheTask and AnalysisCacheAggregationTask to pre-compute and aggregate per-event payloads (e.g. the b-tag shape weights) before histogramming. They are part of the same graph and run automatically when required. See the Task reference.

Where the outputs live¶

Each output type is written to a named filesystem (fs_*) that you configure — typically grid/EOS storage for the big ntuples and histograms, and a local data/ area for small artifacts. The mapping and how to set it is covered in Storage & filesystems and the user_custom.yaml guide. The practical consequence:

Large products (anaTuples, histTuples, histograms) persist on shared storage, so collaborators — and the next stage — can reuse them without recomputing.
Because LAW skips tasks whose output already exists, the pipeline is incremental: re-running a late stage only computes what is genuinely missing.

Versions keep productions apart¶

Every output path includes the --version you chose. Two runs with different versions never collide, which is how parallel productions, personal tests and official productions coexist on the same storage. The per-task --<TaskName>-version overrides let one run read an existing upstream production while writing its own downstream outputs under a new version — see Command arguments.