Skip to content

Data flow

This page follows the data through the pipeline: what each stage consumes, what it produces, and how the products feed the next stage. It is the conceptual companion to the hands-on walkthrough.

The pipeline at a glance

flowchart TD
    NANO[(CMS NanoAOD<br/>DAS / WLCG)]
    LIST[Input file lists]
    ANA[anaTuples<br/>analysis ntuples]
    HTUP[histTuples<br/>+ analysis observables]
    HIST[Histograms<br/>per process, per systematic]
    PLOT[Plots]
    STAT[Limits, scans,<br/>pulls & impacts]

    NANO -->|InputFileTask| LIST
    LIST -->|AnaTupleFileTask| ANA
    ANA -->|AnaTupleMergeTask| ANA
    ANA -->|HistTupleProducerTask| HTUP
    HTUP -->|HistFromNtupleProducerTask| HIST
    HIST -->|HistMergerTask| HIST
    HIST -->|HistPlotTask| PLOT
    HIST -->|StatInference + inference| STAT

Stage by stage

Stage (task) Consumes Produces
InputFileTask A DAS query for the requested datasets and era. The concrete list of NanoAOD files to process. Runs first and cheaply; everything else keys off it.
AnaTupleFileTask One NanoAOD file (one branch per file). One anaTuple: a slimmed/skimmed analysis ntuple with the objects, weights and flags the analysis needs. Runs inside CMSSW via AnaProd/anaTupleProducer.py.
AnaTupleMergeTask The per-file anaTuples for a dataset. One merged anaTuple per dataset (data merged across runs).
HistTupleProducerTask Merged anaTuples. histTuples: ntuples with the heavier analysis observables computed (the "payload producers").
HistFromNtupleProducerTask histTuples. Histograms of the requested variables, including systematic variations. Branches over variables.
HistMergerTask Per-piece histograms. Merged histograms per process, ready for plotting and fitting.
HistPlotTask Merged histograms. Plots (one branch per variable).
Statistical inference Merged histograms / shapes. Datacards, exclusion limits, likelihood scans, pulls & impacts (via StatInference and the inference/dhi combine tooling).

Two helper tasks you will also see

Some analyses (notably HH→bb̄WW) insert AnalysisCacheTask and AnalysisCacheAggregationTask to pre-compute and aggregate per-event payloads (e.g. the b-tag shape weights) before histogramming. They are part of the same graph and run automatically when required. See the Task reference.

Where the outputs live

Each output type is written to a named filesystem (fs_*) that you configure — typically grid/EOS storage for the big ntuples and histograms, and a local data/ area for small artifacts. The mapping and how to set it is covered in Storage & filesystems and the user_custom.yaml guide. The practical consequence:

  • Large products (anaTuples, histTuples, histograms) persist on shared storage, so collaborators — and the next stage — can reuse them without recomputing.
  • Because LAW skips tasks whose output already exists, the pipeline is incremental: re-running a late stage only computes what is genuinely missing.

Versions keep productions apart

Every output path includes the --version you chose. Two runs with different versions never collide, which is how parallel productions, personal tests and official productions coexist on the same storage. The per-task --<TaskName>-version overrides let one run read an existing upstream production while writing its own downstream outputs under a new version — see Command arguments.