Running on HTCondor¶
Producing ntuples and histograms for a full era means processing thousands of files — far too much
for one machine. FLAF tasks are workflows (Tasks & LAW), so
their branches can be submitted to CERN's HTCondor batch system. The recommended pattern is to
develop and test with --workflow local, then switch to --workflow htcondor for production —
the command is otherwise the same.
Submit a task to the batch system¶
law run FLAF.AnaProd.tasks.AnaTupleFileTask \
--period Run3_2022 --version prod \
--workflow htcondor \
--transfer-logs \
--parallel-jobs 100
| Option | Why you want it |
|---|---|
--workflow htcondor |
Submit branches as batch jobs instead of running locally. |
--transfer-logs |
Bring each job's stdout/stderr back to your data/ area. Highly recommended — without it, debugging a failed job is painful. |
--parallel-jobs 100 |
Cap how many jobs are in flight at once. Be a good citizen on the shared pool; very large uncapped submissions are discouraged. |
--branches 0-99 |
Submit only a subset (e.g. to retry a range). |
Other HTCondor parameters available on every workflow task: --max-runtime, --n-cpus,
--priority, --htcondor-spool. See Command arguments.
Monitor and resume¶
LAW tracks which branches have finished (by checking their outputs), so a re-run only resubmits the missing ones — batch jobs fail and time out, and resuming is normal. Check progress with:
Standard condor_q / condor_status work for the underlying jobs.
Bundles: shipping the code to workers¶
A batch worker needs your code and environment. FLAF supports two modes:
- Non-bundle jobs rely on the shared AFS area being mounted on the worker: the job receives
FLAF_PATH/CORRECTIONS_PATHand runs the code straight from AFS (including any edits you made via the dev overlay). - Bundle jobs ship a tarball of the code/environment to the worker (the
--bundleflag and theBundleTaskmachinery). The worker runs from the tarball and never reaches back to AFS, so it is deliberately not givenFLAF_PATH/CORRECTIONS_PATH. Bundles also setFLAF_NO_INSTALL=1so the worker never tries to build the environment.
For most work the defaults are correct; you only think about bundles when a stage explicitly needs one (e.g. it declares a CMSSW bundle flavour) or when AFS is not available on the target pool.
Your edits to FLAF do reach the workers
Thanks to the dev overlay, non-bundle jobs run your edited FLAF/Corrections, and bundle
jobs include them in the tarball — so testing framework changes on HTCondor works without
committing first. See Contributing.
Caveats¶
Keep your proxy valid for the whole run
Jobs that outlive your VOMS proxy lose grid access mid-flight. Create a long-lived proxy
(-valid 192:00) before a big submission, and refresh it for long campaigns.
Killing a background law leaves its jobs/children
Pressing Ctrl-C or kill-ing a backgrounded law process does not necessarily stop the
branches it spawned. To stop everything for a run, match the processes by pattern, e.g.
pkill -f "version=prod", and condor_rm the submitted jobs if needed.
Test small, then scale
Validate a task with --workflow local --branches 0 --test 1000 before submitting the full
workflow to HTCondor. A bug found on one local branch is far cheaper than one found across a
thousand batch jobs.