.. _batch_pipeline_label: ************** Batch Pipeline ************** After all jobs in a batch complete you may want to run additional code to process the results. You can use the ``jade pipeline`` command for this purpose. JADE will run ``submit-jobs`` on a series of config files sequentially. Each stage has the option of reading the outputs of previous stages. To create the pipeline the user must provide a list of scripts that will be used to create the JADE config file for each stage. Create the pipeline =================== The user must provide a script that will create the JADE configuration for each stage in the pipeline. Before invoking each script JADE sets the following environment variables in order to provide information about completed stages and their outputs: - JADE_PIPELINE_OUTPUT_DIR: main output directory containing all stage outputs - JADE_PIPELINE_STATUS_FILE: path to file containing stage-specific, information including output directories - JADE_PIPELINE_STAGE_ID: current stage ID .. code-block:: bash $ jade pipeline create -a batch1-auto-config.sh -a batch2-auto-config.sh -c pipeline.toml Alternatively, if your config files are known beforehand, you can specify them directly. .. code-block:: bash $ jade pipeline create -f config_stage1.json -f config_stage2.json -c pipeline.toml Customize the config ==================== ``pipeline.toml`` will have default values for each ``jade submit-jobs`` command. You may may want to override the max-nodes or per-node-batch-size parameters for each stage. Submit the pipeline =================== .. code-block:: bash $ jade pipeline submit pipeline.toml -o pipeline-output Check status ============ .. code-block:: bash $ jade pipeline status -o pipeline-output Example ======= Let's use the extension ``demo`` as an example. This extension performs auto-regression analysis for the ``gdp`` values for several countries. In each job (or country), it reads a CSV file containing ``gdp`` values, and generates a new CSV file ``result.csv`` containing ``pred_gdp`` values. Suppose that we want to merge each job's output file into one file once all jobs are complete. The first step is to write a script to produce the summary file. Here's how to to run the demo extension on test data. .. code-block:: bash $ jade auto-config demo tests/data/demo -c config.json $ jade submit-jobs config.json $ tree output output ├── config.json ├── diff.patch ├── job-outputs │   ├── australia │   │   ├── events.log │   │   ├── result.csv │   │   ├── result.png │   │   ├── run.log │   │   └── summary.toml │   ├── brazil │   │   ├── events.log │   │   ├── result.csv │   │   ├── result.png │   │   ├── run.log │   │   └── summary.toml │   └── united_states │   ├── events.log │   ├── result.csv │   ├── result.png │   ├── run.log │   └── summary.toml ├── results.json └── submit_jobs.log .. note:: Please note that, we use datasets ``tests/data/gdp`` which contains only 3 countries. The content of ``result.csv`` looks similar this, .. code-block:: bash year,gdp,pred_gdp 1960,543300000000, 1961,563300000000, 1962,605100000000, ... 2016,18707188235000,19406250376876.492 2017,19485393853000,20519007253667.656 2018,20494100000000,20672861935684.523 Our post-processing task is to collect ``result.csv`` files from all jobs, extract ``pred_gdp`` column from each ``result.csv`` file, and aggregate them in one CSV file. The script ``jade/extensions/demo/merge_pred_gdp.py`` writes this result to ``pred_gdp.csv``. Now let's automate this workflow in a JADE pipeline using two stages. The first stage will use the ``demo`` extension. The script ``jade/extensions/demo/create_demo_config.sh`` creates its config file. .. code-block:: bash $ cat jade/extensions/demo/create_demo_config.sh #!/bin/bash jade auto-config demo tests/data/demo -c config-stage1.json The second stage will use the ``generic_command`` extension. We will create a config that runs one "generic_command" - the script above to post-process the results. The script to create the stage 1 configuration is :mod:`jade.extensions.demo.create_merge_pred_gdp`. Note that this script reads the environment variable JADE_PIPELINE_STATUS_FILE to find out the output directory name of the first stage as well as its own output directory. Let's create the pipeline and submit it for execution. .. code-block:: bash $ jade pipeline create -a ./jade/extensions/demo/create_demo_config.sh -a ./jade/extensions/demo/create_merge_pred_gdp.py Created pipeline config file pipeline.toml $ jade pipeline submit pipeline.toml Let's take a look at the ``output`` directory. You'll notice that per-country results are in ``output-stage1`` and the summary file ``pred_gdb.csv`` is in ``output-stage1``. .. code-block:: bash $ tree output output ├── config-stage1.json ├── config-stage2.json ├── output-stage1 │   ├── config.json │   ├── diff.patch │   ├── job-outputs │   │   ├── australia │   │   │   ├── events.log │   │   │   ├── result.csv │   │   │   ├── result.png │   │   │   ├── run.log │   │   │   └── summary.toml │   │   ├── brazil │   │   │   ├── events.log │   │   │   ├── result.csv │   │   │   ├── result.png │   │   │   ├── run.log │   │   │   └── summary.toml │   │   └── united_states │   │   ├── events.log │   │   ├── result.csv │   │   ├── result.png │   │   ├── run.log │   │   └── summary.toml │   ├── results.json │   └── submit_jobs.log ├── output-stage2 │   ├── config.json │   ├── diff.patch │   ├── job-outputs │   ├── pred_gdp.csv │   ├── results.json │   └── submit_jobs.log ├── pipeline_status.toml ├── pipeline_submit.log └── pipeline.toml In ``pred_gdp.csv``, you'll see the content: .. code-block:: year,brazil,australia,united_states 1960,,, 1961,,, 1962,,, ... 2016,2080587377798.5112,1258003336600.582,19406250376876.49 2017,1827457759144.0063,1438897367269.8796,20519007253667.656 2018,1995335978627.933,2154574393156.4248,20672861935684.523 Done!