Batch Pipeline

After all jobs in a batch complete you may want to run additional code to process the results. You can use the jade pipeline command for this purpose.

JADE will run submit-jobs on a series of config files sequentially. Each stage has the option of reading the outputs of previous stages.

To create the pipeline the user must provide a list of scripts that will be used to create the JADE config file for each stage.

Create the pipeline

The user must provide a script that will create the JADE configuration for each stage in the pipeline.

Before invoking each script JADE sets the following environment variables in order to provide information about completed stages and their outputs:

  • JADE_PIPELINE_OUTPUT_DIR: main output directory containing all stage outputs

  • JADE_PIPELINE_STATUS_FILE: path to file containing stage-specific, information including output directories

  • JADE_PIPELINE_STAGE_ID: current stage ID

$ jade pipeline create -a batch1-auto-config.sh -a batch2-auto-config.sh -c pipeline.toml

Alternatively, if your config files are known beforehand, you can specify them directly.

$ jade pipeline create -f config_stage1.json -f config_stage2.json -c pipeline.toml

Customize the config

pipeline.toml will have default values for each jade submit-jobs command. You may may want to override the max-nodes or per-node-batch-size parameters for each stage.

Submit the pipeline

$ jade pipeline submit pipeline.toml -o pipeline-output

Check status

$ jade pipeline status -o pipeline-output

Example

Let’s use the extension demo as an example. This extension performs auto-regression analysis for the gdp values for several countries. In each job (or country), it reads a CSV file containing gdp values, and generates a new CSV file result.csv containing pred_gdp values.

Suppose that we want to merge each job’s output file into one file once all jobs are complete.

The first step is to write a script to produce the summary file. Here’s how to to run the demo extension on test data.

$ jade auto-config demo tests/data/demo -c config.json
$ jade submit-jobs config.json
$ tree output
output
├── config.json
├── diff.patch
├── job-outputs
│   ├── australia
│      ├── events.log
│      ├── result.csv
│      ├── result.png
│      ├── run.log
│      └── summary.toml
│   ├── brazil
│      ├── events.log
│      ├── result.csv
│      ├── result.png
│      ├── run.log
│      └── summary.toml
│   └── united_states
│       ├── events.log
│       ├── result.csv
│       ├── result.png
│       ├── run.log
│       └── summary.toml
├── results.json
└── submit_jobs.log

Note

Please note that, we use datasets tests/data/gdp which contains only 3 countries.

The content of result.csv looks similar this,

year,gdp,pred_gdp
1960,543300000000,
1961,563300000000,
1962,605100000000,
...
2016,18707188235000,19406250376876.492
2017,19485393853000,20519007253667.656
2018,20494100000000,20672861935684.523

Our post-processing task is to collect result.csv files from all jobs, extract pred_gdp column from each result.csv file, and aggregate them in one CSV file. The script jade/extensions/demo/merge_pred_gdp.py writes this result to pred_gdp.csv.

Now let’s automate this workflow in a JADE pipeline using two stages.

The first stage will use the demo extension. The script jade/extensions/demo/create_demo_config.sh creates its config file.

$ cat jade/extensions/demo/create_demo_config.sh
#!/bin/bash
jade auto-config demo tests/data/demo -c config-stage1.json

The second stage will use the generic_command extension. We will create a config that runs one “generic_command” - the script above to post-process the results.

The script to create the stage 1 configuration is jade.extensions.demo.create_merge_pred_gdp.

Note that this script reads the environment variable JADE_PIPELINE_STATUS_FILE to find out the output directory name of the first stage as well as its own output directory.

Let’s create the pipeline and submit it for execution.

$ jade pipeline create -a ./jade/extensions/demo/create_demo_config.sh -a ./jade/extensions/demo/create_merge_pred_gdp.py
Created pipeline config file pipeline.toml

$ jade pipeline submit pipeline.toml

Let’s take a look at the output directory. You’ll notice that per-country results are in output-stage1 and the summary file pred_gdb.csv is in output-stage1.

$ tree output
output
├── config-stage1.json
├── config-stage2.json
├── output-stage1
│   ├── config.json
│   ├── diff.patch
│   ├── job-outputs
│      ├── australia
│         ├── events.log
│         ├── result.csv
│         ├── result.png
│         ├── run.log
│         └── summary.toml
│      ├── brazil
│         ├── events.log
│         ├── result.csv
│         ├── result.png
│         ├── run.log
│         └── summary.toml
│      └── united_states
│          ├── events.log
│          ├── result.csv
│          ├── result.png
│          ├── run.log
│          └── summary.toml
│   ├── results.json
│   └── submit_jobs.log
├── output-stage2
│   ├── config.json
│   ├── diff.patch
│   ├── job-outputs
│   ├── pred_gdp.csv
│   ├── results.json
│   └── submit_jobs.log
├── pipeline_status.toml
├── pipeline_submit.log
└── pipeline.toml

In pred_gdp.csv, you’ll see the content:

year,brazil,australia,united_states
1960,,,
1961,,,
1962,,,
...
2016,2080587377798.5112,1258003336600.582,19406250376876.49
2017,1827457759144.0063,1438897367269.8796,20519007253667.656
2018,1995335978627.933,2154574393156.4248,20672861935684.523

Done!