Batch Pipeline¶
After all jobs in a batch complete you may want to run additional code to
process the results. You can use the jade pipeline
command for this
purpose.
JADE will run submit-jobs
on a series of config files sequentially. Each
stage has the option of reading the outputs of previous stages.
To create the pipeline the user must provide a list of scripts that will be used to create the JADE config file for each stage.
Create the pipeline¶
The user must provide a script that will create the JADE configuration for each stage in the pipeline.
Before invoking each script JADE sets the following environment variables in order to provide information about completed stages and their outputs:
JADE_PIPELINE_OUTPUT_DIR: main output directory containing all stage outputs
JADE_PIPELINE_STATUS_FILE: path to file containing stage-specific, information including output directories
JADE_PIPELINE_STAGE_ID: current stage ID
$ jade pipeline create -a batch1-auto-config.sh -a batch2-auto-config.sh -c pipeline.toml
Alternatively, if your config files are known beforehand, you can specify them directly.
$ jade pipeline create -f config_stage1.json -f config_stage2.json -c pipeline.toml
Customize the config¶
pipeline.toml
will have default values for each jade submit-jobs
command. You may may want to override the max-nodes or per-node-batch-size
parameters for each stage.
Submit the pipeline¶
$ jade pipeline submit pipeline.toml -o pipeline-output
Check status¶
$ jade pipeline status -o pipeline-output
Example¶
Let’s use the extension demo
as an example. This extension performs
auto-regression analysis for the gdp
values for several countries. In each
job (or country), it reads a CSV file containing gdp
values, and generates
a new CSV file result.csv
containing pred_gdp
values.
Suppose that we want to merge each job’s output file into one file once all jobs are complete.
The first step is to write a script to produce the summary file. Here’s how to to run the demo extension on test data.
$ jade auto-config demo tests/data/demo -c config.json
$ jade submit-jobs config.json
$ tree output
output
├── config.json
├── diff.patch
├── job-outputs
│ ├── australia
│ │ ├── events.log
│ │ ├── result.csv
│ │ ├── result.png
│ │ ├── run.log
│ │ └── summary.toml
│ ├── brazil
│ │ ├── events.log
│ │ ├── result.csv
│ │ ├── result.png
│ │ ├── run.log
│ │ └── summary.toml
│ └── united_states
│ ├── events.log
│ ├── result.csv
│ ├── result.png
│ ├── run.log
│ └── summary.toml
├── results.json
└── submit_jobs.log
Note
Please note that, we use datasets tests/data/gdp
which contains only 3 countries.
The content of result.csv
looks similar this,
year,gdp,pred_gdp
1960,543300000000,
1961,563300000000,
1962,605100000000,
...
2016,18707188235000,19406250376876.492
2017,19485393853000,20519007253667.656
2018,20494100000000,20672861935684.523
Our post-processing task is to collect result.csv
files from all jobs, extract pred_gdp
column from
each result.csv
file, and aggregate them in one CSV file. The script
jade/extensions/demo/merge_pred_gdp.py
writes this result to pred_gdp.csv
.
Now let’s automate this workflow in a JADE pipeline using two stages.
The first stage will use the demo
extension. The script jade/extensions/demo/create_demo_config.sh
creates its config file.
$ cat jade/extensions/demo/create_demo_config.sh
#!/bin/bash
jade auto-config demo tests/data/demo -c config-stage1.json
The second stage will use the generic_command
extension. We will create a
config that runs one “generic_command” - the script above to post-process the
results.
The script to create the stage 1 configuration is
jade.extensions.demo.create_merge_pred_gdp
.
Note that this script reads the environment variable JADE_PIPELINE_STATUS_FILE to find out the output directory name of the first stage as well as its own output directory.
Let’s create the pipeline and submit it for execution.
$ jade pipeline create -a ./jade/extensions/demo/create_demo_config.sh -a ./jade/extensions/demo/create_merge_pred_gdp.py
Created pipeline config file pipeline.toml
$ jade pipeline submit pipeline.toml
Let’s take a look at the output
directory. You’ll notice that per-country
results are in output-stage1
and the summary file pred_gdb.csv
is in
output-stage1
.
$ tree output
output
├── config-stage1.json
├── config-stage2.json
├── output-stage1
│ ├── config.json
│ ├── diff.patch
│ ├── job-outputs
│ │ ├── australia
│ │ │ ├── events.log
│ │ │ ├── result.csv
│ │ │ ├── result.png
│ │ │ ├── run.log
│ │ │ └── summary.toml
│ │ ├── brazil
│ │ │ ├── events.log
│ │ │ ├── result.csv
│ │ │ ├── result.png
│ │ │ ├── run.log
│ │ │ └── summary.toml
│ │ └── united_states
│ │ ├── events.log
│ │ ├── result.csv
│ │ ├── result.png
│ │ ├── run.log
│ │ └── summary.toml
│ ├── results.json
│ └── submit_jobs.log
├── output-stage2
│ ├── config.json
│ ├── diff.patch
│ ├── job-outputs
│ ├── pred_gdp.csv
│ ├── results.json
│ └── submit_jobs.log
├── pipeline_status.toml
├── pipeline_submit.log
└── pipeline.toml
In pred_gdp.csv
, you’ll see the content:
year,brazil,australia,united_states
1960,,,
1961,,,
1962,,,
...
2016,2080587377798.5112,1258003336600.582,19406250376876.49
2017,1827457759144.0063,1438897367269.8796,20519007253667.656
2018,1995335978627.933,2154574393156.4248,20672861935684.523
Done!