Resource Utilization Statistics¶
Torc will optionally monitor resource utilization on compute nodes.
Configuration¶
You can define these settings in the config
field of the workflow specification JSON5 file.
config: {
compute_node_resource_stats: {
cpu: true,
disk: false,
memory: true,
network: false,
process: true,
include_child_processes: true,
recurse_child_processes: false,
monitor_type: "aggregation",
make_plots: true,
interval: 10,
}
}
Setting cpu
, disk
, memory
, or network
to true will track those resources on the
compute node overall. Setting process
to true will track CPU and memory usage on a per-job
basis.
You can set monitor_type
to these options:
aggregation
: Track min/max/average stats in memory and record the results in the database.periodic
: Record time-series data on an interval in per-node SQLite database files (<output-dir>/stats/*.sqlite
).
If monitor_type = periodic
and make_plots = true
then torc will generate HTML plots of the
results (<output-dir>/stats/*.html
).
Aggregated Stats¶
The commands below will print summaries of the stats in the terminal. These stats are stored in the database.
$ torc jobs list-process-stats
$ torc compute-nodes list-resource-stats
Time Series Stats¶
If you set monitor_type = periodic
in the config then you will one SQLite file per compute node
in <output-dir>/stats/*.sqlite
. Here are some example commands.
$ sqlite3 -table output/stats/compute_node_98209950.sqlite
SQLite version 3.41.2 2023-03-22 11:56:21
sqlite> .tables
cpu disk memory network process
$ sqlite> select * from cpu;
+------+------+--------+------+-------------+----------------------------+
| user | nice | system | idle | cpu_percent | timestamp |
+------+------+--------+------+-------------+----------------------------+
| 2.0 | 0.0 | 2.0 | 10.0 | 28.6 | 2023-04-28 13:51:42.655853 |
| 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2023-04-28 13:51:46.350560 |
| 16.1 | 0.0 | 7.4 | 76.5 | 23.5 | 2023-04-28 13:51:49.541241 |
| 11.0 | 0.0 | 10.8 | 78.2 | 21.8 | 2023-04-28 13:51:51.702789 |
| 10.3 | 0.0 | 8.1 | 81.6 | 18.4 | 2023-04-28 13:51:52.832309 |
| 11.5 | 0.0 | 1.8 | 86.7 | 13.3 | 2023-04-28 13:51:53.966989 |
| 9.4 | 0.0 | 3.5 | 87.1 | 12.9 | 2023-04-28 13:51:54.175749 |
+------+------+--------+------+-------------+----------------------------
$ sqlite> select timestamp, available / (1024*1024*1024) as available_gb, percent from memory;
+----------------------------+--------------+---------+
| timestamp | available_gb | percent |
+----------------------------+--------------+---------+
| 2023-04-28 13:51:42.655853 | 17 | 45.6 |
| 2023-04-28 13:51:46.350560 | 17 | 45.6 |
| 2023-04-28 13:51:49.541241 | 16 | 47.0 |
| 2023-04-28 13:51:51.702789 | 17 | 46.6 |
| 2023-04-28 13:51:52.832309 | 17 | 45.6 |
| 2023-04-28 13:51:53.966989 | 17 | 45.8 |
| 2023-04-28 13:51:54.071424 | 17 | 45.2 |
+----------------------------+--------------+---------+
$ sqlite> select timestamp, job_key, cpu_percent, rss / (1024*1024*1024) AS rss_gb from process;
+----------------------------+----------+-------------+--------------------+
| timestamp | job_key | cpu_percent | rss_gb |
+----------------------------+----------+-------------+--------------------+
| 2023-04-28 13:51:46.350560 | 98207990 | 82.8 | 0.331188201904297 |
| 2023-04-28 13:51:46.350560 | 98208002 | 0.0 | 0.396995544433594 |
| 2023-04-28 13:51:49.541241 | 98207918 | 0.0 | 0.0418891906738281 |
| 2023-04-28 13:51:49.541241 | 98207930 | 0.0 | 0.0420913696289062 |
| 2023-04-28 13:51:49.541241 | 98207954 | 0.0 | 0.216609954833984 |
| 2023-04-28 13:51:49.541241 | 98207966 | 0.0 | 0.0409011840820312 |
| 2023-04-28 13:51:49.541241 | 98207990 | 0.0 | 0.0354042053222656 |
| 2023-04-28 13:51:49.541241 | 98208002 | 0.0 | 0.0270614624023437 |
| 2023-04-28 13:51:51.702789 | 98207954 | 0.0 | 0.041168212890625 |
| 2023-04-28 13:51:51.702789 | 98207966 | 0.0 | 0.0479011535644531 |
| 2023-04-28 13:51:51.702789 | 98207990 | 0.0 | 0.0424423217773437 |
| 2023-04-28 13:51:51.702789 | 98208002 | 0.0 | 0.0340538024902344 |
| 2023-04-28 13:51:52.832309 | 98207990 | 83.2 | 0.293796539306641 |
| 2023-04-28 13:51:52.832309 | 98208002 | 0.0 | 0.410999298095703 |
| 2023-04-28 13:51:53.966989 | 98207990 | 0.0 | 0.0494346618652344 |
| 2023-04-28 13:51:53.966989 | 98208002 | 0.0 | 0.0381813049316406 |
+----------------------------+----------+-------------+--------------------+
Time Series Job-Process Stats¶
As stated above, torc records time-series stats in one SQLite file per compute node. This is inconvenient for job-process stats. You typically want to look at all process stats together rather than have them separated by compute node. Torc provides a CLI command to concatenate them in one file.
$ torc stats concatenate-process output/stats
2023-04-27 17:01:10,907 - INFO [torc.utils.sql sql.py:103] : Added table process from output/stats/compute_node_98209951.sqlite to output/stats/job_process_stats.sqlite
2023-04-27 17:01:10,909 - INFO [torc.utils.sql sql.py:103] : Added table process from output/stats/compute_node_98209950.sqlite to output/stats/job_process_stats.sqlite