SLURM - monitor resources during job

Hi there!

Our group is interested in monitoring the memory and CPU usage during a SLURM job. A while ago I already asked on stackoverflow how to realize that (see here ).
In short, a good way for the data collection would be via the AcctGatherProfileType/hdf5 plugin (see here ).
I quickly checked and it seems that this plugin is currently not installed on baobab. If you find it useful, could you please install said plugin?

Thank you,
Alex

Hi there,

Sorry for the very late reply.

We had already thought adding the AccGatherProfileType/hdf5 plugin but we have not found the time yet to test it first.

While the forthcoming maintenance (cf. Baobab scheduled maintenance: 27-28 November 2019 ) could be a good time to compile an HDF5-enabled Slurm, unfortunately a new minor and bugfixes version of Slurm has just been announced (cf. [slurm-users] Slurm version 19.05.4 is now available, SC19 ), thus I would avoid adding the new version and the HDF5 support at the same time.

I will come back to you ASAP.

Thx, bye,
Luca

1 Like

Hi there,

I am sorry it took so much time, but finally during the 2020-08 Baobab maintenance (cf. Baobab scheduled maintenance: 26-27 August 2020 ) we updated Slurm to the latest upstream version and we took the opportunity to activate the HDF5 plugin.

Some notes:

  1. please check the upstream documentation (cf. Slurm Workload Manager - Profiling Using HDF5 User Guide ) to know how to profile your jobs.
  2. HDF5 files will be saved as /opt/cluster/slurm/hdf5/${USERNAME}/${JOBID}.h5 .
  3. you can analyze the HDF5 files directly on the login node via HDFView (cf. New software installed: HDFView/2.14-System-Java-centos7 ).

Thx, bye,
Luca