Monitoring jobs
sacct and sstat
It is important to track if the resources that we are asking to the queuing system match to the amount of resources that our jobs are actually using. Slurm command sstat
allows you to check on the status information of any running job/step, whereas sacct
allows you to check on the accounting of any job/step that has already finshed.
There’s an alias (sc
) in your ~/.bashrc
file that ‘prettyfies’ the output of sacct.
If you want to see older jobs you may specify an start time and/or an end time with the following options:
sc -S 2019-04-10 -E 2019-04-29
to see jobs that started after April 10th 2019 and ended before April 29th. Just read the man page for sacct
and play around.
NOTE: there’s a problem with the gathering of memory usage information (that we are working on fixing). Don’t believe the numbers in MaxVMSize nor MaxRSS
Ganglia
The CPU load and the memory pressure on the whole cluster and for every compute node can be monitored with Ganglia. This is more intended for system administrators, but it can also help you to gauge how your jobs are doing (mostly because many of you like to take whole nodes for your jobs, so it translates pretty well). You can access to the Ganglia statistics here when connected to the ICM local network.
Open XDMoD
The Open XDMod tool allows users to track CPU hours, node usage, job waiting times, and many more statistics, drilled by user, project account, etc. You can access these data here when connected to the ICM local network.