Monitoring jobs

sacct and sstat

It is important to track if the resources that we are asking to the queuing system match to the amount of resources that our jobs are actually using. Slurm command sstat allows you to check on the status information of any running job/step, whereas sacct allows you to check on the accounting of any job/step that has already finshed.

There’s an alias (sc) in your ~/.bashrc  file that ‘prettyfies’ the output of  sacct. If you want to  see older jobs you may specify an start time and/or an end time with the following options:

sc -S 2019-04-10 -E 2019-04-29

to see jobs that started after April 10th 2019 and ended before April 29th. Just read the man page for sacct  and play around.

NOTE: there’s a problem with the gathering of memory usage information (that we are working on fixing). Don’t believe the numbers in MaxVMSize nor MaxRSS

Ganglia

The CPU load and the memory pressure on the whole cluster and for every compute node can be monitored with Ganglia. This is more intended for system administrators, but it can also help you to gauge how your jobs are doing (mostly because many of you like to take whole nodes for your jobs, so it translates pretty well). You can access to the Ganglia statistics here when connected to the ICM local network.

Open XDMoD

The Open XDMod tool allows users to track CPU hours, node usage, job waiting times, and many more statistics, drilled by user, project account, etc. You can access these data here when connected to the ICM local network.