GPU Saturation Scorer¶
Info
The GPU Saturation Scorer (GSSR) will ultimately replace Jobreport but is currently limited to Swiss AI proposals. Please refer to the submission guidelines for which tool to use.
Overview¶
GSSR records how the GPUs on allocated nodes are used during a job and generates a PDF report suitable for project proposals. The PDF helps reviewers better understand the GPU usage of your application.
Using GSSR always follows the same two steps:
- Run your job while recording GPU metrics
- Generate the PDF report
Choosing the right run for your proposal¶
Your profiling run should demonstrate how a typical simulations from your proposal performs.
A good profiling run:
- Captures GPU usage for a few minutes after any initial data loading and setup.
- Represents real training or simulation behaviour.
- Shows steady GPU usage.
Downloading GSSR¶
Clone the GSSR repository onto a GH200 Alps login node and build
Optionally install uv if you don’t already have it installed.
Follow the official uv instructions, or just run
Tip
Consider putting gssr-record and gssr-analyze in an architecture-specific location in your $PATH.
This is discussed in detail on this page.
Otherwise, you will need to modify the examples to include the path to the tools in the command.
Quick start example, single node¶
This example uses a dummy job so you can verify everything works in under one minute.
Step 1 – record a run¶
Run any command using gssr-record:
What happens:
- Your command runs normally (e.g.,
sleep 30) - GPU metrics are recorded in the directory
gssr_report/
Step 2 – generate the report¶
You now have a GPU utilization report in gssr-report.pdf.
Open the PDF to verify it was created successfully.
Note: sleep will not produce any GPU activity. This example is just to verify the workflow.
Real usage with your application, multinode¶
Replace the test command with your real workload. Example:
After the job finishes:
Upload the generated PDF with your project proposal.
Slurm job script example¶
#!/bin/bash
#SBATCH --job-name=gssr-test
#SBATCH --nodes=4
#SBATCH --gpus=4
#SBATCH --time=00:30:00
srun gssr-record python train.py
After the job completes:
Using GSSR inside containers¶
GSSR uses the NVIDIA DCGM library to read GPU metrics. This is available on the Alps host system.
When running inside a container, you must enable the DCGM hook in your EDF file:
Without this setting, GPU metrics cannot be collected and you will receive the error:
error while loading shared libraries: libdcgm.so.3: cannot open shared object file: No such file or directory
Warning
The output directory should be under a path that has been bind-mounted in the
.toml file. If the output directory is inside the container, the data will
be lost when the container ends. GSSR attempts to detect this scenario and warn
the user.
Important behaviour to know¶
Overlapping jobs share GPU data¶
If multiple jobs run on the same GPUs at the same time, they will record the same GPU metrics.
This behaviour is normal.
Output files¶
After recording, the default output directory contains raw GPU metrics, e.g.:
Most users do not need to inspect these files directly.
They are used to generate the report.
Tip
gssr-analyze will generate a single report containing all the jobs in the given directory.
Alternatively, one can list specific directories to include, e.g., gssr_report/alps-daint_1234567 gssr_report/alps-clariden_7654321.
Troubleshooting¶
The report is empty¶
Your job likely did not see GPUs.
Verify that GPUs are visible inside your job or container.
I forgot to run gssr-record¶
You must rerun the job.
GPU metrics cannot be recreated after a job finishes.
The run was very short¶
Runs shorter than approximately one minute may not produce useful plots.
Command reference¶
gssr-record¶
Run a command while recording GPU metrics.
Options:
gssr-analyze¶
Generate a PDF report from recorded metrics.
Options:
Support¶
If you encounter problems, contact support and include:
- Job script
- GSSR output directory
- Error messages