GPU Saturation Scorer¶

Info

The GPU Saturation Scorer (GSSR) will ultimately replace Jobreport but is currently limited to Swiss AI proposals. Please refer to the submission guidelines for which tool to use.

Overview¶

GSSR records how the GPUs on allocated nodes are used during a job and generates a PDF report suitable for project proposals. The PDF helps reviewers better understand the GPU usage of your application.

Using GSSR always follows the same two steps:

Run your job while recording GPU metrics
Generate the PDF report

Choosing the right run for your proposal¶

Your profiling run should demonstrate how a typical simulations from your proposal performs.

A good profiling run:

Captures GPU usage for a few minutes after any initial data loading and setup.
Represents real training or simulation behaviour.
Shows steady GPU usage.

Downloading GSSR¶

Clone the GSSR repository onto a GH200 Alps login node and build

git clone https://github.com/eth-cscs/GPU-Saturation-Scorer.git
cd GPU-Saturation-Scorer
make

Optionally install uv if you don’t already have it installed. Follow the official uv instructions, or just run

make install-uv

Tip

Consider putting gssr-record and gssr-analyze in an architecture-specific location in your $PATH. This is discussed in detail on this page. Otherwise, you will need to modify the examples to include the path to the tools in the command.

Quick start example, single node¶

This example uses a dummy job so you can verify everything works in under one minute.

Step 1 – record a run¶

Run any command using gssr-record:

gssr-record sleep 30

What happens:

Your command runs normally (e.g., sleep 30)
GPU metrics are recorded in the directory gssr_report/

Step 2 – generate the report¶

gssr-analyze gssr_report -o gssr-report.pdf

You now have a GPU utilization report in gssr-report.pdf.
Open the PDF to verify it was created successfully.

Note: sleep will not produce any GPU activity. This example is just to verify the workflow.

Real usage with your application, multinode¶

Replace the test command with your real workload. Example:

srun -N4 gssr-record python train.py

After the job finishes:

gssr-analyze gssr_report -o gpu-report.pdf

Upload the generated PDF with your project proposal.

Slurm job script example¶

#!/bin/bash
#SBATCH --job-name=gssr-test
#SBATCH --nodes=4
#SBATCH --gpus=4
#SBATCH --time=00:30:00

srun gssr-record python train.py

After the job completes:

gssr-analyze gssr_report -o gpu-report.pdf

Using GSSR inside containers¶

GSSR uses the NVIDIA DCGM library to read GPU metrics. This is available on the Alps host system.

When running inside a container, you must enable the DCGM hook in your EDF file:

[annotations]
com.hooks.dcgm.enabled = "true"

Without this setting, GPU metrics cannot be collected and you will receive the error:

error while loading shared libraries: libdcgm.so.3: cannot open shared object file: No such file or directory

Warning

The output directory should be under a path that has been bind-mounted in the .toml file. If the output directory is inside the container, the data will be lost when the container ends. GSSR attempts to detect this scenario and warn the user.

Important behaviour to know¶

If multiple jobs run on the same GPUs at the same time, they will record the same GPU metrics.
This behaviour is normal.

Output files¶

After recording, the default output directory contains raw GPU metrics, e.g.:

gssr_report/<cluster-name>_<jobid>

Most users do not need to inspect these files directly.
They are used to generate the report.

Tip

gssr-analyze will generate a single report containing all the jobs in the given directory. Alternatively, one can list specific directories to include, e.g., gssr_report/alps-daint_1234567 gssr_report/alps-clariden_7654321.

Troubleshooting¶

The report is empty¶

Your job likely did not see GPUs.
Verify that GPUs are visible inside your job or container.

I forgot to run gssr-record¶

You must rerun the job.
GPU metrics cannot be recreated after a job finishes.

The run was very short¶

Runs shorter than approximately one minute may not produce useful plots.

Command reference¶

gssr-record¶

Run a command while recording GPU metrics.

gssr-record -o <directory> <command>

Options:

-o <directory>   Output directory for recorded metrics
-h, --help       Show help
--version        Show version

gssr-analyze¶

Generate a PDF report from recorded metrics.

gssr-analyze <directory> -o <pdf>