Gordon Bell 2025¶
This is temporary documentation for the Gordon Bell second round benchmark runs scheduled for the week August 18-22 2025.
Schedule¶
Times are in CEST (Central European Time): Time conversion table
group | date | time | duration (h) | activity |
---|---|---|---|---|
- | 08-18 | 08:00 | - | Daint is reconfigured and resized for GB runs |
all | 08-18 | ASAP | - | Daint is available to all teams for final testing at scale |
g202 |
08-18 | 21:00 | 2 | GB run |
g199 |
08-18 | 23:00 | 10 | GB run |
g186 |
08-19 | 09:00 | 6 | GB run |
g200 |
08-19 | 15:00 | 3 | GB run |
g183 |
08-19 | 18:00 | 24 | GB run |
cwd01 |
08-20 | 18:00 | 5 | GB run |
- | 08-20 | 23:00 | 9 | free slot |
g188 |
08-21 | 08:00 | 8 | GB run |
g202 |
08-21 | 16:00 | 1 | GB run |
System¶
The system Daint will be expanded to approximately 2350 Grace-Hopper nodes.
information about partition, account, time limits
#!/bin/bash
#SBATCH --account=<group>
#SBATCH --partition=normal
#SBATCH --reservation=<group>
srun --uenv=prgenv-gnu/24.11:v2 --view=default -n? -N? ....
Recommendations on run configuration¶
Disabling core-dumps¶
If a large job crashes and tries to write core-dump files on thousands of processes, it will overwhelm the filesystem. Therefore we strongly recommend to disable them with the following command:
Improving job startup times¶
In the first round of GB runs we identified slow job startup times as a common cause of crashes during job startup.
With HPE we have identified that the most likely cause is file system contention loading dynamic libraries before main()
starts.
The fix is to update how the SquashFS file for the uenv or container used by your job is stored on the filesystem.
$ uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}'
/capstor/scratch/cscs/bcumming/.uenv-images/images/6068794b820fb4dd91019d020d6d98334a2f9fd23035a5e4a2f72f9dda5f1260/store.squashfs
$ lfs migrate --stripe-count 20 --stripe-size 1M $(uenv image inspect prgenv-gnu/24.11:v2 --format='{sqfs}')
If you are using a SquashFS image for your Python environment, you should also set the striping for that file.
As an additional precaution, we recommend to increase the default wait threshold for MPI_Init
from 180 seconds to 300.