Clariden¶
Clariden is an Alps cluster that provides GPU accelerators and file systems designed to meet the needs of machine learning workloads in the MLP.
Cluster Specification¶
Compute Nodes¶
Clariden consists of around 1200 Grace-Hopper nodes. The number of nodes can change when nodes are added or removed from other clusters on Alps.
| node type | number of nodes | total CPU sockets | total GPUs |
|---|---|---|---|
| gh200 | 1,200 | 4,800 | 4,800 |
Most nodes are in the normal Slurm partition, while a few nodes are in the debug partition.
Storage and file systems¶
Clariden uses the MLP filesystems and storage policies.
Getting started¶
Logging into Clariden¶
To connect to Clariden via SSH, first refer to the ssh guide.
~/.ssh/config
Add the following to your SSH configuration to enable you to directly connect to clariden using ssh clariden.
Software¶
Users are encouraged to use containers on Clariden.
- Jobs using containers can be easily set up and submitted using the container engine.
- To build images, see the guide to building container images on Alps.
- The Nvidia NGC Catalog provides containers with pre-built ML software stacks:
- Recommended: Alps extended images provided by CSCS are customized versions of NGC images optimized for the Alps network.
- Or start with base images from the Nvidia NGC Catalog, for example the HPC and PyTorch images — note that you will have to use container hooks to get optimal network performance.
Alternatively, uenv are also available on Clariden. Currently deployed on Clariden:
using uenv provided for other clusters
You can run uenv that were built for other Alps clusters using the @ notation.
For example, to use uenv images for daint:
For detailed instructions and best practices with ML frameworks, please refer to the dedicated pages under ML software.
Running Jobs on Clariden¶
Slurm¶
Clariden uses Slurm as the workload manager, which is used to launch and monitor distributed workloads, such as training runs.
There are four Slurm partitions on the system:
- the
normalpartition is for all production workloads. - the
debugpartition is intended for short debugging and testing jobs. It is configured on top of the same node pool asnormal, with tight per-user limits to keep it focused on its intended use case. - the
lowpartition is a low-priority partition, which may be enabled for specific projects at specific times. - the
xferpartition is for internal data transfer at CSCS.
| name | nodes | max nodes per job | time limit |
|---|---|---|---|
normal |
most nodes | - | 12 hours |
debug |
most nodes (shared with normal) plus a few dedicated |
4 | 1.5 node-hours |
low |
most nodes (shared with normal) |
- | 24 hours |
xfer |
2 | 1 | 24 hours |
- jobs in the
normal,debug, andlowpartitions get exclusive use of their allocated nodes (one job per node) - the
lowpartition shares the exact same node pool asnormal, whiledebugshares that pool and adds a small set of nodes dedicated to debugging: short debug jobs therefore always have capacity available, even whennormalis full - because these partitions overlap, a node may belong to more than one of them at the same time
- nodes in the
xferpartition can be shared - nodes in the
debugqueue have a 1.5 node-hour time limit. This means you could for example request 2 nodes for 45 minutes each, or 1 single node for the full time limit.
The debug partition has additional per-user limits enforced by its QoS:
- max 1 running job per user
- max 2 submitted jobs per user (1 running + 1 pending)
- max 90 node·minutes per job (e.g. 1 node × 90 min, 2 nodes × 45 min, or 4 nodes × 22 min)
The debug partition is scheduled at a higher priority than normal, so debug jobs are placed ahead of normal jobs in the queue and typically start sooner. Preemption is disabled, so debug jobs never interrupt running normal jobs: they simply use idle nodes as soon as these become available. The tight per-user limits make the partition unsuitable for production workloads while keeping it responsive for short debug sessions.
The debug partition is for debugging and testing only
The debug partition is reserved for short, interactive debugging and testing sessions, and must not be used to run production workloads or to otherwise circumvent the per-user limits.
Usage of the partition is monitored: workloads that are not genuine debugging or testing will be flagged and reported.
See the Slurm documentation for instructions on how to run jobs on the Grace-Hopper nodes.
how to check the number of nodes on the system
You can check the size of the system by running the following command in the terminal:
$ sinfo --format "| %20R | %10D | %10s | %10l | %10A |"
| PARTITION | NODES | JOB_SIZE | TIMELIMIT | NODES(A/I) |
| debug | 1384 | 1-4 | 1:30:00 | 1260/82 |
| normal | 1359 | 1-infinite | 12:00:00 | 1254/65 |
| low | 1359 | 1-infinite | 1-00:00:00 | 1254/65 |
| xfer | 2 | 1 | 1-00:00:00 | 2/0 |
A) and the number of jobs that are idle (I).
FirecREST¶
Clariden can also be accessed using FirecREST at the https://api.cscs.ch/ml/firecrest/v1 API endpoint.
Maintenance and status¶
Scheduled Maintenance¶
Wednesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe. If the queues must be drained (redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.
Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the CSCS status page.
Change log¶
2025-03-05 container engine updated
now supports better containers that go faster. Users do not to change their workflow to take advantage of these updates.
2024-10-07 old event
this is an old update. Use ??? to automatically fold the update.