Skip to content

Daint

Daint is the main HPC Platform cluster that provides compute nodes and file systems for GPU-enabled workloads.

Cluster specification

Compute nodes

Daint consists of around 800-1000 Grace-Hopper nodes.

The number of nodes can vary as nodes are added or removed from other clusters on Alps.

There are four login nodes, daint-ln00[1-4]. You will be assigned to one of the four login nodes when you ssh onto the system, from where you can edit files, compile applications and launch batch jobs.

node type number of nodes total CPU sockets total GPUs
gh200 1,022 4,088 4,088

Storage and file systems

Daint uses the HPCP filesystems and storage policies.

Getting started

Logging into Daint

To connect to Daint via SSH, first refer to the ssh guide.

~/.ssh/config

Add the following to your SSH configuration to enable you to directly connect to Daint using ssh daint.

Host daint
    HostName daint.alps.cscs.ch
    ProxyJump ela
    User cscsusername
    IdentityFile ~/.ssh/cscs-key
    IdentitiesOnly yes

Software

uenv

Daint provides uenv to deliver programming environments and application software. Please refer to the uenv documentation for detailed information on how to use the uenv tools on the system.

  • Scientific Applications

    Provide the latest versions of scientific applications, tuned for Daint, and the tools required to build your own versions of the applications.

Containers

Daint supports container workloads using the container engine.

To build images, see the guide to building container images on Alps.

Cray Modules

Warning

The Cray Programming Environment (CPE), loaded using module load cray, is no longer supported by CSCS.

CSCS will continue to support and update uenv and container engine, and users are encouraged to update their workflows to use these methods at the first opportunity.

The CPE is still installed on Daint, however it will receive no support or updates, and will be replaced with a container in a future update.

Running jobs on Daint

Slurm

Daint uses Slurm as the workload manager, which is used to launch and monitor compute-intensive workloads.

There are four Slurm partitions on the system:

  • the normal partition is for all production workloads.
  • the debug partition can be used to access a small allocation for up to 30 minutes for debugging and testing purposes.
  • the xfer partition is for internal data transfer.
  • the low partition is a low-priority partition, which may be enabled for specific projects at specific times.
name nodes max nodes per job time limit
normal unlim - 24 hours
debug 24 2 30 minutes
xfer 2 1 24 hours
low unlim - 24 hours
  • nodes in the normal and debug (and low) partitions are not shared
  • nodes in the xfer partition can be shared

See the Slurm documentation for instructions on how to run jobs on the Grace-Hopper nodes.

FirecREST

Daint can also be accessed using FirecREST at the https://api.cscs.ch/ml/firecrest/v2 API endpoint.

The FirecREST v1 API is still available, but deprecated

Maintenance and status

Scheduled maintenance

move this to HPCP top level docs

Wednesday mornings 8:00-12:00 CET are reserved for periodic updates, with services potentially unavailable during this time frame. If the batch queues must be drained (for redeployment of node images, rebooting of compute nodes, etc) then a Slurm reservation will be in place that will prevent jobs from running into the maintenance window.

Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list, and on the CSCS status page.

Change log

2025-05-21

Minor enhancements to system configuration have been applied. These changes should reduce the frequency of compute nodes being marked as NOT_RESPONDING by the workload manager, while we continue to investigate the issue

2025-05-14

Performance hotfix

The access-counter-based memory migration feature in the NVIDIA driver for Grace Hopper is disabled to address performance issues affecting NCCL-based workloads (e.g. LLM training)

NVIDIA boost slider

Added an option to enable the NVIDIA boost slider (vboost) via Slurm using the -C nvidia_vboost_enabled flag. This feature, disabled by default, may increase GPU frequency and performance while staying within the power budget

Enroot update

The container runtime is upgraded from version 2.12.0 to 2.13.0. This update includes libfabric version 1.22.0 (previously 1.15.2.0), which has demonstrated improved performance during LLM checkpointing

2025-04-30

uenv is updated from v7.0.1 to v8.1.0

Release notes

Pyxis is upgraded from v24.5.0 to v24.5.3
  • Added image caching for Enroot
  • Added support for environment variable expansion in EDFs
  • Added support for relative paths expansion in EDFs
  • Print a message about the experimental status of the --environment option when used outside of the srun command
  • Merged small features and bug fixes from upstream Pyxis releases v0.16.0 to v0.20.0
  • Internal changes: various bug fixes and refactoring
2025-03-12
  1. The number of compute nodes has been increased to 1018
  2. The restriction on the number of running jobs per project has been lifted.
  3. A "low" priority partition has been added, which allows some project types to consume up to 130% of the project's quarterly allocation
  4. We have increased the power cap for the GH module from 624 to 660 W. You might see increased application performance as a consequence
  5. Small changes in kernel tuning parameters

Known issues

Todo

Most of these issues (see original KB docs) should be consolidated in a location where they can be linked to by all clusters.

We have some "know issues" documented under communication libraries, however these might be a bit too disperse for centralised linking.