Machine learning platform¶

The Machine Learning Platform (MLP) provides compute, storage and expertise to the machine learning and AI community in Switzerland, with the main user being the Swiss AI Initiative.

Getting started¶

Getting access¶

Project administrators (PIs and deputy PIs) of projects on the MLP can to invite users to join their project, before they can use the project's resources on Alps. This is performed using the project management tool

Once invited to a project, you will receive an email, which you can need to create an account and configure multi-factor authentication (MFA).

Systems¶

The main cluster provided by the MLP is Clariden, a large Grace-Hopper GPU system on Alps.

Clariden

Clariden is the main Grace-Hopper cluster.

Bristen

Bristen is a smaller system with A100 GPU nodes for data processing, development, x86 workloads and inference services.

File Systems and Storage¶

There are three main file systems mounted on the MLP clusters Clariden and Bristen.

type	mount	filesystem
Home	`/users/$USER`	VAST
Scratch	`/iopsstor/scratch/cscs/$USER`	Iopsstor
	`/capstor/scratch/cscs/$USER`	Capstor
Project	`/capstor/store/cscs/swissai/<project>`	Capstor

Home¶

Every user has a home path ($HOME) mounted at /users/$USER on the VAST filesystem. The home directory has 50 GB of capacity, and is intended for configuration, small software packages and scripts.

Scratch¶

Scratch filesystems provide temporary storage for high-performance I/O for executing jobs. Use scratch to store datasets that will be accessed by jobs, and for job output. Scratch is per user - each user gets separate scratch path and quota.

The environment variable SCRATCH=/iopsstor/scratch/cscs/$USER is set automatically when you log into the system, and can be used as a shortcut to access scratch.
There is an additional scratch path mounted on Capstor at /capstor/scratch/cscs/$USER.

scratch cleanup policy

Files that have not been accessed in 30 days are automatically deleted.

Scratch is not intended for permanent storage: transfer files back to the capstor project storage after job runs.

file system suitability

The Capstor scratch filesystem is based on HDDs and is optimized for large, sequential read and write operations. We recommend using Capstor for storing checkpoint files and other large, contiguous outputs generated by your training runs. In contrast, Iopsstor uses high-performance NVMe drives, which excel at handling IOPS-intensive workloads involving frequent, random access. This makes it a better choice for storing training datasets, especially when accessed randomly during machine learning training. See the Lustre guide for some hints on how to get the best performance out of the filesystem.

Scratch Usage Recommendations¶

Use Iopsstor scratch ($SCRATCH) for:

Training and validation datasets that are read frequently and non-sequentially.
Workloads that perform many small, random I/O operations.

Use Capstor scratch (/capstor/scratch/cscs/$USER) for:

Storing model checkpoints.
Outputs from simulations or training jobs that involve large, contiguous I/O.

After your job completes, remember to transfer any important results to your permanent project storage.

Project¶

Project storage is backed up, with no cleaning policy: it provides intermediate storage space for datasets, shared code or configuration scripts that need to be accessed from different vClusters. Project is per project - each project gets a project folder with project-specific quota.

if you need additional storage, ask your PI to contact the CSCS service managers Fawzi or Nicholas.
hard limits on capacity and inodes prevent users from writing to project if the quota is reached - you can check quota and available space by running the quota command on a login node or ela
it is not recommended to write directly to the project path from jobs.

Guides and tutorials¶

Tutorials for fine-tuning and running inference of LLMs as well as training an LLM with Nanotron can be found in the MLP Tutorials page.