Alps Extended Images¶
The Alps infrastructure (specifically the networking stack) requires custom-built libraries and specific environment settings to fully leverage the high-speed network. To reduce the burden on users and ensure best-in-class performance, we provide pre-built Alps Extended Images based on popular container images, starting with those commonly used by the ML/AI community.
Info
See our communication library guide for detailed information about how to build containers with optimised support for the Slingshot network used by Alps.
Note
All extended images are thoroughly tested and validated to ensure correct behavior and optimal performance (see contributing section).
Images¶
The images are hosted on the CSCS internal artifactory repository and can only be pulled from within the Alps environment.
| Base Image | Alps Extended Image | URL |
|---|---|---|
| nvcr.io/nvidia/pytorch:26.01-py3 | ngc-pytorch:26.01-py3-alps3 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3 |
| nvcr.io/nvidia/pytorch:25.12-py3 | ngc-pytorch:25.12-py3-alps3 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:25.12-py3-alps3 |
| nvcr.io/nvidia/nemo:25.11.01 | ngc-nemo:25.11.01-alps3 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-nemo:25.11.01-alps3 |
Network Stack: libraries and versions
| Library | Version | Notes |
|---|---|---|
libfabric |
2.5.0a1 |
Built from commit 102872c0280ce290d9d663945dad8a36ceb53c50 + patch (removing dependency on shs-14 API, which is not available on Alps) |
NCCL |
2.29.3-1* |
Patched by applying https://github.com/NVIDIA/nccl/pull/1979 to the 2.29.3 release |
aws-ofi-plugin |
git-394ae7b* |
Built from commit 394ae7b20dd0e6b4e5f63652e15e9da100d5fe83 + patched by applying https://github.com/aws/aws-ofi-nccl/pull/1056 |
nvshmem |
3.4.5-0 |
|
OpenMPI |
5.0.9 |
The alps2 images are deprecated
| Base Image | Alps Extended Image | URL |
|---|---|---|
| nvcr.io/nvidia/pytorch:26.01-py3 | ngc-pytorch:26.01-py3-alps2 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps2 |
| nvcr.io/nvidia/pytorch:25.12-py3 | ngc-pytorch:25.12-py3-alps2 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:25.12-py3-alps2 |
| nvcr.io/nvidia/nemo:25.11.01 | ngc-nemo:25.11.01-alps2 | jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-nemo:25.11.01-alps2 |
Network Stack: libraries and versions
| Library | Version | Notes |
|---|---|---|
libfabric |
2.5.0a1 |
Built from commit f8262817c337d615a1acceea6cd4ecb526ce548b + patched by applying https://github.com/ofiwg/libfabric/pull/11684 |
NCCL |
2.29.2-1* |
Patched by applying https://github.com/NVIDIA/nccl/pull/1979 to the 2.29.2 release |
aws-ofi-plugin |
git-eb9877e* |
Built from commit eb9877e9cfecf725dba0794a5e0fc06f8fdf7f3f + patched by applying https://github.com/aws/aws-ofi-nccl/pull/1056 |
nvshmem |
3.4.5-0 |
|
OpenMPI |
5.0.9 |
The alps1 images are deprecated
| Base Image | Alps Extended Image |
|---|---|
| nvcr.io/nvidia/pytorch:26.01-py3 | ngc-pytorch:26.01-py3-alps1 |
| nvcr.io/nvidia/pytorch:25.12-py3 | ngc-pytorch:25.12-py3-alps1 |
| nvcr.io/nvidia/nemo:25.11.01 | ngc-nemo:25.11.01-alps1 |
Tip
Images are continuously updated to incorporate the latest improvements. We strongly recommend periodically checking whether a newer version of an Alps Extended Image is available.
Usage¶
To use an image directly on Alps via an EDF environment file, set the image to the repository URL followed by the image name and tag.
Danger
- Do not use the
aws_ofi_ncclhook annotation - Explicitly disable the
cxihook - Use the
--environmentflag forsruninstead ofsbatch(i.e.srun --environment=my_edf.toml ...) - Use the
--network=disable_rdzv_getflag forsrunto disable the rendezvous mechanism for network initialization (i.e.srun --network=disable_rdzv_get ...or settingSLURM_NETWORK=disable_rdzv_get) - Launch MPI applications with
PMIx(i.e.srun --mpi=pmixor settingSLURM_MPI_TYPE=pmix)
# (1)!
image = "jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3"
mounts = [
"/capstor/",
"/iopsstor/",
]
writable = true
[env]
PMIX_MCA_psec = "native" # (2)!
[annotations]
com.hooks.cxi.enabled = "false" # (3)!
- Images will be pulled directly from CSCS’
jfrogartifactory - Pertinent environment variables for optimal network performance are already set in the container image.
PMIX_MCA_psec = "native"is recommended here in order to avoid warnings at initialization. - The
CXIhook must be disabled such that the container images network libraries have priority over the host system’s libraries.
#!/usr/bin/env bash
#SBATCH --account=my_account
#SBATCH --job-name=example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
srun \
--mpi=pmix \ # (1)!
--network=disable_rdzv_get \ #(2)!
--environment=example.edf.toml \ #(3)!
python my_script.py
- The
--mpi=pmixflag is required to ensure thatPMIxis used as the MPI launcher - without this flag you may encounter errors during initialization. - The
--network=disable_rdzv_getflag is required to disable the rendezvous mechanism for network initialization. Alternatively, you can also set the environment variableSLURM_NETWORK=disable_rdzv_getto achieve the same effect. - The
--environmentmust be used as a flag forsrun- passing this flag tosbatchwill lead to errors related to missing Slurm plugins.
srun errors related to missing Slurm plugins
If you are submitting a batch job with sbatch and using the --environment (i.e. #SBATCH --environment=my_edf.toml) option, this can lead to errors such as:
srun: error: plugin_load_from_file: dlopen(/usr/lib64/slurm/switch_hpe_slingshot.so): libjson-c.so.3: cannot open shared object file: No such file or directory
srun: error: Couldn't load specified plugin name for switch/hpe_slingshot: Dlopen of plugin file failed
srun: fatal: Can't find plugin for switch/hpe_slingshot
srun instead of sbatch, i.e.:
PMIx/ucx errors during initialization
If you see warnings related to PMIx, for example
ucx logs indicating that
this likely indicates that Slurm is not configured to use PMIx for launching MPI applications.
To resolve this, ensure that you are launching your application with the --mpi=pmix flag, for example:
or set the environment variable SLURM_MPI_TYPE=pmix to make PMIx the default MPI launcher.
Pulling and using images with Podman¶
Extended images can also be pulled using Podman
podman pull docker://jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3
and/or be used as base images in your own Containerfiles:
FROM jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3
RUN echo "Hello world!"
Inspecting image provenance labels¶
Alps Extended Images include OCI labels with provenance metadata (for example, source repository, commit SHA, and build time). You can inspect these labels with podman.
IMAGE="jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3"
# Pull the image
podman pull "$IMAGE"
# Show all labels (JSON)
podman image inspect "$IMAGE" --format '{{ json .Labels }}'
# Show specific labels
podman image inspect "$IMAGE" --format 'Source Repository: {{ index .Labels "org.opencontainers.image.source" }}'
podman image inspect "$IMAGE" --format 'Source Revision: {{ index .Labels "org.opencontainers.image.revision" }}'
podman image inspect "$IMAGE" --format 'Build Time: {{ index .Labels "org.opencontainers.image.created" }}'
Contributing¶
The Alps extended images are automatically built via a dedicated CI/CD pipeline hosted on GitHub:
github.com/eth-cscs/alps-swiss-ai
Additional tests can be added to the build and test pipeline
New images can be added to the Alps-Images folder
Note
The repository is currently private. Please open a Service Desk ticket to request access.