NVSHMEM¶

NVSHMEM is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.

Using NVSHMEM¶

uenv¶

Version 2.8 of the PyTorch uenv is currently the only uenv that provides NVSHMEM.

CSCS working on building NVSHMEM that runs efficiently on the Alps network in uenv, and will update these docs when it is available.

Containers¶

To use NVSHMEM, we recommend first installing OpenMPI with libfabric support in the container, or starting with an image that contains OpenMPI with libfabric.

The image recipe described here is based on the OpenMPI image for NVIDIA, and thus it is suited for hosts with NVIDIA GPUs, like Alps GH200 nodes.

Be careful with NVSHMEM provided by NVIDIA containers

Containers provided by NVIDIA on NGC typically provide NVSHMEM as part of the NVHPC SDK in the image, however this version is built for and linked against OpenMPI and UCX in the container, which are not compatible with the Slingshot network of Alps.

NVSHMEM is built from source in the container, from a source tar ball provided by NVIDIA.

Notice that NVSHMEM is configured with support for libfabric explicitly enabled: NVSHMEM_LIBFABRIC_SUPPORT=1
NVSHMEM is built without support for UCX and Infiniband components, because they are not needed on Alps.
Since this image uses OpenMPI (which provides PMIx) as MPI implementation, NVSHMEM is also configured to default to PMIx for bootstrapping (NVSHMEM_PMIX_SUPPORT=1).

Installing NVSHMEM in a container for NVIDIA nodes

The following example demonstrates how to download and install NVSHMEM from source in a Containerfile.

The container image cont

href="#__codelineno-0-1">RUN apt-get update \ && DEBIAN_FRONTEND=noninteractive \ apt-get install -y \ python3-venv \ python3-dev \ --no-install-recommends \ && rm -rf /var/lib/apt/lists/* \ && rm /usr/lib/python3.12/EXTERNALLY-MANAGED Build NVSHMEM from source ARG nvshmem_version=3.4.5 RUN wget -q https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/source/nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \ && tar -xvf nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \ && cd nvshmem_src \ && NVSHMEM_BUILD_EXAMPLES=0 \ NVSHMEM_BUILD_TESTS=1 \ NVSHMEM_DEBUG=0 \ NVSHMEM_DEVEL=0 \ NVSHMEM_DEFAULT_PMI2=0 \ NVSHMEM_DEFAULT_PMIX=1 \ NVSHMEM_DISABLE_COLL_POLL=1 \ NVSHMEM_ENABLE_ALL_DEVICE_INLINING=0 \ NVSHMEM_GPU_COLL_USE_LDST=0 \ NVSHMEM_LIBFABRIC_SUPPORT=1 \ NVSHMEM_MPI_SUPPORT=1 \ NVSHMEM_MPI_IS_OMPI=1 \ NVSHMEM_NVTX=1 \ NVSHMEM_PMIX_SUPPORT=1 \ NVSHMEM_SHMEM_SUPPORT=1 \ NVSHMEM_TEST_STATIC_LIB=0 \ NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ NVSHMEM_TRACE=0 \ NVSHMEM_USE_DLMALLOC=0 \ NVSHMEM_USE_NCCL=1 \ NVSHMEM_USE_GDRCOPY=1 \ NVSHMEM_VERBOSE=0 \ NVSHMEM_DEFAULT_UCX=0 \ NVSHMEM_UCX_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0 \ NVSHMEM_IBDEVX_SUPPORT=0 \ NVSHMEM_IBRC_SUPPORT=0 \ LIBFABRIC_HOME=/usr \ NCCL_HOME=/usr \ GDRCOPY_HOME=/usr/local \ MPI_HOME=/usr \ SHMEM_HOME=/usr \ NVSHMEM_HOME=/usr \ cmake . \ && make -j$(nproc) \ && make install \ && ldconfig \ && cd .. \ && rm -r nvshmem_src nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz

Note

The image also installs the NVSHMEM performance tests, NVSHMEM_BUILD_TESTS=1, to demonstrate performance below. The performance tests, in turn, require the installation of Python dependencies. When building images intended solely for production purposes, you may exclude both those elements.

Expand the box below to see an example of a complete Containerfile that installs NVSHMEM and all of its dependencies in an NVIDIA container.

The full Containerfile

ARG ubuntu_version=24.04
ARG cuda_version=12.8.1
FROM docker.io/nvidia/cuda:${cuda_version}-cudnn-devel-ubuntu${ubuntu_version}

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive \
       apt-get install -y \
        build-essential \
        ca-certificates \
        pkg-config \
        automake \
        autoconf \
        libtool \
        cmake \
        gdb \
        strace \
        wget \
        git \
        bzip2 \
        python3 \
        gfortran \
        rdma-core \
        numactl \
        libconfig-dev \
        libuv1-dev \
        libfuse-dev \
        libfuse3-dev \
        libyaml-dev \
        libnl-3-dev \
        libnuma-dev \
        libsensors-dev \
        libcurl4-openssl-dev \
        libjson-c-dev \
        libibverbs-dev \
        --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

ARG gdrcopy_version=2.5.1
RUN git clone --depth 1 --branch v${gdrcopy_version} https://github.com/NVIDIA/gdrcopy.git \
    && cd gdrcopy \
    && export CUDA_PATH=/usr/local/cuda \
    && make CC=gcc CUDA=$CUDA_PATH lib \
    && make lib_install \
    && cd ../ && rm -rf gdrcopy

# Install libfabric
ARG libfabric_version=1.22.0
RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \
    && cd libfabric \
    && ./autogen.sh \
    && ./configure --prefix=/usr --with-cuda=/usr/local/cuda --enable-cuda-dlopen \
       --enable-gdrcopy-dlopen --enable-efa \
    && make -j$(nproc) \
    && make install \
    && ldconfig \
    && cd .. \
    && rm -rf libfabric

# Install UCX
ARG UCX_VERSION=1.19.0
RUN wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz \
    && tar xzf ucx-${UCX_VERSION}.tar.gz \
    && cd ucx-${UCX_VERSION} \
    && mkdir build \
    && cd build \
    && ../configure --prefix=/usr --with-cuda=/usr/local/cuda --with-gdrcopy=/usr/local \
       --enable-mt --enable-devel-headers \
    && make -j$(nproc) \
    && make install \
    && cd ../.. \
    && rm -rf ucx-${UCX_VERSION}.tar.gz ucx-${UCX_VERSION}

ARG OMPI_VER=5.0.8
RUN wget -q https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VER}.tar.gz \
    && tar xf openmpi-${OMPI_VER}.tar.gz \
    && cd openmpi-${OMPI_VER} \
    && ./configure --prefix=/usr --with-ofi=/usr --with-ucx=/usr \
        --enable-oshmem --with-cuda=/usr/local/cuda \
        --with-cuda-libdir=/usr/local/cuda/lib64/stubs \
    && make -j$(nproc) \
    && make install \
    && ldconfig \
    && cd .. \
    && rm -rf openmpi-${OMPI_VER}.tar.gz openmpi-${OMPI_VER}

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive \
       apt-get install -y \
        python3-venv \
        python3-dev \
        --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && rm /usr/lib/python3.12/EXTERNALLY-MANAGED

# Build NVSHMEM from source
ARG nvshmem_version=3.4.5
RUN wget -q https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/source/nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \
    && tar -xvf nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz \
    && cd nvshmem_src \
    && NVSHMEM_BUILD_EXAMPLES=0 \
       NVSHMEM_BUILD_TESTS=1 \
       NVSHMEM_DEBUG=0 \
       NVSHMEM_DEVEL=0 \
       NVSHMEM_DEFAULT_PMI2=0 \
       NVSHMEM_DEFAULT_PMIX=1 \
       NVSHMEM_DISABLE_COLL_POLL=1 \
       NVSHMEM_ENABLE_ALL_DEVICE_INLINING=0 \
       NVSHMEM_GPU_COLL_USE_LDST=0 \
       NVSHMEM_LIBFABRIC_SUPPORT=1 \
       NVSHMEM_MPI_SUPPORT=1 \
       NVSHMEM_MPI_IS_OMPI=1 \
       NVSHMEM_NVTX=1 \
       NVSHMEM_PMIX_SUPPORT=1 \
       NVSHMEM_SHMEM_SUPPORT=1 \
       NVSHMEM_TEST_STATIC_LIB=0 \
       NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
       NVSHMEM_TRACE=0 \
       NVSHMEM_USE_DLMALLOC=0 \
       NVSHMEM_USE_NCCL=1 \
       NVSHMEM_USE_GDRCOPY=1 \
       NVSHMEM_VERBOSE=0 \
       NVSHMEM_DEFAULT_UCX=0 \
       NVSHMEM_UCX_SUPPORT=0 \
       NVSHMEM_IBGDA_SUPPORT=0 \
       NVSHMEM_IBGDA_SUPPORT_GPUMEM_ONLY=0 \
       NVSHMEM_IBDEVX_SUPPORT=0 \
       NVSHMEM_IBRC_SUPPORT=0 \
       LIBFABRIC_HOME=/usr \
       NCCL_HOME=/usr \
       GDRCOPY_HOME=/usr/local \
       MPI_HOME=/usr \
       SHMEM_HOME=/usr \
       NVSHMEM_HOME=/usr \
       cmake . \
       && make -j$(nproc) \
       && make install \
   && ldconfig \
   && cd .. \
   && rm -r nvshmem_src nvshmem_src_cuda12-all-all-${nvshmem_version}.tar.gz

Running the NVSHMEM container

The following EDF file sets the required environment variables and container hooks for NVSHMEM. It uses a pre-built container hosted on the Quay.io registry at the following reference: quay.io/ethcscs/nvshmem:3.4.5-ompi5.0.8-ofi1.22-cuda12.8.

image = "quay.io#ethcscs/nvshmem:3.4.5-ompi5.0.8-ofi1.22-cuda12.8"

[env]
PMIX_MCA_psec="native" # (1)!
NVSHMEM_REMOTE_TRANSPORT="libfabric"
NVSHMEM_LIBFABRIC_PROVIDER="cxi"
NVSHMEM_DISABLE_CUDA_VMM="1" # (2)!

[annotations]
com.hooks.aws_ofi_nccl.enabled = "true" # (3)!
com.hooks.aws_ofi_nccl.variant = "cuda12"

Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup.
NVSHMEM’s libfabric transport does not support VMM yet, so VMM must be disabled by setting the environment variable NVSHMEM_DISABLE_CUDA_VMM=1.
NCCL requires the presence of the AWS OFI NCCL plugin in order to correctly interface with Libfabric and (through the latter) the Slingshot interconnect. Therefore, for optimal performance the related CE hook must be enabled and set to match the CUDA version in the container.

Libfabric itself is usually injected by the CXI hook, which is enabled by default on several Alps vClusters.

srun -N2 --ntasks-per-node=4  \
     -mpi=pmix                \ # (1)!
     --environment=nvshmem    \
    /usr/local/nvshmem/bin/perftest/device/coll/alltoall_latency

Since NVSHMEM has been configured in the Containerfile to use PMIx for bootstrapping, when using this image the srun option --mpi=pmix must be used to run successful multi-rank jobs.

Other bootstrapping methods (including different PMI implementations) can be specified for NVSHMEM through the related environment variables. When bootstrapping through PMI or MPI through Slurm, ensure that the PMI implementation used by Slurm (i.e. srun --mpi option) matches the one expected by NVSHMEM or the MPI library.

NVSHMEM Performance¶

The results of running the alltoall_latency benchmark provided by the NCCL test suite, built in the example container above.

$ srun -N2 --ntasks-per-node=4  --mpi=pmix --environment=nvshmem /usr/local/nvshmem/bin/perftest/device/coll/alltoall_latency
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream, use_graph: 0, use_mmap: 0, mem_handle_type: 0, use_egm: 0
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 6 mype_node: 2 device name: NVIDIA GH200 120GB bus id: 1 
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    thread    116.220796        0.000         0.000       
64          16        32-bit    thread    112.700796        0.001         0.000       
128         32        32-bit    thread    113.571203        0.001         0.001       
256         64        32-bit    thread    111.123204        0.002         0.002       
512         128       32-bit    thread    111.075199        0.005         0.004       
1024        256       32-bit    thread    110.131204        0.009         0.008       
2048        512       32-bit    thread    111.030400        0.018         0.016       
4096        1024      32-bit    thread    110.985601        0.037         0.032       
8192        2048      32-bit    thread    111.039996        0.074         0.065       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    warp      89.801598         0.000         0.000       
64          16        32-bit    warp      90.563202         0.001         0.001       
128         32        32-bit    warp      89.830399         0.001         0.001       
256         64        32-bit    warp      88.863999         0.003         0.003       
512         128       32-bit    warp      89.686400         0.006         0.005       
1024        256       32-bit    warp      88.908798         0.012         0.010       
2048        512       32-bit    warp      88.819200         0.023         0.020       
4096        1024      32-bit    warp      89.670402         0.046         0.040       
8192        2048      32-bit    warp      88.889599         0.092         0.081       
16384       4096      32-bit    warp      88.972801         0.184         0.161       
32768       8192      32-bit    warp      89.564800         0.366         0.320       
65536       16384     32-bit    warp      89.888000         0.729         0.638       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
32          8         32-bit    block     89.747202         0.000         0.000       
64          16        32-bit    block     88.086402         0.001         0.001       
128         32        32-bit    block     87.254399         0.001         0.001       
256         64        32-bit    block     87.401599         0.003         0.003       
512         128       32-bit    block     88.095999         0.006         0.005       
1024        256       32-bit    block     87.273598         0.012         0.010       
2048        512       32-bit    block     88.086402         0.023         0.020       
4096        1024      32-bit    block     88.940799         0.046         0.040       
8192        2048      32-bit    block     88.095999         0.093         0.081       
16384       4096      32-bit    block     87.247998         0.188         0.164       
32768       8192      32-bit    block     88.976002         0.368         0.322       
65536       16384     32-bit    block     88.121599         0.744         0.651       
131072      32768     32-bit    block     90.579200         1.447         1.266       
262144      65536     32-bit    block     91.360003         2.869         2.511       
524288      131072    32-bit    block     101.145601        5.183         4.536       
1048576     262144    32-bit    block     111.052799        9.442         8.262       
2097152     524288    32-bit    block     137.164795        15.289        13.378      
4194304     1048576   32-bit    block     183.171201        22.898        20.036      
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    thread    111.955202        0.001         0.001       
128         16        64-bit    thread    113.420796        0.001         0.001       
256         32        64-bit    thread    108.508801        0.002         0.002       
512         64        64-bit    thread    110.204804        0.005         0.004       
1024        128       64-bit    thread    109.487998        0.009         0.008       
2048        256       64-bit    thread    109.462404        0.019         0.016       
4096        512       64-bit    thread    110.156798        0.037         0.033       
8192        1024      64-bit    thread    109.401596        0.075         0.066       
16384       2048      64-bit    thread    108.591998        0.151         0.132       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    warp      88.896000         0.001         0.001       
128         16        64-bit    warp      89.679998         0.001         0.001       
256         32        64-bit    warp      88.950402         0.003         0.003       
512         64        64-bit    warp      89.606398         0.006         0.005       
1024        128       64-bit    warp      89.775997         0.011         0.010       
2048        256       64-bit    warp      88.838398         0.023         0.020       
4096        512       64-bit    warp      90.671998         0.045         0.040       
8192        1024      64-bit    warp      89.699203         0.091         0.080       
16384       2048      64-bit    warp      89.011198         0.184         0.161       
32768       4096      64-bit    warp      89.622402         0.366         0.320       
65536       8192      64-bit    warp      88.905603         0.737         0.645       
131072      16384     64-bit    warp      89.766401         1.460         1.278       
#alltoall_device
size(B)     count     type      scope     latency(us)       algbw(GB/s)   busbw(GB/s) 
64          8         64-bit    block     89.788800         0.001         0.001       
128         16        64-bit    block     88.012803         0.001         0.001       
256         32        64-bit    block     87.353599         0.003         0.003       
512         64        64-bit    block     88.000000         0.006         0.005       
1024        128       64-bit    block     87.225598         0.012         0.010       
2048        256       64-bit    block     87.225598         0.023         0.021       
4096        512       64-bit    block     87.168002         0.047         0.041       
8192        1024      64-bit    block     88.067198         0.093         0.081       
16384       2048      64-bit    block     88.863999         0.184         0.161       
32768       4096      64-bit    block     88.723201         0.369         0.323       
65536       8192      64-bit    block     87.993598         0.745         0.652       
131072      16384     64-bit    block     88.783997         1.476         1.292       
262144      32768     64-bit    block     91.366398         2.869         2.511       
524288      65536     64-bit    block     102.060795        5.137         4.495       
1048576     131072    64-bit    block     111.846399        9.375         8.203       
2097152     262144    64-bit    block     137.107205        15.296        13.384      
4194304     524288    64-bit    block     183.100796        22.907        20.044