OpenMPI¶
Cray MPICH is the recommended MPI implementation on Alps, particularly if you are using uenv.
However, OpenMPI can be used as an alternative in some cases, with limited support from CSCS. OpenMPI is available for use in both uenv and containers.
Support for the Slingshot 11 network is provided by the libfabric library.
Using OpenMPI¶
uenv¶
OpenMPI is provided in the prgenv-gnu-openmpi uenv.
Once the uenv is loaded, compiling and linking with OpenMPI and libfabric is transparent.
At runtime, some additional options must be set to correctly use the Slingshot network.
First, when launching applications through Slurm, PMIx must be used for application launching.
This is done with the --mpi flag of srun:
There are two primary ways to configure OpenMPI and libfabric to use the Slingshot network:
- Only using the CXI provider. This method has been found to work in more applications but uses NICs for intra-node communication which can limit performance.
- Using the LINKx provider which combines the CXI provider for inter-node communication with the shared memory provider for intra-node communication. This provider is newer, may not support all features, and more likely to contain bugs, but makes full use of intra-node bandwidth.
We recommend trying the LINKx provider first as it provides better performance in the situations that it’s supported. If you encounter failures using the LINKx provider we ask you to get in touch with us so that we can evaluate whether upstream libfabric or OpenMPI need fixing.
Using the CXI provider¶
To use the CXI provider the following environment variables should be set:
export PMIX_MCA_psec="native" # (1)!
export FI_PROVIDER="cxi" # (2)!
export OMPI_MCA_pml="cm" # (3)!
export OMPI_MCA_mtl="ofi" # (4)!
- Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup.
- Use the CXI (Slingshot) provider.
- Use CM for point-to-point communication.
- Use libfabric for the Matching Transport Layer.
The CXI provider does all communication through the network interface cards (NICs)
When using the libfabric CXI provider, all communication goes through NICs, including intra-node communication. This means that intra-node communication can not make use of shared memory optimizations and the maximum bandwidth will be severely limited. Use the LINKx provider to make full use of the available intra-node bandwidth.
Using the LINKx provider¶
The default configuration routes all communication through the NICs. While performance may sometimes be acceptable, this mode does not make full use of the much higher intra-node bandwidth available on Grace-Hopper nodes. In particular, GPU-GPU communication is significantly faster when using the appropriate intra-node links.
The experimental LINKx libfabric provider allows composing multiple libfabric providers for inter- and intra-node communication.
The CXI provider can be used for inter-node communication while the shared memory (shm) provider can be used to take advantage of xpmem for CPU-CPU communication and GDRCopy for GPU-GPU communication.
The LINKx provider is experimental
While many basic tests work correctly using the LINKx provider we have had reports of applications failing to run with the LINKx provider. Always validate your results to ensure MPI is working correctly.
To use the LINKx provider set the following environment variables:
export PMIX_MCA_psec="native"
export FI_PROVIDER="lnx" # (1)!
export FI_LNX_PROV_LINKS="shm+cxi:cxi0|shm+cxi:cxi1|shm+cxi:cxi2|shm+cxi:cxi3" # (2)!
export FI_SHM_USE_XPMEM=1 # (3)!
export OMPI_MCA_pml="cm"
export OMPI_MCA_mtl="ofi"
export OMPI_MCA_mtl_ofi_av=table # (4)!
- Use the libfabric LINKx provider, to allow using different libfabric providers for inter- and intra-node communication.
- Specify which providers LINKx should use. Use the shared memory provider for intra-node communication and the CXI (Slingshot) provider for inter-node communication. Choose one of the four available NICs on a node in a round-robin fashion.
- Explicitly use xpmem for CPU-CPU communication. The default is to use CMA.
- The LINKx provider requires this option to be set.
Containers¶
To install OpenMPI in a container, libfabric (and possibly UCX if the container should be portable to other centers), should be installed. Then OpenMPI is built, and configured to use at least libfabric. Note that OpenMPI v5 is the first version with full support for libfabric, required for good performance.
Note
The version of MPI in the containers provided by NVIDIA is OpenMPI v4 provided by NVIDIA’s HPC-X toolkit. This version is not suitable for use on Alps for two reasons:
- OpenMPI version 5 is required for full libfabric support.
- It is linked against UCX only, and can’t be modified to use the system libfabric.
See the performance section below for examples of the level of performance loss caused by using HPC-X.
Installing OpenMPI in a container for NVIDIA nodes
The following Dockerfile instructions install OpenMPI from source in an Ubuntu image that already contains CUDA, libfabric and UCX.
ARG OMPI_VER=5.0.8
RUN wget -q https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VER}.tar.gz \
&& tar xf openmpi-${OMPI_VER}.tar.gz \
&& cd openmpi-${OMPI_VER} \
&& ./configure --prefix=/usr --with-ofi=/usr --with-ucx=/usr \
--enable-oshmem --with-cuda=/usr/local/cuda \
--with-cuda-libdir=/usr/local/cuda/lib64/stubs \
&& make -j$(nproc) \
&& make install \
&& ldconfig \
&& cd .. \
&& rm -rf openmpi-${OMPI_VER}.tar.gz openmpi-${OMPI_VER}
- The
--with-ofiand--with-ucxflags configure OpenMPI with the libfabric and UCX back ends respectively. - The
--enable-oshmemflag builds OpenSHMEM as part of the OpenMPI installation, which is useful to support SHMEM implementations like NVSHMEM.
Note that this example does not enable the LINKx provider as in the uenv. We do not currently provide instructions to enable the LINKx provider in manually built container images.
Expand the box below to see an example of a full Containerfile that can be used to create an OpenMPI container on the gh200 nodes of Alps:
The full Containerfile
This is an example of a complete Containerfile that installs OpenMPI based on the a “base image” that provides gdrcopy, libfabric and UCX on top of an NVIDIA container that provides CUDA:
ARG ubuntu_version=24.04
ARG cuda_version=12.8.1
FROM docker.io/nvidia/cuda:${cuda_version}-cudnn-devel-ubuntu${ubuntu_version}
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive \
apt-get install -y \
build-essential \
ca-certificates \
pkg-config \
automake \
autoconf \
libtool \
cmake \
gdb \
strace \
wget \
git \
bzip2 \
python3 \
gfortran \
rdma-core \
numactl \
libconfig-dev \
libuv1-dev \
libfuse-dev \
libfuse3-dev \
libyaml-dev \
libnl-3-dev \
libnuma-dev \
libsensors-dev \
libcurl4-openssl-dev \
libjson-c-dev \
libibverbs-dev \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
ARG gdrcopy_version=2.5.1
RUN git clone --depth 1 --branch v${gdrcopy_version} https://github.com/NVIDIA/gdrcopy.git \
&& cd gdrcopy \
&& export CUDA_PATH=/usr/local/cuda \
&& make CC=gcc CUDA=$CUDA_PATH lib \
&& make lib_install \
&& cd ../ && rm -rf gdrcopy
# Install libfabric
ARG libfabric_version=1.22.0
RUN git clone --branch v${libfabric_version} --depth 1 https://github.com/ofiwg/libfabric.git \
&& cd libfabric \
&& ./autogen.sh \
&& ./configure --prefix=/usr --with-cuda=/usr/local/cuda --enable-cuda-dlopen \
--enable-gdrcopy-dlopen --enable-efa \
&& make -j$(nproc) \
&& make install \
&& ldconfig \
&& cd .. \
&& rm -rf libfabric
# Install UCX
ARG UCX_VERSION=1.19.0
RUN wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz \
&& tar xzf ucx-${UCX_VERSION}.tar.gz \
&& cd ucx-${UCX_VERSION} \
&& mkdir build \
&& cd build \
&& ../configure --prefix=/usr --with-cuda=/usr/local/cuda --with-gdrcopy=/usr/local \
--enable-mt --enable-devel-headers \
&& make -j$(nproc) \
&& make install \
&& cd ../.. \
&& rm -rf ucx-${UCX_VERSION}.tar.gz ucx-${UCX_VERSION}
ARG OMPI_VER=5.0.8
RUN wget -q https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${OMPI_VER}.tar.gz \
&& tar xf openmpi-${OMPI_VER}.tar.gz \
&& cd openmpi-${OMPI_VER} \
&& ./configure --prefix=/usr --with-ofi=/usr --with-ucx=/usr \
--enable-oshmem --with-cuda=/usr/local/cuda \
--with-cuda-libdir=/usr/local/cuda/lib64/stubs \
&& make -j$(nproc) \
&& make install \
&& ldconfig \
&& cd .. \
&& rm -rf openmpi-${OMPI_VER}.tar.gz openmpi-${OMPI_VER}
ARG omb_version=7.5.1
RUN wget -q http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-${omb_version}.tar.gz \
&& tar xf osu-micro-benchmarks-${omb_version}.tar.gz \
&& cd osu-micro-benchmarks-${omb_version} \
&& ldconfig /usr/local/cuda/targets/sbsa-linux/lib/stubs \
&& ./configure --prefix=/usr/local CC=$(which mpicc) CFLAGS="-O3 -lcuda -lnvidia-ml" \
--enable-cuda --with-cuda-include=/usr/local/cuda/include \
--with-cuda-libpath=/usr/local/cuda/lib64 \
CXXFLAGS="-lmpi -lcuda" \
&& make -j$(nproc) \
&& make install \
&& ldconfig \
&& cd .. \
&& rm -rf osu-micro-benchmarks-${omb_version} osu-micro-benchmarks-${omb_version}.tar.gz
WORKDIR /usr/local/libexec/osu-micro-benchmarks/mpi
- The container also installs the OSU MPI micro-benchmarks so that the implementation can be tested.
The EDF file for the container should contain the following:
[env]
PMIX_MCA_psec="native" # (1)!
FI_PROVIDER="cxi" # (2)!
OMPI_MCA_pml="cm" # (3)!
OMPI_MCA_mtl="ofi" # (4)!
- Ensures PMIx uses the same security domain as Slurm. Otherwise PMIx will print warnings at startup.
- Use the CXI (Slingshot) provider.
- Use CM for point-to-point communication.
- Use libfabric for the Matching Transport Layer.
Like with the uenv, the --mpi=pmix flag must be passed to srun to ensure PMIx is used for MPI initialization:
OpenMPI performance¶
We present some performance numbers for OpenMPI, obtained using the OSU benchmarks compiled in the above container image.
no version information available
The following warning message was generated by each rank running the benchmarks below, and can safely be ignored.
The first performance benchmarks are for the OSU point-to-point bandwidth test osu_bw.
- inter-node tests place the two ranks on different nodes, so that all communication is over the Slingshot network
- intra-node tests place two ranks on the same node, but communication is still done over the Slingshot network
Note
The container is configured to only use the CXI provider of libfabric, routing intra-node communication over NICs. We currently only provide instructions on using the experimental LINKx provider, which can make use of higher intra-node bandwidth, for uenv.
Impact of disabling the CXI hook
On many Alps vClusters, the Container Engine is configured with the CXI hook enabled by default, enabling transparent access to the Slingshot interconnect.
The inter-node tests marked with (*) were run with the CXI container hook disabled, to demonstrate the effect of not using an optimised network configuration.
If you see similar performance degradation in your tests, the first thing to investigate is whether your setup is using the libfabric CXI provider.
$ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation
# OSU MPI Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.95 Pass
2 1.90 Pass
4 3.80 Pass
8 7.61 Pass
16 15.21 Pass
32 30.47 Pass
64 60.72 Pass
128 121.56 Pass
256 242.28 Pass
512 484.54 Pass
1024 968.30 Pass
2048 1943.99 Pass
4096 3870.29 Pass
8192 6972.95 Pass
16384 13922.36 Pass
32768 18835.52 Pass
65536 22049.82 Pass
131072 23136.20 Pass
262144 23555.35 Pass
524288 23758.39 Pass
1048576 23883.95 Pass
2097152 23949.94 Pass
4194304 23982.18 Pass
$ srun -N2 --mpi=pmix --environment=omb-ompi-no-cxi ./pt2pt/osu_bw --validation
# OSU MPI Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.16 Pass
2 0.32 Pass
4 0.65 Pass
8 1.31 Pass
16 2.59 Pass
32 5.26 Pass
64 10.37 Pass
128 20.91 Pass
256 41.49 Pass
512 74.26 Pass
1024 123.99 Pass
2048 213.82 Pass
4096 356.13 Pass
8192 468.55 Pass
16384 505.89 Pass
32768 549.59 Pass
65536 2170.64 Pass
131072 2137.95 Pass
262144 2469.63 Pass
524288 2731.85 Pass
1048576 2919.18 Pass
2097152 3047.21 Pass
4194304 3121.42 Pass
$ srun -N2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation D D
# OSU MPI-CUDA Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.90 Pass
2 1.82 Pass
4 3.65 Pass
8 7.30 Pass
16 14.56 Pass
32 29.03 Pass
64 57.49 Pass
128 118.30 Pass
256 227.18 Pass
512 461.26 Pass
1024 926.30 Pass
2048 1820.46 Pass
4096 3611.70 Pass
8192 6837.89 Pass
16384 13361.25 Pass
32768 18037.71 Pass
65536 22019.46 Pass
131072 23104.58 Pass
262144 23542.71 Pass
524288 23758.69 Pass
1048576 23881.02 Pass
2097152 23955.49 Pass
4194304 23989.54 Pass
$ srun -N2 --mpi=pmix --environment=omb-ompi-no-cxi ./pt2pt/osu_bw --validation D D
# OSU MPI-CUDA Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.06 Pass
2 0.12 Pass
4 0.24 Pass
8 0.48 Pass
16 0.95 Pass
32 1.91 Pass
64 3.85 Pass
128 7.57 Pass
256 15.28 Pass
512 19.87 Pass
1024 53.06 Pass
2048 97.29 Pass
4096 180.73 Pass
8192 343.75 Pass
16384 473.72 Pass
32768 530.81 Pass
65536 1268.51 Pass
131072 1080.83 Pass
262144 1435.36 Pass
524288 1526.12 Pass
1048576 1727.31 Pass
2097152 1755.61 Pass
4194304 1802.75 Pass
$ srun -N1 -n2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation
# OSU MPI Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.96 Pass
2 1.92 Pass
4 3.85 Pass
8 7.68 Pass
16 15.40 Pass
32 30.78 Pass
64 61.26 Pass
128 122.23 Pass
256 240.96 Pass
512 483.12 Pass
1024 966.52 Pass
2048 1938.09 Pass
4096 3873.67 Pass
8192 7100.56 Pass
16384 14170.44 Pass
32768 18607.68 Pass
65536 21993.95 Pass
131072 23082.11 Pass
262144 23546.09 Pass
524288 23745.05 Pass
1048576 23879.79 Pass
2097152 23947.23 Pass
4194304 23980.15 Pass
$ srun -N1 -n2 --mpi=pmix --environment=omb-ompi ./pt2pt/osu_bw --validation D D
# OSU MPI-CUDA Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s) Validation
1 0.91 Pass
2 1.83 Pass
4 3.73 Pass
8 7.47 Pass
16 14.99 Pass
32 29.98 Pass
64 59.72 Pass
128 119.13 Pass
256 241.88 Pass
512 481.52 Pass
1024 963.60 Pass
2048 1917.15 Pass
4096 3840.96 Pass
8192 6942.05 Pass
16384 13911.45 Pass
32768 18379.14 Pass
65536 21761.73 Pass
131072 23069.72 Pass
262144 23543.98 Pass
524288 23750.83 Pass
1048576 23882.44 Pass
2097152 23951.34 Pass
4194304 23989.44 Pass
Next is the all to all latency test osu_alltoall, for 8 ranks spread over nodes (4 ranks per node, 1 rank per GPU).
$ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi ./collective/osu_alltoall --validation
# OSU MPI All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Validation
1 12.46 Pass
2 12.05 Pass
4 11.99 Pass
8 11.84 Pass
16 11.87 Pass
32 11.84 Pass
64 11.95 Pass
128 12.22 Pass
256 13.21 Pass
512 13.23 Pass
1024 13.37 Pass
2048 13.52 Pass
4096 13.88 Pass
8192 17.32 Pass
16384 18.98 Pass
32768 23.72 Pass
65536 36.53 Pass
131072 62.96 Pass
262144 119.44 Pass
524288 236.43 Pass
1048576 519.85 Pass
$ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi-no-cxi ./collective/osu_alltoall --validation
# OSU MPI All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Validation
1 137.85 Pass
2 133.47 Pass
4 134.03 Pass
8 131.14 Pass
16 134.45 Pass
32 135.35 Pass
64 137.21 Pass
128 137.03 Pass
256 139.90 Pass
512 140.70 Pass
1024 165.05 Pass
2048 197.14 Pass
4096 255.02 Pass
8192 335.75 Pass
16384 543.12 Pass
32768 928.81 Pass
65536 782.28 Pass
131072 1812.95 Pass
262144 2284.26 Pass
524288 3213.63 Pass
1048576 5688.27 Pass
$ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi ./collective/osu_alltoall --validation -d cuda
# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Validation
1 22.26 Pass
2 22.08 Pass
4 22.15 Pass
8 22.19 Pass
16 22.25 Pass
32 22.11 Pass
64 22.22 Pass
128 21.98 Pass
256 22.19 Pass
512 22.20 Pass
1024 22.37 Pass
2048 22.58 Pass
4096 22.99 Pass
8192 27.22 Pass
16384 28.55 Pass
32768 32.60 Pass
65536 44.88 Pass
131072 70.15 Pass
262144 123.30 Pass
524288 234.89 Pass
1048576 486.89 Pass
$ srun -N2 --ntasks-per-node=4 --mpi=pmix --environment=omb-ompi-no-cxi ./collective/osu_alltoall --validation -d cuda
# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size Avg Latency(us) Validation
1 186.92 Pass
2 180.80 Pass
4 180.72 Pass
8 179.45 Pass
16 209.53 Pass
32 181.73 Pass
64 182.20 Pass
128 182.84 Pass
256 188.29 Pass
512 189.35 Pass
1024 237.31 Pass
2048 231.73 Pass
4096 298.73 Pass
8192 396.10 Pass
16384 589.72 Pass
32768 983.72 Pass
65536 786.48 Pass
131072 1127.39 Pass
262144 2144.57 Pass
524288 3107.62 Pass
1048576 5545.28 Pass
Known issues¶
Some asynchronous collectives are known not to work with GPU buffers, independent of the libfabric provider used.
For example, MPI_Iallreduce will fail with a segmentation fault.
Running the osu_iallreduce benchmark with GPU buffers results in:
$ srun -u --mpi=pmix -n4 osu_iallreduce -d cuda
# OSU MPI-CUDA Non-blocking Allreduce Latency Test v7.5
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait
# Datatype: MPI_INT.
# Size Overall(us) Compute(us) Pure Comm.(us) Overlap(%)
[nid006549:31808] *** Process received signal ***
[nid006549:31808] Signal: Segmentation fault (11)
[nid006549:31808] Signal code: Invalid permissions (2)
[nid006549:31808] Failing at address: 0x4002da000000
[nid006550:188198] *** Process received signal ***
[nid006550:188198] Signal: Segmentation fault (11)
[nid006550:188198] Signal code: Invalid permissions (2)
[nid006550:188198] Failing at address: 0x40029a000000
[nid006549:31808] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x400027ce07dc]
[nid006549:31808] [ 1] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(+0x19f1c8)[0x400029b0f1c8]
[nid006549:31808] [ 2] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(+0x12836c)[0x400029a9836c]
[nid006549:31808] [ 3] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(NBC_Progress+0x164)[0x400029a97bd4]
[nid006549:31808] [ 4] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(ompi_coll_libnbc_progress+0x8c)[0x400029a96a0c]
[nid006549:31808] [ 5] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libopen-pal.so.80(opal_progress+0x3c)[0x40002a23737c]
[nid006549:31808] [ 6] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(ompi_request_default_wait+0x50)[0x4000299f3810]
[nid006549:31808] [ 7] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(MPI_Wait+0x64)[0x400029a3df24]
[nid006549:31808] [ 8] /user-environment/env/default/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x40424c]
[nid006549:31808] [ 9] /lib64/libc.so.6(__libc_start_main+0xe8)[0x40002a073fa0]
[nid006549:31808] [10] /user-environment/env/default/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x404e98]
[nid006549:31808] *** End of error message ***
[nid006550:188198] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x4000026a07dc]
[nid006550:188198] [ 1] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(+0x19f1c8)[0x4000044cf1c8]
[nid006550:188198] [ 2] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(+0x12836c)[0x40000445836c]
[nid006550:188198] [ 3] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(NBC_Progress+0x164)[0x400004457bd4]
[nid006550:188198] [ 4] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(ompi_coll_libnbc_progress+0x8c)[0x400004456a0c]
[nid006550:188198] [ 5] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libopen-pal.so.80(opal_progress+0x3c)[0x400004bf737c]
[nid006550:188198] [ 6] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(ompi_request_default_wait+0x50)[0x4000043b3810]
[nid006550:188198] [ 7] /user-environment/linux-neoverse_v2/openmpi-5.0.9-leskuw5dyswfdw3eaybcyfmsrbid3uuq/lib/libmpi.so.40(MPI_Wait+0x64)[0x4000043fdf24]
[nid006550:188198] [ 8] /user-environment/env/default/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x40424c]
[nid006550:188198] [ 9] /lib64/libc.so.6(__libc_start_main+0xe8)[0x400004a33fa0]
[nid006550:188198] [10] /user-environment/env/default/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x404e98]
[nid006550:188198] *** End of error message ***
srun: error: nid006549: task 0: Segmentation fault (core dumped)
srun: Terminating StepId=2243671.21
[2025-12-17T12:59:34.342] error: *** STEP 2243671.21 ON nid006549 CANCELLED AT 2025-12-17T12:59:34 DUE TO TASK FAILURE ***
srun: error: nid006550: task 2: Segmentation fault (core dumped)
srun: error: nid006550: task 3: Terminated
srun: error: nid006549: task 1: Terminated
srun: Force Terminated StepId=2243671.21