LLM Inference Tutorial¶
This tutorial will guide you through the steps required to set up a PyTorch container and do ML inference. This means that we load an existing machine learning model, prompt it with some custom data, and run the model to see what output it will generate with our data.
To complete the tutorial, we get a PyTorch container from Nvidia's GPU Cloud (NGC), customize it to suit our needs, and tell the Container Engine how to run it. Finally, we set up and run a python script to run the machine learning model and generate some output.
The model we will be running is Google's Gemma-7B in the instruction-tuned variant. This is an LLM similar in style to popular chat assistants like ChatGPT, which can generate text responses to text prompts that we feed into it.
Gemma-7B Inference using NGC PyTorch¶
Prerequisites¶
This tutorial assumes you are able to access the cluster via SSH. To set up access to CSCS systems, follow the guide here, and read through the documentation about the ML Platform.
For clarity, we prepend all shell commands with the hostname and any active Python virtual environment they are executed in. E.g. clariden-lnXXX
refers to a login node on Clariden, while nidYYYYYY
is a compute node (with placeholders for numeric values). The commands listed here are run on Clariden, but can be adapted slightly to run on other vClusters as well.
Note
Login nodes are a shared environment for editing files, preparing and submitting SLURM jobs as well as inspecting logs. They are not intended for running significant data processing or compute work. Any memory- or compute-intensive work should instead be done on compute nodes.
If you need to move data externally or internally, please follow the corresponding guides using Globus or the xfer
queue, respectively.
Build a modified NGC PyTorch Container¶
In theory, we could just go ahead and use the vanilla container image to run some PyTorch code. However, chances are that we will need some additional libraries or software. For this reason, we need to build another image on top of the one provided by Nvidia. To do this, we create a new directory for recipes to build containers in our home directory and set up a Dockerfile:
[clariden-lnXXX]$ cd $SCRATCH
[clariden-lnXXX]$ mkdir -p tutorials/gemma-7b
[clariden-lnXXX]$ cd tutorials/gemma-7b
Use your favorite text editor to create a file Dockerfile
here. The Dockerfile should look like this:
FROM nvcr.io/nvidia/pytorch:24.01-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y python3.10-venv && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
The first line specifies that we are working on top of an existing container.
In this case we start FROM
an NGC PyTorch container.
Next, we set an environment variable with ENV
that helps us run apt-get
in the container.
Finally, we RUN
the package installer apt-get
to install python virtual environments.
This will let us install python packages later on without having to rebuild the container again and again.
There's a bunch of extra commands in this line to tidy things up.
If you want to understand what is happening, take a look at the Docker documentation.
Recent changes in NGC releases
Starting with the 24.11 release, NGC PyTorch no longer requires the installation of the Python venv module. That is, the Dockerfile simplifies to only the first line, e.g. for the 25.06
release
The remaining steps can then be performed equivalently, replacing the version number 24.01
by the one chosen in the Dockerfile (e.g. 25.06
).
It is generally recommended to stick to one of the most recent versions of NGC, unless there is a strong reason from your application to stick to an old version for compatibility.
Now that we've setup the Dockerfile, we can go ahead and pass it to Podman to build a container. Podman is a tool that enables us to fetch, manipulate, and interact with containers on the cluster. For more information, please see the Container Engine page. To use Podman, we first need to configure some storage locations for it. This step is straightforward, just create the file in your home:
[storage]
driver = "overlay"
runroot = "/dev/shm/$USER/runroot"
graphroot = "/dev/shm/$USER/root"
[storage.options.overlay]
mount_program = "/usr/bin/fuse-overlayfs-1.13"
Warning
If $XDG_CONFIG_HOME
is set, place this file at $XDG_CONFIG_HOME/containers/storage.conf
instead.
Before building the container image, we create a dedicated directory to keep track of all images used with the CE. Since container images are large files and the filesystem is a shared resource, we need to apply best practices for LUSTRE so they are properly distributed across storage nodes.
[clariden-lnXXX]$ mkdir -p $SCRATCH/ce-images
[clariden-lnXXX]$ lfs setstripe -E 4M -c 1 -E 64M -c 4 -E -1 -c -1 -S 4M \
$SCRATCH/ce-images # (1)!
- This makes sure that files stored subsequently end up on the same storage node (up to 4 MB), on 4 storage nodes (between 4 and 64 MB) or are striped across all storage nodes (above 64 MB)
To build a container with Podman, we need to request a shell on a compute node from Slurm, pass the Dockerfile to Podman, and finally import the freshly built container to the dedicated directory using enroot. Slurm is a workload manager which distributes workloads on the cluster. Through Slurm, many people can use the supercomputer at the same time without interfering with one another.
[clariden-lnXXX]$ srun -A <ACCOUNT> --pty bash
[nidYYYYYY]$ podman build -t ngc-pytorch:24.01 . # (1)!
# ... lots of output here ...
[nidYYYYYY]$ enroot import -x mount \
-o $SCRATCH/ce-images/ngc-pytorch+24.01.sqsh \
podman://ngc-pytorch:24.01 # (2)!
# ... more output here ...
- This builds the container image with the current working directory as the build context. The
Dockerfile
inside that directory is implicitly used as a recipe. If it is named differently use the-f path/to/Dockerfile
option. - The newly built container image is imported and stored under
$SCRATCH/ce-images
.
where you should replace <ACCOUNT>
with your project account ID.
At this point, you can exit the Slurm allocation by typing exit
.
You should be able to see a new Squashfs file in your container image directory:
This squashfs file is essentially a compressed container image, which can be run directly by the container engine.
We will use our freshly-built container ngc-pytorch+24.01.sqsh
in the following steps to run a PyTorch script that loads the Google Gemma-7B model and performs some inference with it.
Note
In order to import a container image from a registry without building additional layers on top of it, we can directly use enroot
(without podman
). This is useful in this tutorial if we want to use a more recent NGC PyTorch container that was released since 24.11
. Use the following syntax for importing the 25.06
release:
Set up an EDF¶
We need to set up an EDF (Environment Definition File) which tells the Container Engine what container image to load, which paths to mount from the host filesystem, and what plugins to load. Use your favorite text editor to create a file ngc-pytorch-gemma-24.01.toml
for the container engine. The EDF should look like this:
image = "${SCRATCH}/ce-images/ngc-pytorch+24.01.sqsh" # (1)!
mounts = [
"/capstor",
"/iopsstor"
] # (2)!
workdir = "${SCRATCH}/tutorials/gemma-7b" # (3)!
[annotations]
com.hooks.aws_ofi_nccl.enabled = "true" # (4)!
com.hooks.aws_ofi_nccl.variant = "cuda12"
[env]
NCCL_DEBUG = "INFO" # (5)!
CUDA_CACHE_DISABLE = "1" # (6)!
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)!
MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
- It is important to use curly braces for environment variables used in the EDF
- The path
/users
is not mounted since it often contains user-specific initialization scripts for the host environment and many frameworks leave temporary data behind that can lead to non-trivial runtime errors when swapping container images. Thus, it is recommended to selectively mount specific subfolders under${HOME}
if needed. - You can use
${PWD}
as an alternative to use the path submitted from when the container is started - This enables NCCL installed in the container to make effective use of the Slingshot interconnect on Alps by interfacing with the AWS OFI NCCL plugin with libfabric. While not strictly needed for single node workloads, it is good practice to keep it always on.
- This makes NCCL output debug info during initialization, which can be useful to spot communication-related issues in a distributed scenario (see later tutorials). Subsystems with debug log can be configured with
NCCL_DEBUG_SUBSYS
. - Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
- Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
- Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
If you've decided to build the container somewhere else, make sure to supply the correct path to the image
variable.
The image
variable defines which container we want to load.
This could either be a container from an online docker repository, like nvcr.io/nvidia/pytorch:24.01-py3
, or in our case, a local squashfs file which we built ourselves.
The mounts
variable defines which directories we want to mount where in our container.
In general, it's a good idea to use a directory under /capstor/scratch
directory to store outputs from any scientific software as this filesystem is optimized for sequential write-operations as described in Alps storage. This particularly applies to e.g. checkpoints from ML training, which we will see in the next tutorials (and there it matters also to apply good LUSTRE settings beforehand as for container images). In this tutorial, we will not generate a lot of output, but it's a good practice to stick to anyways.
Finally, the workdir
variable tells the container engine where to start working.
If we request a shell, this is where we will find ourselves dropped initially after starting the container.
Set up a Python Virtual Environment¶
This will be the first time we run our modified container.
To run the container, we need allocate some compute resources using Slurm and launch a shell, just like we already did to build the container.
This time, we also use the --environment
option to specify that we want to launch the shell inside the container specified by our gemma-pytorch EDF file:
[clariden-lnXXX]$ cd $SCRATCH/tutorials/gemma-7b
[clariden-lnXXX]$ srun -A <ACCOUNT> \
--environment=./ngc-pytorch-gemma-24.01.toml --pty bash
PyTorch is already setup in the container for us. We can verify this by asking pip for a list of installed packages:
user@nidYYYYYY$ python -m pip list | grep torch
pytorch-quantization 2.1.2
torch 2.2.0a0+81ea7a4
torch-tensorrt 2.2.0a0
torchdata 0.7.0a0
torchtext 0.17.0a0
torchvision 0.17.0a0
However, we will need to install a few more Python packages to make it easier to do inference with Gemma-7B.
While it is best practice to install stable dependencies in the container image, we can maintain frequently changing packages in a virtual environment built on top of the container image.
The --system-site-packages
option of the Python venv
creation command ensures that we install packages in addition to the existing packages and don't accidentally re-install a new version of PyTorch shadowing the one that has been put in place by Nvidia.
Next, we activate the environment and use pip to install the two packages we need, accelerate
and transformers
:
user@nidYYYYYY$ python -m venv --system-site-packages venv-gemma-24.01
user@nidYYYYYY$ source venv-gemma-24.01/bin/activate
(venv-gemma-24.01) user@nidYYYYYY$ pip install \
accelerate==0.30.1 transformers==4.38.1 huggingface_hub[cli]
# ... pip output ...
Before we move on to running the Gemma-7B model, we additionally need to make an account at HuggingFace, get an API token, and accept the license agreement for the Gemma-7B model. You can save the token to $SCRATCH
using the huggingface-cli:
(venv-gemma-24.01) user@nidYYYYYY$ export HF_HOME=$SCRATCH/huggingface
(venv-gemma-24.01) user@nidYYYYYY$ huggingface-cli login
At this point, you can exit the Slurm allocation again by typing exit
.
If you ls
the contents of the gemma-inference
folder, you will see that the venv-gemma-24.01
virtual environment folder persists outside of the Slurm job.
Note
Keep in mind that
- this virtual environment won't actually work unless you're running something from inside the PyTorch container. This is because the virtual environment ultimately relies on the resources packaged inside the container.
- every Slurm job making use of this virtual environment will need to activate it first (inside the
srun
command).
Since HF_HOME
will not only contain the API token, but also be the storage location for model, dataset and space caches of huggingface_hub
(unless HF_HUB_CACHE
is set), we also want to apply proper LUSTRE striping settings before it gets populated.
Run Inference on Gemma-7B¶
Cool, now you have a working container with PyTorch and all the necessary Python packages installed! Let's move on to Gemma-7B. We write a Python script to load the model and prompt it with some custom text. The Python script should look like this:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")
input_text = "Write me a poem about the Swiss Alps."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(outputs[0]))
Feel free to change the input_text
variable to whatever prompt you like.
All that remains is to run the python script inside the PyTorch container. There are several ways of doing this. As before, you could just use Slurm to get an interactive shell in the container. Then you would source the virtual environment and run the Python script we just wrote. There's nothing wrong with this approach per se, but consider that you might be running much more complex and lengthy Slurm jobs in the future. You'll want to document how you're calling Slurm, what commands you're running on the shell, and you might not want to (or might not be able to) keep a terminal open for the length of time the job might take. For this reason, it often makes sense to write a batch file, which enables you to document all these processes and run the Slurm job regardless of whether you're still connected to the cluster.
Create a Slurm batch file submit-gemma-inference.sh
. It should look like this:
#!/bin/bash
#SBATCH --account=<ACCOUNT>
#SBATCH --job-name=gemma-inference
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=288
#SBATCH --output logs/slurm-%x-%j.out
export HF_HOME=$SCRATCH/huggingface
export TRANSFORMERS_VERBOSITY=info
cd $SCRATCH/tutorials/gemma-7b # (1)!
set -x
srun -ul --environment=./ngc-pytorch-gemma-24.01.toml bash -c "
source venv-gemma-24.01/bin/activate
python gemma-inference.py
"
- Change directory if submitted with sbatch from a different directory
The first few lines of the batch script declare the shell we want to use to run this batch file and pass several options to the Slurm scheduler.
After this, we cd
to our working directory and srun
the command in our container environment that source
s our virtual environment and finally runs our inference script.
The operations performed before the srun
command resemble largely the operations performed on the login node above and, in fact, happen in the host environment. If you need to perform these steps in the container environment as well, you can alternatively use the #SBATCH --environment=path/to/ngc-pytorch-gemma-24.01.toml
option instead of using --environment
with srun
.
#SBATCH --environment
Use of the --environment
option for sbatch
is still considered experimental and could result in unexpected behavior. In particular, avoid mixing #SBATCH --environment
and srun --environment
in the same job.
Use of --environment
is currently only recommended for the srun
command.
Once you've finished editing the batch file, you can save it and run it with Slurm:
This command should just finish without any output and return you to your terminal.
At this point, you can follow the output in your shell using tail -f logs/slurm-gemma-inference-<job-id>.out
.
Besides you're free to do whatever you like; you can close the terminal, keep working, or just wait for the Slurm job to finish.
You can always check on the state of your job by logging back into the cluster and running squeue -l --me
.
Once your job finishes, you will find a file in the same directory you ran it from, named something like logs/slurm-gemma-inference-<job-id>.out
, and containing the output generated by your Slurm job.
For this tutorial, you should see something like the following:
[clariden-lnXXX]$ cat logs/slurm-gemma-inference-543210.out
/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00, 1.13it/s]
/capstor/scratch/cscs/user/gemma-inference/venv-gemma-24.01/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
<bos>Write me a poem about the Swiss Alps.
In the heart of Switzerland, where towering peaks touch sky,
Lies a playground of beauty, beneath the watchful eye.
The Swiss Alps, a majestic force,
A symphony of granite, snow, and force.
Snow-laden peaks pierce the heavens above,
Their glaciers whisper secrets of ancient love.
Emerald valleys bloom with flowers,
A tapestry of colors, a breathtaking sight.
Hiking trails wind through meadows and woods,
Where waterfalls cascade, a silent song unfolds.
The crystal clear lakes reflect the sky above,
A mirror of dreams, a place of peace and love.
The Swiss Alps, a treasure to behold,
A land of wonder, a story untold.
From towering peaks to shimmering shores,
They inspire awe, forevermore.<eos>
Congrats! You've run Google Gemma-7B inference on four GH200 chips simultaneously. Move on to the next tutorial or try the challenge.
Collaborating with Git
In order to track and exchange your progress with colleagues, you can use standard git
commands on the host, i.e. in the directory $SCRATCH/tutorials/gemma-7b
run
[clariden-lnXXX]$ git init .
[clariden-lnXXX]$ git remote add origin \
git@github.com:<github-username>/alps-mlp-tutorials-gemma-7b.git # (1)!
[clariden-lnXXX]$ ... # git add/commit
- Use any alternative Git hosting service instead of Github
where you can replace <github-username>
by the owner of the Github repository you want to push to.
Note that for reproducibility, it is recommended to always track the Dockerfile and EDF alongside your application code in a Git repository.
Challenge¶
Using the same approach as in the latter half of step 4, use pip to install the package nvitop
. This is a tool that shows you a concise real-time summary of GPU activity. Then, run Gemma and launch nvitop
at the same time:
Note the use of bash > gemma-output.log 2>&1
to hide any output from Python.
Note also the use of the single ampersand '&'
which backgrounds the first command in order to run nvitop
exclusively in the foreground.
After a moment, you will see your Python script spawn on all four GPUs, after which the GPU activity will increase a bit and then go back to idle.
At this point, you can hit q
to quite nvitop and you will find the output of your Python script in gemma-output.log
.