JupyterLab¶
Access and setup¶
The JupyterHub service enables the interactive execution of JupyterLab on the compute nodes of Daint, Clariden, Santis and Eiger.
The service is accessed at jupyter-daint.cscs.ch, jupyter-clariden.cscs.ch, jupyter-santis.cscs.ch and jupyter-eiger.cscs.ch, respectively. As the notebook servers are executed on compute nodes, you must have a project with compute resources available on the respective cluster.
Once logged in, you will be redirected to the JupyterHub Spawner Options form, where typical job configuration options can be selected. These options might include the type and number of compute nodes, the wall time limit, and your project account.
By default, JupyterLab servers are launched in a dedicated queue, which should ensure a start-up time of less than a few minutes. If your server is not running within 5 minutes we encourage you to first try the non-dedicated queue, and then contact us.
When resources are granted the page redirects to the JupyterLab session, where you can browse, open and execute notebooks on the compute nodes. A new notebook with a Python 3 kernel can be created with the menu new
and then Python 3
. Under new
it is also possible to create new text files and folders, as well as to open a terminal session on the allocated compute node.
Debugging
The log file of a JupyterLab server session is saved in $HOME
in a file named slurm-<jobid>.out
. If you encounter problems with your JupyterLab session, the contents of this file can contain clues to debug the issue.
Unexpected error while saving file: disk I/O error.
This error message indicates that you have run out of disk quota.
You can check your quota using the command quota
.
Runtime environment¶
A Jupyter session can be started with either a uenv or a container as a base image. The JupyterHub Spawner form provides a set of default images such as the prgenv-gnu uenv or the NGC Pytorch container to choose from in a dropdown menu. When using uenv, the software stack will be mounted at /user-environment
, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with Launch JupyterLab
.
Using remote uenv for the first time.
If the uenv is not present in the local repository, it will be automatically fetched. As a result, JupyterLab may take slightly longer than usual to start.
Ending your interactive session and logging out
The Jupyter servers can be shut down through the Hub. To end a JupyterLab session, please select Hub Control Panel
under the File
menu and then Stop My Server
. By contrast, clicking Logout
will log you out of the server, but the server will continue to run until the Slurm job reaches its maximum wall time.
If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or container engine (CE) TOML file under the section Advanced options
before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages.
Example of a custom Pytorch container
A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service:
FROM nvcr.io/nvidia/pytorch:25.05-py3
RUN pip install --no-cache \
jupyterlab \
jupyterhub==4.1.6 \
pyfirecrest==1.2.0 \
SQLAlchemy==1.4.52 \
oauthenticator==16.3.1 \
notebook==7.3.3 \
jupyterlab_nvdashboard==0.13.0 \
git+https://github.com/eth-cscs/firecrestspawner.git
The package nvdashboard is also installed here, which allows to monitor system metrics at runtime.
A corresponding TOML file can look like
image = "/capstor/scratch/cscs/${USER}/ce-images/ngc-pytorch+25.05.sqsh"
mounts = [
"/capstor",
"/iopsstor",
"/users/${USER}/.local/share/jupyter", # (1)!
"/etc/slurm", # (2)!
"/usr/lib64/libslurm-uenv-mount.so",
"/etc/container_engine_pyxis.conf" # (3)!
]
workdir = "/capstor/scratch/cscs/${USER}" # (4)!
writable = true
[annotations]
com.hooks.aws_ofi_nccl.enabled = "true" # (5)!
com.hooks.aws_ofi_nccl.variant = "cuda12"
[env]
CUDA_CACHE_DISABLE = "1" # (6)!
TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)!
MPICH_GPU_SUPPORT_ENABLED = "0" # (8)!
- Avoid mounting all of
$HOME
to avoid subtle issues with cached files, but mount Jupyter kernels - Enable Slurm commands (together with two subsequent mounts)
- Currently only required on Daint and Santis, not on Clariden
- Set working directory of Jupyter session (file browser root directory)
- Use environment settings for optimized communication
- Disable CUDA JIT cache
- Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error
- Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL
Accessing file systems with uenv
While Jupyter sessions with CE start in the directory specified with workdir
, a uenv session always start in your $HOME
folder. All non-hidden files and folders in $HOME
are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above $HOME
. To enable access your $SCRATCH
folder, it is therefore necessary to create a symbolic link to your $SCRATCH
folder. This can be done by issuing the following command in a terminal from your $HOME
directory:
Creating Jupyter kernels¶
A kernel, in the context of Jupyter, is a program together with environment settings that runs the user code within Jupyter notebooks. In Python, Jupyter kernels make it possible to access the (system) Python installation of a uenv or container, that of a virtual environment (on top) or any other custom Python installations like Anaconda/Miniconda from Jupyter notebooks. Alternatively, a kernel can also be created for other programming languages such as Julia, allowing e.g. the execution of Julia code in notebook cells.
As a preliminary step to running any code in Jupyter notebooks, a kernel needs to be installed, which is described in the following for both Python and Julia.
Using Python in Jupyter¶
For Python, the recommended setup consists of a uenv or container as a base image as described above that includes the stable dependencies of the software stack. Additional packages can be installed in a virtual environment on top of the Python installation in the base image (mandatory for most uenvs). Having the base image loaded, such a virtual environment can be created with
python -m venv --system-site-packages venv-<base-image-version>
where <base-image-version>
can be replaced by an identifier uniquely referring to the base image (such virtual environments are specific for the base image and are not portable).
Jupyter kernels for Python are powered by ipykernel
.
As a result, ipykernel
must be installed in the target environment that will be used as a kernel. That can be done with pip install ipykernel
(either as part of a Dockerfile or in an activated virtual environment on top of a uenv/container image).
A kernel can now be created from an active Python virtual environment with the following commands
. venv-<base-image-version>/bin/activate # (1)!
python -m ipykernel install \
${VIRTUAL_ENV:+--env PATH $PATH --env VIRTUAL_ENV $VIRTUAL_ENV} \
--user --name="<kernel-name>" # (2)!
- This step is only necessary when working with a virtual environment on top of the base image
- The expression in braces makes sure the kernel's environment is properly configured when using a virtual environment (must be activated). The flag
--user
installs the kernel to a path under${HOME}/.local/share/jupyter
.
The <kernel-name>
can be replaced by a name specific to the base image/virtual environment.
Python packages from uenv shadowing those in a virtual environment
When using uenv with a virtual environment on top, the site-packages under /user-environment
currently take precedence over those in the activated virtual environment. This is due to the uenv paths being included in the PYTHONPATH
environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to PYTHONPATH
whenever activating the virtual environment.
PYTHONPATH
to the Jupyter environment. This can be done as follows.
It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application.
Using Julia in Jupyter¶
To run Julia code in Jupyter notebooks, you can use the provided uenv for this language. In particular, you need to use the following in the JupyterHub Spawner Advanced options
forms mentioned above:
pass a julia
uenv and the view jupyter
.
When Julia is first used within Jupyter, IJulia and one or more Julia kernel need to be installed. Type the following command in a shell within JupyterHub to install IJulia, the default Julia kernel and, on systems with Nvidia GPUs, a Julia kernel running under Nvidia Nsight Systems:
You can install additional custom Julia kernels by typing the following in a shell:
- type
? installkernel
to learn about valid<args>
First time use of Julia
If you are using Julia for the first time at all, executing install_ijulia
will automatically first trigger the installation of juliaup
and the latest julia
version (it is also triggered if you execute juliaup
or julia
).
Parallel computing¶
MPI in the notebook via IPyParallel and MPI4Py¶
MPI for Python provides bindings of the Message Passing Interface (MPI) standard for Python, allowing any Python program to exploit multiple processors.
MPI can be made available on Jupyter notebooks through IPyParallel. This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: a set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them.
We provide the Python package ipcmagic
to make easier the management of IPyParallel clusters. ipcmagic
can be installed by the user with
The engines and another server that moderates the cluster, called the controller, can be started an stopped with the magic %ipcluster start -n <num-engines>
and %ipcluster stop
, respectively. Before running the command, the python package ipcmagic
must be imported
Information about the command, can be obtained with %ipcluster --help
.
In order to execute MPI code on JupyterLab, it is necessary to indicate that the cells have to be run on the IPyParallel engines. This is done by adding the IPyParallel magic command %%px
to the first line of each cell.
There are two important points to keep in mind when using IPyParallel. The first one is that the code executed on IPyParallel engines has no effect on non-%%px
cells. For instance, a variable created on a %%px
-cell will not exist on a non-%%px
-cell. The opposite is also true. A variable created on a regular cell, will be unknown to the IPyParallel engines. The second one is that the IPyParallel engines are common for all the user's notebooks. This means that variables created on a %%px
cell of one notebook can be accessed or modified by a different notebook.
The magic command %autopx
can be used to make all the cells of the notebook %%px
-cells. %autopx
acts like a switch: running it once, activates the %%px
and running it again deactivates it. If %autopx
is used, then there are no regular cells and all the code will be run on the IPyParallel engines.
Examples of notebooks with ipcmagic
can be found here.
Distributed training and inference for ML¶
While it is generally recommended to submit long-running machine learning training and inference jobs via sbatch
, certain use cases can benefit from an interactive Jupyter environment.
A popular approach to run multi-GPU ML workloads is with accelerate
and torchrun
as demonstrated in the tutorials. In particular, the accelerate launch
script in the LLM fine-tuning tutorial can be directly carried over to a Jupyter cell with a %%bash
header (to run its contents interpreted by bash). For torchrun
, one can adapt the command from the multi-node nanotron tutorial to run on a single GH200 node using the following line in a Jupyter cell
torchrun with virtual environments
When using a virtual environment on top of a base image with Pytorch, always replace torchrun
with python -m torch.distributed.run
to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained Pytorch container, torchrun
is equivalent to python -m torch.distributed.run
.
Notebook structure
In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by accelerate
or torchrun
, respectively.
Alternatively to using these launchers, it is also possible to use Slurm to obtain more control over resource mappings, e.g. by launching an overlapping Slurm step onto the same node used by the Jupyter process. An example with the container engine looks like this:
!srun --overlap -ul --environment /path/to/edf.toml \
--container-workdir $PWD -n 4 bash -c "\
MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \
MASTER_PORT=29500 \
RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID WORLD_SIZE=\$SLURM_NPROCS \
python train.py ..."
where /path/to/edf.toml
should be replaced by the TOML file and train.py
is a script using torch.distributed
for distributed training. This can be further customized with extra Slurm options.
Concurrent usage of resources
Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with htop
and nvdashboard can be helpful for debugging.
Multi-GPU training from a shared Jupyter process
Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the Pytorch docs). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like accelerate
's notebook_launcher
or transformers
' Trainer
class. When using these in containers, you will currently need to unset the environment variables RANK
and LOCAL_RANK
, that is have the following in a cell at the top of the notebook: