Tutorial¶

Login, SLURM, and modules¶

Login¶

You should be able to open an SSH session on the login node of bora with:

ssh username@bora.units.it

Where username is the UniTS username you are using to access the university services, with the same credentials.

You can run this in any terminal in Linux/maxOS client or in Windows Subsystem for Linux (WSL) on a Windows client, which you should setup as described in the User guide page.

Setting up SSH Public Key Authentication¶

We can login using a pair of cryptographic keys, instead of a password If there is a copy of your public key on the server, in your user $HOME/.ssh/authorized_keys, you can prove you have the corresponding private key and start an SSH session.

If you haven’t used this form of authentication before, or if you want to generate a new key pair just for Bora, run this:

ssh-keygen -t ed25519 

You will be given some prompts - the defaults are fine, but you may want to use a different filename if you have multiple keypairs. Providing a passphrase is optional, but prevents a third party from using your private key (should they obtain it). At the end of the procedure you should see something like:

Your identification has been saved in /home/you/.ssh/id_ed25519.
Your public key has been saved in /home/you/.ssh/id_ed25519.pub.

Then copy the public key to Bora. ssh-copy-id is an helper tool provided by the package named openssh-client in most of Linux distributions (and in WSL):

ssh-copy-id username@bora.units.it

If you are using more than one keypair, you should specify which public key should be copied over:

ssh-copy-id username@bora.units.it -i /home/you/.ssh/id_ed25519.pub

Storage¶

Three storage locations are available to users:

user homes: /u/$USER, where your personal home folders are located. They are the (default) working directory after logging in;
slow: /data/slow/$USER, meant for normal I/O of jobs. They are relatively slow, with respect to the fast storage (the devices are HDDs), but more space is available;
fast: /data/fast/$USER, meant for fast I/O of jobs. Space is scarcer than slow storage, but faster (the devices are SDDs).

Disk quota¶

Disk quotas are a way to account the storage usage by each user and group and to enforce limits on them. Quotas have been set up on Bora, with the following default limits:

user homes (/u): 10 GB
slow: (/data/slow): 60 GB
fast: (/data/fast): 30 GB

Quota can also enforce a maximum number of files (more precisely: of inodes) - however, at the moment no limit is imposed on this. It is accounted for anyway, and reported in the quota reports.

You can check your usage and limits with the quota command:

$ quota
Disk quotas for user exactlab-vdi (uid 592488047):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
  10.141.3.9:/u 1860436  10485760 10485760           10799       0       0
10.141.3.9:/data
                 102400  62914560 62914560               4       0       0
10.141.3.8:/data
                      0  31457280 31457280               3       0       0

A nicer formatting can be obtained with the right options: with human-readable units (instead of bytes) and the name of mount points instead of the raw addresses:

$ quota -s --show-mntpoint --hide-device
Disk quotas for user exactlab-vdi (uid 592488047):
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
             /u   1817M  10240M  10240M           10799       0       0
     /data/slow    100M  61440M  61440M               4       0       0
     /data/fast      0K  30720M  30720M               3       0       0

A quota reminder is printed out in the message-of-the-day that prepends your prompt each time you login. Should you ever run out of quota, a clear warning is printed out. If you have no files on a storage resource, its quota report for your user may still be empty.

Submitting jobs to the SLURM workload manager¶

Running computing jobs on shared resources means submitting them to a queue. There, the jobs start order is controlled by the available resources and by a priority metric, depending on various factors. Regardless of priority, a smaller job is more likely to fit in the available resources in any given moment, while a larger one may end up waiting for the required resources to free up.

Leaving the details on the scheduler strategy aside, this implies that when requesting a job we need to specify the resources we need. Therefore, submitting a Slurm job amounts to declaring a set of resources to be allocated, with command-line options or #SBATCH directives, prior to executing the instructions in the rest of the script.

The main commands to submit jobs are:

sbatch, which is to be provided with a job script, for later execution (output is written to a file)
srun, which instead runs interactively: it is blocking for the current shell, unless sent to background, and the standard output and error are redirected to the current terminal

There are other subtle differences between the two, such as the capability to submit job arrays (sbatch only). srun can be also called within an existing Slurm job - in that case it won’t spawn a new job allocation: it is commonly used as such to start multiple instances of an MPI program.

Let’s start by launching a mock job that just prints the node it is running ons:

srun hostname

We can do the same, asking to allocate run multiple tasks (a process, in this example) of the hostname command (the -l option prepends the task number to the printout)

srun --ntasks 6 -l hostname

This will likely run on one node only:

bora-cpu01
bora-cpu01
bora-cpu01
bora-cpu01
bora-cpu01
bora-cpu01

We can ask for a given number of tasks per node, only one for example:

srun --ntasks 6 --ntasks-per-node=1 -l hostname

bora-cpu01
bora-cpu02
bora-cpu03
bora-cpu04
bora-fat01
bora-gpu01

Unless otherwise specified - there is an option for that - one cpu per task is. Therefore, since each node has 36 cores, more than one node will be allocated to satisfy the following: srun -n 72 hostname

Before moving on to something more interesting, let’s see how such a simple job translates to a batch script, instead of srun.

Consider the following 💤

srun --ntasks 2 bash -c 'sleep 60; date'

Non only this takes a while to finish, but we may have to wait few hours to get our turn in the queue, had this been a job asking for real resources. We can turn it in a batch script. Command line options become directives:

#!/bin/bash
#SBATCH --ntasks=2
sleep 60
date

If we paste this in a file, we can submit the job with sbatch zzz.sh. Now: the output is not printed to the current shell. It is written to a logfile named slurm-$SLURM_JOB_ID.out instead, in the current working directory. This file collects all the stdout, while stderr (if any), gets written to slurm-$SLURM_JOB_ID.out. This can be changed

So: why bother with srun used as such? On reason is because it is interactive. With the following we can obtain a shell on a compute node:

srun --pty bash -i

A more useful example: using 2 of the 4 shards (this anticipates MIG instances) of a physical GPU:

srun -n 1 --gres=gpu:a30_1g.6gb:2 --pty bash -i

By exiting that shell session (going back to the login node), the job ends.

SLURM queues¶

We can inquire the queue state with squeue:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
53729      main opt_mpi_ exactlab PD       0:00      4 (Resources)
53730      main opt_mpi_ exactlab PD       0:00      4 (Priority)
53728      main opt_mpi_ exactlab  R       0:07      4 bora-cpu[01-04]

Which in this example is showing three jobs in queue and their state: running (R) or pending (PD). In the latter case, a reason why the job did not start yet is provided: if it is only waiting for resources to free up or if there a jobs higher up in the queue (priority) which will start first.

Any job we own can be cancelled with scancel <job_id>:

$ `sbatch zzz.sh`
Submitted batch job 54854
$ scancel 54854
$ cat slurm-54854.out 
slurmstepd: error: *** JOB 54854 ON bora-cpu01 CANCELLED AT 2024-11-22T17:54:37 ***

Environment modules¶

Modules allow the dynamic modification of a user environment. They allow to run jobs with complex, reproducible sets of applications, libraries, compilers, and their dependency tree.

See all the available modules with module avail, and filter down the list with, for example, module avail python.

In the following example, which for the sake of this tutorial we run as a batch job, note the difference between using the Python interpreter provided by the OS and the one loaded via an environment module (we are using the python3 binary in this example, since no python is provided by the OS).

#!/bin/bash -e

#SBATCH --job-name=ModuleExample
# substitutions in names of output files
#  %x : job name
#  %j : job ID
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --time=00:10:00

# unload all currently loaded modules
module purge

echo "from the OS:"
which python3
python3 -V

echo "from the module:"
module load python
which python3
python3 -V

$ sbatch ModuleExample.sh
Submitted batch job 54855
$ cat ModuleExample.54855.out
/usr/bin/python3
Python 3.6.8
Loading python/3.10.13-gcc-13.2.0-soauhxd
  Loading requirement: bzip2/1.0.8-gcc-13.2.0-um4trw3
    libmd/1.0.4-gcc-13.2.0-72iu3hc libbsd/0.11.7-gcc-13.2.0-vr7frns
    expat/2.5.0-gcc-13.2.0-4mhdfrd ncurses/6.4-gcc-13.2.0-wlumdp4
    readline/8.2-gcc-13.2.0-cxp5mht gdbm/1.23-gcc-13.2.0-4n5clot
    libiconv/1.17-gcc-13.2.0-rtrijyj xz/5.4.1-gcc-13.2.0-ictsdhi
    zlib-ng/2.1.4-gcc-13.2.0-zlvkm4z libxml2/2.10.3-gcc-13.2.0-76f5f5u
    pigz/2.7-gcc-13.2.0-vh4n5e4 zstd/1.5.5-gcc-13.2.0-qpyi3hv
    tar/1.34-gcc-13.2.0-4tbqy2j gettext/0.22.3-gcc-13.2.0-s6bsbbd
    libffi/3.4.4-gcc-13.2.0-uf2tysn libxcrypt/4.4.35-gcc-13.2.0-xtapixq
    openssl/3.1.3-gcc-13.2.0-oh6awo7 sqlite/3.43.2-gcc-13.2.0-yqas6dx
    util-linux-uuid/2.38.1-gcc-13.2.0-pvrwuo6
/opt/spack/opt/spack/linux-rocky8-icelake/gcc-13.2.0/python-3.10.13-soauhxdtwsr4or6x3gqfyxrnqt2csq24/bin/python3
Python 3.10.13

Allocating GPU partitions (MIG devices)¶

The GPU node bora-gpu01 has 2 NVIDIA A30 GPUs. One can only be allocated entirely, while 4 MIG devices where defined on the other one, allowing to request only one shard (or more) of it, leaving the others available. Let’s see what is reported by nvidia-smi (a CLI client to NVIDIA System Management Interface) in two different cases:

Allocate an entire GPU:

srun -n 1 --gres=gpu:1 nvidia-smi

Allocate only 2 of the available MIG devices

srun -n 1 --gres=gpu:a30_1g.6gb:2 nvidia-smi

In a real batch job this would require to pass the #SBATCH --gres=... directive. The arguments that can be passed can be found by inquiring the Gres entry of the involved node, which reports what devices have been defined on it:

$ scontrol show node bora-gpu01 | grep Gres
   Gres=gpu:a30:1(S:0),gpu:a30_1g.6gb:4(S:1)

Examples¶

Gather your own copy of the examples¶

Find a set of example programs and scripts in /opt/training/welcome. Ordinary users cannot write there, but you have read permissions and can copy those files to your home: /u/$USER (~ is a shell alias for that).

cp -r /opt/training/welcome ~
cd welcome
ls

We put the examples involved in this hands-on tutorial in a subdirectory: hands-on_2024-11-28. For the most of it they build upon the examples in the main directory, thus there is a degree of overlap with the covered topics.

A sequential job¶

Example job: job_sequential.sh. Compile hello if needed with make hello. This trivial example launches hello (an example of any program with no parallel capabilities) in a SLURM job: sbatch job_sequential.sh

#!/bin/bash
#SBATCH --job-name=sequential-example
#SBATCH --output %x.%j.out
#SBATCH --time=00:10:00
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task=1

echo "SLURM_NODELIST=$SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"
echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK"

./hello

Embarrassingly parallel sequential jobs¶

Example job: job_embparallel.sh. This will call hello in an srun call, resulting in n instances (processes), one of it being called in each task independently. If hello was MPI capable instead, it would use the MPI library (openmpi module) - we are seeing this further below.

#!/bin/bash
#SBATCH --job-name=embpar-example
#SBATCH --output %x.%j.out
#SBATCH --time=00:10:00
#SBATCH --nodes 1
#SBATCH --ntasks 8
#SBATCH --cpus-per-task=1

echo "SLURM_NODELIST=$SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"
echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK"

srun bash -c '{ ./hello; echo "${SLURM_PROCID} done"; }'

echo "All done."

Parallel job: MPI¶

Compile this example with make hello_openmp - in order to do it, you need to module load openmpi beforehand, since it is using the mpicc compiler, as well as while running.

Note that its job example (job_mpi.sh) asks for more than one task - thus the workload can be distributed in more than one process.

#!/bin/bash
#SBATCH --job-name=mpi-example
#SBATCH --output %x.%j.out
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=00:10:00

echo "Running on: $SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"

module load openmpi
mpirun -n $SLURM_NTASKS ./hello_mpi
sleep 10

Each instance of hello_mpi reports the rank it is assigned to, and the total number of ranks available. After that, all the rank values are summed (reduced) over all the ranks and the master rank prints out the result.

Parallel job: OpenMP¶

We move on to a hello version with multithreading capabilities, using OpenMP. Compile it with make hello_openmp (and check in the makefile what flags are needed to do so). Looking at the source (hello_openmp.c), see how there is a mock long-running function which sleeps for 5 seconds before reporting the thread it is running on:

#include <stdio.h>
#include <omp.h>
#include <unistd.h>

void long_process() {
  sleep(5);
  fprintf(
    stderr, 
    "Hello, world! I am thread %d out of %d\n", 
    omp_get_thread_num(), 
    omp_get_num_threads()
  );
}

int main() {
        #pragma omp parallel for
  for (int i = 0; i < 4; ++i){
    long_process();
  }
  return 0;
}

long_process is called in parallel for loop, which spawns a group of threads and divides loop iterations between them. There are only 4 elements to iterate on: see the effect of varying the number of threads available to the job.

This example’s job is job_openmp.sh, which by default allocates 4 cpus-per-task.

OpenMP in Python¶

Some libraries in Python have multithread capabilities (e.g. some of the functions provided in numpy.linalg, which in turn rely on BLAS and LAPACK). The next example, using numpy.dot, is showing how the control on the number of threads is done via an environment variable. This simple script (find it hands-on_2024-11-28/np_omp_dotproduct/np_dotproduct.py), provided with an integer argument n, computes the dot product of two n × n arrays with randomly generated elements and report how long the computation took.

#!/usr/bin/env python
"""
provided with argument 'n' (int),
create n-by-n random arrays
and compute their dot product
"""

import sys
from os import environ
from time import time
from numpy.random import rand

if not "OMP_NUM_THREADS" in environ:
    print(f"OMP_NUM_THREADS found not set")
    environ["OMP_NUM_THREADS"] = "1"

print("OMP_NUM_THREADS: {}".format(environ["OMP_NUM_THREADS"]))

if len(sys.argv) < 2:
    raise ValueError("No argument was provided")

start = time()
n = int(sys.argv[1])
print(f"Shape: {n} by {n}")

# create two n-by-n random arrays
data1 = rand(n, n)
data2 = rand(n, n)

# calculate and report duration
result = data1.dot(data2)
duration = time() - start
print(f'Duration: {duration:.3f} seconds')

A job definition is available in that directory: job_np_dotproduct.sh. It expects an argument to be provided: n, to be passed on the the Python script - for example:

sbatch job_np_dotproduct.sh 10000

In addition to that, we can pass the cpus-per-task option:

sbatch --cpus-per-task=12 job_np_dotproduct.sh 10000

This parallel job not only asks for 12 CPUs to be allocated (for one task/process), but correctly exports the OMP_NUM_THREADS environment variable, setting it equal to the number of requested CPUs. See the effect of changing n and cpus-per-task. For the sake of this example we did not pass another resource directive: --mem, asking the amount of memory to be allocated (in megabytes). In this way, a default value will be requested (see it with scontrol show config | grep DefMem), but you may soon run into problems with large arrays - again, relying on defaults is not a good practice. --mem=0 is a reserved case: it request the job to access all the memory on each node it runs on.

The following command may be helpful in diagnosing the environment (e.g. “which implementation of BLAS am I using”, “how many threads is numpy using”?). For example:

srun -n 4 \
  bash -lc \
  "{ module load py-threadpoolctl py-numpy;
    OMP_NUM_THREADS=4 python -m threadpoolctl -i numpy; }"

This relies on modules threadpoolctl and numpy: both need to be available either as loaded environment modules or in the current environment (e.g. via virtualenv, conda). The following is an useful reference on Switching BLAS implementation in a conda environment.