# Tutorial

## Login, SLURM, and modules

### Login

You should be able to open an SSH session on the login node of bora with:

```bash
ssh username@bora.units.it
```

Where `username` is the UniTS username you are using to access the university services, with the same credentials.

You can run this in any terminal in Linux/maxOS client or in Windows Subsystem for Linux (WSL) on a Windows client, which you should setup as described in the User guide page.

#### Setting up SSH Public Key Authentication

We can login using a pair of cryptographic keys, instead of a password
If there is a copy of your public key on the server, in your user `$HOME/.ssh/authorized_keys`, you can prove you have the corresponding private key and start an SSH session.

If you haven't used this form of authentication before, or if you want to generate a new key pair just for Bora, run this:

```bash
ssh-keygen -t ed25519 
```

You will be given some prompts - the defaults are fine, but you may want to use a different filename if you have multiple keypairs. Providing a passphrase is optional, but prevents a third party from using your private key (should they obtain it).
At the end of the procedure you should see something like:

```plaintext
Your identification has been saved in /home/you/.ssh/id_ed25519.
Your public key has been saved in /home/you/.ssh/id_ed25519.pub.
```

Then copy the public key to Bora. `ssh-copy-id` is an helper tool provided by the package named `openssh-client` in most of Linux distributions (and in WSL):

```bash
ssh-copy-id username@bora.units.it
```

If you are using more than one keypair, you should specify which public key should be copied over:

```bash
ssh-copy-id username@bora.units.it -i /home/you/.ssh/id_ed25519.pub
```

### Storage

Three storage locations are available to users:

- user homes: `/u/$USER`, where your personal home folders are located. They are the (default) working directory after logging in;
- slow: `/data/slow/$USER`, meant for normal I/O of jobs. They are _relatively_ slow, with respect to the fast storage (the devices are HDDs), but more space is available;
- fast: `/data/fast/$USER`, meant for fast I/O of jobs. Space is scarcer than slow storage, but faster (the devices are SDDs).

#### Disk quota

Disk quotas are a way to account the storage usage by each user and group and to enforce limits on them.
Quotas have been set up on Bora, with the following default limits:

- user homes (`/u`): 10 GB
- slow: (`/data/slow`): 60 GB
- fast: (`/data/fast`): 30 GB

Quota can also enforce a maximum number of files (more precisely: of inodes) - however, at the moment no limit is imposed on this. It is accounted for anyway, and reported in the quota reports.

You can check your usage and limits with the `quota` command:

```plaintext
$ quota
Disk quotas for user exactlab-vdi (uid 592488047):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
  10.141.3.9:/u 1860436  10485760 10485760           10799       0       0
10.141.3.9:/data
                 102400  62914560 62914560               4       0       0
10.141.3.8:/data
                      0  31457280 31457280               3       0       0
```

A nicer formatting can be obtained with the right options: with human-readable units (instead of bytes) and the name of mount points instead of the raw addresses:

```plaintext
$ quota -s --show-mntpoint --hide-device
Disk quotas for user exactlab-vdi (uid 592488047):
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
             /u   1817M  10240M  10240M           10799       0       0
     /data/slow    100M  61440M  61440M               4       0       0
     /data/fast      0K  30720M  30720M               3       0       0
```

A _quota reminder_ is printed out in the message-of-the-day that prepends your prompt each time you login.
Should you ever run out of quota, a clear warning is printed out.
If you have no files on a storage resource, its quota report for your user may still be empty.

### Submitting jobs to the SLURM workload manager

Running computing jobs on shared resources means submitting them to a queue. There, the jobs start order is controlled by the available resources and by a priority metric, depending on various factors. Regardless of priority, a smaller job is more likely to fit in the available resources in any given moment, while a larger one may end up waiting for the required resources to free up.

Leaving the details on the scheduler strategy aside, this implies that when requesting a job we need to specify the resources we need.
Therefore, submitting a Slurm job amounts to declaring a set of resources to be allocated, with command-line options or `#SBATCH` directives, prior to executing the instructions in the rest of the script.

The main commands to submit jobs are:

- `sbatch`, which is to be provided with a job script, for later execution (output is written to a file)
- `srun`, which instead runs interactively: it is _blocking_ for the current shell, unless sent to background, and the standard output and error are redirected to the current terminal

There are other subtle differences between the two, such as the capability to submit job arrays (`sbatch` only). `srun` can be also called within an existing Slurm job - in that case it won't spawn a new job allocation: it is commonly used as such to start multiple instances of an MPI program.

Let's start by launching a mock job that just prints the node it is running ons:

```bash
srun hostname
```

We can do the same, asking to allocate run multiple _tasks_ (a process, in this example) of the hostname command (the `-l` option prepends the task number to the printout)

```bash
srun --ntasks 6 -l hostname
```

This will likely run on one node only:

```plaintext
0: bora-cpu01
1: bora-cpu01
4: bora-cpu01
2: bora-cpu01
3: bora-cpu01
5: bora-cpu01
```

We can ask for a given number of tasks per node, only one for example:

```bash
srun --ntasks 6 --ntasks-per-node=1 -l hostname
```

```plaintext
0: bora-cpu01
1: bora-cpu02
2: bora-cpu03
3: bora-cpu04
4: bora-fat01
5: bora-gpu01
```

Unless otherwise specified - there is an option for that - one cpu per task is. Therefore, since each node has 36 cores, more than one node will be allocated to satisfy the following: `srun -n 72 hostname`

Before moving on to something more interesting, let's see how such a simple job translates to a batch script, instead of `srun`.

Consider the following 💤

```bash
srun --ntasks 2 bash -c 'sleep 60; date'
```

Non only this takes a while to finish, but we may have to wait few hours to get our turn in the queue, had this been a job asking for real resources.
We can turn it in a batch script. Command line options become directives:

```bash
#!/bin/bash
#SBATCH --ntasks=2
sleep 60
date
```

If we paste this in a file, we can submit the job with `sbatch zzz.sh`.
Now: the output is not printed to the current shell.
It is written to a logfile named `slurm-$SLURM_JOB_ID.out` instead, in the current working directory. This file collects all the `stdout`, while `stderr` (if any), gets written to `slurm-$SLURM_JOB_ID.out`. This can be changed 

So: why bother with `srun` used as such? On reason is because it is interactive.
With the following we can obtain a shell on a compute node:

```bash
srun --pty bash -i
```

A more useful example: using 2 of the 4 shards (this anticipates _MIG instances_) of a physical GPU:

```bash
srun -n 1 --gres=gpu:a30_1g.6gb:2 --pty bash -i
```

By exiting that shell session (going back to the login node), the job ends.

```{info}
Jobs get cancelled if they reach their maximum duration (in terms of _walltime_, i.e. actual time), recording their exit state as `timeout`.
In the examples above we never passed the [`--time`](https://slurm.schedmd.com/srun.html#OPT_time) option - this is **not** a good practice: we always need to estimate the maximum duration of the job we are sending. This makes it easier to obtain a slot in the schedule early on and allows for more realistic estimates of expected start times of queueing jobs.
```

### SLURM queues

We can inquire the queue state with `squeue`:

```plaintext
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
53729      main opt_mpi_ exactlab PD       0:00      4 (Resources)
53730      main opt_mpi_ exactlab PD       0:00      4 (Priority)
53728      main opt_mpi_ exactlab  R       0:07      4 bora-cpu[01-04]
```

Which in this example is showing three jobs in queue and their state: running (`R`) or pending (`PD`).
In the latter case, a reason why the job did not start yet is provided: if it is only waiting for _resources_ to free up or if there a jobs higher up in the queue (_priority_) which will start first.

Any job we own can be cancelled with [`scancel <job_id>`](https://slurm.schedmd.com/scancel.html):

```plaintext
$ `sbatch zzz.sh`
Submitted batch job 54854
$ scancel 54854
$ cat slurm-54854.out 
slurmstepd: error: *** JOB 54854 ON bora-cpu01 CANCELLED AT 2024-11-22T17:54:37 ***
```

### Environment modules

Modules allow the dynamic modification of a user environment. They allow to run jobs with complex, reproducible sets of applications, libraries, compilers, and their dependency tree.

See all the available modules with `module avail`, and filter down the list with, for example, `module avail python`.

In the following example, which for the sake of this tutorial we run as a batch job, note the difference between using the Python interpreter provided by the OS and the one loaded via an environment module (we are using the `python3` binary in this example, since no `python` is provided by the OS).

```bash
#!/bin/bash -e

#SBATCH --job-name=ModuleExample
# substitutions in names of output files
#  %x : job name
#  %j : job ID
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --time=00:10:00

# unload all currently loaded modules
module purge

echo "from the OS:"
which python3
python3 -V

echo "from the module:"
module load python
which python3
python3 -V
```

```plaintext
$ sbatch ModuleExample.sh
Submitted batch job 54855
$ cat ModuleExample.54855.out
/usr/bin/python3
Python 3.6.8
Loading python/3.10.13-gcc-13.2.0-soauhxd
  Loading requirement: bzip2/1.0.8-gcc-13.2.0-um4trw3
    libmd/1.0.4-gcc-13.2.0-72iu3hc libbsd/0.11.7-gcc-13.2.0-vr7frns
    expat/2.5.0-gcc-13.2.0-4mhdfrd ncurses/6.4-gcc-13.2.0-wlumdp4
    readline/8.2-gcc-13.2.0-cxp5mht gdbm/1.23-gcc-13.2.0-4n5clot
    libiconv/1.17-gcc-13.2.0-rtrijyj xz/5.4.1-gcc-13.2.0-ictsdhi
    zlib-ng/2.1.4-gcc-13.2.0-zlvkm4z libxml2/2.10.3-gcc-13.2.0-76f5f5u
    pigz/2.7-gcc-13.2.0-vh4n5e4 zstd/1.5.5-gcc-13.2.0-qpyi3hv
    tar/1.34-gcc-13.2.0-4tbqy2j gettext/0.22.3-gcc-13.2.0-s6bsbbd
    libffi/3.4.4-gcc-13.2.0-uf2tysn libxcrypt/4.4.35-gcc-13.2.0-xtapixq
    openssl/3.1.3-gcc-13.2.0-oh6awo7 sqlite/3.43.2-gcc-13.2.0-yqas6dx
    util-linux-uuid/2.38.1-gcc-13.2.0-pvrwuo6
/opt/spack/opt/spack/linux-rocky8-icelake/gcc-13.2.0/python-3.10.13-soauhxdtwsr4or6x3gqfyxrnqt2csq24/bin/python3
Python 3.10.13
```

### Allocating GPU partitions (MIG devices)

The GPU node `bora-gpu01` has 2 NVIDIA A30 GPUs.
One can only be allocated entirely, while 4 MIG devices where defined on the other one, allowing to request only one shard (or more) of it, leaving the others available.
Let's see what is reported by `nvidia-smi` (a CLI client to NVIDIA System Management Interface) in two different cases:

Allocate an entire GPU:

```bash
srun -n 1 --gres=gpu:1 nvidia-smi
```

Allocate only 2 of the available MIG devices

```bash
srun -n 1 --gres=gpu:a30_1g.6gb:2 nvidia-smi
```

In a real batch job this would require to pass the `#SBATCH --gres=...` directive.
The arguments that can be passed can be found by inquiring the `Gres` entry of the involved node, which reports what devices have been defined on it:

```plaintext
$ scontrol show node bora-gpu01 | grep Gres
   Gres=gpu:a30:1(S:0),gpu:a30_1g.6gb:4(S:1)
```

## Examples

### Gather your own copy of the examples

Find a set of example programs and scripts in `/opt/training/welcome`.
Ordinary users cannot write there, but you have read permissions and can copy those files to your home: `/u/$USER` (`~` is a shell alias for that).

```bash
cp -r /opt/training/welcome ~
cd welcome
ls
```

We put the examples involved in this hands-on tutorial in a subdirectory: `hands-on_2024-11-28`.
For the most of it they build upon the examples in the main directory, thus there is a degree of overlap with the covered topics.

### A sequential job

Example job: `job_sequential.sh`.
Compile `hello` if needed with `make hello`.
This trivial example launches `hello` (an example of any program with no parallel capabilities) in a SLURM job: `sbatch job_sequential.sh`

```bash
#!/bin/bash
#SBATCH --job-name=sequential-example
#SBATCH --output %x.%j.out
#SBATCH --time=00:10:00
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task=1

echo "SLURM_NODELIST=$SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"
echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK"

./hello
```

#### Embarrassingly parallel sequential jobs

Example job: `job_embparallel.sh`.
This will call `hello` in an `srun` call, resulting in `n` instances (processes), one of it being called in each _task_ independently.
If `hello` was MPI capable instead, it would use the MPI library (`openmpi` module) - we are seeing this further below.

```bash
#!/bin/bash
#SBATCH --job-name=embpar-example
#SBATCH --output %x.%j.out
#SBATCH --time=00:10:00
#SBATCH --nodes 1
#SBATCH --ntasks 8
#SBATCH --cpus-per-task=1

echo "SLURM_NODELIST=$SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"
echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK"

srun bash -c '{ ./hello; echo "${SLURM_PROCID} done"; }'

echo "All done."
```

### Parallel job: MPI

Compile this example with `make hello_openmp` - in order to do it, you need to `module load openmpi` beforehand, since it is using the `mpicc` compiler, as well as while running.

Note that its job example (`job_mpi.sh`) asks for more than one task - thus the workload can be distributed in more than one process.

```bash
#!/bin/bash
#SBATCH --job-name=mpi-example
#SBATCH --output %x.%j.out
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=00:10:00

echo "Running on: $SLURM_NODELIST"
echo "SLURM_NTASKS=$SLURM_NTASKS"

module load openmpi
mpirun -n $SLURM_NTASKS ./hello_mpi
sleep 10
```

Each instance of `hello_mpi` reports the rank it is assigned to, and the total number of ranks available.
After that, all the rank values are summed (_reduced_) over all the ranks and the master rank prints out the result.

### Parallel job: OpenMP

We move on to a `hello` version with multithreading capabilities, using `OpenMP`. Compile it with `make hello_openmp` (and check in the makefile what flags are needed to do so).
Looking at the source (`hello_openmp.c`), see how there is a mock long-running function which sleeps for 5 seconds before reporting the thread it is running on:

```C
#include <stdio.h>
#include <omp.h>
#include <unistd.h>

void long_process() {
  sleep(5);
  fprintf(
    stderr, 
    "Hello, world! I am thread %d out of %d\n", 
    omp_get_thread_num(), 
    omp_get_num_threads()
  );
}

int main() {
        #pragma omp parallel for
  for (int i = 0; i < 4; ++i){
    long_process();
  }
  return 0;
}
```

`long_process` is called in parallel for loop, which spawns a group of threads and divides loop iterations between them.
There are only 4 elements to iterate on: see the effect of varying the number of threads available to the job.

This example's job is `job_openmp.sh`, which by default allocates 4 `cpus-per-task`.

#### OpenMP in Python

Some libraries in Python have multithread capabilities (e.g. some of the functions provided in [`numpy.linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html), which in turn rely on BLAS and LAPACK).
The next example, using `numpy.dot`, is showing how the control on the number of threads is done via an environment variable.
This simple script (find it `hands-on_2024-11-28/np_omp_dotproduct/np_dotproduct.py`), provided with an integer argument `n`, computes the dot product of two `n × n` arrays with randomly generated elements and report how long the computation took.

```python
#!/usr/bin/env python
"""
provided with argument 'n' (int),
create n-by-n random arrays
and compute their dot product
"""

import sys
from os import environ
from time import time
from numpy.random import rand

if not "OMP_NUM_THREADS" in environ:
    print(f"OMP_NUM_THREADS found not set")
    environ["OMP_NUM_THREADS"] = "1"

print("OMP_NUM_THREADS: {}".format(environ["OMP_NUM_THREADS"]))

if len(sys.argv) < 2:
    raise ValueError("No argument was provided")

start = time()
n = int(sys.argv[1])
print(f"Shape: {n} by {n}")

# create two n-by-n random arrays
data1 = rand(n, n)
data2 = rand(n, n)

# calculate and report duration
result = data1.dot(data2)
duration = time() - start
print(f'Duration: {duration:.3f} seconds')
```

A job definition is available in that directory: `job_np_dotproduct.sh`.
It expects an argument to be provided: `n`, to be passed on the the Python script - for example:

```bash
sbatch job_np_dotproduct.sh 10000
```

In addition to that, we can pass the `cpus-per-task` option:

```bash
sbatch --cpus-per-task=12 job_np_dotproduct.sh 10000
```

This parallel job not only asks for 12 CPUs to be allocated (for one task/process), but correctly exports the `OMP_NUM_THREADS` environment variable, setting it equal to the number of requested CPUs.
See the effect of changing `n` and `cpus-per-task`.
For the sake of this example we did not pass another resource directive: `--mem`, asking the amount of memory to be allocated (in megabytes). In this way, a default value will be requested (see it with `scontrol show config | grep DefMem`), but you may soon run into problems with large arrays - again, relying on defaults is not a good practice. `--mem=0` is a reserved case: it request the job to access all the memory on each node it runs on.

The following command may be helpful in diagnosing the environment (e.g. "which implementation of BLAS am I using", "how many threads is numpy using"?). For example:

```bash
srun -n 4 \
  bash -lc \
  "{ module load py-threadpoolctl py-numpy;
    OMP_NUM_THREADS=4 python -m threadpoolctl -i numpy; }"
```

This relies on modules `threadpoolctl` and `numpy`: both need to be available either as loaded environment modules or in the current environment (e.g. via virtualenv, conda).
The following is an useful reference on [Switching BLAS implementation](https://conda-forge.org/docs/maintainer/knowledge_base/#switching-blas-implementation) in a conda environment.