# Tutorial ## Login, SLURM, and modules ### Login You should be able to open an SSH session on the login node of bora with: ```bash ssh username@bora.units.it ``` Where `username` is the UniTS username you are using to access the university services, with the same credentials. You can run this in any terminal in Linux/maxOS client or in Windows Subsystem for Linux (WSL) on a Windows client, which you should setup as described in the User guide page. #### Setting up SSH Public Key Authentication We can login using a pair of cryptographic keys, instead of a password If there is a copy of your public key on the server, in your user `$HOME/.ssh/authorized_keys`, you can prove you have the corresponding private key and start an SSH session. If you haven't used this form of authentication before, or if you want to generate a new key pair just for Bora, run this: ```bash ssh-keygen -t ed25519 ``` You will be given some prompts - the defaults are fine, but you may want to use a different filename if you have multiple keypairs. Providing a passphrase is optional, but prevents a third party from using your private key (should they obtain it). At the end of the procedure you should see something like: ```plaintext Your identification has been saved in /home/you/.ssh/id_ed25519. Your public key has been saved in /home/you/.ssh/id_ed25519.pub. ``` Then copy the public key to Bora. `ssh-copy-id` is an helper tool provided by the package named `openssh-client` in most of Linux distributions (and in WSL): ```bash ssh-copy-id username@bora.units.it ``` If you are using more than one keypair, you should specify which public key should be copied over: ```bash ssh-copy-id username@bora.units.it -i /home/you/.ssh/id_ed25519.pub ``` ### Storage Three storage locations are available to users: - user homes: `/u/$USER`, where your personal home folders are located. They are the (default) working directory after logging in; - slow: `/data/slow/$USER`, meant for normal I/O of jobs. They are _relatively_ slow, with respect to the fast storage (the devices are HDDs), but more space is available; - fast: `/data/fast/$USER`, meant for fast I/O of jobs. Space is scarcer than slow storage, but faster (the devices are SDDs). #### Disk quota Disk quotas are a way to account the storage usage by each user and group and to enforce limits on them. Quotas have been set up on Bora, with the following default limits: - user homes (`/u`): 10 GB - slow: (`/data/slow`): 60 GB - fast: (`/data/fast`): 30 GB Quota can also enforce a maximum number of files (more precisely: of inodes) - however, at the moment no limit is imposed on this. It is accounted for anyway, and reported in the quota reports. You can check your usage and limits with the `quota` command: ```plaintext $ quota Disk quotas for user exactlab-vdi (uid 592488047): Filesystem blocks quota limit grace files quota limit grace 10.141.3.9:/u 1860436 10485760 10485760 10799 0 0 10.141.3.9:/data 102400 62914560 62914560 4 0 0 10.141.3.8:/data 0 31457280 31457280 3 0 0 ``` A nicer formatting can be obtained with the right options: with human-readable units (instead of bytes) and the name of mount points instead of the raw addresses: ```plaintext $ quota -s --show-mntpoint --hide-device Disk quotas for user exactlab-vdi (uid 592488047): Filesystem space quota limit grace files quota limit grace /u 1817M 10240M 10240M 10799 0 0 /data/slow 100M 61440M 61440M 4 0 0 /data/fast 0K 30720M 30720M 3 0 0 ``` A _quota reminder_ is printed out in the message-of-the-day that prepends your prompt each time you login. Should you ever run out of quota, a clear warning is printed out. If you have no files on a storage resource, its quota report for your user may still be empty. ### Submitting jobs to the SLURM workload manager Running computing jobs on shared resources means submitting them to a queue. There, the jobs start order is controlled by the available resources and by a priority metric, depending on various factors. Regardless of priority, a smaller job is more likely to fit in the available resources in any given moment, while a larger one may end up waiting for the required resources to free up. Leaving the details on the scheduler strategy aside, this implies that when requesting a job we need to specify the resources we need. Therefore, submitting a Slurm job amounts to declaring a set of resources to be allocated, with command-line options or `#SBATCH` directives, prior to executing the instructions in the rest of the script. The main commands to submit jobs are: - `sbatch`, which is to be provided with a job script, for later execution (output is written to a file) - `srun`, which instead runs interactively: it is _blocking_ for the current shell, unless sent to background, and the standard output and error are redirected to the current terminal There are other subtle differences between the two, such as the capability to submit job arrays (`sbatch` only). `srun` can be also called within an existing Slurm job - in that case it won't spawn a new job allocation: it is commonly used as such to start multiple instances of an MPI program. Let's start by launching a mock job that just prints the node it is running ons: ```bash srun hostname ``` We can do the same, asking to allocate run multiple _tasks_ (a process, in this example) of the hostname command (the `-l` option prepends the task number to the printout) ```bash srun --ntasks 6 -l hostname ``` This will likely run on one node only: ```plaintext 0: bora-cpu01 1: bora-cpu01 4: bora-cpu01 2: bora-cpu01 3: bora-cpu01 5: bora-cpu01 ``` We can ask for a given number of tasks per node, only one for example: ```bash srun --ntasks 6 --ntasks-per-node=1 -l hostname ``` ```plaintext 0: bora-cpu01 1: bora-cpu02 2: bora-cpu03 3: bora-cpu04 4: bora-fat01 5: bora-gpu01 ``` Unless otherwise specified - there is an option for that - one cpu per task is. Therefore, since each node has 36 cores, more than one node will be allocated to satisfy the following: `srun -n 72 hostname` Before moving on to something more interesting, let's see how such a simple job translates to a batch script, instead of `srun`. Consider the following 💤 ```bash srun --ntasks 2 bash -c 'sleep 60; date' ``` Non only this takes a while to finish, but we may have to wait few hours to get our turn in the queue, had this been a job asking for real resources. We can turn it in a batch script. Command line options become directives: ```bash #!/bin/bash #SBATCH --ntasks=2 sleep 60 date ``` If we paste this in a file, we can submit the job with `sbatch zzz.sh`. Now: the output is not printed to the current shell. It is written to a logfile named `slurm-$SLURM_JOB_ID.out` instead, in the current working directory. This file collects all the `stdout`, while `stderr` (if any), gets written to `slurm-$SLURM_JOB_ID.out`. This can be changed So: why bother with `srun` used as such? On reason is because it is interactive. With the following we can obtain a shell on a compute node: ```bash srun --pty bash -i ``` A more useful example: using 2 of the 4 shards (this anticipates _MIG instances_) of a physical GPU: ```bash srun -n 1 --gres=gpu:a30_1g.6gb:2 --pty bash -i ``` By exiting that shell session (going back to the login node), the job ends. ```{info} Jobs get cancelled if they reach their maximum duration (in terms of _walltime_, i.e. actual time), recording their exit state as `timeout`. In the examples above we never passed the [`--time`](https://slurm.schedmd.com/srun.html#OPT_time) option - this is **not** a good practice: we always need to estimate the maximum duration of the job we are sending. This makes it easier to obtain a slot in the schedule early on and allows for more realistic estimates of expected start times of queueing jobs. ``` ### SLURM queues We can inquire the queue state with `squeue`: ```plaintext $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 53729 main opt_mpi_ exactlab PD 0:00 4 (Resources) 53730 main opt_mpi_ exactlab PD 0:00 4 (Priority) 53728 main opt_mpi_ exactlab R 0:07 4 bora-cpu[01-04] ``` Which in this example is showing three jobs in queue and their state: running (`R`) or pending (`PD`). In the latter case, a reason why the job did not start yet is provided: if it is only waiting for _resources_ to free up or if there a jobs higher up in the queue (_priority_) which will start first. Any job we own can be cancelled with [`scancel `](https://slurm.schedmd.com/scancel.html): ```plaintext $ `sbatch zzz.sh` Submitted batch job 54854 $ scancel 54854 $ cat slurm-54854.out slurmstepd: error: *** JOB 54854 ON bora-cpu01 CANCELLED AT 2024-11-22T17:54:37 *** ``` ### Environment modules Modules allow the dynamic modification of a user environment. They allow to run jobs with complex, reproducible sets of applications, libraries, compilers, and their dependency tree. See all the available modules with `module avail`, and filter down the list with, for example, `module avail python`. In the following example, which for the sake of this tutorial we run as a batch job, note the difference between using the Python interpreter provided by the OS and the one loaded via an environment module (we are using the `python3` binary in this example, since no `python` is provided by the OS). ```bash #!/bin/bash -e #SBATCH --job-name=ModuleExample # substitutions in names of output files # %x : job name # %j : job ID #SBATCH --output=%x.%j.out #SBATCH --error=%x.%j.err #SBATCH --time=00:10:00 # unload all currently loaded modules module purge echo "from the OS:" which python3 python3 -V echo "from the module:" module load python which python3 python3 -V ``` ```plaintext $ sbatch ModuleExample.sh Submitted batch job 54855 $ cat ModuleExample.54855.out /usr/bin/python3 Python 3.6.8 Loading python/3.10.13-gcc-13.2.0-soauhxd Loading requirement: bzip2/1.0.8-gcc-13.2.0-um4trw3 libmd/1.0.4-gcc-13.2.0-72iu3hc libbsd/0.11.7-gcc-13.2.0-vr7frns expat/2.5.0-gcc-13.2.0-4mhdfrd ncurses/6.4-gcc-13.2.0-wlumdp4 readline/8.2-gcc-13.2.0-cxp5mht gdbm/1.23-gcc-13.2.0-4n5clot libiconv/1.17-gcc-13.2.0-rtrijyj xz/5.4.1-gcc-13.2.0-ictsdhi zlib-ng/2.1.4-gcc-13.2.0-zlvkm4z libxml2/2.10.3-gcc-13.2.0-76f5f5u pigz/2.7-gcc-13.2.0-vh4n5e4 zstd/1.5.5-gcc-13.2.0-qpyi3hv tar/1.34-gcc-13.2.0-4tbqy2j gettext/0.22.3-gcc-13.2.0-s6bsbbd libffi/3.4.4-gcc-13.2.0-uf2tysn libxcrypt/4.4.35-gcc-13.2.0-xtapixq openssl/3.1.3-gcc-13.2.0-oh6awo7 sqlite/3.43.2-gcc-13.2.0-yqas6dx util-linux-uuid/2.38.1-gcc-13.2.0-pvrwuo6 /opt/spack/opt/spack/linux-rocky8-icelake/gcc-13.2.0/python-3.10.13-soauhxdtwsr4or6x3gqfyxrnqt2csq24/bin/python3 Python 3.10.13 ``` ### Allocating GPU partitions (MIG devices) The GPU node `bora-gpu01` has 2 NVIDIA A30 GPUs. One can only be allocated entirely, while 4 MIG devices where defined on the other one, allowing to request only one shard (or more) of it, leaving the others available. Let's see what is reported by `nvidia-smi` (a CLI client to NVIDIA System Management Interface) in two different cases: Allocate an entire GPU: ```bash srun -n 1 --gres=gpu:1 nvidia-smi ``` Allocate only 2 of the available MIG devices ```bash srun -n 1 --gres=gpu:a30_1g.6gb:2 nvidia-smi ``` In a real batch job this would require to pass the `#SBATCH --gres=...` directive. The arguments that can be passed can be found by inquiring the `Gres` entry of the involved node, which reports what devices have been defined on it: ```plaintext $ scontrol show node bora-gpu01 | grep Gres Gres=gpu:a30:1(S:0),gpu:a30_1g.6gb:4(S:1) ``` ## Examples ### Gather your own copy of the examples Find a set of example programs and scripts in `/opt/training/welcome`. Ordinary users cannot write there, but you have read permissions and can copy those files to your home: `/u/$USER` (`~` is a shell alias for that). ```bash cp -r /opt/training/welcome ~ cd welcome ls ``` We put the examples involved in this hands-on tutorial in a subdirectory: `hands-on_2024-11-28`. For the most of it they build upon the examples in the main directory, thus there is a degree of overlap with the covered topics. ### A sequential job Example job: `job_sequential.sh`. Compile `hello` if needed with `make hello`. This trivial example launches `hello` (an example of any program with no parallel capabilities) in a SLURM job: `sbatch job_sequential.sh` ```bash #!/bin/bash #SBATCH --job-name=sequential-example #SBATCH --output %x.%j.out #SBATCH --time=00:10:00 #SBATCH --nodes 1 #SBATCH --ntasks 1 #SBATCH --cpus-per-task=1 echo "SLURM_NODELIST=$SLURM_NODELIST" echo "SLURM_NTASKS=$SLURM_NTASKS" echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK" ./hello ``` #### Embarrassingly parallel sequential jobs Example job: `job_embparallel.sh`. This will call `hello` in an `srun` call, resulting in `n` instances (processes), one of it being called in each _task_ independently. If `hello` was MPI capable instead, it would use the MPI library (`openmpi` module) - we are seeing this further below. ```bash #!/bin/bash #SBATCH --job-name=embpar-example #SBATCH --output %x.%j.out #SBATCH --time=00:10:00 #SBATCH --nodes 1 #SBATCH --ntasks 8 #SBATCH --cpus-per-task=1 echo "SLURM_NODELIST=$SLURM_NODELIST" echo "SLURM_NTASKS=$SLURM_NTASKS" echo "SLURM_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK" srun bash -c '{ ./hello; echo "${SLURM_PROCID} done"; }' echo "All done." ``` ### Parallel job: MPI Compile this example with `make hello_openmp` - in order to do it, you need to `module load openmpi` beforehand, since it is using the `mpicc` compiler, as well as while running. Note that its job example (`job_mpi.sh`) asks for more than one task - thus the workload can be distributed in more than one process. ```bash #!/bin/bash #SBATCH --job-name=mpi-example #SBATCH --output %x.%j.out #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --time=00:10:00 echo "Running on: $SLURM_NODELIST" echo "SLURM_NTASKS=$SLURM_NTASKS" module load openmpi mpirun -n $SLURM_NTASKS ./hello_mpi sleep 10 ``` Each instance of `hello_mpi` reports the rank it is assigned to, and the total number of ranks available. After that, all the rank values are summed (_reduced_) over all the ranks and the master rank prints out the result. ### Parallel job: OpenMP We move on to a `hello` version with multithreading capabilities, using `OpenMP`. Compile it with `make hello_openmp` (and check in the makefile what flags are needed to do so). Looking at the source (`hello_openmp.c`), see how there is a mock long-running function which sleeps for 5 seconds before reporting the thread it is running on: ```C #include #include #include void long_process() { sleep(5); fprintf( stderr, "Hello, world! I am thread %d out of %d\n", omp_get_thread_num(), omp_get_num_threads() ); } int main() { #pragma omp parallel for for (int i = 0; i < 4; ++i){ long_process(); } return 0; } ``` `long_process` is called in parallel for loop, which spawns a group of threads and divides loop iterations between them. There are only 4 elements to iterate on: see the effect of varying the number of threads available to the job. This example's job is `job_openmp.sh`, which by default allocates 4 `cpus-per-task`. #### OpenMP in Python Some libraries in Python have multithread capabilities (e.g. some of the functions provided in [`numpy.linalg`](https://numpy.org/doc/stable/reference/routines.linalg.html), which in turn rely on BLAS and LAPACK). The next example, using `numpy.dot`, is showing how the control on the number of threads is done via an environment variable. This simple script (find it `hands-on_2024-11-28/np_omp_dotproduct/np_dotproduct.py`), provided with an integer argument `n`, computes the dot product of two `n × n` arrays with randomly generated elements and report how long the computation took. ```python #!/usr/bin/env python """ provided with argument 'n' (int), create n-by-n random arrays and compute their dot product """ import sys from os import environ from time import time from numpy.random import rand if not "OMP_NUM_THREADS" in environ: print(f"OMP_NUM_THREADS found not set") environ["OMP_NUM_THREADS"] = "1" print("OMP_NUM_THREADS: {}".format(environ["OMP_NUM_THREADS"])) if len(sys.argv) < 2: raise ValueError("No argument was provided") start = time() n = int(sys.argv[1]) print(f"Shape: {n} by {n}") # create two n-by-n random arrays data1 = rand(n, n) data2 = rand(n, n) # calculate and report duration result = data1.dot(data2) duration = time() - start print(f'Duration: {duration:.3f} seconds') ``` A job definition is available in that directory: `job_np_dotproduct.sh`. It expects an argument to be provided: `n`, to be passed on the the Python script - for example: ```bash sbatch job_np_dotproduct.sh 10000 ``` In addition to that, we can pass the `cpus-per-task` option: ```bash sbatch --cpus-per-task=12 job_np_dotproduct.sh 10000 ``` This parallel job not only asks for 12 CPUs to be allocated (for one task/process), but correctly exports the `OMP_NUM_THREADS` environment variable, setting it equal to the number of requested CPUs. See the effect of changing `n` and `cpus-per-task`. For the sake of this example we did not pass another resource directive: `--mem`, asking the amount of memory to be allocated (in megabytes). In this way, a default value will be requested (see it with `scontrol show config | grep DefMem`), but you may soon run into problems with large arrays - again, relying on defaults is not a good practice. `--mem=0` is a reserved case: it request the job to access all the memory on each node it runs on. The following command may be helpful in diagnosing the environment (e.g. "which implementation of BLAS am I using", "how many threads is numpy using"?). For example: ```bash srun -n 4 \ bash -lc \ "{ module load py-threadpoolctl py-numpy; OMP_NUM_THREADS=4 python -m threadpoolctl -i numpy; }" ``` This relies on modules `threadpoolctl` and `numpy`: both need to be available either as loaded environment modules or in the current environment (e.g. via virtualenv, conda). The following is an useful reference on [Switching BLAS implementation](https://conda-forge.org/docs/maintainer/knowledge_base/#switching-blas-implementation) in a conda environment.