User guide¶
This document describes the Bora high-performance computing (HPC) cluster, and how to use it.
Bora HPC Cluster and associated resources are owned by the Physics Department of the University of Trieste. It was funded through the “Department of Excellence” grant (2023-2027).
To report issues or receive support, please write to calcolo.df AT units.it
providing all the relevant details about your request (see Access the infrastructure and Reporting issues).
Access the infrastructure¶
Access to Bora is reserved to members and students of the Physics Department at UniTS, not to their collaborators. Only researchers and professors of the Physics Department can request access to the cluster, for themselves or students they supervise. To get access to the cluster, please contact calcolo.df AT units.it
using this template email, providing the following information about the user whose account must be activated
Name and surname
UniTS username (the one used to get access to University services like esse3, email, moodle, …)
Account’s expiry date
Brief description of the expected cluster usage
Warning
Make sure the user has read and understood the general rules of the cluster.
You will receive a confirmation email once the requested account has been activated.
Please note:
you can login with SSH to the Bora cluster at
bora.units.it
only from the UniTS internal network or from the UniTS VPN - visit the Login with SSH section for more information about SSHyour login credentials are the same as the UniTS login credentials:
your username is the “matricola” identification number
your password is the same one you use to access UniTS services (ex. email, esse3)
Reporting issues¶
To report issues, please contact calcolo.df AT units.it
providing:
Your username
The problem you encounter and the behavior you were expecting
Logs, error messages or screenshots illustrating the problem
preferably using the following template email.
General rules¶
Please read carefully the rules and respect them when using the cluster:
Do not run long processes on the login node; of course, short tests and code compilation are fine
Submit jobs from the login node
bora.units.it
using SLURM (see the Slurm workload manager section)Make sure you request the resources actually needed to run your jobs (see Job resources)
Test the scalability of your code before running long parallel jobs (if you do not know what “scalability” means, ask your supervisor first!)
Login with SSH¶
$ ssh 00001@bora.units.it
Welcome to Bora!
____ ___ ____ _
| __ ) / _ \ | _ \ / \
| _ \ | | | | | |_) | / _ \
| |_) | | |_| | | _ < / ___ \
|____/ \___/ |_| \_\ /_/ \_\
HPC Cluster - Dipartimento di Fisica - UniTS
Ask for support to
calcolo.df@units.it
Last login: Mon Sep 30 15:49:32 2024 from 140.105.1.1
Disk quotas for user 00001:
Filesystem space quota limit grace files quota limit grace
/data/slow 30246M 30720M 32768M 7773 0 0
Disk quotas for group domain users: none
[00001@bora ~]$
SSH from Linux/macOS clients¶
To login from any linux/macOS client, use the following command from a terminal:
ssh username@bora.units.it
Setup WSL on Windows¶
Windows users can set up Windows Subsystem for Linux (WSL) on their machine to log into a remote server using SSH. Please visit the official guide for more information.
Step 1: Enable WSL
Open PowerShell as Administrator:
Press
Windows Key + X
, then select Windows PowerShell (Admin).
In the PowerShell window, type the following command to enable WSL:
wsl --install
This command installs the necessary WSL components and the latest version of Ubuntu by default. You may be prompted to restart—if so, restart your system.
Step 2: Set Up the Linux Environment
Once installed, launch the Linux distribution from the Start menu.
You will be prompted to create a new user and password for your WSL instance. Follow the prompts to configure it.
Update your Linux environment by running the following commands
sudo apt update && sudo apt upgrade -y
Step 3: Install OpenSSH (if not installed)
Most Linux distributions come with OpenSSH pre-installed. To verify or install:
Open your WSL terminal.
Install the OpenSSH client by running:
sudo apt install openssh-client
Data storage and transfer¶
There are different storage areas on Bora:
/u/username
: your personal home folder (that’s were you are when you log in)/data/slow/username
: shared folder for normal I/O operations/data/fast/username
: shared folder for fast I/O operations
There are user-based quotas on each partition:
/u
: 10 Gb/data/slow
: 60 Gb/data/fast
: 30 Gb
These quotas may be subject to change. The login screen will always remind you about the space occupied by your files in each partition and your quota limit.
Please follow these guidelines:
for projects with limited I/O and disk space usage, just work on your home folder
for projects requiring larger disk space, use
/data/slow
for projects requiring I/O of large files, use
/data/fast
Note that the performance difference between the fast and slow partition may strongly depend on the kind of I/O operations you perform. For instance, I/O of many small files is unlikely to be faster on /data/fast
than on /data/slow
.
To copy data from your computer to the cluster, use one of these commands
scp -r <local_path> username@bora.units.it:<remote_destination_path>
rsync <local_path> username@bora.units.it:<remote_destination_path>
To copy data from the cluster to your computer
scp -r username@bora.units.it:<remote_path> <local_destination_path>
rsync username@bora.units.it:<remote_path> <local_destination_path>
Submitting jobs¶
To run processes on the computing nodes of the cluster (for instance, to perform simulations or data analysis), you must submit jobs to the SLURM scheduler. In this section, we describe the basic process of submitting jobs to the scheduler, either interactively or as batch jobs. For more info on on how to get started to SLURM, please visit this guide.
Interactive jobs¶
Interactive jobs allow you to run commands in real-time, directly from the command line.
srun <command>
These jobs are useful when you need to test or debug code before submitting a batch job. For example, if you want to execute a Python script or a shell command, you can run it directly on the allocated resources using this command.
To run the job in the background and detach it from the terminal
nohup srun -n 16 sleep 10 &
The nohup
command ensures that the process keeps running even if the user logs out. Here, srun
is used to run the job on 16 tasks (-n 16
), and the job will sleep for 10 seconds. The &
symbol ensures that the job runs in the background.
Batch jobs¶
Batch jobs allow you to submit jobs to the queue for execution. These jobs run without real-time user input, so they are ideal for long-running jobs or jobs that run in parallel on multiple nodes.
The first step is to create a bash script with your favorite text editor. This is a sample batch job script:
#!/bin/bash
#SBATCH --ntasks=16
mpirun -n $SLURM_TASKS sleep 10
A few explanations:
#!/bin/bash
: Specifies that the script should be executed in the bash shell.#SBATCH --ntasks=16
: This line tells SLURM to allocate 16 tasks (by default, the number of nodes is 1)mpirun -n $SLURM_TASKS sleep 10
: Runs the job using MPI on the allocated tasks.$SLURM_TASKS
is an environment variable set by SLURM, representing the number of tasks to be run.
To submit the batch job to the SLURM scheduler
sbatch <batch_script>
SLURM will place the job in a queue and run it when resources become available.
You can even define the number of tasks when submitting the job from the command line
sbatch -n 2 <batch_script>
The -n 2
option will take precedence over the one specified in the job script. Note that -n
is just a shortcut for --ntasks
.
Of course, this is just a simple example: the script can be modified to run any program or workload and there are more options to fine tune your job.
Job resources (CPUs, GPUs, RAM, …)¶
Here are a few options you can provide to srun
and sbatch
to request specific resources (ex. parallel jobs, memory requirements, …)
--time=hh:mm:ss
: maximum runtime for the job.--mem=<amount>
: memory required per node or per CPU.--output=<filename>
: file to save the job’s output.--cpus-per-task=<num_cpus>
: number of CPUs (cores) per task for your job. Useful for multi-threaded applications.--nodes=<num_nodes>
: number of computing nodes your job needs. This is useful for distributed jobs that need to run across multiple nodes.--ntasks=<num_tasks>
: total number of tasks to run. This is commonly used for MPI-based parallel programs.--ntasks-per-node=<num_tasks>
: number of tasks per node. This is commonly used for MPI-based parallel programs.--partition=<partition_name>
: the partition (queue) to run the job on (not needed right now on Bora)
Compiling and testing code
To compile, debug or monitor the ram usage of your code you can request an interactive session passing the --pty
option to srun
, for example:
srun --pty --mem=8G --time=1:00:00 bash
The above command allocates a pseudo-terminal for the session for 1 hours, requiring 8Gb of memory. The bash
at the end specifies the shell (bash) that will be run when the Slurm job is started.
Once you enter the interactive session you can load the modules needed to compile or run your code using the command module load <module_name>
(see Modules and software section for more details)
Inside the interactive shell you can directly run your executable named e.g. program_name
using the command ./program_name
.
GPU nodes
The Bora cluster has currently one GPU node with 2 NVIDIA A30 (24GB each), one of which is split into 4 partitions (MIG devices). It is possible to request GPU devices by passing the --gres
to sbatch
or srun
. For example
srun -n1 --gres=gpu:1 bash
will allocate an entire physical A30 GPU. In case you want to allocate only a portion of a physical GPU (MIG devices), you can specify the desired MIG profile
srun -n1 --gres=gpu:a30_1g.6gb:2 bash
The above command will allocate 2 of the 4 MIG devices available on the second GPU, allowing you to run a multi-GPU jobs on a single graphics card.
The names and counts of the GPU devices can be queried via the scontrol
command as follows:
[user@bora ~]$ scontrol show node bora-gpu01 |grep -i gres
Gres=gpu:a30:1(S:0),gpu:a30_1g.6gb:4(S:1)
Other useful commands¶
Here are a few additional tips to help you work effectively with SLURM:
Resources available: Use
sinfo
to get information about the resources on available nodes that make up the HPC cluster.Monitoring jobs: You can check the status of your submitted jobs using the
squeue
command:
squeue -u <username>
This will show you the jobs that are currently queued or running for your user.
Canceling jobs: If you need to cancel a job that is in the queue or running, use the
scancel
command with the job ID:
scancel <job_id>
Checking job output: Once a job finishes, the output and error messages are saved in the files specified in the
--output
or--error
options of your batch script. If no files are specified, SLURM will create default output files in the current working directory.
Modules and software¶
The scientific software has been installed using the Spack package manager and can be loaded using environment modules. Modules allow you to load different versions of a software (ex. compilers, libraries, …). Here are some basic commands to get started with environment modules.
Available modules¶
You can list all available modules on the cluster using the command:
module avail
This will show a very long list! If you are looking for a specific module, you can filter the list by specifying the module name. For instance, the command
module avail python
will list all available versions of Python modules. Users can then choose which version to load based on their requirements, as described in the Loading modules section
Warning
The default Python distribution on Bora is the one of the operating system, which ships a relatively old Python version (3.6). If you need a more recent Python version, load the appropriate module!
This is a short list of the available software:
C, C++, Fortran compilers (GCC suite and Intel OneApi)
OpenMPI built with InfiniBand support (GCC and Intel)
Quantum Espresso (GCC only)
Yambo (GCC only)
Gadget-4
Cern-vm filesystem
HPL
iozone
HEP SPEC
stream
Of course, to get an updated and complete list use the module avail
command.
Loading modules¶
To load a specific environment module, use the module load
command followed by the module name. For instance, to load Python 3.10:
module load python/3.10
The above command sets up the environment with Python version 3.10 and adjusts the PATH and other environment variables accordingly.
Note
If you want some modules to be loaded by default every time you log in, add the corresponding module load
command in your ~/.bashrc
file (in your home directory on Bora).
To show the currently loaded modules use the command
module list
If you want to see detailed information about a specific module, such as environment variables it sets or paths it modifies, use the module show
command, e.g.:
module show python
This will display information about the Python module, including any environment variables that will be affected when the module is loaded.
Clearing up¶
To unload a single module
module unload <module>
To unload all currently loaded modules and reset the environment
module purge
The above command will remove all modules currently loaded in your session, which is useful if you want to start with a clean environment before loading new modules.