User guide

This document describes the Bora high-performance computing (HPC) cluster, and how to use it.

Bora HPC Cluster and associated resources are owned by the Physics Department of the University of Trieste. It was funded through the “Department of Excellence” grant (2023-2027).

To report issues or receive support, please write to calcolo.df AT units.it providing all the relevant details about your request (see Access the infrastructure and Reporting issues).


Access the infrastructure

Access to Bora is reserved to members and students of the Physics Department at UniTS, not to their collaborators. Only researchers and professors of the Physics Department can request access to the cluster, for themselves or students they supervise. To get access to the cluster, please contact calcolo.df AT units.it using this template email, providing the following information about the user whose account must be activated

  • Name and surname

  • UniTS username (the one used to get access to University services like esse3, email, moodle, …)

  • Account’s expiry date

  • Brief description of the expected cluster usage

Warning

Make sure the user has read and understood the general rules of the cluster.

You will receive a confirmation email once the requested account has been activated.

Please note:

  • you can login with SSH to the Bora cluster at bora.units.it only from the UniTS internal network or from the UniTS VPN - visit the Login with SSH section for more information about SSH

  • your login credentials are the same as the UniTS login credentials:

    • your username is the “matricola” identification number

    • your password is the same one you use to access UniTS services (ex. email, esse3)


Reporting issues

To report issues, please contact calcolo.df AT units.it providing:

  1. Your username

  2. The problem you encounter and the behavior you were expecting

  3. Logs, error messages or screenshots illustrating the problem

preferably using the following template email.


General rules

Please read carefully the rules and respect them when using the cluster:

  • Do not run long processes on the login node; of course, short tests and code compilation are fine

  • Submit jobs from the login node bora.units.it using SLURM (see the Slurm workload manager section)

  • Make sure you request the resources actually needed to run your jobs (see Job resources)

  • Test the scalability of your code before running long parallel jobs (if you do not know what “scalability” means, ask your supervisor first!)


Login with SSH

$ ssh 00001@bora.units.it

Welcome to Bora!
  ____     ___    ____       _    
 | __ )   / _ \  |  _ \     / \   
 |  _ \  | | | | | |_) |   / _ \  
 | |_) | | |_| | |  _ <   / ___ \ 
 |____/   \___/  |_| \_\ /_/   \_\
                                  
HPC Cluster - Dipartimento di Fisica - UniTS

Ask for support to
     calcolo.df@units.it

Last login: Mon Sep 30 15:49:32 2024 from 140.105.1.1
Disk quotas for user 00001: 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
     /data/slow  30246M  30720M  32768M            7773       0       0        
Disk quotas for group domain users: none
[00001@bora ~]$ 

SSH from Linux/macOS clients

To login from any linux/macOS client, use the following command from a terminal:

ssh username@bora.units.it

Setup WSL on Windows

Windows users can set up Windows Subsystem for Linux (WSL) on their machine to log into a remote server using SSH. Please visit the official guide for more information.

Step 1: Enable WSL

  1. Open PowerShell as Administrator:

    • Press Windows Key + X, then select Windows PowerShell (Admin).

  2. In the PowerShell window, type the following command to enable WSL:

wsl --install

This command installs the necessary WSL components and the latest version of Ubuntu by default. You may be prompted to restart—if so, restart your system.

Step 2: Set Up the Linux Environment

  1. Once installed, launch the Linux distribution from the Start menu.

  2. You will be prompted to create a new user and password for your WSL instance. Follow the prompts to configure it.

  3. Update your Linux environment by running the following commands

sudo apt update && sudo apt upgrade -y

Step 3: Install OpenSSH (if not installed)

Most Linux distributions come with OpenSSH pre-installed. To verify or install:

  1. Open your WSL terminal.

  2. Install the OpenSSH client by running:

sudo apt install openssh-client

Data storage and transfer

There are different storage areas on Bora:

  • /u/username: your personal home folder (that’s were you are when you log in)

  • /data/slow/username: shared folder for normal I/O operations

  • /data/fast/username: shared folder for fast I/O operations

There are user-based quotas on each partition:

  • /u: 10 Gb

  • /data/slow: 60 Gb

  • /data/fast: 30 Gb

These quotas may be subject to change. The login screen will always remind you about the space occupied by your files in each partition and your quota limit.

Please follow these guidelines:

  • for projects with limited I/O and disk space usage, just work on your home folder

  • for projects requiring larger disk space, use /data/slow

  • for projects requiring I/O of large files, use /data/fast

Note that the performance difference between the fast and slow partition may strongly depend on the kind of I/O operations you perform. For instance, I/O of many small files is unlikely to be faster on /data/fast than on /data/slow.

To copy data from your computer to the cluster, use one of these commands

scp -r <local_path> username@bora.units.it:<remote_destination_path>
rsync <local_path> username@bora.units.it:<remote_destination_path>

To copy data from the cluster to your computer

scp -r username@bora.units.it:<remote_path> <local_destination_path>
rsync username@bora.units.it:<remote_path> <local_destination_path>

Submitting jobs

To run processes on the computing nodes of the cluster (for instance, to perform simulations or data analysis), you must submit jobs to the SLURM scheduler. In this section, we describe the basic process of submitting jobs to the scheduler, either interactively or as batch jobs. For more info on on how to get started to SLURM, please visit this guide.

Interactive jobs

Interactive jobs allow you to run commands in real-time, directly from the command line.

srun <command>

These jobs are useful when you need to test or debug code before submitting a batch job. For example, if you want to execute a Python script or a shell command, you can run it directly on the allocated resources using this command.

To run the job in the background and detach it from the terminal

nohup srun -n 16 sleep 10 &

The nohup command ensures that the process keeps running even if the user logs out. Here, srun is used to run the job on 16 tasks (-n 16), and the job will sleep for 10 seconds. The & symbol ensures that the job runs in the background.

Batch jobs

Batch jobs allow you to submit jobs to the queue for execution. These jobs run without real-time user input, so they are ideal for long-running jobs or jobs that run in parallel on multiple nodes.

The first step is to create a bash script with your favorite text editor. This is a sample batch job script:

#!/bin/bash
#SBATCH --ntasks=16
mpirun -n $SLURM_TASKS sleep 10

A few explanations:

  • #!/bin/bash: Specifies that the script should be executed in the bash shell.

  • #SBATCH --ntasks=16: This line tells SLURM to allocate 16 tasks (by default, the number of nodes is 1)

  • mpirun -n $SLURM_TASKS sleep 10: Runs the job using MPI on the allocated tasks. $SLURM_TASKS is an environment variable set by SLURM, representing the number of tasks to be run.

To submit the batch job to the SLURM scheduler

sbatch <batch_script>

SLURM will place the job in a queue and run it when resources become available.

You can even define the number of tasks when submitting the job from the command line

sbatch -n 2 <batch_script>

The -n 2 option will take precedence over the one specified in the job script. Note that -n is just a shortcut for --ntasks.

Of course, this is just a simple example: the script can be modified to run any program or workload and there are more options to fine tune your job.

Job resources (CPUs, GPUs, RAM, …)

Here are a few options you can provide to srun and sbatch to request specific resources (ex. parallel jobs, memory requirements, …)

  • --time=hh:mm:ss: maximum runtime for the job.

  • --mem=<amount>: memory required per node or per CPU.

  • --output=<filename>: file to save the job’s output.

  • --cpus-per-task=<num_cpus>: number of CPUs (cores) per task for your job. Useful for multi-threaded applications.

  • --nodes=<num_nodes>: number of computing nodes your job needs. This is useful for distributed jobs that need to run across multiple nodes.

  • --ntasks=<num_tasks>: total number of tasks to run. This is commonly used for MPI-based parallel programs.

  • --ntasks-per-node=<num_tasks>: number of tasks per node. This is commonly used for MPI-based parallel programs.

  • --partition=<partition_name>: the partition (queue) to run the job on (not needed right now on Bora)

Compiling and testing code

To compile, debug or monitor the ram usage of your code you can request an interactive session passing the --pty option to srun, for example:

srun --pty --mem=8G --time=1:00:00 bash

The above command allocates a pseudo-terminal for the session for 1 hours, requiring 8Gb of memory. The bash at the end specifies the shell (bash) that will be run when the Slurm job is started.

Once you enter the interactive session you can load the modules needed to compile or run your code using the command module load <module_name> (see Modules and software section for more details)

Inside the interactive shell you can directly run your executable named e.g. program_name using the command ./program_name.

GPU nodes

The Bora cluster has currently one GPU node with 2 NVIDIA A30 (24GB each), one of which is split into 4 partitions (MIG devices). It is possible to request GPU devices by passing the --gres to sbatch or srun. For example

srun -n1 --gres=gpu:1 bash

will allocate an entire physical A30 GPU. In case you want to allocate only a portion of a physical GPU (MIG devices), you can specify the desired MIG profile

srun -n1 --gres=gpu:a30_1g.6gb:2 bash 

The above command will allocate 2 of the 4 MIG devices available on the second GPU, allowing you to run a multi-GPU jobs on a single graphics card.

The names and counts of the GPU devices can be queried via the scontrol command as follows:

[user@bora ~]$ scontrol show node bora-gpu01 |grep -i gres
   Gres=gpu:a30:1(S:0),gpu:a30_1g.6gb:4(S:1)

Other useful commands

Here are a few additional tips to help you work effectively with SLURM:

  • Resources available: Use sinfo to get information about the resources on available nodes that make up the HPC cluster.

  • Monitoring jobs: You can check the status of your submitted jobs using the squeue command:

squeue -u <username>

This will show you the jobs that are currently queued or running for your user.

  • Canceling jobs: If you need to cancel a job that is in the queue or running, use the scancel command with the job ID:

scancel <job_id>
  • Checking job output: Once a job finishes, the output and error messages are saved in the files specified in the --output or --error options of your batch script. If no files are specified, SLURM will create default output files in the current working directory.


Modules and software

The scientific software has been installed using the Spack package manager and can be loaded using environment modules. Modules allow you to load different versions of a software (ex. compilers, libraries, …). Here are some basic commands to get started with environment modules.

Available modules

You can list all available modules on the cluster using the command:

module avail

This will show a very long list! If you are looking for a specific module, you can filter the list by specifying the module name. For instance, the command

module avail python

will list all available versions of Python modules. Users can then choose which version to load based on their requirements, as described in the Loading modules section

Warning

The default Python distribution on Bora is the one of the operating system, which ships a relatively old Python version (3.6). If you need a more recent Python version, load the appropriate module!

This is a short list of the available software:

  • C, C++, Fortran compilers (GCC suite and Intel OneApi)

  • OpenMPI built with InfiniBand support (GCC and Intel)

  • Quantum Espresso (GCC only)

  • Yambo (GCC only)

  • Gadget-4

  • Cern-vm filesystem

  • HPL

  • iozone

  • HEP SPEC

  • stream

Of course, to get an updated and complete list use the module avail command.

Loading modules

To load a specific environment module, use the module load command followed by the module name. For instance, to load Python 3.10:

module load python/3.10

The above command sets up the environment with Python version 3.10 and adjusts the PATH and other environment variables accordingly.

Note

If you want some modules to be loaded by default every time you log in, add the corresponding module load command in your ~/.bashrc file (in your home directory on Bora).

To show the currently loaded modules use the command

module list

If you want to see detailed information about a specific module, such as environment variables it sets or paths it modifies, use the module show command, e.g.:

module show python

This will display information about the Python module, including any environment variables that will be affected when the module is loaded.

Clearing up

To unload a single module

module unload <module>

To unload all currently loaded modules and reset the environment

module purge

The above command will remove all modules currently loaded in your session, which is useful if you want to start with a clean environment before loading new modules.