slurm is albedo's job scheduling system. It is used to submit jobs from the login nodes to the compute nodes.

Jobs

Submitting jobs

To work interactively on a compute node use salloc.
You can use all options (more CPU, RAM, time, partition, qos, ...) described in the next section.
To enable working with graphical interfaces (X forwarding) add the option --x11 .
Job scripts are submitted via sbatch
You can ssh to a node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519; ssh-copy-id albedo1)
Make sure your key is secured with a password!

Slurm keeps some environment variables that are set when submitting your job, but changes others. This might break your environment.

To be on the safe side:

When using environment modules in your job script (or interactive job), run module purge to unload all loaded modules before you load all modules required in your job!

Specifying job resources

Job resources are defined at the header of your job script (or as command line arguments for sbatch or salloc). A full list see https://slurm.schedmd.com/sbatch.html#SECTION_OPTIONS. Here is a list of the most common ones:

#SBATCH --account=<account>          # Your account
#SBATCH --partition=<partition>      # Slurm Partition; Default: smp
#SBATCH --time=<time>                # time limit for job; Default: 0:30:00
#SBATCH --qos=<QOS>                  # Slurm QOS; Default: 30min
#SBATCH --nodes=<#Nodes>             # Number of nodes
#SBATCH --ntasks=<#Tasks>            # Number of tasks (MPI) tasks to be launched
#SBATCH --mem=<memory>               # If more than the default memory is needed;
                                     # Default: <#Cores> * <mem per node>/<cores per node>
#SBATCH --ntasks-per-node=<ntasks>   # Numer of tasks per node
#SBATCH --mail-user=<email adress>   # Your mail adress if you want to get notifications
#SBATCH --mail-type=<email type>     # Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --job-name=<jobname>         # Job name
#SBATCH --output=<filename_pattern>  # File where the standard output is written to(*)
#SBATCH --error=<filename_pattern>   # File where the error messages are written to(*)

*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E

Details about specific parameters

Account (-A)

Compute resources are attributed to (primary)sections and projects at AWI. Therefore it is mandatory to specify an account.

This is new on Albedo, compared to ollie

The slurm accounts you may use are listed after login or can be shown via

sacctmgr -s show user name=$USER format=user,account%-30

Note: The account noaccount is just a dummy account that can not be used for computing.

You can change the default setting on your own:

sacctmgr modify user $USER set DefaultAccount=<account>

Partitions (-p)

Identical compute nodes are combined in partitions. More information about the hardware specification of each node can be found in the System Overview.

Partition	Nodes	Description
smp	prod-[001-120]	default partition, MaxNodes=1 → MaxCores=128, default RAM: 1900 MB/core Jobs can share a node
smpht	prod-[121-240]	like smp but with hyperthreading (to be more precise: Simultaneous multithreading, SMT) https://en.wikipedia.org/wiki/Simultaneous_multithreadi
mpp	prod-[001-120]	exclusive access to nodes, MaxNodes=240
mppht	prod-[121-240]	like mpp but with HT
fat	fat-00[1-2]	like smp but for jobs with extensive need of RAM default RAM: 30000 MB/core
fatht	fat-00[3-4]	like fat but with HT
gpu	gpu-00[1-2]	like smp but... ... you have to specify the type and number of desired GPUs via `--gpus=<GpuType>:<GpuNumber>` . The two gpu nodes each contain a different number and type of GPU: gpu-001: 2x A40 gpu-002: 4x A100

Quality of service (--qos)

A higher priority means your job is scheduled before other jobs. In addition, during working hours 10 nodes are reserved exclusively for jobs using qos=30min (to facilitate development and testing). For longer runs, another QOS (and walltime) has to be specified. Note: long running jobs (longer than 12 hours, up to 48 hours) “cost” more in terms of fairshare (meaning you priority will decrease for further jobs).

QOS	max. walltime	UsageFactor	Priority QOS factor	Notes
30min	00:30:00	1	50	default
12h	12:00:00	1	0
48h	48:00:00	2	0

Job Scheduling

Priority

For the job scheduling, Slurm assigns each job a priority, which is calculated based on several factors (Multifactor Priority Plugin). Jobs with higher priority, run first. (In principle – the backfill scheduling plugin helps making best use of available resources by filling up resources that are reserved (and thus idle) for large higher priority jobs with small (lower priority) jobs.)

On albedo, the priority is mainly influenced by the

the fairshare factor (which is based on the user’s recent use of resources) and
the QOS' priority factor and
the time your job waits in the queue

Job size (RAM, cores), partitions and/or associations have no influence.

Fairshare

On Albedo all users have the same share of resources, independent of the account used. … TODO…

Sebastian Hinck

Accounting

TODO...

Sebastian Hinck

Useful Slurm commands

sinfo shows existing queues
scontrol show job <JobID> shows information about specific job
sstat <JobID> shows resources used by a specific job
squeue shows information about queues and used nodes
smap curses-graphic of queues and nodes
sbatch <script> submits a batch job
salloc <resources> requests access to compute nodes for interactive use
scancel <JobID> cancels a batch job
srun <ressources> <executable> starts a (parallel) code
sshare and sprio give information on fair share value and job priority

Example Scripts

Job arrays

Job arrays in Slurm are an easy way to submit multiple similar jobs (e.g. executing the same script with multiple input data). See here for further details.

#!/bin/bash

#SBATCH --account=<account>          # Your account
#SBATCH --partition=smp
#SBATCH --time=0:10:00
#SBATCH --ntasks=1

# run 100 tasks, but only run 10 at a time
#SBATCH --array=1-100%10
#SBATCH --output=result_%A_%a.out    # gives result_<jobID>_<taskID>.out

echo "SLURM_JOBID:         $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID:  $SLURM_ARRAY_JOB_ID"

# Here we "translate" the $SLURM_ARRAY_TASK_ID (which takes values from 1-100)
# into an input file, that we want to analyze.
# Suppose 'input_files.txt' is a text file that has 100 lines, each containing
# the respective input file.

INPUT_LIST=input_files.txt

# Read the (SLURM_ARRAY_TASK_ID)th input file
INPUT_FILE=`sed -n "${SLURM_ARRAY_TASK_ID}p" < ${INPUT_LIST}`

srun my_executable $INPUT_FILE

How you “translate” your task ID into the srun command line is up to you. You could, for example, also have different scripts that you select in some way and execute.

MPI

full node

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time 0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node 128
#SBATCH --cpus-per-task 1
#SBATCH --hint=nomultithread
#SBATCH --job-name=mpi
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0    intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1

srun  xthi | sort -g -k 4

partially filled node

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time 0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node 31
#SBATCH --hint=nomultithread
#SBATCH --job-name=mpi_partial_node
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1

# The --cpu-bind=rank_ldom distributes the tasks via the node's cores
# respecting the node's NUMA domains
srun --cpu-bind=rank_ldom xthi | sort -g -k 4

OpenMP

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time 0:10:00
#SBATCH -p smp
#SBATCH --tasks-per-node 1
#SBATCH --cpus-per-task 64
#SBATCH --job-name=openMP
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# This binds each thread to one core
export OMP_PROC_BIND=TRUE

# OpenMP and srun, both need to know the number of CPUs per task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun xthi | sort -g -k 4

Hybrid (MPI+OpenMP)

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time 0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node 8
#SBATCH --cpus-per-task 16
#SBATCH --job-name=hybrid
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# This binds each thread to one core
export OMP_PROC_BIND=TRUE

# OpenMP and srun, both need to know the number of CPUs per task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun xthi | sort -g -k 4

Space shortcuts

Page tree

Jobs

Submitting jobs

Specifying job resources

Details about specific parameters

Account (-A)

Partitions (-p)

Quality of service (--qos)

Job Scheduling

Priority

Fairshare

Accounting

Useful Slurm commands

Example Scripts

Job arrays

MPI

OpenMP

Hybrid (MPI+OpenMP)

Space shortcuts

Page tree

Slurm-Albedo

Jobs

Submitting jobs

Specifying job resources

Details about specific parameters

Account (-A)

Partitions (-p)

Quality of service (--qos)

Job Scheduling

Priority

Fairshare

Accounting

Useful Slurm commands

Example Scripts

Job arrays

MPI

OpenMP

Hybrid (MPI+OpenMP)