slurm is is albedo's job scheduling system. It is used to submit jobs from the login nodes to the compute nodes.

*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E

Details about specific parameters

Account (-A)

FILENAME-PATTERN

Job enforcements

We implemented some enforcements to improve albedo's overall performance.

Jobs requesting --partition=fat but only low memory are rejected.
Jobs requesting less than 40 nodes are enforced to use only nodes connected to the very same Infiniband switch (if this is feasible within 15 Minutes).

Details about specific parameters

Account (-A)

Compute resources are attributed to (primary)Compute resources are attributed to (primary)sections and projects at AWI. Therefore it is mandatory to specify an account.

...

currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later. Note: for testing fat-004 has HT enabled

Partition

Nodes

Description

smp

prod-[001-120200]

default partition,
MaxNodes=1 → MaxCores=128,
default RAM: 1900 MB/core
Jobs can share a node

smphtmpp

prod-[121-240001-200]

exclusive access to nodes,
MaxNodes=240

fat

fat-00[1-2]

like smp but

with hyperthreading (to be more precise: Simultaneous multithreading, SMT) https://en.wikipedia.org/wiki/Simultaneous_multithreadi

mpp

prod-[001-120]

exclusive access to nodes,
MaxNodes=240

mppht

prod-[121-240]

like mpp but with HT

fat

fat-00[1-2]

like smp but for jobs with extensive need of RAM
default RAM: 30000 MB/core

for jobs with extensive need of RAM
default RAM: 30000 MB/core

matlab

fat-00[3-4]

currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later.

Note
To prohibit single users from allocating too many resources on these dedicated nodes, we limit the resources per user in this partition to 32 CPUs and 1TB RAM. Please get in touch with us if these limitations conflict with your use case!

gpu

gpu-00[1-5]

like smp but...
... 5 gpu nodes, each contain a different number and type of GPU:
- gpu-001: 2x a40
- gpu-00[2-5]: 4x a100
...you have to specify the type and number of desired GPUs via
--gpus=<GpuType>

matlab

fat-00[3-4]

Note
To prohibit single users from allocating entire resources on these dedicated nodes, we limit the resources per user in this partition to 32 CPUs and 1TB RAM. Please get in touch with us if these limitations conflict with your use case!

gpu

gpu-00[1-2]

like smp but...

... the two gpu nodes each contain a different number and type of GPU:

gpu-001: 2x A40
gpu-002: 4x A100

...you have to specify the type and number of desired GPUs via
--gpus=<GpuType>

:<GpuQuantity>
(otherwise no GPU will be allocated for you)

Quality of service (--qos)

Example for requesting 2 a40 GPUs with salloc:
Code Block
language bash
salloc --partition=gpu --gpus=a40:2

Quality of service (--qos)

Slurm's QOS is a way for us to influence a job's priority (priority QOS_factor) and "cost" (UsageFactor) based on the job's size (we only take walltime into account here!). We therefore created the different QOS, which are listed below.

The default QOS is 30min; for a job with a walltime >30min you have to select and set an appropriate QOS in addition to your walltime!

To facilitate development and testing, we have reserved 20 nodes during working hours exclusively for jobs with QOS=30minA higher priority means your job is scheduled before other jobs. In addition, during working hours 10 nodes are reserved exclusively for jobs using qos=30min (to facilitate development and testing). For longer runs, another QOS (and walltime) has to be specified. Note: long running jobs (longer than 12 hours, up to 48 hours) “cost” more in terms of fairshare (meaning you priority will decrease for further jobs).

QOS

max. walltime

max. Nodes/User

UsageFactor

Priority QOS_factor

Notes

30min

00:30

-

1

default

12h

12:00

120

1

0

48h

48:00

80

2

0

1wk

7-00:00:00 (168h)

1

10

0

only available for users upon request; whenever possible try to adapt your workflow to allow for shorter walltime!

Warning
In case of urgent system maintenance we might cancel long jobs using this QOS without further warning!

Job Job Scheduling

Priority

Jobs on albedo are scheduled based on a priority that is computed by Slurm depending on multiple factors (https://slurm.schedmd.com/priority_multifactor.html).
The higher the priority, the sooner your job begins. (In principle – the backfill scheduling plugin helps making best use of available resources by filling up resources that are reserved (and thus idle) for large higher priority jobs with small (lower priority) jobs.)
At AWI, only few of the possible factors are taken into account:

...

The factors (except of the nice_factor (default is zero), which can be set by the user to downgrade the jobs priority by the setting --nice=...), are numbers in the range from 0 to 1.
They are shortly explained in the following.

...

You can check the recent usage of albedo with this command:

Code Block
sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 week ago") Format=used,login,account FairShare

The The fairshare factor is the most important factor here, but also the most difficult factor to understand. This factor is calculated using the "classic" fairshare algorithm of Slurm (https://slurm.schedmd.com/classic_fair_share.html). It computes the fairshare for each user based on the recent usage of the system.
Note, the usage of your associated account is *not* taken into accunt here, as it was the case on ollie!
Usage is basically "CPU seconds", but weighted using the UsageFactor depending on the used QOS (see section QOS). Furthermore, the usage taken into account here decays with time (with a half life time of 7 days).
Fairshare is the calculated by

...

sinfo shows existing queues
For example to check how many nodes are available in a given partition (mpp, fat, gpu...)
Code Block
language bash
sinfo -p<partition_name>
scontrol show job <JobID> shows information scontrol show job <JobID> shows information about specific job
sstat <JobID> shows resources used by a specific job
squeue shows information about queues and used nodes
smap curses-graphic of queues and nodes
sbatch <script> submits a batch job
salloc <resources> requests access to compute nodes for interactive use
scancel <JobID> cancels a batch job
srun <ressources> <executable> starts a (parallel) codesshare and sprio give information on fair share value and job priority(parallel) code
sshare and sprio give information on fair share value and job priority
sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 month ago") Format=used,login,account | head -20 top usage users during the last month

Do's & Don'ts

Do not use srun for simple non-parallel jobs like cp, ln, rm, cat, g[un]zip
Do not write loops in your slurm script to start several instance of similar jobs → See Job arrays below
Make use of parallel srun p[gu]igz instead of g[un]zip if you have allocated more than one CPU already
Do not allocate costly resources (like fat/gpu nodes) if you not need them. Check the CPU/Memory-Efficiency of your jobs with info.sh -S

Example Scripts

Job arrays

...

Code Block

language	bash
title	full node

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time =0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node =128
#SBATCH --cpus-per-task =1
#SBATCH --hint=nomultithread
#SBATCH --job-name=mpi
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0    intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1

srun  xthi | sort -g -k 4

...

Code Block

language	bash
title	partially filled node

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time =0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node =31
#SBATCH --hint=nomultithread
#SBATCH --job-name=mpi_partial_node
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1

# The --cpu-bind=rank_ldom distributes the tasks via the node's cores
# respecting the node's NUMA domains
srun --cpu-bind=rank_ldom xthi | sort -g -k 4

...

Code Block

language	bash

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time =0:10:00
#SBATCH -p smp
#SBATCH --tasks-per-node =1
#SBATCH --cpus-per-task =64
#SBATCH --job-name=openMP
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited
# export OMP_STACKSIZE=128M

# This binds each thread to one core
export OMP_PROC_BIND=TRUE

# OpenMP and srun, both need to know the number of CPUs per task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun xthi | sort -g -k 4

...

Code Block

language	bash

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time =0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node =8
#SBATCH --cpus-per-task =16
#SBATCH --job-name=hybrid
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

module purge
module load    xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0   intel-oneapi-mpi
# module load    xthi/1.0-openmpi4.1.3-gcc8.5.0   openmpi/4.1.3

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited
# export OMP_STACKSIZE=128M

# This binds each thread to one core
export OMP_PROC_BIND=TRUE

# OpenMP and srun, both need to know the number of CPUs per task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun xthi | sort -g -k 4

...

Code Block

language	bash

#!/bin/bash

#SBATCH --account=<account>          # Your account 
#SBATCH --time =0:10:00
#SBATCH -p gpu
#SBATCH --ntasks=1
#SBATCH --gpugpus=a100:2                 # allocate 2 (out of 4) A100 GPUs; to get 2 (out of 2) A40 GPUs use --gpus=a40:2
#SBATCH --hint=nomultithread
#SBATCH --job-name=gpu
#SBATCH --output=out_%x.%j

# disable hyperthreading
#SBATCH --hint=nomultithread

## Uncomment the following line to enlarge the stacksize if needed,
##  e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited

# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1

srun your_code_that_runs_on_GPUs

...

Space shortcuts

Page tree

Versions Compared

Old Version 64

New Version Current

Key

Details about specific parameters

Account (-A)

Job enforcements

Details about specific parameters

Account (-A)

Quality of service (--qos)

Quality of service (--qos)

Job Job Scheduling

Priority

Do's & Don'ts

Example Scripts

Job arrays

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 64

New Version Current

Key

Details about specific parameters

Account (-A)

Job enforcements

Details about specific parameters

Account (-A)

Quality of service (--qos)

Quality of service (--qos)

Job Job Scheduling

Priority

Do's & Don'ts

Example Scripts

Job arrays