slurm is is albedo's job scheduling system. It is used to submit jobs from the login nodes to the compute nodes.
...
*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E
Details about specific parameters
Account (-A)
Job enforcements
We implemented some enforcements to improve albedo's overall performance.
- Jobs requesting --partition=fat but only low memory are rejected.
- Jobs requesting less than 40 nodes are enforced to use only nodes connected to the very same Infiniband switch (if this is feasible within 15 Minutes).
Details about specific parameters
Account (-A)
Compute resources are attributed to (primary)Compute resources are attributed to (primary)sections and projects at AWI. Therefore it is mandatory to specify an account.
...
The slurm accounts you may use are listed after login, with info.sh -s or can be shown via
Code Block | ||
---|---|---|
| ||
sacctmgr -s show user name=$USER format=user,account%-30 |
...
Identical compute nodes are combined in partitions. More information about the hardware specification of each node can be found in the System Overview.
Partition | Nodes | Description |
---|---|---|
smp | prod-[001- |
200] |
|
mpp | prod-[ |
001-200] |
| |
fat | fat-00[1-2] |
|
| ||||
matlab | fat-00[3-4] | currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later.
| ||
gpu | gpu-00[1-5] |
|
mpp
prod-[001-120]
exclusive access to nodes,
MaxNodes=240
mppht
prod-[121-240]
like mpp but with HT
fat
fat-00[1-2]
like smp but for jobs with extensive need of RAM
- default RAM: 30000 MB/core
fatht
fat-00[3-4]
like fat but with HT
gpu
gpu-00[1-2]
|
|
|
|
|
-w gpu-00x or --nodelist=gpu-00x
or you can specify
|
|
Quality of service (--qos)
A higher priority means your job is scheduled before other jobs. In addition, during working hours 10 nodes are reserved exclusively for jobs using qos=30min (to facilitate development and testing). For longer runs, another QOS (and walltime) has to be specified. Note: long running jobs (longer than 12 hours, up to 48 hours) “cost” more in terms of fairshare (meaning you priority will decrease for further jobs).
...
QOS
...
max. walltime
...
max. Nodes/User
...
UsageFactor
...
Priority QOS_factor
...
Notes
|
Quality of service (--qos)
Slurm's QOS is a way for us to influence a job's priority (priority QOS_factor) and "cost" (UsageFactor) based on the job's size (we only take walltime into account here!). We therefore created the different QOS, which are listed below.
The default QOS is 30min; for a job with a walltime >30min you have to select and set an appropriate QOS in addition to your walltime!
To facilitate development and testing, we have reserved 20 nodes during working hours exclusively for jobs with QOS=30min.
QOS | max. walltime | max. Nodes/User | UsageFactor | Priority QOS_factor | Notes | ||
---|---|---|---|---|---|---|---|
30min | 00:30 | - | 1 | 1 | default | ||
12h | 12:00 | 120 | 1 | 0 | |||
48h | 48:00 | 80 | 2 | 0 | |||
1wk | 7-00:00:00 (168h) | 1 | 10 | 0 | only available for users upon request; whenever possible try to adapt your workflow to allow for shorter walltime!
|
Job Scheduling
Priority
Jobs on albedo are scheduled based on a priority that is computed by Slurm depending on multiple factors (https://slurm.schedmd.com/priority_multifactor.html).
The higher the priority, the sooner your job begins. (In principle – the backfill scheduling plugin helps making best use of available resources by filling up resources that are reserved (and thus idle) for large higher priority jobs with small (lower priority) jobs.)
At AWI, only few of the possible factors are taken into account:
Code Block | ||
---|---|---|
| ||
Job_priority = (PriorityWeightAge) * (age |
...
30min
...
00:30
...
1
...
1
...
default
...
12h
...
12:00
...
120
...
1
...
0
...
48h
...
48:00
...
80
...
2
...
0
Job Scheduling
Priority
Jobs on albedo are scheduled based on a priority that is computed by Slurm depending on multiple factors (https://slurm.schedmd.com/priority_multifactor.html).
The higher the priority, the sooner your job begins. (In principle – the backfill scheduling plugin helps making best use of available resources by filling up resources that are reserved (and thus idle) for large higher priority jobs with small (lower priority) jobs.)
At AWI, only few of the possible factors are taken into account:
Code Block | ||
---|---|---|
| ||
Job_priority = (PriorityWeightAge) * (age_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightQOS) * (QOS_factor) - nice_factor+ (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightQOS) * (QOS_factor) - nice_factor |
The weights in this formula are set to balance the different factors and might become subject for tuning.
The current values can be assessed by running
...
The factors (except of the nice_factor (default is zero), which can be set by the user to downgrade the jobs priority by the setting --nice=...), are numbers in the range from 0 to 1.
They are shortly explained in the following.
...
You can check the recent usage of albedo with this command:
Code Block |
---|
sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 week ago") Format=used,login,account
FairShare |
The fairshare factor is the most important The fairshare factor is the most important factor here, but also the most difficult factor to understand. This factor is calculated using the "classic" fairshare algorithm of Slurm (https://slurm.schedmd.com/classic_fair_share.html). It computes the fairshare for each user based on the recent usage of the system.
Note, the usage of your associated account is *not* taken into accunt here, as it was the case on ollie!
Usage is basically "CPU seconds", but weighted using the UsageFactor depending on the used QOS (see section QOS). Furthermore, the usage taken into account here decays with time (with a half life time of 7 days).
Fairshare is the calculated by
...
- sinfo shows existing queues
For example to check how many nodes are available in a given partition (mpp, fat, gpu...)Code Block language bash sinfo -p<partition_name>
- scontrol show job <JobID> shows scontrol show job <JobID> shows information about specific job
- sstat <JobID> shows resources used by a specific job
- squeue shows information about queues and used nodes
- smap curses-graphic of queues and nodes
- sbatch <script> submits a batch job
- salloc <resources> requests access to compute nodes for interactive use
- scancel <JobID> cancels a batch job
- srun <ressources> <executable> starts a (parallel) code
- sshare and sprio give information on fair share value and job priority
Example Scripts
Job arrays
Job arrays in Slurm are an easy way to submit multiple similar jobs (e.g. executing the same script with multiple input data). See here for further details.
Code Block | ||
---|---|---|
| ||
#!/bin/bash
#SBATCH --account=<account> # Your account
#SBATCH --partition=smp
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
# run 100 tasks, but only run 10 at a time
#SBATCH --array=1-100%10
#SBATCH --output=result_%A_%a.out # gives result_<jobID>_<taskID>.out
echo "SLURM_JOBID: $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
# Here we "translate" the $SLURM_ARRAY_TASK_ID (which takes values from 1-100)
# into an input file, that we want to analyze.
# Suppose 'input_files.txt' is a text file that has 100 lines, each containing
# the respective input file.
INPUT_LIST=input_files.txt
# Read the (SLURM_ARRAY_TASK_ID)th input file
INPUT_FILE=`sed -n "${SLURM_ARRAY_TASK_ID}p" < ${INPUT_LIST}`
srun my_executable $INPUT_FILE |
...
- srun <ressources> <executable> starts a (parallel) code
- sshare and sprio give information on fair share value and job priority
- sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 month ago") Format=used,login,account | head -20 top usage users during the last month
Do's & Don'ts
- Do not use srun for simple non-parallel jobs like cp, ln, rm, cat, g[un]zip
- Do not write loops in your slurm script to start several instance of similar jobs → See Job arrays below
- Make use of parallel srun p[gu]igz instead of g[un]zip if you have allocated more than one CPU already
- Do not allocate costly resources (like fat/gpu nodes) if you not need them. Check the CPU/Memory-Efficiency of your jobs with info.sh -S
Example Scripts
Job arrays
Job arrays in Slurm are an easy way to submit multiple similar jobs (e.g. executing the same script with multiple input data). See here for further details.
Code Block | ||
---|---|---|
| ||
#!/bin/bash
#SBATCH --account=<account> # Your account
#SBATCH --partition=smp
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
# run 100 tasks, but only run 10 at a time
#SBATCH --array=1-100%10
#SBATCH --output=result_%A_%a.out # gives result_<jobID>_<taskID>.out
echo "SLURM_JOBID: $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"
# Here we "translate" the $SLURM_ARRAY_TASK_ID (which takes values from 1-100)
# into an input file, that we want to analyze.
# Suppose 'input_files.txt' is a text file that has 100 lines, each containing
# the respective input file.
INPUT_LIST=input_files.txt
# Read the (SLURM_ARRAY_TASK_ID)th input file
INPUT_FILE=`sed -n "${SLURM_ARRAY_TASK_ID}p" < ${INPUT_LIST}`
srun my_executable $INPUT_FILE |
Info |
---|
How you “translate” your task ID into the srun command line is up to you. You could, for example, also have different scripts that you select in some way and execute. |
MPI
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/bash
#SBATCH --account=<account> # Your account
#SBATCH --time=0:10:00
#SBATCH -p mpp
#SBATCH -N 2
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --hint=nomultithread
#SBATCH --job-name=mpi
#SBATCH --output=out_%x.%j
# disable hyperthreading
#SBATCH --hint=nomultithread
module purge
module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi
# module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3
## Uncomment the following line to enlarge the stacksize if needed,
## e.g., if your code crashes with a spurious segmentation fault.
# ulimit -s unlimited
# To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads
export OMP_NUM_THREADS=1
srun xthi | sort -g -k 4
|
How you “translate” your task ID into the srun command line is up to you. You could, for example, also have different scripts that you select in some way and execute.
...
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mpp #SBATCH -N 2 #SBATCH --tasks-per-node 128 #SBATCH --cpus-per-task 1=31 #SBATCH --hint=nomultithread #SBATCH --job-name=mpi_partial_node #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads export OMP_NUM_THREADS=1 # The --cpu-bind=rank_ldom distributes the tasks via the node's cores # respecting the node's NUMA domains srun --cpu-bind=rank_ldom xthi | sort -g -k 4 |
OpenMP
Code Block | |||||
---|---|---|---|---|---|
| |||||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mppsmp #SBATCH -N 2 #SBATCH --tasks-per-node 31=1 #SBATCH --hint=nomultithreadcpus-per-task=64 #SBATCH --job-name=mpi_partial_nodeopenMP #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # To be on the safe side, we emphasize that it is pure MPI, no OpenMP threadsexport OMP_STACKSIZE=128M # This binds each thread to one core export OMP_NUMPROC_THREADSBIND=1TRUE # OpenMP The --cpu-bind=rank_ldom distributes the tasks via the node's cores # respecting the node's NUMA domains srun --cpu-bind=rank_ldomand srun, both need to know the number of CPUs per task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun xthi | sort -g -k 4 |
Hybrid (MPI+OpenMP)
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p smpmpp #SBATCH -N 2 #SBATCH --tasks-per-node 1=8 #SBATCH --cpus-per-task 64=16 #SBATCH --job-name=openMPhybrid #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited unlimited # export OMP_STACKSIZE=128M # This binds each thread to one core export OMP_PROC_BIND=TRUE # OpenMP and srun, both need to know the number of CPUs per task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun xthi | sort -g -k 4 |
...
Usage of GPU
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mpp #SBATCH -N 2 #SBATCH --tasks-per-node 8 #SBATCH --cpus-per-task 16 #SBATCH --job-name=hybrid #SBATCH --output=out_%x.%j # disable hyperthreading00 #SBATCH -p gpu #SBATCH --ntasks=1 #SBATCH --gpus=a100:2 # allocate 2 (out of 4) A100 GPUs; to get 2 (out of 2) A40 GPUs use --gpus=a40:2 #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3#SBATCH --job-name=gpu #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # ThisTo bindsbe eachon thread to one core export OMP_PROC_BIND=TRUE # OpenMP and srun, both need to know the number of CPUs per taskthe safe side, we emphasize that it is pure MPI, no OpenMP threads export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun xthi | sort -g -k 41 srun your_code_that_runs_on_GPUs |