Slurm-Albedo

Account

To help attributing the usage of computing resources to the groups and projects of AWI, which is needed for reporting, on Albedo it is necessary to specify an account.
This is done by setting

-A, --account=<account>

Possible slurm accounts are listed after login. To enforce setting an account, no (valid) default account is set.
Users are, however, able to change this setting on their own:

sacctmgr modify user <user> set DefaultAccount=<account>

The account specified is only used for reporting issues. No account gets privileged access to compute resources compared to others!

Partitions

Albedo’s compute nodes are divided into the following partitions, which are shown in the table below.

The smp partiton is the default and is for jobs with cores.

Nodes in the mpp partition are exclusively reserved. This partition is used when one ore more nodes are needed.

The fat nodes can be selected via the fat partition. This partition resembles the smp partition but each node has much more memory.

Similarly, the GPU nodes can be accessed via the gpu partition. Note, that the type and number of GPUs need to be specified. More infos about the hardware specification of each node can be found in the System Overview (TODO: Link).

Partition	Nodes	Description
smp	prod-[001-240]	default partition, MaxNodes=1 → MaxCores=128, Jobs can share a node
mpp	prod-[001-240]	exclusive access to nodes, MaxNodes=240
fat	fat-00[1-4]	MaxNodes=1 Jobs can share a Node
gpu	gpu-00[1-2]	MaxNodes=1, Jobs can share a node, Note: You have to specify the type and number of GPUs `--gpus=<GpuType>:<GpuNumber>` gpu-001: 2x a40 gpu-002: 4x a100

Quality of service (QOS)

By default, the QOS 30min is used. It has a max. walltime of 30 minutes and jobs with this QOS get a higher priority and have access to a special SLURM reservation during working time (TODO: add details when set up), to facilitate development and testing. For longer runs, another QOS (and walltime) has to be specified. See table below. Note: long running jobs (longer than 12 hours, up to 48 hours) “cost” more in terms of fairshare.

QOS	max. walltime	UsageFactor	Priority QOS_factor
30min	0:30:00	1	50
12h	12:00:00	1	0
48h	48:00:00	2	0

A short note on the definitions:

UsageFactor: A float that is factored into a job’s TRES usage (e.g. RawUsage, …)

and RawUsage= cpu-seconds (#CPUs * seconds). → Jobs using the 48h QOS are twice as expensive when calculating job priorities (see Scheduling (TODO: Link)).

Specifying job resources

Job resources are defined at the header of your job script (or as command line arguments for sbatch or salloc). A full list see https://slurm.schedmd.com/sbatch.html#SECTION_OPTIONS. Here is a list of the most common ones:

#SBATCH --account=<account>          # Your account
#SBATCH --partition=<partition>      # Slurm Partition; Default: smp
#SBATCH --time=<time>                # time limit for job; Default: 0:30:00
#SBATCH --qos=<QOS>                  # Slurm QOS; Default: short
#SBATCH --nodes=<#Nodes>             # Number of nodes
#SBATCH --ntasks=<#Tasks>            # Number of tasks (MPI) tasks to be launched
#SBATCH --mem=<memory>               # If more than the default memory is needed;
                                     # Default: <#Cores> * <mem per node>/<cores per node>
#SBATCH --ntasks-per-node=<ntasks>   # Numer of tasks per node
#SBATCH --mail-user=<email adress>   # Your mail adress if you want to get notifications
#SBATCH --mail-type=<email type>     # Valid type values are NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --job-name=<jobname>         # Job name
#SBATCH --output=<filename_pattern>  # File where the standard output is written to(*)
#SBATCH --error=<filename_pattern>   # File where the error messages are written to(*)

*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E

Scheduling

Priority

For the job scheduling, Slurm assigns each job a priority, which is calculated based on several factors (Multifactor Priority Plugin). Jobs with higher priority, run first. (In principle – the backfill scheduling plugin helps making best use of available resources by filling up resources that are reserved (and thus idle) for large (high priority) jobs with small (lower priority) jobs.)

On Albedo, Slurm is configured such, that the priority is mainly influenced by the Fairshare factor (which is based on the user’s recent use of resources; see Fairshare (TODO: Link)), while favoring short jobs (with qos=short). With longer waiting time in the queue, a job’s priority increases. Job size, partitions or associations are not directly taken into account.

Fairshare

On Albedo all users have the same share of resources, independent of the account used. … TODO…

Accounting

TODO...

Information of jobs and nodes

TODO...

Example Scripts

Job arrays

Job arrays in Slurm are an easy way to submit multiple similar jobs (e.g. executing the same script with multiple input data). See here for further details.

#!/bin/bash

#SBATCH --account=<account>          # Your account
#SBATCH --partition=smp
#SBATCH --time=0:10:00
#SBATCH --ntasks=1

# run 100 tasks, but only run 10 at a time
#SBATCH --array=1-100%10
#SBATCH --output=result_%A_%a.out    # gives result_<jobID>_<taskID>.out

echo "SLURM_JOBID:         $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID:  $SLURM_ARRAY_JOB_ID"

# Here we "translate" the $SLURM_ARRAY_TASK_ID (which takes values from 1-100)
# into an input file, that we want to analyze.
# Suppose 'input_files.txt' is a text file that has 100 lines, each containing
# the respective input file.

INPUT_LIST=input_files.txt

# Read the (SLURM_ARRAY_TASK_ID)th input file
INPUT_FILE=`sed -n "${SLURM_ARRAY_TASK_ID}p" < ${INPUT_LIST}`

srun my_executable $INPUT_FILE

How you “translate” your task ID into a concrete command that is executed is up to you. You could, for example, also have different scripts that you select in some way and execute.

Space shortcuts

Page tree