slurm is is albedo's job scheduling system. It is used to submit jobs from the login nodes to the compute nodes.
...
*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3EfilenameFILENAME-pattern%3C/B%3EPATTERN
Job enforcements
We implemented some enforcements to improve albedo's overall performance.
...
Partition | Nodes | Description | |||||
---|---|---|---|---|---|---|---|
smp | prod-[001-200] |
| |||||
mpp | prod-[001-200] |
| |||||
fat | fat-00[1-2] |
| |||||
matlab | fat-00[3-4] | currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later.
| |||||
gpu | gpu-00[1-25] |
|
Quality of service (--qos)
...
To facilitate development and testing, we have reserved 10 20 nodes during working hours exclusively for jobs with QOS=30min.
...
The factors (except of the nice_factor (default is zero), which can be set by the user to downgrade the jobs priority by the setting --nice=...), are numbers in the range from 0 to 1.
They are shortly explained in the following.
...
You can check the recent usage of albedo with this command:
Code Block |
---|
sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 week ago") Format=used,login,account
FairShare |
The fairshare The fairshare factor is the most important factor here, but also the most difficult factor to understand. This factor is calculated using the "classic" fairshare algorithm of Slurm (https://slurm.schedmd.com/classic_fair_share.html). It computes the fairshare for each user based on the recent usage of the system.
Note, the usage of your associated account is *not* taken into accunt here, as it was the case on ollie!
Usage is basically "CPU seconds", but weighted using the UsageFactor depending on the used QOS (see section QOS). Furthermore, the usage taken into account here decays with time (with a half life time of 7 days).
Fairshare is the calculated by
...
- sinfo shows existing queues
For example to check how many nodes are available in a given partition (mpp, fat, gpu...)Code Block language bash sinfo -p<partition_name>
- scontrol show job <JobID> shows information scontrol show job <JobID> shows information about specific job
- sstat <JobID> shows resources used by a specific job
- squeue shows information about queues and used nodes
- smap curses-graphic of queues and nodes
- sbatch <script> submits a batch job
- salloc <resources> requests access to compute nodes for interactive use
- scancel <JobID> cancels a batch job
- srun <ressources> <executable> starts a (parallel) codesshare and sprio give information on fair share value and job prioritya (parallel) code
- sshare and sprio give information on fair share value and job priority
- sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 month ago") Format=used,login,account | head -20 top usage users during the last month
Do's & Don'ts
- Do not use srun for simple non-parallel jobs like cp, ln, rm, cat, g[un]zip
- Do not write loops in your slurm script to start several instance of similar jobs → See Job arrays below
- Make use of parallel srun p[gu]igz instead of g[un]zip if you have allocated more than one CPU already
- Do not allocate costly resources (like fat/gpu nodes) if you not need them. Check the CPU/Memory-Efficiency of your jobs with info.sh -S
Example Scripts
Job arrays
...
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mpp #SBATCH -N 2 #SBATCH --tasks-per-node =128 #SBATCH --cpus-per-task =1 #SBATCH --hint=nomultithread #SBATCH --job-name=mpi #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads export OMP_NUM_THREADS=1 srun xthi | sort -g -k 4 |
...
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mpp #SBATCH -N 2 #SBATCH --tasks-per-node =31 #SBATCH --hint=nomultithread #SBATCH --job-name=mpi_partial_node #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads export OMP_NUM_THREADS=1 # The --cpu-bind=rank_ldom distributes the tasks via the node's cores # respecting the node's NUMA domains srun --cpu-bind=rank_ldom xthi | sort -g -k 4 |
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p smp #SBATCH --tasks-per-node =1 #SBATCH --cpus-per-task =64 #SBATCH --job-name=openMP #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # export OMP_STACKSIZE=128M # This binds each thread to one core export OMP_PROC_BIND=TRUE # OpenMP and srun, both need to know the number of CPUs per task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun xthi | sort -g -k 4 |
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p mpp #SBATCH -N 2 #SBATCH --tasks-per-node =8 #SBATCH --cpus-per-task =16 #SBATCH --job-name=hybrid #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread module purge module load xthi/1.0-intel-oneapi-mpi2021.6.0-oneapi2022.1.0 intel-oneapi-mpi # module load xthi/1.0-openmpi4.1.3-gcc8.5.0 openmpi/4.1.3 ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # export OMP_STACKSIZE=128M # This binds each thread to one core export OMP_PROC_BIND=TRUE # OpenMP and srun, both need to know the number of CPUs per task export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK srun xthi | sort -g -k 4 |
...
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --account=<account> # Your account #SBATCH --time =0:10:00 #SBATCH -p gpu #SBATCH --ntasks=1 #SBATCH --gpus=a100:2 # allocate 2 (out of 4) A100 GPUs; to get 2 (out of 2) A40 GPUs use --gpus=a40:2 #SBATCH --hint=nomultithread #SBATCH --job-name=gpu #SBATCH --output=out_%x.%j # disable hyperthreading #SBATCH --hint=nomultithread ## Uncomment the following line to enlarge the stacksize if needed, ## e.g., if your code crashes with a spurious segmentation fault. # ulimit -s unlimited # To be on the safe side, we emphasize that it is pure MPI, no OpenMP threads export OMP_NUM_THREADS=1 srun your_code_that_runs_on_GPUs |
...