Page History

...

*) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3EfilenameFILENAME-pattern%3C/B%3EPATTERN

Job enforcements

We implemented some enforcements to improve albedo's overall performance.

...

Partition

Nodes

Description

smp

prod-[001-200]

default partition,
MaxNodes=1 → MaxCores=128,
default RAM: 1900 MB/core
Jobs can share a node

mpp

prod-[001-200]

exclusive access to nodes,
MaxNodes=240

fat

fat-00[1-2]

like smp but for jobs with extensive need of RAM
default RAM: 30000 MB/core

matlab

fat-00[3-4]

currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later.

Note
To prohibit single users from allocating too many resources on these dedicated nodes, we limit the resources per user in this partition to 32 CPUs and 1TB RAM. Please get in touch with us if these limitations conflict with your use case!

gpu

gpu-00[1-25]

like smp but...
...

the two

5 gpu nodes, each contain a different number and type of GPU:
- gpu-001: 2x

A40

- a40
- gpu-

002

- 00[2-5]: 4x

A100

- a100
...you have to specify the type and number of desired GPUs via
--gpus=<GpuType>:<GpuQuantity>
(otherwise no GPU will be allocated for you)
Example for requesting 2 a40 GPUs with salloc:
Code Block
language bash
salloc --partition=gpu --gpus=a40:2

Quality of service (--qos)

...

To facilitate development and testing, we have reserved 10 20 nodes during working hours exclusively for jobs with QOS=30min.

...

sinfo shows existing queues
For example to check how many nodes are available in a given partition (mpp, fat, gpu...)
Code Block
language bash
sinfo -p<partition_name>
scontrol show job <JobID> shows scontrol show job <JobID> shows information about specific job
sstat <JobID> shows resources used by a specific job
squeue shows information about queues and used nodes
smap curses-graphic of queues and nodes
sbatch <script> submits a batch job
salloc <resources> requests access to compute nodes for interactive use
scancel <JobID> cancels a batch job
srun <ressources> <executable> starts a (parallel) code
sshare and sprio give information on fair share value and job priority
sreport -t Percent cluster UserUtilizationByAccount Start=$(date +%FT%T -d "1 month ago") Format=used,login,account | head -20 top usage users during the last month

Do's & Don'ts

Do not use srun for simple non-parallel jobs like cp, ln, rm, cat, g[un]zip
Do not write loops in your slurm script to start several instance of similar jobs → See Job arrays below
Make use of parallel srun p[gu]igz instead of g[un]zip if you have allocated more than one CPU already
Do not allocate costly resources (like fat/gpu nodes) if you not need them. Check the CPU/Memory-Efficiency of your jobs with info.sh -S

...

Space shortcuts

Page tree

Versions Compared

Old Version 77

New Version Current

Key

Job enforcements

Quality of service (--qos)

Do's & Don'ts