Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 *) For filename patterns see: https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3EfilenameFILENAME-pattern%3C/B%3EPATTERN

Job enforcements

We implemented some enforcements to improve albedo's overall performance.

...

Partition

Nodes

Description

smp

prod-[001-200]

  • default partition,

  • MaxNodes=1 → MaxCores=128,

  • default RAM: 1900 MB/core
  • Jobs can share a node

mpp

prod-[001-200]

  • exclusive access to nodes,

  • MaxNodes=240

fat

fat-00[1-2]

  • like smp but for jobs with extensive need of RAM

  • default RAM: 30000 MB/core

matlab

fat-00[3-4]

currently reserved for matlab users (as personal matlab licenses are node-bound). This might change later.

Note

To prohibit single users from allocating too many resources on these dedicated nodes, we limit the resources per user in this partition to 32 CPUs and 1TB RAM. Please get in touch with us if these limitations conflict with your use case!


gpu

gpu-00[1-25]


  • like smp but...

  • ...

the two
  • 5 gpu nodes, each contain a different number and type of GPU:

    • gpu-001: 2x

A40  
    • a40  

    • gpu-

002
    • 00[2-5]: 4x

A100
    • a100

  • ...you have to specify the type and number of desired GPUs via
    --gpus=<GpuType>:<GpuQuantity>
    (otherwise no GPU will be allocated for you)
    Example for requesting 2 a40 GPUs with salloc:
    Code Block
    languagebash
    salloc --partition=gpu --gpus=a40:2


Quality of service (--qos)

...

To facilitate development and testing, we have reserved 10 20 nodes during working hours exclusively for jobs with QOS=30min.

...

  • sinfo shows existing queues
    For example to check how many nodes are available in a given partition (mpp, fat, gpu...)
    Code Block
    languagebash
    sinfo -p<partition_name>
  • scontrol show job <JobID> shows scontrol show job <JobID> shows information about specific job
  • sstat <JobID> shows resources used by a specific job
  • squeue shows information about queues and used nodes
  • smap curses-graphic of queues and nodes
  • sbatch <script> submits a batch job
  • salloc <resources> requests access to compute nodes for interactive use
  • scancel <JobID> cancels a batch job
  • srun <ressources> <executable> starts a (parallel) code
  • sshare and sprio give information on fair share value and job priority
  • sreport -t Percent cluster UserUtilizationByAccount  Start=$(date +%FT%T -d "1 month ago")  Format=used,login,account | head -20 top usage users  during the last month

Do's & Don'ts

  • Do not use srun for simple non-parallel jobs like cplnrm, cat, g[un]zip
  • Do not write loops in your slurm script to start several instance of similar jobs → See Job arrays below
  • Make use of parallel srun p[gu]igz instead of g[un]zip if you have allocated more than one CPU already
  • Do not allocate costly resources (like fat/gpu nodes) if you not need them. Check the CPU/Memory-Efficiency of your jobs with info.sh -S

...