You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 41 Next »




Compute nodes & Slurm

  • To work interactively on a compute node use salloc. You can use all options (more CPU, RAM, time, partition, qos, ...) described below.
  • You can ssh to a node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519;  ssh-copy-id albedo1)

  • To submit a job from the login nodes to the compute nodes you need slurm (job scheduler, batch queueing system and workload manager)
  • A submitted job has/needs the following information/resources:

    WhatUseDefaultComment
    Name
    -J or --job-name=


    Account -A or --account=primary section, e.g.
    clidyn.clidyn
    computing.computing

    New on albedo (was not necessary on ollie)

    You can add a project (defined in eResources, you must be a member of the project) with -A <section>.<project>. This is helpful for reporting.

    e.g.,
    clidyn.clidyn
    computing.tsunami
    clidyn.p_fesom

    Number of nodes
    -N or --nodes=
    1
    Number of (MPI-)tasks (per node)
    -n or --ntasks=
    --ntasks-per-node=
    1Needed for MPI
    Number of cores/threads per task
    -c or --cpus-per-task=
    1

    Needed for OpenMP
    If -n N and -c C is given, you get N x C cores.

    Memory/RAM per CPU

    --mem=
    ntasks x nthreads x 1.6 GBOnly needed for smp-jobs, mpp-jobs get whole nodes (cores and memory)

    Maximum walltime

    -t or --time=
    01:00
    Partition
    -p or --partition=
    smp
    qos (quality of service)-q or --qos=normal
  • Please take a look at our examples scripts (from ollie) SLURM Example Scripts

Partitions

PartitionNodesPurpose
smp240For OpenMP or serial jobs with up to 36 (OpenMP) cores. Jobs may share nodes (until all cores are occupied).
mpp240For MPI-parallel jobs, typically >=36 cores. Jobs get nodes exclusively
fat4For OpenMP or serial jobs that need >256 GiB RAM
gpu1For jobs that can take advantage from a Nvidia A100/80 

QOS

QOSmax. timemax nodes
job/user/total
Fairshare
usage factor
Comment
short00:30:00128 / 312 / 3121High priority, jobs in this qos run first
normal12:00:00128 / 312 / 3121Default
large96:00:008 / 16 / 642
xlarge400:00:001 / 2 / 810Only on request. Use at own risk, consider short jobs with restarts if possible
knurd--1For admins only


Useful SLURM commands

  • sinfo shows existing queues
  • scontrol show job <JobID> shows information about specific job
  • sstat <JobID> shows resources used by a specific job
  • squeue shows information about queues and used nodes
  • smap curses-graphic of queues and nodes
  • sbatch <script> submits a batch job
  • salloc <resources> requests access to compute nodes for interactive use
  • scancel <JobID> cancels a batch job
  • srun <ressources> <executable> starts a (parallel) code
  • sshare and sprio give information on fair share value and job priority
  • No labels