You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

Login

Support

You can open a support ticket on helpdesk.awi.de or by writing an email to hpc@awi.de. Please do not send a personal email to any admin!

Storage

Local user storage

  • The local storage is a parallel GxFS Storage Appliance from NEC based on https://en.wikipedia.org/wiki/GPFS.
  • All nodes are connected via a 100 Gb Mellanox/Inifiniband network.

Personal directoriesProject directories
Mountpoint/albedo/home/$USER/albedo/work/user/$USER/albedo/scratch/user/$USER/albedo/work/projects/$PROJECT/albedo/scratch/projects/$PROJECT/albedo/burst
Quota (soft)100 GB3TB50 TB

variable
30 €/TB/yr

variable
10 €/TB/yr

--
Quota (hard)100 GB15 TB for 60 days50 TB

2x soft quota for 60 days

--


Delete90 days after user account expiredall data older than 90 days90 days after project expiredall data older then 90 days after 10 days
SecuritySnapshots for 6 months--Snapshots for 6 months----
Owner$USER:hpc_user$OWNER:$PROJECTroot:root
Permission2700 → drwx--S---2770 → rwxrws---1777 → rwxwrxrwt
Focusmany small files
large files, large bandwidth

low latency, huge bandwidth

Remote user storage (/isibhv)

  • You can access your online space on the Isilon in Bremerhaven (see https://spaces.awi.de/x/a13-Eg for more information) via the nfs-mountpoints
    /isibhv/projects
    /isibhv/projects-noreplica
    /isibhv/netscratch
    /isibhv/platforms
    /isibhv/home
  • albedo is connected to the AWI backbone (including the Isilon and the HSM) via four eth-100 Gb interfaces.
    Each single albedo node has a 10 Gb interface.

Compute nodes & Slurm

  • To work interactively on a compute node use salloc. You can use all options (more CPU, RAM, time, partition, qos, ...) described below.
  • To submit a job from the login nodes to the compute nodes you need slurm (job scheduler, batch queueing system and workload manager)
  • A submitted job has/needs the following information/resources:

    WhatUseDefaultComment
    Name
    -J or --job-name=


    Account -A or --account=primary section, e.g.
    clidyn.clidyn
    computing.computing

    New on albedo (was not necessary on ollie)

    You can add a project (defined in eResources, you must be a member of the project) with -A <section>.<project>. This is helpful for reporting.

    e.g.,
    clidyn.clidyn
    computing.tsunami
    clidyn.fesom

    Number of nodes
    -N or --nodes=
    1
    Number of (MPI-)tasks (per node)
    -n or --ntasks=
    --ntasks-per-node=
    1Needed for MPI
    Number of cores/threads per task
    -c or --cpus-per-task=
    1

    Needed for OpenMP
    If -n N and -c C is given, you get N x C cores.

    Memory/RAM per CPU

    --mem=
    ntasks x nthreads x 1.6 GBOnly needed for smp-jobs, mpp-jobs get whole nodes (cores and memory)

    Maximum walltime

    -t or --time=
    01:00
    Partition
    -p or --partition=
    smp
    qos (quality of service)-q or --qos=normal
  • Please take a look at our examples scripts (from ollie) SLURM Example Scripts

Partitions

PartitionNodesPurpose
smp240For OpenMP or serial jobs with up to 36 (OpenMP) cores. Jobs may share nodes (until all cores are occupied).
mpp240For MPI-parallel jobs, typically >=36 cores. Jobs get nodes exclusively
fat4For OpenMP or serial jobs that need >256 GiB RAM
gpu1For jobs that can take advantage from a Nvidia A100/80 

QOS

QOSmax. timemax nodes
job/user/total
Fairshare
usage factor
Comment
short00:30:00128 / 312 / 3121High priority, jobs in this qos run first
normal12:00:00128 / 312 / 3121Default
large96:00:008 / 16 / 642
xlarge400:00:001 / 2 / 810Only on request. Use at own risk, consider short jobs with restarts if possible
knurd--1For admins only


Useful SLURM commands

  • sinfo shows existing queues
  • scontrol show job <JobID> shows information about specific job
  • sstat <JobID> shows resources used by a specific job
  • squeue shows information about queues and used nodes
  • smap curses-graphic of queues and nodes
  • sbatch <script> submits a batch job
  • salloc <resources> requests access to compute nodes for interactive use
  • scancel <JobID> cancels a batch job
  • srun <ressources> <executable> starts a (parallel) code
  • sshare and sprio give information on fair share value and job priority
  • No labels