You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 68 Next »

Login

  • You have to be member of HPC_user (can be applied for on id.awi.de)
  • The login nodes can be accessed via
    ssh albedo0.dmawi.de and ssh albedo1.dmawi.de
    → If you do not familiar with ssh and/or bash you should start here for a basic introduction.
  • Please do not use these nodes for computing, please use the compute nodes (and take a look at our hardware and slurm documentation)
  • HPC resources are not available from remote for security reasons (VPN is possible).
  • By using albedo you accept our HPC data policy
  • You can ssh to-a-node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519;  ssh-copy-id albedo1)
    Make sure your key is secured with a password!

Copy data from ollie

  1. login to ollie
  2. on ollie:

    rsync -Pauv --no-g /work/ollie/$USER/your-data albedo0:/albedo/work/projects/$YOURPROJECT/

Short story: The other way round (rsync from albedo instead of from ollie) does not work, because of a specific route set on ollie.
Long Story: The reason is that ollie has two eth cards, namely a 10 Gb for the "normal" AWI network with the IP (172.18.20.0/24) and a (later added) 40 Gb for a high-speed connection to the Isilon and to albedo in the newer 10.100.0.0/16.  network. If you access albedo from ollie, the route setting on ollie ensures that you automatically use the fast 40 Gb card in the 10,100 network. :-)  However, if you want to access ollie from albedo then you will reach ollie via the default route on ollie's 10 Gb interface (172.18.20.0/24) but ollie replies on the 10.100.0.0/16 40 Gb card.  In other words, ollie's reply to albedo never gets through.

Software

  • Albedo is running the operating system Rocky Linux release 8.6 (Green Obsidian).
  • Slurm 22.05 is used as the job scheduling system. Important details on its configuration on Albedo are given here: Slurm.
  • Details on the user software can be found here: Software.

Environment modules

On albedo we use environment modules to load/unload specific versions of software. Loading a module modifies environment variables so that the shell e.g. knows where to look for binaries.

You get an overview of all software installed by typing

module avail

To load and unload a module use

# load
module load <module>

# unload
module unload <loaded module>


Sometimes it might be useful to unload all loaded modules at once. This is done with

module purge


Usage of a node's internal NVMe storage

All compute (including fat and gpu) nodes have a local NVMe disk mounted as /tmp. The GPU nodes have an additional storage /scratch. See System overview for the exact sizes.  We strongly encourage you to use these node-internal storage, which is faster than the global /albedo storage, if your job does lots of reading/writing. In particular, it might be beneficial to write your job output to the local disk and copy it to /albedo after your job is finished.

# Copy input data to the node, where your main MPI (rank 0) task runs
rsync -ur $INPUT_DATA /tmp/

# If you need the input data on every node, you have to add `srun` in front of the copy command
srun --ntasks-per-node=1 rsync -ur $INPUT_DATA /tmp/

# do the main calculation
srun $MY_GREAT_PROGRAM

# Copy your results from node where main MPI (rank 0) task runs to global storage
rsync -r /tmp/output/* /albedo/scratch/$MYPROJECT/output/


Monitoring

Files

  • info.sh -f <file> shows if a file is on NVMe or HDD

Node usage monitoring

  • Try info.sh -l to get output of cat /proc/loadavg and vmstat -t -a -w -S M of all nodes your jobs are running. Use info.sh -L  to add output of top -b -n1 -u$USER
  • ssh prod-xyz where a job of yours is running and try something like [h]top or vmstat -t -a -w -S M 1

GPU monitoring

When using the GPUs you can monitor their usage with

ssh gpu-00[12]  # login
module load gpustat
gpustat -i1 --show-user --show-cmd -a
ssh gpu-00[12]  # login 
watch -d -n 1 nvidia-smi   # -d shows differences 
  • No labels