Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue
indent10px
absoluteUrltrue
exclude1

Login

  • You have to be member of HPC_user (can be applied for on id.awi.de)
  • The login nodes can be accessed via
    ssh albedo0.dmawi.de and ssh albedo1.dmawi.de
    → If you do not familiar with ssh and/or bash you should start here for a basic introduction.
  • Please do not use these login nodes for computing, please use the compute nodes (and take a look at our hardware and slurm documentation)
  • HPC resources are not available from remote for security reasons (VPN is possible).
  • By using albedo you accept our HPC data policy
  • You can ssh to-a-node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519;  ssh-copy-id albedo1)
    Make sure your key is secured with a password!

Copy data from ollie

  • login to ollie
  • Copying data might take some time. To make sure your copying is not disrupted when your ssh connection fails (or you want to turn off your computer), you can use a terminal multiplexer, such as tmux or screen. These tools enable you to detach and later attach a shell session.
  • on ollie:

    Code Block
    languagebash
    rsync -Pauv --no-g /work/ollie/$USER/your-data albedo0:/albedo/work/projects/$YOURPROJECT/

    Note: Of course you can also copy your data via albedo1 (instead of only albedo0).

  • If you want to use up to six streams at once (one for each directory) to copy your complete work directory from ollie, you can try something like

    Code Block
    languagebash
        ssh albedo0 "mkdir /albedo/work/projects/YOUR-PROJECT/fromollie"
        ssh ollie0
        cd /work/ollie/$USER
        find . -maxdepth 1 -print0 | parallel -j6 -0 'rsync -auvP --no-g {} albedo0:/albedo/work/projects/YOUR-PROJECT/fromollie/'

    Since every rsync establishes an own ssh connection from Ollie to Albedo, you need to have setup a ssh key to enable password-less login. Otherwise the parallel rsync commands will wait idle waiting for your password...

    Note
    If the total number of streams is raised by you and/or other users significantly this will probably create a bottleneck for everybody ...

    Short story: The other way round (rsync from albedo instead of from ollie) does not work, because of a specific route set on ollie.
    Long Story: The reason is that ollie has two eth cards, namely a 10 Gb for the "normal" AWI network with the IP (172.18.20.0/24) and a (later added) 40 Gb for a high-speed connection to the Isilon and to albedo in the newer 10.100.0.0/16.  network. If you access albedo from ollie, the route setting on ollie ensures that you automatically use the fast 40 Gb card in the 10,100 network. :-)  However, if you want to access ollie from albedo then you will reach ollie via the default route on ollie's 10 Gb interface (172.18.20.0/24) but ollie replies on the 10.100.0.0/16 40 Gb card.  In other words, ollie's reply to albedo never gets through.

    Software

    • Albedo is running the operating system Rocky Linux release 8.6 (Green Obsidian).
    • Slurm 22.05 is used as the job scheduling system. Important details on its configuration on Albedo are given here: Slurm.
    • Details on the user software can be found here: Software.

    Environment modules

    On albedo we use environment modules to load/unload specific versions of software. Loading a module modifies environment variables so that the shell e.g. knows where to look for binaries.

    You get an overview of all software installed by typing

    Code Block
    languagebash
    module avail

    To load and unload a module use

    Code Block
    languagebash
    # load
    module load <module>
    
    # unload
    module unload <loaded module>


    Sometimes it might be useful to unload all loaded modules at once. This is done with

    Code Block
    languagebash
    module purge


    It is also possible to use the module command from some scripting languages. For example, in Python you can do:

    Code Block
    languagepy
    titleModule Command from Python
    linenumberstrue
    $ python
    Python 3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55)
    [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> module_python_init = "/usr/share/Modules/init/python.py"
    >>> exec(open(module_python_init).read())
    >>> result = module("list")
    Currently Loaded Modulefiles:
     1) git/2.35.2   2) conda/22.9.0-2
    >>> result is True
    True


    Usage of a node's internal NVMe storage

    All compute (including fat and gpu) nodes have a local NVMe disk mounted as /tmp. The GPU nodes have an additional storage /scratch. See System overview for the exact sizes.  We strongly encourage you to use these node-internal storage, which is faster than the global /albedo storage, if your job does lots of reading/writing. In particular, it might be beneficial to write your job output to the local disk and copy it to /albedo after your job is finished.

    Code Block
    languagebash
    # Copy input data to the node, where your main MPI (rank 0) task runs
    rsync -ur $INPUT_DATA /tmp/
    
    # If you need the input data on every node, you have to add `srun` in front of the copy command
    srun --ntasks-per-node=1 rsync -ur $INPUT_DATA /tmp/
    
    # do the main calculation
    srun $MY_GREAT_PROGRAM
    
    # Copy your results from node where main MPI (rank 0) task runs to global storage
    # If data is written on all nodes, start rsync using srun, as above
    rsync -r /tmp/output/* /albedo/scratch/$MYPROJECT/output/


    CPU, Memory, and Process Time Restrictions on a Login Node


    On the login nodes albedo0 and albedo1, you have limits for what a process is allowed to do. Note please that the login nodes are not available for compute jobs, and should be used for simple shell usage only! You get a total of 2048 processes (PIDs), 9 logins,[[[ 

    Have a look at /etc/security/limits.conf. For further details.


    Monitoring

    Files

    • info.sh -f <file> shows if a file is on NVMe or HDD

    Node usage monitoring

    • Try info.sh -l to get output of cat /proc/loadavg and vmstat -t -a -w -S M of all nodes your jobs are running. Use info.sh -L  to add output of top -b -n1 -u$USER
    • ssh prod-xyz where a job of yours is running and try something like [h]top or vmstat -t -a -w -S M 1
    • info.sh -S to see running jobs and resources used from finished slurm jobs.

    GPU monitoring

    When using the GPUs you can monitor their usage with


    Code Block
    languagebash
    ssh gpu-00[12]  # login
    module load gpustat
    gpustat -i1 --show-user --show-cmd -a



    Code Block
    languagebash
    ssh gpu-00[12]  # login 
    watch -d -n 1 nvidia-smi   # -d shows differences