Table of Contents |
---|
outline | true |
---|
indent | 10px |
---|
absoluteUrl | true |
---|
exclude | 1 |
---|
|
Login
- You have to be member of HPC_user (can be applied for on id.awi.de)
- The login nodes can be accessed via
ssh albedo0.dmawi.de and ssh albedo1.dmawi.de
→ If you do not familiar with ssh and/or bash you should start here for a basic introduction. - Please do not use these nodes for computing, please use the compute nodes (and take a look at our hardware and slurm documentation)
- HPC resources are not available from remote for security reasons (VPN is possible).
- By using albedo you accept our HPC data policy
- You can ssh-to-a-node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519; ssh-copy-id albedo1)
Make sure your key is secured with a password!
Copy data from ollie
- login to ollie
on ollie:
Code Block |
---|
|
rsync -Pauv --no-g /work/ollie/$USER/your-data albedo0:/albedo/work/projects/$YOURPROJECT/ |
Short story: The other way round (rsync from albedo instead of from ollie) does not work, because of a specific route set on ollie.
Long Story: The reason is that ollie has two eth cards, namely a 10 Gb for the "normal" AWI network with the IP (172.18.20.0/24) and a (later added) 40 Gb for a high-speed connection to the Isilon and to albedo in the newer 10.100.0.0/16. network. If you access albedo from ollie, the route setting on ollie ensures that you automatically use the fast 40 Gb card in the 10,100 network. :-) However, if you want to access ollie from albedo then you will reach ollie via the default route on ollie's 10 Gb interface (172.18.20.0/24) but ollie replies on the 10.100.0.0/16 40 Gb card. In other words, ollie's reply to albedo never gets through.
Software
- Albedo is running the operating system Rocky Linux release 8.6 (Green Obsidian).
- Slurm 22.05 is used as the job scheduling system. Important details on its configuration on Albedo are given here: Slurm.
- Details on the user software can be found here: Software.
Environment modules
On albedo we use environment modules to load/unload specific versions of software. Loading a module modifies environment variables so that the shell e.g. knows where to look for binaries.
You get an overview of all software installed by typing
To load and unload a module use
Code Block |
---|
|
# load
module load <module>
# unload
module unload <loaded module> |
Sometimes it might be useful to unload all loaded modules at once. This is done with
Usage of node's internal storage
All compute (including fat and gpu) nodes have a local NVMe disk mounted as /tmp. The GPU nodes have an additional storage /scratch. See System overview for the exact sizes. We strongly encourage you to use these node-internal storage, which is faster than the global /albedo storage, if your job does lots of reading/writing. In particular, it might be beneficial to write your job output to the local disk and copy it to /albedo after your job is finished.
Code Block |
---|
|
# Copy input data to the node, where your main MPI (rank 0) task runs
rsync -ur $INPUT_DATA /tmp/
# If you need the input data on every node, you have to add `srun` in front of the copy command
srun rsync -ur $INPUT_DATA /tmp/
# do the main calculation
srun $MY_GREAT_PROGRAM
# Copy your results from node where main MPI (rank 0) task runs to global storage
rsync -r /tmp/output/* /albedo/scratch/$MYPROJECT/output/ |
Monitoring
Node usage monitoring
- Try info.sh -l to get output of cat /proc/loadavg and vmstat -t -a -w -S M of all nodes your jobs are running.
- ssh prod-xyz where a job of yours is running and try something like [h]top or vmstat -t -a -w -S M 1
GPU monitoring
When using the GPUs you can monitor their usage with