Table of Contents | ||||
---|---|---|---|---|
|
Storage
Users storage
...
variable
30 €/TB/yr
...
variable
10 €/TB/yr
...
2x soft quota for 90 days
...
--
...
|
Login
- You have to be member of HPC_user (can be applied for on id.awi.de, Start a new request > IT Services > select High-Performance-Computing (HPC) > Add to cart). See HPC account for more info.
- The login nodes can be accessed via
ssh albedo0.dmawi.de and ssh albedo1.dmawi.de
→ If you do not familiar with ssh and/or bash you should start here for a basic introduction. - Please do not use these login nodes for computing, please use the compute nodes (and take a look at our hardware and slurm documentation)
- HPC resources are not available from remote for security reasons (VPN is possible).
- By using albedo you accept our HPC data policy
- You can ssh to-a-node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519; ssh-copy-id albedo1)
Make sure your key is secured with a password!
Software
- Albedo is running the operating system Rocky Linux release 8.6 (Green Obsidian).
- Slurm 22.05 is used as the job scheduling system. Important details on its configuration on Albedo are given here: Slurm.
- Details on the user software can be found here: Software.
Environment modules
On albedo we use environment modules to load/unload specific versions of software. Loading a module modifies environment variables so that the shell e.g. knows where to look for binaries.
You get an overview of all software installed by typing
Code Block | ||
---|---|---|
| ||
module avail |
To load and unload a module use
Code Block | ||
---|---|---|
| ||
# load
module load <module>
# unload
module unload <loaded module> |
Sometimes it might be useful to unload all loaded modules at once. This is done with
Code Block | ||
---|---|---|
| ||
module purge |
It is also possible to use the module command from some scripting languages. For example, in Python you can do:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ python
Python 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 16:01:55)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> module_python_init = "/usr/share/Modules/init/python.py"
>>> exec(open(module_python_init).read())
>>> result = module("list")
Currently Loaded Modulefiles:
1) git/2.35.2 2) conda/22.9.0-2
>>> result is True
True |
Usage of a node's internal NVMe storage
All compute (including fat and gpu) nodes have a local NVMe disk mounted as /tmp. The GPU nodes have an additional storage /scratch. See System overview for the exact sizes. We strongly encourage you to use these node-internal storage, which is faster than the global /albedo storage, if your job does lots of reading/writing. In particular, it might be beneficial to write your job output to the local disk and copy it to /albedo after your job is finished.
Code Block | ||
---|---|---|
| ||
# Copy input data to the node, where your main MPI (rank 0) task runs
rsync -ur $INPUT_DATA /tmp/
# If you need the input data on every node, you have to add `srun` in front of the copy command
srun --ntasks-per-node=1 rsync -ur $INPUT_DATA /tmp/
# do the main calculation
srun $MY_GREAT_PROGRAM
# Copy your results from node where main MPI (rank 0) task runs to global storage
# If data is written on all nodes, start rsync using srun, as above
rsync -r /tmp/output/* /albedo/scratch/$MYPROJECT/output/ |
CPU, Memory, and Process Time Restrictions on a Login Node
On the login nodes albedo0 and albedo1, you have limits for what a process is allowed to do. Note please that the login nodes are not available for computing, and should be used for simple shell usage only! You get a total of 2048 processes (PIDs), 9 logins
Have a look at /etc/security/limits.conf. For further details.
Monitoring
Files
- info.sh -f <file> shows if a file is on NVMe or HDD
Node usage monitoring
- Try info.sh -l to get output of cat /proc/loadavg and vmstat -t -a -w -S M of all nodes your jobs are running. Use info.sh -L to add output of top -b -n1 -u$USER
- ssh prod-xyz where a job of yours is running and try something like [h]top or vmstat -t -a -w -S M 1
- info.sh -S to see running jobs and resources used from finished slurm jobs.
GPU monitoring
When using the GPUs you can monitor their usage with
Code Block | ||
---|---|---|
| ||
ssh gpu-00[1-5] # login
module load gpustat
gpustat -i1 --show-user --show-cmd -a |
Code Block | ||
---|---|---|
| ||
ssh gpu-00[1-5] # login
watch -d -n 1 nvidia-smi # -d shows differences |
...
low latency, huge bandwitch
System storage
Is installed and maintained in /albedo/home/soft/:
- ./AWIsoft → binaries
- ./AWIbuild → sources
- ./AWImodules → additional modules
...