Table of Contents | ||||||||
---|---|---|---|---|---|---|---|---|
|
Login
- You have to be member of HPC_user (can be applied for on id.awi.de, Start a new request > IT Services > select High-Performance-Computing (HPC) > Add to cart). See HPC account for more info.
- The login nodes can be accessed via
ssh albedo0.
...
...
...
- for a basic introduction.
- Please do not use these login nodes for computing, please use the compute nodes (
...
- and take a look at our hardware and slurm documentation)
- HPC resources are not available from remote for security reasons (VPN is possible
...
- ).
- By using albedo you accept ourHPC data policy
...
Support
You can open a support ticket on helpdesk.awi.de or by writing an email to hpc@awi.de. Please do not send a personal email to an admin!
Storage
Local user storage
- The local storage is a GxFS Storage Appliance from NEC based on https://en.wikipedia.org/wiki/GPFS.
- All nodes are connected via a 100 Gb OPA/Mellanox/Inifiniband network.
...
- You can ssh to-a-node where a job of yours is running, if (and only if) you have a valid ssh-key pair. (e.g. on a login node: ssh-keygen -t ed25519; ssh-copy-id albedo1)
Make sure your key is secured with a password!
Software
- Albedo is running the operating system Rocky Linux release 8.6 (Green Obsidian).
- Slurm 22.05 is used as the job scheduling system. Important details on its configuration on Albedo are given here: Slurm.
- Details on the user software can be found here: Software.
Environment modules
On albedo we use environment modules to load/unload specific versions of software. Loading a module modifies environment variables so that the shell e.g. knows where to look for binaries.
You get an overview of all software installed by typing
Code Block | ||
---|---|---|
| ||
module avail |
To load and unload a module use
Code Block | ||
---|---|---|
| ||
# load
module load <module>
# unload
module unload <loaded module> |
Sometimes it might be useful to unload all loaded modules at once. This is done with
Code Block | ||
---|---|---|
| ||
module purge |
It is also possible to use the module command from some scripting languages. For example, in Python you can do:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
$ python
Python 3.8.16 | packaged by conda-forge | (default, Feb 1 2023, 16:01:55)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> module_python_init = "/usr/share/Modules/init/python.py"
>>> exec(open(module_python_init).read())
>>> result = module("list")
Currently Loaded Modulefiles:
1) git/2.35.2 2) conda/22.9.0-2
>>> result is True
True |
Usage of a node's internal NVMe storage
All compute (including fat and gpu) nodes have a local NVMe disk mounted as /tmp. The GPU nodes have an additional storage /scratch. See System overview for the exact sizes. We strongly encourage you to use these node-internal storage, which is faster than the global /albedo storage, if your job does lots of reading/writing. In particular, it might be beneficial to write your job output to the local disk and copy it to /albedo after your job is finished.
Code Block | ||
---|---|---|
| ||
# Copy input data to the node, where your main MPI (rank 0) task runs
rsync -ur $INPUT_DATA /tmp/
# If you need the input data on every node, you have to add `srun` in front of the copy command
srun |
...
variable
30 €/TB/yr
...
variable
10 €/TB/yr
...
2x soft quota for 90 days
...
--
...
low latency, huge bandwidth
System storage
Is installed and maintained in /albedo/home/soft/:
- ./AWIsoft → binaries
- ./AWIbuild → sources
- ./AWImodules → additional/customized modules
If you need space here, please contact hpc@awi.de
Remote user storage
- You can access your online space on the Isilon in Bremerhaven (see https://spaces.awi.de/x/a13-Eg for more information) via the mountpoints
/isibhv/projects
/isibhv/projects-noreplica
/isibhv/netscratch
/isibhv/platforms
/isibhv/home - albedo is connected with the AWI backbone (including the Isilon) via four 100 Gb interfaces.
Each single albedo node has a 10 Gb interface.
Compute nodes & Slurm
...
A submitted job has/needs the following information/resources:
...
-J or --job-name=
...
-N or --nodes=
...
-n or --ntasks=
--ntasks-per-node= |
...
-c or --cpus-per-task=
...
Needed for OpenMP
If -n N and -c C is given, you get N x C cores.
...
Memory/RAM per CPU
...
--mem=
...
Maximum walltime
...
-t or --time=
...
-p or --partition=
...
Partitions
...
QOS
...
1 rsync -ur $INPUT_DATA /tmp/
# do the main calculation
srun $MY_GREAT_PROGRAM
# Copy your results from node where main MPI (rank 0) task runs to global storage
# If data is written on all nodes, start rsync using srun, as above
rsync -r /tmp/output/* /albedo/scratch/$MYPROJECT/output/ |
CPU, Memory, and Process Time Restrictions on a Login Node
On the login nodes albedo0 and albedo1, you have limits for what a process is allowed to do. Note please that the login nodes are not available for computing, and should be used for simple shell usage only! You get a total of 2048 processes (PIDs), 9 logins
Have a look at /etc/security/limits.conf. For further details.
Monitoring
Files
- info.sh -f <file> shows if a file is on NVMe or HDD
Node usage monitoring
- Try info.sh -l to get output of cat /proc/loadavg and vmstat -t -a -w -S M of all nodes your jobs are running. Use info.sh -L to add output of top -b -n1 -u$USER
- ssh prod-xyz where a job of yours is running and try something like [h]top or vmstat -t -a -w -S M 1
- info.sh -S to see running jobs and resources used from finished slurm jobs.
GPU monitoring
When using the GPUs you can monitor their usage with
Code Block | ||
---|---|---|
| ||
ssh gpu-00[1-5] # login
module load gpustat
gpustat -i1 --show-user --show-cmd -a |
Code Block | ||
---|---|---|
| ||
ssh gpu-00[1-5] # login
watch -d -n 1 nvidia-smi # -d shows differences |
Useful SLURM commands
...