System overview - albedo's hardware

Login nodes

The two login nodes are the only accessible nodes from the AWI intranet.

Quantity	Name	Specification
2x	albedo[0\|1]	2x AMD Rome Epyc 7702 (64 cores each) → 128 cores 512GB RAM internal storage: 1.7TB SSD

Compute nodes

Quantity	Name	Partition	Specification	Notes
240x	prod-[001-240]	smp, smpht mpp, mppht	2x AMD Rome Epyc 7702 (64 cores each) → 128 cores 256 GB RAM internal storage: /tmp: 314 GB NVMe	For our test phase, we have split the compute nodes in two sets - prod-[001-200] : smp,mpp: hyperthreading disabled (one thread per core) - prod-[201-240]: smpht, mppht: hyperthreading enabled (2 threads per core). More information can be found on the slurm documentation.
4x	fat-00[1-4]	fat, matlab	like prod, but with 4 TB RAM internal storage: /tmp: 6.5 TB NVMe	fat-00[3,4] are currently reserved for matlab users, this might change later
1x	gpu-001	gpu	like prod, but with 1 TB RAM internal storage: /tmp: 3 TB NVMe /scratch: 6.3 TB 2x Nvidia A40 GPU (48GB)	A comparison of the two different GPUs can be found here: https://askgeek.io/en/gpus/vs/NVIDIA_A40-vs-NVIDIA_A100-SXM4-80-GB The saying is: How big are your models? Very, very big ⟹ A100 Do you mainly work with mixed precision training (TensorFloat-32)? ⟹ A100 Is FP32 more important? ⟹ A40 Is FP64 more important? ⟹ A100
4x	gpu-00[2-5]	gpu	like prod, but with 1TB RAM internal storage: /tmp: 3 TB NVMe /scratch: 6.3 TB 4x Nvidia A100 GPU (80GB)

Filesystem

Local user storage

The local storage is a parallel GxFS Storage Appliance from NEC based on IBM Spectrum Scale: https://en.wikipedia.org/wiki/GPFS.

- Tier 1 "system": ~213 TiB NVMe as fast cache and/or burst buffer
- Tier 2 "data": ~5030 TiB NL-SAS HDD (NetApp EF300)
- You can check in which storage pool your data resides with mmlsattr -L <file> or info.sh -f <file>
- If your data is in Tier 2 and you need it for your jobs and would like to have it on NVMe, you can migrate it with this command: sudo /albedo/soft/sbin/info.sh -m <file|dir> Note that any use of this service will be logged and will be limited if we should encounter misuse.

All nodes are connected via a 100 Gb Mellanox/Inifiniband network.

	Personal directories			Project directories
Mountpoint	/albedo/home/$USER	/albedo/work/user/$USER	/albedo/scratch/user/$USER	/albedo/work/projects/$PROJECT	/albedo/scratch/projects/$PROJECT	/albedo/burst
Comes with	HPC_user account: https://id.awi.de → Start a new request/Bestellung → IT Service → HPC → Add to chart/In den Einkaufswagen			Apply for Quota here: https://cloud.awi.de/#/projects		--
Block Quota	100 GB (fixed)	3 TB (fixed)	50 TB (fixed)	30 €/TB/yr (variable)	10 €/TB/yr (variable)
File Quota		1e6 (fixed)	3e6 (fixed)	max(1,log(1.5BlockQuota)) 1e6	3max(1,log(1.5BlockQuota)) * 1e6
Delete	90 days after user account expired		all data older than 90 days	90 days after project expired	all data older then 90 days	after 10 days
Security	Snapshots for 100 days		--	Snapshots for 100 days	--	--
Snapshots	/albedo/home/.snapshots/	/albedo/work/user/.snapshots/	--	/albedo/work/projects/.snapshots/	--
Owner:Group	$USER:hpc_user			$OWNER:$PROJECT		root:root
Permissions	2700 → drwx--S---			2770 → rwxrws---		1777 → rwxwrxrwt
Focus	many small files	large files, large bandwidth		large files, large bandwidth		low latency, huge bandwidth

Data Pools

A data pool refers to a centralized storage space where large datasets are stored and managed, for easy and shared access by users. Data stored in pools is typically input data that is common to many users and it is not expected to change. For instance, input, boundary conditions and meshes for numerical models are typically stored in data pools, but also reanalysis data, and historical observations that will be accessed and processed by many users. Having a common pool to all users is very advantageous, since it avoids having identical copies of large datasets spreaded throught the file system.

In Albedo data pools are located in /albedo/pool/<pool_name>. That directory contains soft links to folders in project directories (/albedo/work/projects/p_<project_name>). If you would like to make a pool out of your data project please contact us (hpc@awi.de). If you'd like to know more about project directories and how to create one, please read https://spaces.awi.de/display/HELP/HPC+Data+Policy.

Remote user storage (/isibhv)

You can access your online space on the Isilon in Bremerhaven (see https://spaces.awi.de/x/a13-Eg for more information) via the nfs-mountpoints
/isibhv/projects
/isibhv/projects-noreplica
/isibhv/netscratch
/isibhv/platforms
/isibhv/home
Tape storage (HSM) is not mounted. However, you could archive your results with something like
rsync -Pauv /albedo/work/projects/$PROJECT/my_valuable_results/* hssrv1:/hs/projects/$PROJECT/my_valuable_results_from_albedo/

Considerations: Where should I store my data?

/work		Subject	/isibhv
pro	con		pro	con
100 Gb Infiniband	albedo intern only	Acceptability	available from everywhere (inside AWI)	10 Gb Ethernet
low		Latency		higher
about 10-30 €/TB/yr		Cost		about 100-125 €/TB/yr
snapshots		Security	snapshots, automatic tape backup available (+25 €/TB/yr)

Network

Fast interconnect (beween albedo's nodes):

HDR Infiniband

Ethernet:

albedo is connected to the AWI backbone (including the Isilon and the HSM) via four eth-100 Gb interfaces.
Each single albedo node has a 10 Gb interface.

Space shortcuts

Page tree