Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Albedo is the tier-3 High Performance Computing platform (HPC) hosted and supported at AWI. In this documentation you can find the basics on how to operat it. Please, be aware that a basic knowledge on linux and HPCs is expected from Albedo's users, so this documentation does not cover all there is to know about HPCs, linux user permissions, data management...

That being said, Albedo documentation is a living document and we will do our best to improve it and answer the most common questions of the users, either with expanding the documentation or pointing at external sources.

Children Display

Table of Contents

Hardware

  • 2 login nodes
  • 1 interactive node with 2x NVDIA A40 for testing, Jupyter notebooks, Matlab,...
  • 240 compute nodes with
    • 2x AMD Rome Epyc 7702 2GHz (3.3GHz Boost),  64 Core, cTDP reduced from 200W to 165W
    • 256 GB RAM,
    • 500GB SSD
  • 4 "fat" nodes as above, 4TB RAM, 500GB SSD + 7.5TB SSD
  • 1 GPU node with 1TB RAM, 4x NVIDIA A100/80
    • More GPU nodes will follow later, after we gained first experience of what you really need, and to offer most recent hardware
  • Our small test node with NEC's new vector engine "SX-Aurora TSUBASA" can be integrated
  • Fast interconnect: HDR Infinband or OmniPath
  • 5 PB work/scratch NEC GxFS (IBM Spectrum Scale)
    • 220 TB as NVMe SSDs as fast cache and/or burst buffer
    • extension (capacity, bandwidth) possible
  • All nodes connected to /isibhv (NFS, 10GbE)
  • Alma Linux ("free RedHat", version 8.x)

The FESOM2 Benchmark we used for the procurement on 240 Albedo nodes compares to 800 Ollie nodes.

Preliminary schedule for the transition Ollie → Albedo

Disclaimer: We cannot guarantee the following time frame! It is an optimistic view and can be disturbed by further lock downs in China, a ship may block the Suez channel, or it can be just small technical issues that prevent Albedo from smooth operation even with all hardware at AWI in time . Any step can be delayed by weeks or months! Except the first "Now" step, which is in your hands ;-)

  • Now Please start to prepare your directories on /work/ollie for transfer to Albedo!
    • Delete data you do not need to keep,
    • copy tarballs of valuable data to the tape archive (you should do this anyhow, /work has no backup),
    • place a copy of data which you need continuously with fast access to /isibhv.
    • We will not transfer data automatically from Ollie to Albedo!! This is a chance to clean up ;-)
  • until March  NEC investigates if OmniPath is an alternative for Mellanox Infinband as the fast network in Albedo, because Infiniband would be delivered with a large delay
  • April
    Albedo installed at AWI  with "slow" Gigabit  (10 Gb/s) network and, if recommended by NEC, fast OmniPath (100 Gb/s)
  • May
    Albedo open for power users
    Start to copy data from Ollie to Albedo (Note: must be done by youself, there is no automatic data migration!)
    The more Albedo nodes are powered on, the more Ollie nodes have to be switched off. We have no schedule for this, but would decide depending on how useful Albedo already is for how many users. Of course, Albedo has a far better ratio of computing vs. electric power.
  • June
    Albedo open for all users
  • 30. June
    Ollie hardware support ends. As a prolongation would cost approx. 200.000€, we hope that Albedo is already in stable operation. As a fallback, we would keep Ollie running w/o support, but with the option to have basic components (file system, network) repaired in case of an emergency (maintenance on demand)
  • (August Downtime to add Infiniband cards - if we decide against OmniPath)

More details

Basically, all hardware is in place at NEC's technical center in Germany except the CPUs (expected end of March) and Mellanox Infiniband cards (expected end of July or even later). Concerning the CPUs, NEC is confident that the date of delivery is met, however, there can always be a bad surprise. For Mellanox, the situation is worse. As Mellanox now belongs to NVIDIA, all customers for NVIDIA's flagship DGX are served first. Therefor, NEC is currently investigating if OmniPath offers similar performance for our benchmarks and for the parallel filesystem GxFS (5PB work/scratch). Depending on the outcome, we would start with OmniPath in April, or with just Gigabit network and with a downtime in August to add Infiniband.

Mellanox Infiniband or Cornelis Networks OmniPath? You know OmniPath from Ollie. We have mixed experiences. The MPI performance is very good for the typical codes at AWI, however, Intel declared it deprecated and as a consequence, bugs in BeeGFS in the interplay with OmniPath were never fixed. At the time of the procurement, Infiniband was the only option. However, as NVIDIA acquired Mellanox, Intel decided to revive OmniPath with a spin-off, Cornelis Networks. Due to NVIDIA's policy and the long delays to obtain the infiniband hardware, OmniPath with similar performance seems to have a good new start and is already part of new installations. NEC is careful and they are testing OmniPath thoroughly in new hardware at their test center before they give a recommendation for Albedo. Differences matter: OmniPath puts more load onto the CPUs of the compute nodes and file servers, but comes with better latencies in return. Also, internal configuration options differ.