Disclaimer

The information contained here does not intend to be a complete guide to "Whats and HPC and how to use it". Instead, it provides a very basic introduction to the concepts you'll be needing to get familiar with to use AWI's HPC (and in general any other HPC). Please, make sure you are familiar to all these concepts before starting to use AWI's HPC. To learn more about HPCs and how to use it for your specific application, there is plenty of information on the internet, but also ask your colleagues and PIs who are dealing with similar tasks.

What's an HPC

A High Performance Computer is a computer cluster designed to solve complex computational problems that require a significant ammount of computing power. That is achieved by putting together many "smaller" computers (nodes) and allowing them to very fast share information via very-fast connection hardware and protocols (in the case of Albedo, Infiniband).

More about AWI's HPC system configuration

Because HPCs are used not only by one single person at a time, a queueing system and job scheduler is required, to make sure everyone's job gets queued in a fair way, depending on the resources they are allocated with. At AWI's HPC the job scheduler used is SLURM.

More about SLURM in Albedo

In order to submit jobs to SLURM and to prepare the environment for running your computations, HPCs typically have nodes that allow for login in and submitting jobs. This are known as login nodes. The nodes that run computations are called computing nodes, and are accessed via SLURM. HPCs typically have many different types of computing nodes depending on their different hardware, for instance GPU nodes, fat nodes (with larger memory than the regular ones), postprocessing nodes... but they are all some short of computing node. Please, never use login nodes for computing because that could compromise the access of other users to the HPC!!!

Parallelization

To achieve high performance in an HPC, a computing task needs to be able to be broken down into smaller pieces. These pieces can then be distributed across multiple computing nodes (or cores/threads/devices...), which scales up the speed of the computation, IF everything is configured/program correctly, and IF the problem you are trying to solve scales well. This concept of dividing a computing task into smaller pieces to then be distributed to a number of devices (nodes, CPUs, cores, threads, NUMA domains, GPUs, ...) is known as parallelization. To use the full potential of any HPC one needs to parallelize their code, in one way or another.

One simple example of parallelization is a loop where each iteration is independent from the previous ones. A way of easily parallelizing a simple script with a parallelizable loop is to break the loop into smaller subloops (even single iterations). Then one can submit that script with different input to trigger the different subloops with SLURM. To do that you can use SLURM Job Arrays. Here is a very simple example on how to do such a thing for a python script (https://rcpedia.stanford.edu/topicGuides/jobArrayPythonExample.html but there are many examples online if you search on internet for "SLURM Job Arrays".

Another way of parallelizing code (for example FORTRAN, C, C++...) is by using parallelization standards such as MPI, OpenMP, OpenACC...

How to parallelize Python Code?

Other ways of taking advantage of the high-performance of HPCs in Python is to use parallelized libraries such as "Dask", or use "IPython Parallel" to parallelize your Python code. There are of course many other solutions. It's a personal journey to find how you want to parallelize your Python code (in general any code).

Final remark

HPCs like Albedo are all about parallelization, if you don't parallelize you cannot get the speedup benefits of using such a system. Parallelizing comes at the cost that you need to learn how to do that and how you want to do that, the same way you learnt Bash, Python, R, Julia... Information about SLURM, and other terms mentioned here can be found in our documentation for the system and in general on internet.




  • No labels