Here we describe a generic workflow that can be followed when using machine learning within the AWI infrastructure. For now, we focus on classification tasks: other applications such as regression, clustering, etc can be considered in the future.

This document is organized as follows:

The general data flow is illustrated in the image below.

Storage

Data analytics is performed on some initial raw data which has been produced elsewhere, e.g. by sensors installed on a research vessel. We distinguish between two categories of data storage:

The workspace

While it is possible to use the local disk of a personal computer as a workspace, it is recommended to use the file service (see the documentation) provided by the Computing and Data Centre at the AWI. In the documentation page you find information on how to mount / access files on the file service from your personal computer. The file service has the following advantages:

Projects on the file server are found in the projects folder. A project can be created under cloud.awi.de (My projects).

Each project directory should contain at least:

The data archives

The input raw data for an analytics task can come from different sources, called data archives in this document. Within the AWI, there are three main data sources:

Additionally, external data archives can also be used by a project, e.g. Ecotaxa.

Currently, users have to transfer the raw data to the workspace manually. We plan to develop a library for transferring data between data archives and a project workspace using a unified API.

Model Repository

ML libraries and frameworks such as Keras normally support saving models to file for later use. Models are made of two different components:

It is important to have archive both model architectures and parameters in order to provide reproducibility. Additionally, one should store metadata such as:

We are currently still investigating the requirements for archiving ML models and evaluating appropriate software tools.

Computing Resources

Going from the raw data to an ML model involves several data processing steps:

All these data processing steps require computing resources. Since the workspace can be mounted as a network folder, users working inside the AWI-intranet can run their computations on their PCs. An alternative is to use computing resources of the data centre, which can be provided on request.