Contact: Stefan Pinkernell Giorgio Busatto
Machine learning (ML) methods allow to use a computer system for perform a task by means of a model that has been trained with example data from the domain of interest.
The goal of our work is to identify ML methods and applications that are relevant for the scientific work carried out at the AWI, and to understand how they can be implemented using available software libraries within the AWI (storage and computing) infrastructure.
Typical ML applications include:
The following table shows a few popular ML methods (or method families) that can be used for the above-mentioned applications:
Classification | Regression | Clustering | Dimensionality reduction | |
---|---|---|---|---|
Linear models | X | X | ||
Random forests | X | X | ||
Principal component analysis (PCA) | X | |||
Neural networks and deep learning | X | X | ||
k-Means | X |
The term deep learning refers to neural networks using larger and more complex architectures than traditional ones. Deep learning has been successfully applied in areas like image and sound classification.
It is outside the scope of this document to describe these methods in detail. In the following sections we describe the ML technologies that we have considered so far, and concrete applications / case studies.
Nowadays there is a wide choice of technologies for machine learning and therefore we restrict our attention to a subset that we consider relevant for users working at the AWI. Our supported technologies are chosen according to the following criteria:
Here is an overview of the frameworks and libraries that we have considered:
According to their popularity and relevant for potential users at AWI, we are currently focused mainly on the following frameworks:
We are evaluating other frameworks like Dask and Spark for applications that involve large models and / or data sets, and therefore require more powerful computing resources.
Drawing from our experience with the processing of LOKI-images and whale acoustic data (see the Projects Section below), we are developing a generic workflow for data classification, which is integrated in the AWI computing infrastructure.
We can identify three phases within data analysis based on ML-methods:
These phases are summarized in the following diagram:
An important aspect when working with machine learning is the data flow, i.e.
A typical data flow for a ML classification within the AWI infrastructure can be sketched as follows (see also the diagram):
This workflow is described in more detail in a separate page.
Python and Pandas are often used for data analytics tasks. A typical scenario is that a user starts working on a small dataset on their PC and then want to move on to larger datasets when their code is working. If the dataset size exceeds the capability of a personal computer, it is necessary to move to more powerful computing resources, possibly using parallel computation on a computer cluster. Dask is a framework that provides an API very similar to Pandas' and allows to work on large datasets on a cluster.
Since both Python and Pandas are wide-spread among data scientists, we have evaluated Dask on a few example data sets of different sizes to see how well a Dask application scales. The results of this evaluation are:
These results are documented in a separate page.
In cooperation with workgroup Niehoff
Tools / methods: Machine learning and deep learning (convolutional neural networks), Spark, Elephas.
Stella Mahler, internship and bachelor thesis (2019): “Automatic Classification of Zooplankton Images through Convolutional Neural Networks”
Goal: Comparison of the InceptionV3 and the Resnet101 architectures, as pre-trained and self-trained models.
Image data and taxonomy information were obtained from the Ecotaxa system and provided by Nicole Hildebrandt.
The data have a strong bias in relation to class size. Therefore, classes with lots of images were reduced by means of sub-sampling and classes with few images were extended by means of augmentation. Very small classes were completely discarded or merged together into new classes at a higher taxonomy level. The preprocessed data set consisted of 11 classes, each containing 1000 images. Furthermore, all images were normalized with respect to size, brightness, and so on.
Partitioning of the data: Training set 60%, test and validation set 20% each.
Results:
The details of this projects are described in a separate page.
Classification of plankton images by means of machine learning methods based on image descriptors.
Comparison of different machine learning frameworks: pyspark, Scikit-Learn, Tensorflow. In particular, evaluation of Spark for ML:
Original data set: about 120000 samples, 25 classes.
Reduced data set (with the purpose of reducing bias): 7000 samples, 7 classes (1000 samples per class)
Preliminary results of image classification with descriptors show an accuracy comparable to that obtained by CNNs working directly on the image data. These result must be further evaluated.
pySpark implementation: long training times (several hours on a virtual machine with 16 cores, 32GB RAM). This could be due to some configuration error and shows that more effort is needed to tune a pySpark model / installation.
Scikit-Learn implementation: training times in the order of magnitude of a few minutes.
Tensorflow: offers other methods and is therefore difficult to compare with the previous two wrt performance.
Open tasks: code consolidation, further evaluation of results, Spark environment configuration.
The details of this project can be found in a separate page.
Cooperation with workgroup O. Böbel
Tests with Urika-CS AI and analytics applications.
A recurrent issue in machine learning is that of reproducibility: Once a model has been developed and tested, it should be possible to run the same model with the same data on another system with little or no effort at all. There are two main scenarios in which this is important:
In both cases, the model, its runtime environment (libraries, frameworks, and son on), and possibly some example test data should be easy to install and get to work.
For these reasons, we started to experiment with Docker. The goal is to package a machine learning model together with its example data sets and required libraries in a docker container that can be imported and run out of the box.
Questions, interested in using our AI infrastructure and tools?: o2a-support (at) awi.de