Classification of Zooplankton Images Using Descriptors

Popular image databases like Ecotaxa provide computer-aided annotation of images:

the system suggests a classification of an image by means of a machine-learning method,
the scientist validates / corrects the automatically generated classification in order to produce the final classification.

For the automatic classification, Ecotaxa uses descriptors / features extracted from the images, e.g. shape area, convexity, etc.

This project has two objectives:

To explore machine learning methods based on image descriptors and compare them with methods based on deep learning,
To understand how these methods can scale when applied to very large data sets.

For training our models we have used the same dataset used with the Deep Learning methods. The input data is stored in comma-separated files exported from Ecotaxa.

Training the model

You can check out the model from gitlab with

    git clone git@gitlab.awi.de:gbusatto/de.awi.analytics.lokides.git

In order to train the scikit-learn model, change to the project directory and run

    python src/starred-randomforest-classifier.py

Preliminary results of image classification with descriptors and random forests show an accuracy comparable to that obtained by CNNs working directly on the image data. The training time for this model were in the order of magnitude of a few minutes. These result must be further evaluated and validated on larger datasets.

In order to train the pyspark model, you have to install spark.

TODO: see if we can get the spark model to run. If it works, describe it here.

TODO: done up to here.

Comparison of different machine learning frameworks: pyspark, Scikit-Learn, Tensorflow. In particular, evaluation of Spark for ML:

Is it possible to use the same ML models with Scikit-learn and Spark?
Do these model scale with large data sets?

Original data set: about 120000 samples, 25 classes.

Reduced data set (with the purpose of reducing bias): 7000 samples, 7 classes (1000 samples per class)

pySpark implementation: long training times (several hours on a virtual machine with 16 cores, 32GB RAM). This could be due to some configuration error and shows that more effort is needed to tune a pySpark model / installation.

Scikit-Learn implementation: training times in the order of magnitude of a few minutes.

Tensorflow: offers other methods and is therefore difficult to compare with the previous two wrt performance.

Open tasks: code consolidation, further evaluation of results, Spark environment configuration.