Popular image databases like Ecotaxa provide computer-aided annotation of images:
For the automatic classification, Ecotaxa uses descriptors / features extracted from the images, e.g. shape area, convexity, etc.
This project has two objectives:
For training our models we have used the same dataset used with the Deep Learning methods. The input data is stored in comma-separated files exported from Ecotaxa.
You can check out the model from gitlab with
git clone git@gitlab.awi.de:gbusatto/de.awi.analytics.lokides.git
In order to train the scikit-learn model, change to the project directory and run
python src/starred-randomforest-classifier.py
Preliminary results of image classification with descriptors and random forests show an accuracy comparable to that obtained by CNNs working directly on the image data. The training time for this model were in the order of magnitude of a few minutes. These result must be further evaluated and validated on larger datasets.
In order to train the pyspark model, you have to install spark.
Comparison of different machine learning frameworks: pyspark, Scikit-Learn, Tensorflow. In particular, evaluation of Spark for ML:
Original data set: about 120000 samples, 25 classes.
Reduced data set (with the purpose of reducing bias): 7000 samples, 7 classes (1000 samples per class)
pySpark implementation: long training times (several hours on a virtual machine with 16 cores, 32GB RAM). This could be due to some configuration error and shows that more effort is needed to tune a pySpark model / installation.
Scikit-Learn implementation: training times in the order of magnitude of a few minutes.
Tensorflow: offers other methods and is therefore difficult to compare with the previous two wrt performance.
Open tasks: code consolidation, further evaluation of results, Spark environment configuration.