Popular image databases like Ecotaxa provide computer-aided annotation of images:
For the automatic classification, Ecotaxa uses descriptors / features extracted from the images, e.g. shape area, convexity, etc.
This project has two objectives:
For training our models we have used the same dataset used with the Deep Learning methods. The input data is stored in comma-separated files exported from Ecotaxa.
You can check out the model from gitlab with
git clone git@gitlab.awi.de:gbusatto/de.awi.analytics.lokides.git
In order to train the scikit-learn model, change to the project directory and run
python src/starred-randomforest-classifier.py
Preliminary results of image classification with descriptors and random forests show an accuracy comparable to that obtained by CNNs working directly on the image data. The training time for this model were in the order of magnitude of a few minutes. These result must be further evaluated and validated on larger datasets.
In order to train the pyspark model, you have to install spark.
Comparison of different machine learning frameworks: pyspark, Scikit-Learn, Tensorflow. In particular, evaluation of Spark for ML:
Original data set: about 120000 samples, 25 classes.
Reduced data set (with the purpose of reducing bias): 7000 samples, 7 classes (1000 samples per class)