Real-time arthropod taxonomy

Identifying arthropods in real-time is a hard problem. First, arthropods have an extremely large number of species, even narrowing the search to the biodiversity within specific biomes. Furthermore, they have a huge variety of forms, so classifiying at a higher level, such as at order level, is still a problem. Look at the incredible variety of the order Coleoptera for example.

Yet there are applications, e.g. in agriculture, biodiversity calculations, collection scanning, in general any application that benefits from large scale automation, that require just that: real-time arthropod taxonomy.

So here is a first try, comparing several machine learning methods, and evaluating the results in terms of accuracy and speed.

Continue reading “Real-time arthropod taxonomy”

Hearing in Penguins – Hörfähigkeiten von Pinguinen

Neue Publikation, beim Umweltbundesamt!

Das Projekt untersuchte das Hörvermögen von Humboldt-Pinguine. Darüber hinaus wurde eine Tieraudiogramm-Datenbank entwickelt, die den Vergleich zwischen den veröffentlichten Hörkurven verschiedener mariner Tiere erlaubt. Mit diesem Vorhaben wurde der Grundstein für zukünftige Studien über das Hörvermögen von tauchenden Vögeln gelegt hat und so zu einem größeren Verständnis beigetragen, inwiefern Meeresvögel von Unterwasserlärm betroffen sind.

https://www.umweltbundesamt.de/publikationen/hearing-in-penguins-hoerfaehigkeiten-von-pinguinen

Plant leaf classification

There are important potential applications for a machine learning system than can classify plants. For example, current research on crop protection is using machine learning for precision weed and plant disease detection. The importance of moving away from traditional pesticide-based crop protection methods cannot be overstated, as demonstrated by the alarming rate at which flowering plants are evolving away from insect pollination.

I obtained a dataset of plant leaf images (Hussain, 2023) and compared three machine learning algorithms for classification: multilayer perceptron, random forest and support vector machine. The classifiers’ accuracy vary between 74% and 84% .

The Jupyter notebook with the full Python code can be accessed on Kaggle.

Refactoring the Animal Sound Archive



For my latest project at Museum für Naturkunde Berlin, I refactored the Animal Sound Archive search interface and API. The Animal Sound Archive contains thousands of high quality, scientifically checked recordings, which can be used freely for science or any purpose. Thanks to the wonderful colleagues of the Animal Sound Archive Team, it has been a pleasure.

Check it out on GBIF!

Storing a taxonomic tree in a relational database

Taxonomic trees are ubiquitous in biodiversity software. A very common application is using a tree to allow the users to browse the data. Other applications are: training classification models, curating a collection, visualizing research results etc.

Data is often stored in a relational database, such as MySQL. Unfortunately, relational databases are not particularly well suited for storing tree structures. Yet the choice of a database may be guided by more important requirements, and so the taxonomic tree is sometimes implemented as an afterthought. The result can be a structure that is difficult to maintain and to query, sometimes requiring more work than expected to maintain and finally yielding a less satisfying experience for the end user.

I will show some counterexamples, and how get a better result by using a data structure called “nested set”.

Continue reading “Storing a taxonomic tree in a relational database”

Computing α-diversity

Diversity indices are a common descriptive statistic used in biodiversity informatics. Diversity indices typically express the species richness of a given habitat or area. The α-diversity index is suitable when studying a single habitat and is expressed by a single number. There are several commonly used equations used to compute α-diversity. In this example, I will be using the Simpson’s diversity index, which is computed by the formula:

    \[D = 1 - \sum_{i=1}^{S}p_i^2\]

Where S is the number of species in the sample and p is the proportion of a particular species. The Simpson’s diversity index is thus more influenced by common species rather than by rare species and is often considered to be an index reflecting the actual species diversity in a sample.

To illustrate this, I will use will use data obtained from GBIF. Remember, α-diversity is suitable for expressing the diversity within a single habitat, so I will obtain data accordingly. Here I chose the Tiergarten, a large (210 hectare) park in central Berlin.

Continue reading “Computing α-diversity”

Using neural networks to classify 3D scans

For my capstone project in machine learning at EPFL, I wrote a classifier capable of sorting 3D scans of archaeological objects by culture.

Digitization of museum collections is currently a major challenge faced by cultural heritage and natural history museums. Museums are expected to digitize the collections to improve not only the documentation of artifacts, but also their availability for research, reconstruction and outreach activities, and to make these digital representations available online.

Machine learning setup

Continue reading “Using neural networks to classify 3D scans”