Essential Biodiversity Variables: extracting plant phenological data from specimen labels using machine learning

Maria Mora-Cross; Adriana Morales-Carmiol; Te Chen-Huang; María Barquero-Pérez

doi:10.3897/rio.8.e86012

Research Ideas and Outcomes (Aug 2022)

Essential Biodiversity Variables: extracting plant phenological data from specimen labels using machine learning

Maria Mora-Cross,
Adriana Morales-Carmiol,
Te Chen-Huang,
María Barquero-Pérez

Affiliations

Maria Mora-Cross: School of Computer Engineering, Costa Rica Institute of Technology
Adriana Morales-Carmiol: School of Computer Engineering, Costa Rica Institute of Technology
Te Chen-Huang: School of Computer Engineering, Costa Rica Institute of Technology
María Barquero-Pérez: School of Computer Engineering, Costa Rica Institute of Technology

DOI: https://doi.org/10.3897/rio.8.e86012
Journal volume & issue: Vol. 8
pp. 1 – 24

Abstract

Read online Read online Read online

Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to obtain baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML).A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).

Published in Research Ideas and Outcomes

ISSN: 2367-7163 (Online)
Publisher: Pensoft Publishers
Country of publisher: Bulgaria
LCC subjects: Science
Website: http://rio.pensoft.net

About the journal

Abstract

Keywords