ELIE – Entomological Label Information Extraction
收藏DataCite Commons2025-12-15 更新2026-05-04 收录
下载链接:
https://zenodo.org/records/15835907
下载链接
链接失效反馈官方服务:
资源简介:
Natural history museums curate billions of insect specimens, forming a vast but underutilized resource for biodiversity research. While digitization initiatives have increased the availability of high-resolution specimen images, extracting structured metadata from specimen labels remains a significant bottleneck, often requiring manual transcription.
To address this challenge, we developed ELIE (Entomological Label Information Extraction), a semi-automated pipeline that combines computer vision, convolutional neural networks (CNNs), optical character recognition (OCR), and clustering algorithms to streamline the extraction of entomological label data. ELIE operates in three stages:
1. Label detection and classification (e.g., printed vs. handwritten)
2. OCR-based text extraction from printed labels using Tesseract and the Google Vision API
3. Text-based clustering of OCR output using the K-Medoids algorithm at a 0.9 similarity threshold, allowing for optional human validation of clustered outliers.
This dataset release supports the ELIE pipeline and includes annotated JPEG images and corresponding XML files, structured into training (80%), validation (20%), and testing (10%) subsets. All annotations are based on the “label” class, enabling robust model training for multi-label image (MLI) detection and object segmentation.
In addition to image and XML data, this repository includes derived OCR output files (.json) and clustering results (.csv) for selected datasets. These resources facilitate downstream tasks such as label text parsing, automated record linkage, metadata deduplication, and large-scale content analysis.
The data spans seven digitization projects, totaling over 43,000 labeled images from diverse insect orders and geographic regions, including:
• AntWeb – Formicidae labels from global collections
• Bees Bytes – Apoidea labels digitized by the Museum für Naturkunde Berlin
• LEPPHIL – Lepidoptera labels from the Philippines by the Museum für Naturkunde Berlin
• MCZ_ENT_Boston – Hexapoda labels from the Museum of Comparative Zoology, Harvard
• MfN_LEP_SEASIA – Pyraloidea labels from Southeast Asia digitized by the Museum für Naturkunde Berlin
• Picturae_MfN – Hexapoda labels from the Museum für Naturkunde Berlin
• USNM_COL_CAM – Beetle labels from South and Central America digitized by the Smithsonian National Museum of Natural History
Benchmarking on this diverse dataset showed that ELIE successfully detected and clustered up to 98% of printed labels, significantly reducing manual effort in digitization workflows. By integrating AI-driven methods with structured OCR output and automated clustering, our approach enhances label metadata capture, accelerates biodiversity data accessibility, and supports scalable research in ecology, taxonomy, and biodiversity informatics.
提供机构:
Anonymous
创建时间:
2025-07-07



