Data fusion for integrative species identification using deep learning

NIAID Data Ecosystem2026-05-10 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.4qrfj6qjk

下载链接

链接失效反馈

官方服务：

资源简介：

DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among species and considerable variation within species, particularly among closely-related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and morphological data for species identification. Initially, we systematically assess and compare three different DNA arrangement and two encoding methods. Later, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation. In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as vectors of decimal numbers achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%). Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement was observed (+2.1%). Detailed analysis of confused samples shows that DNA tends to identify the genus correctly, but fails to recognize the species. This shortcoming is alleviated by including morphological data into the training, hinting towards a hierarchical role of modalities. We systematically showed and explained, for the first time, that optimal preprocessing and integration of molecular and image data offers significant benefits, particularly for genetically similar and morphologically indistinguishable species, enhancing species identification by reducing modality-specific failure rates and information gaps. Our results can inform integration efforts for various organism groups, improving automated identification across a wide range of eukaryotic species.

创建时间：

2025-11-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集