BIOSCAN-1M Insect Dataset
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8015205
下载链接
链接失效反馈官方服务:
资源简介:
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity.
为编目昆虫生物多样性,我们构建了全新的人工标注昆虫图像大型数据集——BIOSCAN-1M昆虫数据集(BIOSCAN-1M Insect Dataset)。每条记录均经专家完成分类学鉴定,并附带相关遗传信息,包括原始核苷酸条形码序列与已分配的条形码索引编号,后者是基于遗传学的物种分类替代标识。本文呈现了这一经过精心整理的百万级图像数据集,其核心用途为训练可基于图像实现分类学评估的计算机视觉模型;然而该数据集同时具备多项极具研究价值的特性,相关研究也将受到更广泛的机器学习社区关注。受数据集本身固有的生物学属性驱动,该数据集呈现出典型的长尾类不平衡分布。此外,分类学标签采用层级分类架构,在低层级上呈现出高度细粒度的分类任务。除了激发机器学习社区对生物多样性研究的兴趣之外,开发基于图像的分类学分类器所取得的进展,也将进一步推进所有BIOSCAN研究的终极目标:为全球生物多样性全面调查奠定基础。
创建时间:
2023-06-08



