BIOSCAN-1M Insect Dataset
收藏Mendeley Data2024-06-26 更新2024-06-29 收录
下载链接:
https://zenodo.org/records/8030065
下载链接
链接失效反馈官方服务:
资源简介:
Overview In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-1M Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This dataset presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity.
概述
为编纂昆虫生物多样性名录,我们构建了一款全新的大规模人工标注昆虫图像数据集——BIOSCAN-1M昆虫数据集(BIOSCAN-1M Insect Dataset)。每条数据均由专家完成分类学标注,同时附带关联的遗传信息,涵盖原始核苷酸条形码序列与已分配的条形码索引编号,后者是基于遗传学的物种分类替代指标。
本数据集为经筛选的百万级图像数据集,核心用途是训练可基于图像实现分类学评估的计算机视觉模型;此外该数据集具备多项极具研究价值的特性,对其开展的相关研究将受到机器学习领域更广泛群体的关注。受数据集本身固有的生物学属性影响,该数据集呈现出典型的长尾巴类别不平衡分布特征。同时,分类学标注采用层级分类体系,在低层级上构成了极具细粒度的分类任务。
除推动机器学习领域对生物多样性研究的兴趣之外,基于图像的分类学分类器研发进展,还将助力所有BIOSCAN研究的终极目标——为全球生物多样性全面普查奠定坚实基础。
创建时间:
2023-06-28



