Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process - Datasets
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14051520
下载链接
链接失效反馈官方服务:
资源简介:
This dataset contains supporting data for a research project aimed at analysing herbarium samples from the New England area at a large scale with deep learning techniques. Details on the methodology are shared in the acompanying paper (to be published).
Content:
dataset600k_withAI.csv : A dataset of over 600.000 herbarium samples with its record metadata and a corresponding AI phenological annotations with matching confidence scores. The entirety of the record headers are provided, extracted directly from the NEVP portal. In addition, the AI labels are defined by the following headers. These 8 columns represent 4 binary classifiers with the Presence/Absence of each 4 traits and corresponding confidence (as a percentage - presence/absence percentages sum to 1).
Flowering
Not Flowering
Budding
Not Budding
Fruiting
Not Fruiting
Reproductive
Not Reproductive
data_species_with_statuses.csv: A processed dataset summarizing flowering period shift at a species level. Two types of headers are provided.
First metadata concerning the flowering shift and the data used to compute that value:
genus
genus_species
slope
nb_specimens
p_value_significance
trend_category
Genus of the species
Binomial name of the species
Regression slope defining the flowering shift as a slope
Number of herbarium specimens used to compute the shift
P-value significance of the slope being non-zero. ('Non Significant'/'Significant')
Summary of the shift as a binary characteristic ('Earlier'/'Later')
Second, metadata summarizing various traits associated to each species:
lifeform_status
native_introduced_status
wetland_status
seasonality_average
seasonality_spread
Growth form from the USDA PLANTS Database. 'Forb_Herb', 'Shrub_Tree' or 'Vine'
'Native'/'Introduced' status from the USDA PLANTS Database.
National Wetland Plant List (NWPL) Wetland Indicator Status within the Northcentral and Northeast Region
'OBL'/'FACW'/'FAC'/'FACU'/'UPL'
A characteristic of the flowering season of the species based on the mean Day of Year of the analysed specimens: if <=180: 'Early', else 'Late'
A characteristic of the flowering season of the species based on the spread of the flowering season. Less than 28 days: 'Narrow', larger: 'Large'.
phylogenetic_tree.tre: The raw data used to generate the visualization of the flowering seasonality character and the detected flowering shift foreach species on a phylogenetic tree.
phylogenetic_processed_dataset.csv: The processed dataset resuting from the phylogenetic signal analysis. For each trait, an associated significance binary value is provided.
本数据集为一项研究项目提供支撑数据,该项目旨在借助深度学习技术对新英格兰地区的植物标本馆标本(Herbarium Specimen)开展大规模分析。研究方法的详细细节将在即将发表的配套论文中公开。
数据集内容:
1. **dataset600k_withAI.csv**:包含超过60万份植物标本馆标本的数据集,附带其记录元数据与匹配置信度得分的对应AI物候标注(AI Phenological Annotation)。所有记录表头均直接从NEVP门户(NEVP Portal)提取并完整提供。此外,AI标签由以下表头定义:这8列对应4组二分类器,分别标注4个性状的存在/缺失状态,以及对应的置信度(以百分比表示,存在与缺失的百分比之和为1)。
具体列项如下:
Flowering(开花)、Not Flowering(未开花)、Budding(抽芽)、Not Budding(未抽芽)、Fruiting(结实)、Not Fruiting(未结实)、Reproductive(具繁殖结构)、Not Reproductive(无繁殖结构)
2. **data_species_with_statuses.csv**:经处理后的数据集,用于汇总物种水平的开花物候期偏移情况。该文件包含两类元数据表头:
第一类为与开花物候期偏移及用于计算该值的相关数据:
- genus:物种所属的属名
- genus_species:物种的双名法学名
- slope:表征开花物候期偏移量的回归斜率
- nb_specimens:用于计算偏移量的植物标本馆标本数量
- p_value_significance:斜率非零的P值显著性,取值为“Non Significant”(无显著性)/“Significant”(有显著性)
- trend_category:偏移趋势的二分类总结,取值为“Earlier”(提前)/“Later”(延后)
第二类为汇总各物种关联性状的元数据:
- lifeform_status:物种的生命型状态,源自美国农业部植物数据库(USDA PLANTS Database),取值为'Forb_Herb'(草本)、'Shrub_Tree'(灌乔木)或'Vine'(藤本)
- native_introduced_status:物种的本土/归化状态,源自USDA PLANTS数据库,取值为'Native'(本土)/'Introduced'(归化)
- wetland_status:北美中北部与东北部区域的国家湿地植物名录(National Wetland Plant List, NWPL)湿地指示状态,取值为'OBL'/'FACW'/'FAC'/'FACU'/'UPL'
- seasonality_average:基于分析标本的平均开花日序表征的开花季节特征:若日序≤180则为'Early'(早花),否则为'Late'(晚花)
- seasonality_spread:基于开花季分布跨度表征的开花季节特征:跨度小于28天为'Narrow'(窄花期),大于28天为'Large'(宽花期)
3. **phylogenetic_tree.tre**:用于生成物种开花物候特征与检测到的开花物候期偏移量的系统发育树(Phylogenetic Tree)可视化的原始数据。
4. **phylogenetic_processed_dataset.csv**:经系统发育信号分析(Phylogenetic Signal Analysis)得到的处理后数据集。针对每个分析性状,均提供了关联的显著性二值标签。
创建时间:
2024-11-22



