A curated benchmark dataset for molecular identification based on genome skimming

Name: A curated benchmark dataset for molecular identification based on genome skimming
Creator: Harvard Dataverse
Published: 2025-04-03 14:19:04
License: 暂无描述

DataCite Commons2025-04-03 更新2025-04-15 收录

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/IMOX0S

下载链接

链接失效反馈

官方服务：

资源简介：

Genome skimming is an emerging tool allowing for scalable DNA barcoding efforts for numerous biodiversity science applications. Despite its growing importance, there are few standardized datasets for benchmarking genome skimming tools, making it challenging to evaluate new methods (e.g., using machine learning), and comparing to existing ones (e.g., conventional barcoding loci). As part of the development of varKoder, a new tool for DNA-based identification, we curated four datasets designed for comparing molecular identification tools using low-coverage genomes. These datasets comprise vast phylogenetic and taxonomic diversity from closely related species to all taxa currently represented on NCBI SRA. One of them consists of novel sequences from taxonomically verified samples in the plant clade Malpighiales, while the other three datasets compile publicly available data. All include raw genome skim sequences to enable comprehensive testing and validation of a variety molecular species identification methods. We also provide the two-dimensional graphical representations of genomic data (chaos game representations and varKodes) that have been used to develop and test varKoder. These datasets represent a reliable resource for researchers to assess the accuracy, efficiency, and robustness of new tools to varKoder and other methods in a consistent and reproducible manner. See README.md for details on data organization.

提供机构：

Harvard Dataverse

创建时间：

2024-12-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集