Benchmarking Dataset
收藏DataCite Commons2024-08-23 更新2024-08-26 收录
下载链接:
https://figshare.com/articles/dataset/Benchmarking_Dataset/26816035
下载链接
链接失效反馈官方服务:
资源简介:
<b>Extract from Global Ocean Eukaryotic Viral (GOEV) database from Gaïa M. et al. https://www.nature.com/articles/s41586-023-05962-4.</b>The dataset comprises:591 MAGs from Schulz, F. et al. https://doi.org/10.1038/s41586-020-1957-x (2020)445 MAGs from Sunagawa, S. et al. https://doi.org/10.1038/s41579-020-0364-5 (2020)218 MAGs from Moniruzzaman, M. et al. https://doi.org/10.1038/s41467-020-15507-2158 Reference viral assemblies. Last accessed: July 20, 2024.Source: GOEV_DB_CONTIGS.db.zip at https://doi.org/10.6084/m9.figshare.20284713Selection Criteria: Data labeled at the Order taxon level were kept.Sampling Method: NaTaxonomic Assignment: Original labels indicated in the GOEV database.<br>--<br><b>known Viral Sequence Clusters (kVSCs) from Zolfo, M. et al. https://doi.org/10.1101/2024.02.19.580813</b>Source: DNA sequences VSC5_rep_fnas_nr99_45k_metaphlanDB.fna.gz and metadata file VSCs_groups.csv downloaded from https://zenodo.org/records/10512460, last accessed on June 28th, 2024.Selection Criteria: Starting from the 45,872 representative sequences included in the MetaPhlan 4.1 module we selected the kVSC, i.e. the sequences that cluster together with a RefSeq representative.Sampling Method: Matched the RefSeq accessions contained in the metadata table to those in the International Committee on Taxonomy of Viruses (ICTV) Release #39 to ensure correct labels. Final number of eligible samples: 2,232.Taxonomic Assignment: Using the linked RefSeq accessions present in the ICTV Release #39.<br>--<br><b>Extract from the International Committee on Taxonomy of Viruses (ICTV) Release #39.</b>Source: Data downloaded using ICTVdump https://github.com/christopher-riccardi/ICTVdump on July 17, 2024.Selection Criteria: All entries with the same virus present in both VMR releases #37 and #39 and at least two representatives of each family.Sampling method: up to 5 randomly extracted viral exemplar genomes using pandas' method sample(). There are 192 families represented.Taxonomic Assignment: The ICTV-ratified taxonomic lineage, Lefkowitz, J. et al. https://doi.org/10.1093/nar/gkx932<br>--<br><b>Extract from RefSeq</b>Source: NCBI Virus resource at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/Selection Criteria: Beginning with all RefSeq viral sequences and annotation file (last accessed July 16, 2024 from the NCBI Virus resource https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&SourceDB_s=RefSeq, containing 18,670 sequences). We computed the Mash distance between all NCBI Viruses and all viruses listed in the International Committee on Taxonomy of Viruses (ICTV) Release #39. Only the sequences that exhibited a minimum Mash distance of 0.1 from the ICTV data were kept to ensure a non overlapping dataset.Sampling Method: See previous. Final number of eligible samples: 1,127.Taxonomic Assignment: RefSeq-ratified Family taxon level specified in the annotation file.<br>= Reduction Study Dataset =<br>(Another) <b>Extract from the International Committee on Taxonomy of Viruses (ICTV) Release #39.</b>Source: Data downloaded using ICTVdump https://github.com/christopher-riccardi/ICTVdump on July 17, 2024.Selection Criteria: 1,000 randomly sampled viruses.Sampling method: pandas' method sample().Taxonomic Assignment: The ICTV-ratified taxonomic lineage, Lefkowitz, J. et al. https://doi.org/10.1093/nar/gkx932Notes: Reduction study starting data. We provide the source code for generating the fragmented genomes.
提供机构:
figshare
创建时间:
2024-08-23



