Scorpio Gene-Taxa Benchmark Dataset2 (Short Fragments)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14042229
下载链接
链接失效反馈官方服务:
资源简介:
We used the Woltka pipeline to compile the complete Basic genome dataset, consisting of 4634 genomes, with each genus represented by a single genome. After downloading all coding sequences (CDS) from the NCBI database, we extracted 8 million distinct CDS, focusing on bacteria and archaea and excluding viruses and fungi due to inadequate gene information.
To maintain accuracy, we excluded hypothetical proteins, uncharacterized proteins, and sequences without gene labels. We addressed issues with gene name inconsistencies in NCBI by keeping only genes with more than 1000 samples and ensuring each phylum had at least 350 sequences. This resulted in a curated dataset of 800,318 gene sequences from 497 genes across 2046 genera.
We created four datasets to evaluate our model: a training set (Train_set), a test set (Test_set) with different samples but the same genus and gene as the training set, a Taxa_out_set excluding 18 phyla present in the training set but from different phyla, and a Gene_out_set excluding 60 genes from the training set but from the same phyla. We ensured each CDS had only one representation per genome, removing genes with multiple representations within the same species.
We derived 400bp fragments from the 800k-sequence gene dataset. Our method involved selecting randomly 400bp segments from different regions within each gene sequence, ensuring at least a 50bp gap between consecutive fragments. This was achieved within the range Range: [0, Gap: 50, length(gene_sequence)]. We chose this approach to avoid fragments with minimal base-pair differences and to replicate sequences that may not necessarily begin with an open reading frame.
Technical info
test.fasta : Contains sequences for model testing.
gene_out.fasta: Includes sequences excluded based on gene criteria for model evaluation.
taxa_out.fasta :Includes sequences excluded based on taxonomic criteria for model evaluation.
val.fasta: Contains sequences for model validation .
train.fasta: Contains sequences for model training.
meta_data.csv: Contains metadata information for sequences in the FASTA files.
hierarchical-level.txt : Determines hierarchical levels for triplet training and hierarchical sampling required for Scorpio training.
@article{refahi2024scorpio,
title={Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences},
author={Refahi, Mohammadsaleh and Sokhansanj, Bahrad A and Mell, Joshua Chang and Brown, James and Yoo, Hyunwoo and Hearne, Gavin and Rosen, Gail},
journal={bioRxiv},
pages={2024--07},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
创建时间:
2024-12-08



