five

Data from: Identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/8120879
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the data and Docker image to reproduce the results of our paper: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning Authors: Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu* This work has been accepted by GigaScience.  Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu. "IPEV: Identification of Prokaryotic and Eukaryotic Virus-Derived Sequences in Virome Using Deep Learning." GigaScience 13 (2024): giae018. https://doi.org/10.1093/gigascience/giae018.     Background: The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses. Findings: We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2D convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.  Conclusions: IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.   5_fold_cross_validation.zip: Dataset of cross-validation of IPEV Eukaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (eukaryotic virus) Prokaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (prokaryotic virus) Test_Prokaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (prokaryotic virus) Test_Eukaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (eukaryotic virus)     Dataset_sequencing_error.zip: Simulated dataset with sequencing errors Cap_enzyme_sequence.fasta: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article Dataset_runtime_evaluation.zip: Dataset for evaluating the runtime of IPEV Receptor_binding_protein_accession_id: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article   archaea_ID.txt Accession ID information for the reference archaea dataset bacteria_ID.txt Accession ID information for the reference bacterial dataset marine_virome_id.csv: Ocean virome data information used in our paper gut_virome.csv:Gur virome data information used in our paper fungi.txt: Negative sequence information used to train, validate, and test the model in the decontamination function bacteria.txt: Negative sequence information used to train, validate, and test the model in the decontamination function     Reproduce the results of our paper from a Docker image.   We also provide a Docker image file that does not require any environment configuration. You can reproduce the results of our paper (e.g., train and test our IPEV model) in a Docker image. Pull the dryinhc/ipev_v1 image from Docker Hub. Open a terminal window and run the following command: docker pull dryinhc/ipev_v1 This will download the image to your local machine. Run the dryinhc/ipev_v1 image. In the same terminal window, run the following command: docker run -it --rm dryinhc/ipev_v1 This will start a container based on the image and run the IPEV tool. And you can run cd train or cd other file folders in the container. To exit the container, press Ctrl+D or type exit. It contains 4 directories, namely 5 fold cross validation, independent set, marine virome, and gut virome. The 5-fold cross-validation directory holds the scripts required for implementing the 5-fold cross-validation method. The independent set directory contains scripts necessary for working with an independent set. Lastly, the marine virome and gut virome directories store scripts for analyzing real datasets.         We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.     Contact    If you have any questions, please don't hesitate to ask me: yinhengchuang@pku.edu.cn or hqzhu@pku.edu.cn
创建时间:
2024-10-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作