Data from: Identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning
收藏Zenodo2024-10-08 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.8120879
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the data and Docker image to reproduce the results of our paper: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning
Authors: Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu*
This work has been accepted by GigaScience.
Hengchuang Yin, Shufang Wu, Jie Tan, Qian Guo, Mo Li, Jinyuan Guo, Yaqi Wang, Xiaoqing Jiang, and Huaiqiu Zhu. "IPEV: Identification of Prokaryotic and Eukaryotic Virus-Derived Sequences in Virome Using Deep Learning." GigaScience 13 (2024): giae018. https://doi.org/10.1093/gigascience/giae018.
Background: The virome obtained through virus-like particle enrichment contains a mixture of prokaryotic and eukaryotic virus-derived fragments. Accurate identification and classification of these elements are crucial to understanding their roles and functions in microbial communities. However, the rapid mutation rates of viral genomes pose challenges in developing high-performance tools for classification, potentially limiting downstream analyses.
Findings: We present IPEV, a novel method to distinguish prokaryotic and eukaryotic viruses in viromes, with a 2D convolutional neural network combining trinucleotide pair relative distance and frequency. Cross-validation assessments of IPEV demonstrate its state-of-the-art precision, significantly improving the F1-score by approximately 22% on an independent test set compared to existing methods when query viruses share less than 30% sequence similarity with known viruses. Furthermore, IPEV outperforms other methods in accuracy on marine and gut virome samples based on annotations by sequence alignments. IPEV reduces runtime by at most 1,225 times compared to existing methods under the same computing configuration. We also utilized IPEV to analyze longitudinal samples and found that the gut virome exhibits a higher degree of temporal stability than previously observed in persistent personal viromes, providing novel insights into the resilience of the gut virome in individuals.
Conclusions: IPEV is a high-performance, user-friendly tool that assists biologists in identifying and classifying prokaryotic and eukaryotic viruses within viromes. The tool is available at https://github.com/basehc/IPEV.
5_fold_cross_validation.zip: Dataset of cross-validation of IPEV
Eukaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (eukaryotic virus)
Prokaryotic_virus_CV_Dataset-1.csv: GI, and accession ID for the cross-validation Dataset-1 (prokaryotic virus)
Test_Prokaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (prokaryotic virus)
Test_Eukaryotic_virus_Dataset-1.fasta: An independent test set of IPEV (eukaryotic virus)
Dataset_sequencing_error.zip: Simulated dataset with sequencing errors
Cap_enzyme_sequence.fasta: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article
Dataset_runtime_evaluation.zip: Dataset for evaluating the runtime of IPEV
Receptor_binding_protein_accession_id: Accession IDs of Receptor Binding Proteins (RBPs) in phages collected by our article
archaea_ID.txt Accession ID information for the reference archaea dataset
bacteria_ID.txt Accession ID information for the reference bacterial dataset
marine_virome_id.csv: Ocean virome data information used in our paper
gut_virome.csv:Gur virome data information used in our paper
fungi.txt: Negative sequence information used to train, validate, and test the model in the decontamination function
bacteria.txt: Negative sequence information used to train, validate, and test the model in the decontamination function
Reproduce the results of our paper from a Docker image.
We also provide a Docker image file that does not require any environment configuration. You can reproduce the results of our paper (e.g., train and test our IPEV model) in a Docker image.
Pull the dryinhc/ipev_v1 image from Docker Hub. Open a terminal window and run the following command:
docker pull dryinhc/ipev_v1
This will download the image to your local machine.
Run the dryinhc/ipev_v1 image. In the same terminal window, run the following command:
docker run -it --rm dryinhc/ipev_v1
This will start a container based on the image and run the IPEV tool.
And you can run cd train or cd other file folders in the container.
To exit the container, press Ctrl+D or type exit.
It contains 4 directories, namely 5 fold cross validation, independent set, marine virome, and gut virome. The 5-fold cross-validation directory holds the scripts required for implementing the 5-fold cross-validation method. The independent set directory contains scripts necessary for working with an independent set. Lastly, the marine virome and gut virome directories store scripts for analyzing real datasets.
We hereby confirm that the dataset associated with the research described in this work is made available to the public under the Creative Commons Zero (CC0) license.
Contact
If you have any questions, please don't hesitate to ask me: yinhengchuang@pku.edu.cn or hqzhu@pku.edu.cn
提供机构:
Zenodo
创建时间:
2023-07-06



