zakiasalod/VPAgs-Dataset4ML

Name: zakiasalod/VPAgs-Dataset4ML
Creator: zakiasalod
Published: 2024-03-24 11:12:08
License: 暂无描述

Hugging Face2024-03-24 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/zakiasalod/VPAgs-Dataset4ML

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification tags: - public health - bioinformatics - virus - proteomics - vaccine development - antigen - machine learning - reverse vaccinology - viral proteins - protegen - uniprot pretty_name: VPAgs-Dataset4ML size_categories: - 1K<n<10K --- # Dataset Card for VPAgs-Dataset4ML ## Dataset Details ### Dataset Description **VPAgs-Dataset4ML** comprises 2,145 viral protein sequences, curated to facilitate the development of machine learning models capable of predicting viral protective antigens (PAgs). These antigens are crucial for designing vaccines against various viral pathogens. The dataset is divided into two categories: 210 protective antigens (positive class) and 1,935 non-protective protein sequences (negative class), derived from the Protegen database and UniProt, respectively. This collection aims to support and accelerate research in reverse vaccinology, providing a valuable resource for bioinformatics and public health. - **Curated by:** Zakia Salod from the University of KwaZulu-Natal and Ozayr Mahomed from the University of KwaZulu-Natal and Dasman Diabetes Institute. - **Funded by** National Research Foundation (NRF) of South Africa (grant number 130187) and College of Health Sciences (CHS) of the University of KwaZulu-Natal (UKZN) in Durban, Kwa-Zulu-Natal, South Africa. - **Language(s) (NLP):** English. - **License:** [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ### Dataset Sources - **Repository:** Mendeley Data - [VPAgs-Dataset4ML](https://doi.org/10.17632/w78tyrjz4z.1) - **Paper** Salod, Z.; Mahomed, O. VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology. Data 2023, 8, 41. [https://doi.org/10.3390/data8020041](https://doi.org/10.3390/data8020041). ## Uses ### Direct Use This dataset serves as an invaluable asset for developing and testing machine learning algorithms aimed at identifying potential vaccine candidates. Its application extends beyond academic research, offering insights that could significantly impact vaccine development strategies, particularly in the realm of emerging viral threats. ## Dataset Structure ### Data Instances ``` { "sequence": "MATLLRSLALFKRNKDKPPITSGSGGAIRGIKHIIIVPIPGDSSITTRSRLLDRLVRLIGNPDVSGPKLTGALIGILSLFVESPGQLIQRITDDPDVSIRLLEVVQSDQSQSGLTFASRGTNMEDEADQYFSHDDPSSSDQSRSGWFENKEISDIEVQDPEGFNMILGTILAQIWVLLAKAVTAPDTAADSELRRWIKYTQQRRVVGEFRLERKWLDVVRNRIAEDLSLRRFMVALILDIKRTPGNKPRIAEMICDIDTYIVEAGLASFILTIKFGIETMYPALGLHEFAGELSTLESLMNLYQQMGETAPYMVILENSIQNKFSAGSYPLLWSYAMGVGVELENSMGGLNFGRSYFDPAYFRLGQEMVRRSAGKVSSTLASELGITAEDARLVSEIAMHTTEDRISRAVGPRQAQVSFLHGDQSENELPGLGGKEDRRVKQGRGEARESYRETGSSRASDARAAHPPTSMPLDIDTASESGQDPQDSRRSADALLRLQAMAGILEEQGSDTDTPRVYNDRDLLD", "label": "1" } ``` ### Data Fields - `sequence`: A string representing the amino acid sequence of a viral protein. - `label`: An integer indicating whether the sequence is a protective antigen (1) or not (0). ### Data Splits The dataset has not been split into training and testing sets, to allow for flexibility. You may split the dataset into training and testing sets, based on your preferred ratio. ## Dataset Creation ### Curation Rationale The dataset was curated to address the need for a machine learning-ready dataset containing labeled protective (positive) and non-protective (negative) viral protein sequences. This dataset facilitates the development of machine learning models for predicting viral protective antigens, which are crucial for reverse vaccinology and the development of effective vaccines against viral pathogens. ### Source Data #### Data Collection and Processing The dataset was compiled through a meticulous process involving the retrieval of viral PAgs with experimental evidence from the [Protegen](https://violinet.org/protegen/) database, followed by computational steps carried out on viral protein sequences in [UniProt](https://www.uniprot.org/) to select non-protective protein sequences. ## Bias, Risks, and Limitations Given the imbalanced nature of the dataset, with a greater number of non-protective than protective sequences, there's a risk that machine learning models may become biased towards predicting the majority class. To mitigate this, researchers are encouraged to implement strategies such as balanced sampling or weighted loss functions during model training. Additionally, the dataset's focus on viral proteins from specific databases might limit its coverage of all potential protective antigens across the viral kingdom, which should be considered when generalizing findings. ## Citation **BibTeX:** ```bibtex @article{salod2023vpags, title={VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology}, author={Salod, Zakia and Mahomed, Ozayr}, journal={Data}, volume={8}, number={41}, year={2023}, publisher={MDPI}, doi={10.3390/data8020041} } ``` **APA:** Salod, Z., & Mahomed, O. (2023). VPAgs-Dataset4ML: A Dataset to Predict Viral Protective Antigens for Machine Learning-Based Reverse Vaccinology. Data, 8(41). https://doi.org/10.3390/data8020041 ## More Information This dataset is a crucial step towards leveraging machine learning in the field of vaccinology. By providing a high-quality, curated dataset, VPAgs-Dataset4ML facilitates the development of predictive models that can identify promising vaccine candidates, potentially accelerating vaccine development and deployment in response to emerging viral threats. ## Dataset Card Authors Zakia Salod, Ozayr Mahomed ## Dataset Card Contact For any inquiries regarding this dataset, please contact Zakia Salod at [zakia.salod@gmail.com](zakia.salod@gmail.com).

提供机构：

zakiasalod

原始信息汇总

VPAgs-Dataset4ML 数据集概述

数据集描述

VPAgs-Dataset4ML 包含2,145个病毒蛋白序列，旨在辅助开发能够预测病毒保护性抗原（PAgs）的机器学习模型。这些抗原对于设计针对各种病毒病原体的疫苗至关重要。数据集分为两类：210个保护性抗原（正类）和1,935个非保护性蛋白序列（负类），分别来自Protegen数据库和UniProt。此数据集旨在支持和加速逆向疫苗学研究，为生物信息学和公共卫生提供宝贵资源。

数据集来源：
- Protegen数据库
- UniProt
数据集大小： 2,145个序列
数据集类别：
- 保护性抗原（正类）：210个
- 非保护性蛋白序列（负类）：1,935个
数据集用途： 用于开发和测试旨在识别潜在疫苗候选者的机器学习算法。

数据集结构

数据实例

json { "sequence": "MATLLRSLALFKRNKDKPPITSGSGGAIRGIKHIIIVPIPGDSSITTRSRLLDRLVRLIGNPDVSGPKLTGALIGILSLFVESPGQLIQRITDDPDVSIRLLEVVQSDQSQSGLTFASRGTNMEDEADQYFSHDDPSSSDQSRSGWFENKEISDIEVQDPEGFNMILGTILAQIWVLLAKAVTAPDTAADSELRRWIKYTQQRRVVGEFRLERKWLDVVRNRIAEDLSLRRFMVALILDIKRTPGNKPRIAEMICDIDTYIVEAGLASFILTIKFGIETMYPALGLHEFAGELSTLESLMNLYQQMGETAPYMVILENSIQNKFSAGSYPLLWSYAMGVGVELENSMGGLNFGRSYFDPAYFRLGQEMVRRSAGKVSSTLASELGITAEDARLVSEIAMHTTEDRISRAVGPRQAQVSFLHGDQSENELPGLGGKEDRRVKQGRGEARESYRETGSSRASDARAAHPPTSMPLDIDTASESGQDPQDSRRSADALLRLQAMAGILEEQGSDTDTPRVYNDRDLLD", "label": "1" }

数据字段

sequence: 字符串，表示病毒蛋白的氨基酸序列。
label: 整数，指示序列是否为保护性抗原（1）或非保护性抗原（0）。

数据分割

数据集未分割为训练集和测试集，用户可根据需要自行分割。

数据集创建

数据集理由

数据集旨在提供一个机器学习就绪的数据集，包含标记的保护性（正）和非保护性（负）病毒蛋白序列，以促进预测病毒保护性抗原的机器学习模型的发展。

源数据

数据收集和处理

数据集通过从Protegen数据库中检索具有实验证据的病毒PAgs，并在UniProt中对病毒蛋白序列进行计算步骤来选择非保护性蛋白序列，经过精心编制而成。

数据集偏差、风险和限制

数据集的不平衡性可能导致机器学习模型偏向预测多数类。为缓解此问题，建议研究者在模型训练中实施平衡采样或加权损失函数。此外，数据集专注于特定数据库的病毒蛋白可能限制其覆盖所有潜在保护性抗原的范围，应在使用时考虑。

5,000+

优质数据集

54 个

任务类型

进入经典数据集