as-cle-bert/VirBiCla-training
收藏Hugging Face2024-03-20 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/as-cle-bert/VirBiCla-training
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Dataset Card for VirBiCla-training
VirBiCla is a ML-based viral DNA detector designed for long-read sequencing metagenomics.
This dataset is a support dataset for training the base ML model.
## Dataset Details
### Dataset Description
- **Curated by:** [Astra Bertelli](https://astrabert.vercel.app/)
- **License:** MIT License
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Repository:** [GitHub repository for VirBiCla](https://github.com/AstraBert/VirBiCla)
## Uses
This dataset is intended as support for training the base VirBiCla model
## Dataset Structure
Dataset is a CSV file composed of 60.003 record sequences (coming from RefSeq 16S bacterial rRNA, 18S fungal rRNA, SSU eukaryotic rRNA and RefSeq viral genomes) evaluated on 13 features.
Features are:
- Domain
- A, T, C and G proportion
- Percentage of A, T, C and G homopolimeric regions
- Gene density
- Entropy
- Effective Number of Codons (codon usage metrics)
## Dataset Creation
Find everything that is needed for Dataset creation on [VirBiCla website](https://astrabert.github.io/VirBiCla)
## Bias, Risks, and Limitations
The dataset is mainly directed towards amplicon-sequencing and long-read sequencing, which are the best use cases for VirBiCla.
## Citation
Please consider cite the author of this work (Astra Bertelli) and VirBiCla [GitHub repository](https://github.com/AstraBert/VirBiCla) when using this dataset or the associated model.
提供机构:
as-cle-bert
原始信息汇总
数据集概述
数据集名称
VirBiCla-training
数据集描述
VirBiCla-training 是一个用于训练基于机器学习的病毒DNA检测模型的支持数据集,专门设计用于长读序列元基因组学。
数据集详情
数据集描述
- 维护者: Astra Bertelli
- 许可证: MIT License
数据集来源
- 仓库: GitHub仓库
数据集用途
该数据集旨在支持训练基础的VirBiCla模型。
数据集结构
数据集为CSV格式,包含60,003条记录序列,这些序列来自RefSeq 16S细菌rRNA、18S真菌rRNA、SSU真核rRNA和RefSeq病毒基因组,评估了13个特征。
特征列表
- 域
- A, T, C 和 G 的比例
- A, T, C 和 G 同聚物区域的百分比
- 基因密度
- 熵
- 有效密码子数(密码子使用度量)
数据集创建
数据集创建所需的所有信息可在 VirBiCla网站 找到。
偏差、风险和限制
数据集主要针对扩增子测序和长读序列,这是VirBiCla的最佳使用案例。



