Hack90/virus_dna_dataset
收藏Hugging Face2023-08-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Hack90/virus_dna_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: sequence
dtype: string
- name: name
dtype: string
- name: description
dtype: string
- name: features
dtype: int64
- name: seq_length
dtype: int64
splits:
- name: train
num_bytes: 6621468623
num_examples: 2602437
download_size: 2319826398
dataset_size: 6621468623
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
[Needs More Information]
# Dataset Card for virus_dna_dataset
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [Needs More Information]
- **Repository:** [Needs More Information]
- **Paper:** [Needs More Information]
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
A collection of full virus genome dna, the dataset was built from NCBI data
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
DNA
## Dataset Structure
### Data Instances
{ 'Description' : 'NC_030848.1 Haloarcula californiae icosahedral...', 'dna_sequence' : 'TCATCTC TCTCTCT CTCTCTT GTTCCCG CGCCCGC CCGCCC...',
'sequence_length':'35787', 'organism_id':' AB063393.2'}
### Data Fields
{ 'Description' : 'this contains the description about the DNA sequence contained in the NCBI dataset', 'dna_sequence' : 'this contains the dna sequence grouped by 7 nucleotides',
'sequence_length':'this contains the length of the dna sequence'}
### Data Splits
[Needs More Information]
## Dataset Creation
### Curation Rationale
The goal of this dataset was to make it easier to train an LLM on virus DNA
### Source Data
#### Initial Data Collection and Normalization
DNA sequences were grouped by 7 nucleotides to make it easier to tokenize. Only full genomes were selected
#### Who are the source language producers?
Viruses :)
### Annotations
#### Annotation process
NCBI
#### Who are the annotators?
NCBI
### Personal and Sensitive Information
N/A
## Considerations for Using the Data
### Social Impact of Dataset
Make it easier to train LLMs on virus DNA
### Discussion of Biases
Only virus data that has been sequenced and upload into NCBI is contained in here
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
Hassan Ahmed
### Licensing Information
[Needs More Information]
### Citation Information
[Needs More Information]
提供机构:
Hack90
原始信息汇总
数据集概述
数据集描述
- 数据集名称: virus_dna_dataset
- 数据集摘要: 该数据集包含完整的病毒基因组DNA,数据来源于NCBI。
- 支持的任务和排行榜: [未提供详细信息]
- 语言: DNA
数据集结构
- 数据实例: 每个实例包含DNA序列的描述、DNA序列本身、序列长度和生物体ID。
- 数据字段: 包括描述、DNA序列和序列长度。
- 数据分割: 训练集包含2602437个示例,总大小为6621468623字节。
数据集创建
- 筛选理由: 旨在简化训练大型语言模型(LLM)的病毒DNA数据。
- 源数据: 数据来自NCBI,仅包含完整的基因组。
- 注释: 由NCBI进行注释。
使用数据的考虑
- 社会影响: 使训练LLM处理病毒DNA数据更为便捷。
- 偏见讨论: 数据集仅包含已测序并上传至NCBI的病毒数据。
附加信息
- 数据集管理者: Hassan Ahmed
- 许可信息: [未提供详细信息]
- 引用信息: [未提供详细信息]



