five

wyxu/Genome_database

收藏
Hugging Face2023-06-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/wyxu/Genome_database
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - conversational - fill-mask language: - en tags: - biology - medical pretty_name: genome database size_categories: - 100M<n<1B - 10M<n<100M viewer: false --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This dataset contains datas being collected from Genbank. The dataset is organized in a way that it separate all the genes from an DNA , and was classified according to the region and coding type. In that way, people could get more detailed information regarding each DNA sequences. The dataset also contain source, which is the whole DNA sequence, where the user can use it to compare to each segment to see the exact location. The dataset contains 937 files with about 200 million data and 300-400 GB storage space. Therefore user can specify the number of files they are going to use by using the code below according to their own needs. If user want to download all of files, they can enter 937 as second arguement. ```python datasets.load_dataset('wyxu/Genome_database', num_urls = number of file you want to use) ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances ```python {DNA id: AP013063.1 Organism: Serratia marcescens SM39 year: 2017 region type:coding specific_class: Protein Product:thr operon leader peptide sequence: ATGCGCAACATCAGCCTGAAAACCACAATTATTACCACCACCGATACCACAGGTAACGGGGCGGGCTGA gc_content:0.52173913 translation code: MRNISLKTTIITTTDTTGNGAG start_position: 207 end_position: 276} ``` ### Data Fields __DNA id__: id number for the whole DNA sequence, sequences with same DNA id are from same DNA __Organism__: Organism of the DNA __year__: the year of the DNA sequence __region type__: determine the general type of the sequence. For all the type that is typically classified as coding region, it was named with coding; while others including those that are case dependent were named according to their own type such as regulator, repeat_region,gap, intron,extron, etc.(__Note__: when classifying coding type, all the CDS, mRNA, tmRNA, tRNA,rRNA and others such as propetide, sig_propetide,mat_propetide was classified as coding. In order to minimize the missing coding part, all the other categories which has associated product was also classified as coding ) __specific class__: if the sequence is coding sequence, it would be classified according to their production type such as RNA, Protein. The regulators would also be classified by their own class such as terminator, ribosome __Product__ : if the sequence produce protein, the product name would be listed __sequence__: the actual sequence __gc_content__: the gc_content of the sequence __translation code__: if the sequence produce protein, then the translation code would be provided as a reference __start_position__: the start position of the segment __end_position__: the end position of the segment ### Data Splits first 80% of files was used as training dataset, while last 20% was used as testing dataset ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data The data collected are all from the most recent release of genbank, genbank 255. #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions [More Information Needed]
提供机构:
wyxu
原始信息汇总

数据集概述

数据集描述

  • 数据集名称: genome database
  • 数据集摘要: 该数据集从Genbank收集数据,组织方式为分离DNA中的所有基因,并根据区域和编码类型进行分类。用户可以获取每个DNA序列的详细信息。数据集包含937个文件,约2亿数据,存储空间为300-400GB。

数据集结构

数据实例

python {DNA id: AP013063.1
Organism: Serratia marcescens SM39
year: 2017 region type:coding
specific_class: Protein Product:thr operon leader peptide sequence: ATGCGCAACATCAGCCTGAAAACCACAATTATTACCACCACCGATACCACAGGTAACGGGGCGGGCTGA
gc_content:0.52173913
translation code: MRNISLKTTIITTTDTTGNGAG start_position: 207 end_position: 276}

数据字段

  • DNA id: 整个DNA序列的ID号,相同DNA id的序列来自同一DNA。
  • Organism: DNA所属的生物体。
  • year: DNA序列的年份。
  • region type: 序列的一般类型,编码区域命名为coding,其他类型根据具体情况命名。
  • specific class: 如果序列是编码序列,则根据其产物类型分类。
  • Product: 如果序列产生蛋白质,则列出产物名称。
  • sequence: 实际序列。
  • gc_content: 序列的gc含量。
  • translation code: 如果序列产生蛋白质,则提供翻译代码作为参考。
  • start_position: 片段的起始位置。
  • end_position: 片段的结束位置。

数据分割

  • 前80%的文件用作训练数据集,后20%用作测试数据集。

数据集创建

源数据

  • 数据来源: 所有数据均来自最新版本的Genbank,即Genbank 255。

数据集使用

数据加载

python datasets.load_dataset(wyxu/Genome_database, num_urls = number of file you want to use)

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是一个从Genbank收集的基因组数据库,包含约2亿条数据,占用300-400GB存储空间,按区域和编码类型对DNA序列中的基因进行分类,以提供详细的序列信息。数据集包含937个文件,支持用户按需下载指定数量的文件,并已分割为80%训练和20%测试数据,适用于生物学和医学领域的填充掩码任务。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作