five

shefat16/DNA_coding_regions

收藏
Hugging Face2026-01-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/shefat16/DNA_coding_regions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: accession dtype: string - name: organism dtype: string - name: sequence dtype: string - name: introns list: - name: after dtype: string - name: before dtype: string - name: end dtype: int64 - name: gene dtype: string - name: sequence dtype: string - name: start dtype: int64 - name: exons list: - name: after dtype: string - name: before dtype: string - name: end dtype: int64 - name: gene dtype: string - name: sequence dtype: string - name: start dtype: int64 - name: proteins list: - name: end dtype: int64 - name: gene dtype: string - name: sequence dtype: string - name: start dtype: int64 splits: - name: train num_bytes: 11536678696 num_examples: 1677609 download_size: 5448417115 dataset_size: 11536678696 task_categories: - text-classification - token-classification - translation tags: - Exons - Introns - Proteins - DNA pretty_name: DNA Coding Regions size_categories: - 1M<n<10M --- # DNA Coding Regions Dataset This is a curated collection of genomic sequences extracted directly from **NCBI GenBank**, designed to support research in **introns and exons classification**, **DNA-to-protein translation**, **gene structure analysis**, and **biological sequence modeling** with deep learning architectures. --- ## Source and Extraction Pipeline All records were extracted from **GenBank** using [Biopython](https://biopython.org/). The dataset construction followed a reproducible data processing pipeline written in Python, which: - Downloads and parses GenBank records. - Extracts **genomic DNA sequences**, their associated **exons**, **introns**, and **coding sequences (CDS)**. - Processes the `strand` orientation to produce normalized sequences. - Removes duplicate entries based on `(sequence, organism)` pairs. - Assembles each record into a structured JSONL format suitable for machine learning models. The GenBank **search query** used for data collection was: ``` "genomic DNA"[Filter] AND ("exon"[Feature Key] OR "intron"[Feature Key]) AND "CDS"[Feature Key] AND ("3"[SLEN] : "16384"[SLEN]) ```` You can find more information about the pipeline in the GitHub from the [DNA Coding Regions](https://github.com/GustavoHCruz/CodingDNATransformers) repository. --- ## Dataset Structure Each entry in the dataset corresponds to a **unique DNA sequence**, identified by its **GenBank accession**. The dataset is serialized in JSON Lines (`.jsonl`) format and can be loaded with the Hugging Face `datasets` library. ### Example record ```json { "accession": "NC_045512.2", "organism": "Homo sapiens", "sequence": "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGAT...", "introns": [ { "sequence": "TTGTAGACCAGTGCAGTA...", "start": 1450, "end": 1783, "gene": "ORF1ab", "before": "ATGCCDG", "after": "TAACAFG" } ], "exons": [ { "sequence": "ATGGACACAAGTCAGG...", "start": 1, "end": 1449, "gene": "ORF1ab", "before": null, "after": "GT" } ], "proteins": [ { "sequence": "MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVL...", "start": 1, "end": 4405, "gene": "ORF1ab" } ] } ```` --- ## Field Descriptions | Field | Type | Description | | ------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **accession** | `str` | GenBank accession number for the DNA sequence. | | **organism** | `str` | Name of the organism from which the sequence was derived. | | **sequence** | `str` | The full genomic DNA sequence (processed strand). | | **introns** | `list` | List of intronic regions associated with this DNA sequence. Each item contains: <ul><li>`sequence`: only the nucleotide sequence of the intron</li><li>`start`, `end`: coordinates relative to the DNA sequence</li><li>`gene`: gene name when annotated</li><li>`before`, `after`: short flanking sequences</li></ul> | | **exons** | `list` | List of exonic regions associated with this DNA sequence. Same structure as `introns`. | | **proteins** | `list` | List of coding sequences (CDS) translated to amino acid sequences, with: <ul><li>`sequence`: protein sequence</li><li>`start`, `end`: coordinates in the DNA sequence</li><li>`gene`: gene name</li></ul> | --- ## Applications This dataset can be directly used for: * **DNA to protein translation modeling** * **Exon and Introns classification** * **Splicing prediction** * **Genomic representation learning** * **Bioinformatics-focused LLM pretraining (DAPT)** --- ## Loading Example ```python from datasets import load_dataset dataset = load_dataset("gu-dudi/DNA_coding_regions") print(dataset) print(dataset["train"][0]) ``` --- ## Dataset Metadata * **Source:** NCBI GenBank * **Processed with:** Biopython, Pandas, tqdm * **Maintainer:** [Gustavo Henrique Ferreira Cruz](https://huggingface.co/GustavoHCruz) * **License:** Open for research and educational use * **Format:** JSON Lines (UTF-8) --- ## Disclaimer on data completeness Not all genomic entries in this dataset contain every type of annotation (exons, introns, and proteins). While the GenBank records were filtered to include sequences annotated with "exon", "intron", and "CDS" feature keys, the underlying annotations in GenBank are not always deterministic or complete. Some sequences may include only exons or introns without corresponding protein-coding regions, or vice versa. This reflects the inherent variability and curation differences across submissions in the GenBank database. --- ### Citation If you use this dataset in your research, please cite: ```bibtex @misc{gustavo_henrique_ferreira_cruz_2025, author = {Gustavo Henrique Ferreira Cruz}, title = {DNA\_coding\_regions (Revision 16f4e3a)}, year = 2025, url = {https://huggingface.co/datasets/GustavoHCruz/DNA_coding_regions}, doi = {10.57967/hf/7238}, publisher = {Hugging Face} } ``` --- ### Version and Integrity * **Version:** 1.0 * **Total entries:** 1,677,609 * **Deduplication:** duplicates removed based on `(sequence, organism)` pair * **Strand normalization:** handled during extraction --- ### Notes * Coordinates (`start`, `end`) are **relative to the parent DNA sequence**. * `sequence` inside each **intron/exon** corresponds *only to that region* (not the full DNA). * Protein sequences are already **translated amino acid chains**, not nucleotide fragments. --- *Developed as part of the Master’s research project on DNA sequence understanding and translation using deep learning models.*
提供机构:
shefat16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作