Shenzhi-Chen/DeepSTARR-Mouse-dataset
收藏Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Shenzhi-Chen/DeepSTARR-Mouse-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
tags:
- dna-sequence-model
- deepStarr
- genomics
- mouse
pretty_name: DeepSTARR-mouse-dataset
size_categories:
- 10M<n<100M
task_categories:
- feature-extraction
---
**Dataset Summary**
This dataset was used for training the DeepSTARR-Mouse model for predicting chromatin accessibility and enhancer activity.
**Source Data**
Sequences were derived from **Gorkin et al.**, *Nature* (2020) (https://doi.org/10.1038/s41586-020-2093-3) and the **VISTA Enhancer Browser** (https://enhancer.lbl.gov/). The data were further processed to generate tissue-specific activity and accessibility annotations.
**License**
CC BY 4.0 (Attribution required). For research use only.
**Intended Use**
This dataset is provided for scientific and reproducibility purposes accompanying our publication in *Nature genetics*.
**Citation**
Please cite our paper:
[citation once published]
**Accessibility Model Dataset**
The Accessibility_model_dataset provides the training data for the sequence-to-accessibility model.
It includes data from six tissues — **heart, limb, midbrain(CNS), forebrain, hindbrain and neural tube** — each organized into three cross-validation folds.
For each tissue and fold, the dataset contains three subsets: **training, validation, and test**.
Each subset includes:
a **FASTA file** containing DNA sequences, and
a **TXT file** containing the corresponding chromatin accessibility values.
In total, there are 18 files per tissue (3 folds × 3 subsets × 2 file types).
**Activity Model Dataset**
The Activity_model_dataset provides the training data for the sequence-to-activity model.
It follows the same structure as the accessibility dataset, including data for **heart, limb, and midbrain(CNS), forebrain, hindbrain and neural tube**, each with three cross-validation folds.
For each fold, there are **training, validation, and test** subsets.
Each subset contains:
a **FASTA file** with DNA sequences, and
a **TXT file** with the corresponding enhancer activity values.
This organization results in 18 files per tissue (3 folds × 3 subsets × 2 file types).
**Test dataset**
The test dataset provides the randomly 600k generated sequences which used for enhancer activity model evaluation. all_vista_seq.fa are all vista sequences which were used to randomly generate design starting sequences.
许可证:CC BY 4.0
语言:
- 英语
标签:
- DNA序列模型(DNA sequence model)
- DeepSTARR
- 基因组学(genomics)
- 小鼠(mouse)
友好名称:DeepSTARR小鼠数据集(DeepSTARR-mouse-dataset)
样本量范围:1000万 < 样本量 < 1亿
任务类别:
- 特征提取(feature-extraction)
**数据集概述**
本数据集用于训练DeepSTARR小鼠模型,以预测染色质可及性(chromatin accessibility)与增强子活性(enhancer activity)。
**源数据**
序列源自Gorkin等人发表于《自然》(Nature)2020年的研究(https://doi.org/10.1038/s41586-020-2093-3)以及VISTA增强子浏览器(https://enhancer.lbl.gov/)。后续对数据进行了进一步处理,以生成组织特异性的活性与可及性注释。
**许可证**
采用CC BY 4.0协议(需注明原作者),仅可用于科研用途。
**预期用途**
本数据集旨在配合我们发表于《自然·遗传学》(Nature Genetics)的研究,用于科学研究与结果可重复性验证。
**引用方式**
请引用我们的论文:[待论文正式发表后补充引用信息]
**可及性模型数据集**
可及性模型数据集为序列到可及性模型(sequence-to-accessibility model)提供训练数据。其包含6种组织的数据——**心脏、肢体、中脑(中枢神经系统,CNS)、前脑、后脑以及神经管**,每种组织均划分为3个交叉验证折(cross-validation folds)。针对每种组织与折数,数据集包含三个子集:**训练集、验证集与测试集**。每个子集包含:
一个存储DNA序列的FASTA文件(FASTA),以及一个存储对应染色质可及性数值的TXT文件(TXT)。
单种组织总计包含18个文件(3个折 × 3个子集 × 2种文件类型)。
**活性模型数据集**
活性模型数据集为序列到活性模型(sequence-to-activity model)提供训练数据。其结构与可及性数据集一致,包含上述6种组织的数据,每种组织均设有3个交叉验证折。每个折下包含训练、验证与测试三个子集。每个子集包含:
一个存储DNA序列的FASTA文件(FASTA),以及一个存储对应增强子活性数值的TXT文件(TXT)。该组织方式使得单种组织总计拥有18个文件(3个折 × 3个子集 × 2种文件类型)。
**测试数据集**
测试数据集包含随机生成的60万条序列,用于评估增强子活性模型。all_vista_seq.fa为所有用于随机生成设计起始序列的VISTA序列。
提供机构:
Shenzhi-Chen



