Shenzhi-Chen/DeepSTARR-Mouse-dataset

Name: Shenzhi-Chen/DeepSTARR-Mouse-dataset
Creator: Shenzhi-Chen
Published: 2026-04-08 18:13:05
License: 暂无描述

Hugging Face2026-04-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Shenzhi-Chen/DeepSTARR-Mouse-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en tags: - dna-sequence-model - deepStarr - genomics - mouse pretty_name: DeepSTARR-mouse-dataset size_categories: - 10M<n<100M task_categories: - feature-extraction --- **Dataset Summary** This dataset was used for training the DeepSTARR-Mouse model for predicting chromatin accessibility and enhancer activity. **Source Data** Sequences were derived from **Gorkin et al.**, *Nature* (2020) (https://doi.org/10.1038/s41586-020-2093-3) and the **VISTA Enhancer Browser** (https://enhancer.lbl.gov/). The data were further processed to generate tissue-specific activity and accessibility annotations. **License** CC BY 4.0 (Attribution required). For research use only. **Intended Use** This dataset is provided for scientific and reproducibility purposes accompanying our publication in *Nature genetics*. **Citation** Please cite our paper: [citation once published] **Accessibility Model Dataset** The Accessibility_model_dataset provides the training data for the sequence-to-accessibility model. It includes data from six tissues — **heart, limb, midbrain(CNS), forebrain, hindbrain and neural tube** — each organized into three cross-validation folds. For each tissue and fold, the dataset contains three subsets: **training, validation, and test**. Each subset includes: a **FASTA file** containing DNA sequences, and a **TXT file** containing the corresponding chromatin accessibility values. In total, there are 18 files per tissue (3 folds × 3 subsets × 2 file types). **Activity Model Dataset** The Activity_model_dataset provides the training data for the sequence-to-activity model. It follows the same structure as the accessibility dataset, including data for **heart, limb, and midbrain(CNS), forebrain, hindbrain and neural tube**, each with three cross-validation folds. For each fold, there are **training, validation, and test** subsets. Each subset contains: a **FASTA file** with DNA sequences, and a **TXT file** with the corresponding enhancer activity values. This organization results in 18 files per tissue (3 folds × 3 subsets × 2 file types). **Test dataset** The test dataset provides the randomly 600k generated sequences which used for enhancer activity model evaluation. all_vista_seq.fa are all vista sequences which were used to randomly generate design starting sequences.

许可证：CC BY 4.0 语言： - 英语标签： - DNA序列模型（DNA sequence model） - DeepSTARR - 基因组学（genomics） - 小鼠（mouse）友好名称：DeepSTARR小鼠数据集（DeepSTARR-mouse-dataset）样本量范围：1000万 < 样本量 < 1亿任务类别： - 特征提取（feature-extraction） **数据集概述** 本数据集用于训练DeepSTARR小鼠模型，以预测染色质可及性（chromatin accessibility）与增强子活性（enhancer activity）。 **源数据** 序列源自Gorkin等人发表于《自然》（Nature）2020年的研究（https://doi.org/10.1038/s41586-020-2093-3）以及VISTA增强子浏览器（https://enhancer.lbl.gov/）。后续对数据进行了进一步处理，以生成组织特异性的活性与可及性注释。 **许可证** 采用CC BY 4.0协议（需注明原作者），仅可用于科研用途。 **预期用途** 本数据集旨在配合我们发表于《自然·遗传学》（Nature Genetics）的研究，用于科学研究与结果可重复性验证。 **引用方式** 请引用我们的论文：[待论文正式发表后补充引用信息] **可及性模型数据集** 可及性模型数据集为序列到可及性模型（sequence-to-accessibility model）提供训练数据。其包含6种组织的数据——**心脏、肢体、中脑（中枢神经系统，CNS）、前脑、后脑以及神经管**，每种组织均划分为3个交叉验证折（cross-validation folds）。针对每种组织与折数，数据集包含三个子集：**训练集、验证集与测试集**。每个子集包含：一个存储DNA序列的FASTA文件（FASTA），以及一个存储对应染色质可及性数值的TXT文件（TXT）。单种组织总计包含18个文件（3个折 × 3个子集 × 2种文件类型）。 **活性模型数据集** 活性模型数据集为序列到活性模型（sequence-to-activity model）提供训练数据。其结构与可及性数据集一致，包含上述6种组织的数据，每种组织均设有3个交叉验证折。每个折下包含训练、验证与测试三个子集。每个子集包含：一个存储DNA序列的FASTA文件（FASTA），以及一个存储对应增强子活性数值的TXT文件（TXT）。该组织方式使得单种组织总计拥有18个文件（3个折 × 3个子集 × 2种文件类型）。 **测试数据集** 测试数据集包含随机生成的60万条序列，用于评估增强子活性模型。all_vista_seq.fa为所有用于随机生成设计起始序列的VISTA序列。

提供机构：

Shenzhi-Chen

5,000+

优质数据集

54 个

任务类型

进入经典数据集