five

Shenzhi-Chen/DeepSTARR-Mouse-dataset

收藏
Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Shenzhi-Chen/DeepSTARR-Mouse-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en tags: - dna-sequence-model - deepStarr - genomics - mouse pretty_name: DeepSTARR-mouse-dataset size_categories: - 10M<n<100M task_categories: - feature-extraction --- **Dataset Summary** This dataset was used for training the DeepSTARR-Mouse model for predicting chromatin accessibility and enhancer activity. **Source Data** Sequences were derived from **Gorkin et al.**, *Nature* (2020) (https://doi.org/10.1038/s41586-020-2093-3) and the **VISTA Enhancer Browser** (https://enhancer.lbl.gov/). The data were further processed to generate tissue-specific activity and accessibility annotations. **License** CC BY 4.0 (Attribution required). For research use only. **Intended Use** This dataset is provided for scientific and reproducibility purposes accompanying our publication in *Nature genetics*. **Citation** Please cite our paper: [citation once published] **Accessibility Model Dataset** The Accessibility_model_dataset provides the training data for the sequence-to-accessibility model. It includes data from six tissues — **heart, limb, midbrain(CNS), forebrain, hindbrain and neural tube** — each organized into three cross-validation folds. For each tissue and fold, the dataset contains three subsets: **training, validation, and test**. Each subset includes: a **FASTA file** containing DNA sequences, and a **TXT file** containing the corresponding chromatin accessibility values. In total, there are 18 files per tissue (3 folds × 3 subsets × 2 file types). **Activity Model Dataset** The Activity_model_dataset provides the training data for the sequence-to-activity model. It follows the same structure as the accessibility dataset, including data for **heart, limb, and midbrain(CNS), forebrain, hindbrain and neural tube**, each with three cross-validation folds. For each fold, there are **training, validation, and test** subsets. Each subset contains: a **FASTA file** with DNA sequences, and a **TXT file** with the corresponding enhancer activity values. This organization results in 18 files per tissue (3 folds × 3 subsets × 2 file types). **Test dataset** The test dataset provides the randomly 600k generated sequences which used for enhancer activity model evaluation. all_vista_seq.fa are all vista sequences which were used to randomly generate design starting sequences.

许可证:CC BY 4.0 语言: - 英语 标签: - DNA序列模型(DNA sequence model) - DeepSTARR - 基因组学(genomics) - 小鼠(mouse) 友好名称:DeepSTARR小鼠数据集(DeepSTARR-mouse-dataset) 样本量范围:1000万 < 样本量 < 1亿 任务类别: - 特征提取(feature-extraction) **数据集概述** 本数据集用于训练DeepSTARR小鼠模型,以预测染色质可及性(chromatin accessibility)与增强子活性(enhancer activity)。 **源数据** 序列源自Gorkin等人发表于《自然》(Nature)2020年的研究(https://doi.org/10.1038/s41586-020-2093-3)以及VISTA增强子浏览器(https://enhancer.lbl.gov/)。后续对数据进行了进一步处理,以生成组织特异性的活性与可及性注释。 **许可证** 采用CC BY 4.0协议(需注明原作者),仅可用于科研用途。 **预期用途** 本数据集旨在配合我们发表于《自然·遗传学》(Nature Genetics)的研究,用于科学研究与结果可重复性验证。 **引用方式** 请引用我们的论文:[待论文正式发表后补充引用信息] **可及性模型数据集** 可及性模型数据集为序列到可及性模型(sequence-to-accessibility model)提供训练数据。其包含6种组织的数据——**心脏、肢体、中脑(中枢神经系统,CNS)、前脑、后脑以及神经管**,每种组织均划分为3个交叉验证折(cross-validation folds)。针对每种组织与折数,数据集包含三个子集:**训练集、验证集与测试集**。每个子集包含: 一个存储DNA序列的FASTA文件(FASTA),以及一个存储对应染色质可及性数值的TXT文件(TXT)。 单种组织总计包含18个文件(3个折 × 3个子集 × 2种文件类型)。 **活性模型数据集** 活性模型数据集为序列到活性模型(sequence-to-activity model)提供训练数据。其结构与可及性数据集一致,包含上述6种组织的数据,每种组织均设有3个交叉验证折。每个折下包含训练、验证与测试三个子集。每个子集包含: 一个存储DNA序列的FASTA文件(FASTA),以及一个存储对应增强子活性数值的TXT文件(TXT)。该组织方式使得单种组织总计拥有18个文件(3个折 × 3个子集 × 2种文件类型)。 **测试数据集** 测试数据集包含随机生成的60万条序列,用于评估增强子活性模型。all_vista_seq.fa为所有用于随机生成设计起始序列的VISTA序列。
提供机构:
Shenzhi-Chen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作