Genentech/GM12878_dnase-data
收藏Hugging Face2026-02-23 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Genentech/GM12878_dnase-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-regression
tags:
- biology
- genomics
pretty_name: "GM12878 DNase regression data"
size_categories:
- 100K<n<1M
---
# GM12878_dnase-data
## Dataset Summary
This dataset contains genomic intervals used to train a regression model on GM12878 DNase data, described in Lal et al. 2025 (https://www.nature.com/articles/s41592-025-02868-z). Genome coordinates correspond to the hg38 reference genome.
## Repository Content
The repository includes one BED file and one Jupyter notebook:
1. `intervals.bed`: Genomic intervals stored in BED format.
2. `1_process_GM12878_data.ipynb`: Jupyter notebook containing the preprocessing steps used to generate the `intervals.bed` file.
## Dataset Structure
### Statistics
- **Number of intervals:** 435,055
- **Interval length:** 2,114 bp (all intervals)
- **Genome build:** hg38
### Intervals file (`intervals.bed`)
BED format (tab-separated). There are three columns with no header:
- Chromosome name
- Start position
- End position
## Usage
```python
from huggingface_hub import hf_hub_download
import pandas as pd
file_path = hf_hub_download(
repo_id="Genentech/GM12878_dnase-data",
filename="intervals.bed",
repo_type="dataset"
)
df = pd.read_csv(file_path, sep='\t', header=None, names=['chrom', 'start', 'end'])
```
提供机构:
Genentech



