DeweyWang/ETH_ML4G_Project-1
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/DeweyWang/ETH_ML4G_Project-1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- feature-extraction
tags:
- genomics
- gene-expression
- epigenetics
- bioinformatics
- machine-learning
size_categories: 100K<n<1M
language:
- en
pretty_name: Chromatin Landscape
---
# Gene Expression Prediction Dataset
## 📌 Overview
This dataset is designed for predicting gene expression levels from chromatin landscape data, including histone modifications and chromatin accessibility.
It is part of a machine learning project in genomics, where the goal is to model the relationship between epigenetic signals and gene expression.
👉 Full project code (including preprocessing and prediction):
https://github.com/Dewey-Wang/Gene-expression-prediction/tree/main
---
## 📂 Dataset Structure
The dataset consists of two main components:
### 1. Raw Data
- Total size: **18.66 GB**
- Number of files: **72**
- Includes:
- Histone modification data (ChIP-seq)
- Chromatin accessibility (DNase-seq)
- Gene expression (CAGE)
- Gene annotation (TSS, gene body, RefSeq)
---
### 2. Preprocessed Data
- Total size: **6.36 GB**
- Number of files: **53**
- Includes:
- Feature matrices for machine learning
- Aggregated signals around genomic regions (e.g. TSS windows)
- Normalized inputs ready for model training
👉 Full preprocessing code is available in the GitHub repository above.
---
## 🎯 Task
The main task is:
**Predict gene expression levels from chromatin features**
- Input: epigenetic signals (ChIP-seq, DNase-seq)
- Output: gene expression values
---
## 📊 Evaluation
Typical evaluation metrics:
- Spearman correlation (primary)
- Pearson correlation
- R² score
---
## 🧬 Data Details
- Genome version: hg38 / GRCh38
- Multiple cell lines included
- Data normalized for cross-cell-line comparison
---
## 🚀 Usage
You can either:
1. Use preprocessed data directly for ML models
2. Reproduce preprocessing using provided code
---
## ⚠️ Notes
- Raw data is large (~18.66 GB)
- Preprocessed data is recommended for quick experimentation
- Suitable for machine learning and bioinformatics research
提供机构:
DeweyWang



