five

DeweyWang/ETH_ML4G_Project-1

收藏
Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/DeweyWang/ETH_ML4G_Project-1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - feature-extraction tags: - genomics - gene-expression - epigenetics - bioinformatics - machine-learning size_categories: 100K<n<1M language: - en pretty_name: Chromatin Landscape --- # Gene Expression Prediction Dataset ## 📌 Overview This dataset is designed for predicting gene expression levels from chromatin landscape data, including histone modifications and chromatin accessibility. It is part of a machine learning project in genomics, where the goal is to model the relationship between epigenetic signals and gene expression. 👉 Full project code (including preprocessing and prediction): https://github.com/Dewey-Wang/Gene-expression-prediction/tree/main --- ## 📂 Dataset Structure The dataset consists of two main components: ### 1. Raw Data - Total size: **18.66 GB** - Number of files: **72** - Includes: - Histone modification data (ChIP-seq) - Chromatin accessibility (DNase-seq) - Gene expression (CAGE) - Gene annotation (TSS, gene body, RefSeq) --- ### 2. Preprocessed Data - Total size: **6.36 GB** - Number of files: **53** - Includes: - Feature matrices for machine learning - Aggregated signals around genomic regions (e.g. TSS windows) - Normalized inputs ready for model training 👉 Full preprocessing code is available in the GitHub repository above. --- ## 🎯 Task The main task is: **Predict gene expression levels from chromatin features** - Input: epigenetic signals (ChIP-seq, DNase-seq) - Output: gene expression values --- ## 📊 Evaluation Typical evaluation metrics: - Spearman correlation (primary) - Pearson correlation - R² score --- ## 🧬 Data Details - Genome version: hg38 / GRCh38 - Multiple cell lines included - Data normalized for cross-cell-line comparison --- ## 🚀 Usage You can either: 1. Use preprocessed data directly for ML models 2. Reproduce preprocessing using provided code --- ## ⚠️ Notes - Raw data is large (~18.66 GB) - Preprocessed data is recommended for quick experimentation - Suitable for machine learning and bioinformatics research
提供机构:
DeweyWang
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作