leannmlindsey/GUE
收藏Hugging Face2024-05-22 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/leannmlindsey/GUE
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: emp_H3
data_files:
- split: train
path: "GUE/emp_H3/train.csv"
- split: test
path: "GUE/emp_H3/test.csv"
- split: dev
path: "GUE/emp_H3/dev.csv"
- config_name: emp_H3K14ac
data_files:
- split: train
path: "GUE/emp_H3K14ac/train.csv"
- split: test
path: "GUE/emp_H3K14ac/test.csv"
- split: dev
path: "GUE/emp_H3K14ac/dev.csv"
- config_name: emp_H3K36me3
data_files:
- split: train
path: "GUE/emp_H3K36me3/train.csv"
- split: test
path: "GUE/emp_H3K36me3/test.csv"
- split: dev
path: "GUE/emp_H3K36me3/dev.csv"
- config_name: emp_H3K4me1
data_files:
- split: train
path: "GUE/emp_H3K4me1/train.csv"
- split: test
path: "GUE/emp_H3K4me1/test.csv"
- split: dev
path: "GUE/emp_H3K4me1/dev.csv"
- config_name: emp_H3K4me2
data_files:
- split: train
path: "GUE/emp_H3K4me2/train.csv"
- split: test
path: "GUE/emp_H3K4me2/test.csv"
- split: dev
path: "GUE/emp_H3K4me2/dev.csv"
- config_name: emp_H3K4me3
data_files:
- split: train
path: "GUE/emp_H3K4me3/train.csv"
- split: test
path: "GUE/emp_H3K4me3/test.csv"
- split: dev
path: "GUE/emp_H3K4me3/dev.csv"
- config_name: emp_H3K79me3
data_files:
- split: train
path: "GUE/emp_H3K79me3/train.csv"
- split: test
path: "GUE/emp_H3K79me3/test.csv"
- split: dev
path: "GUE/emp_H3K79me3/dev.csv"
- config_name: emp_H3K9ac
data_files:
- split: train
path: "GUE/emp_H3K9ac/train.csv"
- split: test
path: "GUE/emp_H3K9ac/test.csv"
- split: dev
path: "GUE/emp_H3K9ac/dev.csv"
- config_name: emp_H4
data_files:
- split: train
path: "GUE/emp_H4/train.csv"
- split: test
path: "GUE/emp_H4/test.csv"
- split: dev
path: "GUE/emp_H4/dev.csv"
- config_name: emp_H4ac
data_files:
- split: train
path: "GUE/emp_H4ac/train.csv"
- split: test
path: "GUE/emp_H4ac/test.csv"
- split: dev
path: "GUE/emp_H4ac/dev.csv"
- config_name: human_tf_0
data_files:
- split: train
path: "GUE/human_tf_0/train.csv"
- split: test
path: "GUE/human_tf_0/test.csv"
- split: dev
path: "GUE/human_tf_0/dev.csv"
- config_name: human_tf_1
data_files:
- split: train
path: "GUE/human_tf_1/train.csv"
- split: test
path: "GUE/human_tf_1/test.csv"
- split: dev
path: "GUE/human_tf_1/dev.csv"
- config_name: human_tf_2
data_files:
- split: train
path: "GUE/human_tf_2/train.csv"
- split: test
path: "GUE/human_tf_2/test.csv"
- split: dev
path: "GUE/human_tf_2/dev.csv"
- config_name: human_tf_3
data_files:
- split: train
path: "GUE/human_tf_3/train.csv"
- split: test
path: "GUE/human_tf_3/test.csv"
- split: dev
path: "GUE/human_tf_3/dev.csv"
- config_name: human_tf_4
data_files:
- split: train
path: "GUE/human_tf_4/train.csv"
- split: test
path: "GUE/human_tf_4/test.csv"
- split: dev
path: "GUE/human_tf_4/dev.csv"
- config_name: mouse_0
data_files:
- split: train
path: "GUE/mouse_0/train.csv"
- split: test
path: "GUE/mouse_0/test.csv"
- split: dev
path: "GUE/mouse_0/dev.csv"
- config_name: mouse_1
data_files:
- split: train
path: "GUE/mouse_1/train.csv"
- split: test
path: "GUE/mouse_1/test.csv"
- split: dev
path: "GUE/mouse_1/dev.csv"
- config_name: mouse_2
data_files:
- split: train
path: "GUE/mouse_2/train.csv"
- split: test
path: "GUE/mouse_2/test.csv"
- split: dev
path: "GUE/mouse_2/dev.csv"
- config_name: mouse_3
data_files:
- split: train
path: "GUE/mouse_3/train.csv"
- split: test
path: "GUE/mouse_3/test.csv"
- split: dev
path: "GUE/mouse_3/dev.csv"
- config_name: mouse_4
data_files:
- split: train
path: "GUE/mouse_4/train.csv"
- split: test
path: "GUE/mouse_4/test.csv"
- split: dev
path: "GUE/mouse_4/dev.csv"
- config_name: prom_300_all
data_files:
- split: train
path: "GUE/prom_300_all/train.csv"
- split: test
path: "GUE/prom_300_all/test.csv"
- split: dev
path: "GUE/prom_300_all/dev.csv"
- config_name: prom_300_notata
data_files:
- split: train
path: "GUE/prom_300_notata/train.csv"
- split: test
path: "GUE/prom_300_notata/test.csv"
- split: dev
path: "GUE/prom_300_notata/dev.csv"
- config_name: prom_300_tata
data_files:
- split: train
path: "GUE/prom_300_tata/train.csv"
- split: test
path: "GUE/prom_300_tata/test.csv"
- split: dev
path: "GUE/prom_300_tata/dev.csv"
- config_name: prom_core_all
data_files:
- split: train
path: "GUE/prom_core_all/train.csv"
- split: test
path: "GUE/prom_core_all/test.csv"
- split: dev
path: "GUE/prom_core_all/dev.csv"
- config_name: prom_core_notata
data_files:
- split: train
path: "GUE/prom_core_notata/train.csv"
- split: test
path: "GUE/prom_core_notata/test.csv"
- split: dev
path: "GUE/prom_core_notata/dev.csv"
- config_name: prom_core_tata
data_files:
- split: train
path: "GUE/prom_core_tata/train.csv"
- split: test
path: "GUE/prom_core_tata/test.csv"
- split: dev
path: "GUE/prom_core_tata/dev.csv"
- config_name: splice_reconstructed
data_files:
- split: train
path: "GUE/splice_reconstructed/train.csv"
- split: test
path: "GUE/splice_reconstructed/test.csv"
- split: dev
path: "GUE/splice_reconstructed/dev.csv"
- config_name: virus_covid
data_files:
- split: train
path: "GUE/virus_covid/train.csv"
- split: test
path: "GUE/virus_covid/test.csv"
- split: dev
path: "GUE/virus_covid/dev.csv"
---
This is a copy of the Genome Understanding Evaluation (GUE) that was presented in
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu
and is available to download directly from
https://github.com/MAGICS-LAB/DNABERT_2
If you use this dataset, please cite
@misc{zhou2023dnabert2,
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},
year={2023},
eprint={2306.15006},
archivePrefix={arXiv},
primaryClass={q-bio.GN}
}
**Instructions to Load Dataset in Google Colab**
```
# choose the dataset that you wish to load, ex: prom_core_all
from datasets import load_dataset, get_dataset_config_names
config_names = get_dataset_config_names("leannmlindsey/GUE")
print(config_names)
prom_core_all = load_dataset("leannmlindsey/GUE", name="prom_core_all")
prom_core_all
prom_core_all["train"][0]
```
提供机构:
leannmlindsey
原始信息汇总
数据集概述
数据集配置列表
-
emp_H3
- 训练集路径: "GUE/emp_H3/train.csv"
- 测试集路径: "GUE/emp_H3/test.csv"
- 验证集路径: "GUE/emp_H3/dev.csv"
-
emp_H3K14ac
- 训练集路径: "GUE/emp_H3K14ac/train.csv"
- 测试集路径: "GUE/emp_H3K14ac/test.csv"
- 验证集路径: "GUE/emp_H3K14ac/dev.csv"
-
emp_H3K36me3
- 训练集路径: "GUE/emp_H3K36me3/train.csv"
- 测试集路径: "GUE/emp_H3K36me3/test.csv"
- 验证集路径: "GUE/emp_H3K36me3/dev.csv"
-
emp_H3K4me1
- 训练集路径: "GUE/emp_H3K4me1/train.csv"
- 测试集路径: "GUE/emp_H3K4me1/test.csv"
- 验证集路径: "GUE/emp_H3K4me1/dev.csv"
-
emp_H3K4me2
- 训练集路径: "GUE/emp_H3K4me2/train.csv"
- 测试集路径: "GUE/emp_H3K4me2/test.csv"
- 验证集路径: "GUE/emp_H3K4me2/dev.csv"
-
emp_H3K4me3
- 训练集路径: "GUE/emp_H3K4me3/train.csv"
- 测试集路径: "GUE/emp_H3K4me3/test.csv"
- 验证集路径: "GUE/emp_H3K4me3/dev.csv"
-
emp_H3K79me3
- 训练集路径: "GUE/emp_H3K79me3/train.csv"
- 测试集路径: "GUE/emp_H3K79me3/test.csv"
- 验证集路径: "GUE/emp_H3K79me3/dev.csv"
-
emp_H3K9ac
- 训练集路径: "GUE/emp_H3K9ac/train.csv"
- 测试集路径: "GUE/emp_H3K9ac/test.csv"
- 验证集路径: "GUE/emp_H3K9ac/dev.csv"
-
emp_H4
- 训练集路径: "GUE/emp_H4/train.csv"
- 测试集路径: "GUE/emp_H4/test.csv"
- 验证集路径: "GUE/emp_H4/dev.csv"
-
emp_H4ac
- 训练集路径: "GUE/emp_H4ac/train.csv"
- 测试集路径: "GUE/emp_H4ac/test.csv"
- 验证集路径: "GUE/emp_H4ac/dev.csv"
-
human_tf_0
- 训练集路径: "GUE/human_tf_0/train.csv"
- 测试集路径: "GUE/human_tf_0/test.csv"
- 验证集路径: "GUE/human_tf_0/dev.csv"
-
human_tf_1
- 训练集路径: "GUE/human_tf_1/train.csv"
- 测试集路径: "GUE/human_tf_1/test.csv"
- 验证集路径: "GUE/human_tf_1/dev.csv"
-
human_tf_2
- 训练集路径: "GUE/human_tf_2/train.csv"
- 测试集路径: "GUE/human_tf_2/test.csv"
- 验证集路径: "GUE/human_tf_2/dev.csv"
-
human_tf_3
- 训练集路径: "GUE/human_tf_3/train.csv"
- 测试集路径: "GUE/human_tf_3/test.csv"
- 验证集路径: "GUE/human_tf_3/dev.csv"
-
human_tf_4
- 训练集路径: "GUE/human_tf_4/train.csv"
- 测试集路径: "GUE/human_tf_4/test.csv"
- 验证集路径: "GUE/human_tf_4/dev.csv"
-
mouse_0
- 训练集路径: "GUE/mouse_0/train.csv"
- 测试集路径: "GUE/mouse_0/test.csv"
- 验证集路径: "GUE/mouse_0/dev.csv"
-
mouse_1
- 训练集路径: "GUE/mouse_1/train.csv"
- 测试集路径: "GUE/mouse_1/test.csv"
- 验证集路径: "GUE/mouse_1/dev.csv"
-
mouse_2
- 训练集路径: "GUE/mouse_2/train.csv"
- 测试集路径: "GUE/mouse_2/test.csv"
- 验证集路径: "GUE/mouse_2/dev.csv"
-
mouse_3
- 训练集路径: "GUE/mouse_3/train.csv"
- 测试集路径: "GUE/mouse_3/test.csv"
- 验证集路径: "GUE/mouse_3/dev.csv"
-
mouse_4
- 训练集路径: "GUE/mouse_4/train.csv"
- 测试集路径: "GUE/mouse_4/test.csv"
- 验证集路径: "GUE/mouse_4/dev.csv"
-
prom_300_all
- 训练集路径: "GUE/prom_300_all/train.csv"
- 测试集路径: "GUE/prom_300_all/test.csv"
- 验证集路径: "GUE/prom_300_all/dev.csv"
-
prom_300_notata
- 训练集路径: "GUE/prom_300_notata/train.csv"
- 测试集路径: "GUE/prom_300_notata/test.csv"
- 验证集路径: "GUE/prom_300_notata/dev.csv"
-
prom_300_tata
- 训练集路径: "GUE/prom_300_tata/train.csv"
- 测试集路径: "GUE/prom_300_tata/test.csv"
- 验证集路径: "GUE/prom_300_tata/dev.csv"
-
prom_core_all
- 训练集路径: "GUE/prom_core_all/train.csv"
- 测试集路径: "GUE/prom_core_all/test.csv"
- 验证集路径: "GUE/prom_core_all/dev.csv"
-
prom_core_notata
- 训练集路径: "GUE/prom_core_notata/train.csv"
- 测试集路径: "GUE/prom_core_notata/test.csv"
- 验证集路径: "GUE/prom_core_notata/dev.csv"
-
prom_core_tata
- 训练集路径: "GUE/prom_core_tata/train.csv"
- 测试集路径: "GUE/prom_core_tata/test.csv"
- 验证集路径: "GUE/prom_core_tata/dev.csv"
-
splice_reconstructed
- 训练集路径: "GUE/splice_reconstructed/train.csv"
- 测试集路径: "GUE/splice_reconstructed/test.csv"
- 验证集路径: "GUE/splice_reconstructed/dev.csv"
-
virus_covid
- 训练集路径: "GUE/virus_covid/train.csv"
- 测试集路径: "GUE/virus_covid/test.csv"
- 验证集路径: "GUE/virus_covid/dev.csv"
搜集汇总
数据集介绍

构建方式
在基因组学领域,数据集的构建需兼顾物种多样性与任务代表性。GUE数据集通过整合多物种基因组序列,涵盖了从人类、小鼠到病毒、真菌等广泛生物类别,并针对启动子识别、转录因子结合、组蛋白修饰等关键生物学任务进行系统化组织。其构建过程严格遵循生物信息学标准,将原始基因组数据转化为标准化的CSV格式,并划分为训练集、验证集和测试集,确保了数据在机器学习任务中的直接可用性与评估一致性。
使用方法
使用该数据集时,研究者可通过Hugging Face的datasets库便捷加载。首先利用get_dataset_config_names函数查看所有可用配置,随后使用load_dataset函数并指定配置名称(如prom_core_all)即可载入对应子集。数据以标准分割形式呈现,可直接用于模型训练与评估,其结构化格式支持高效的数据迭代与批处理,为基因组序列的深度学习研究提供了即用型基础设施。
背景与挑战
背景概述
基因组理解评估数据集(GUE)由Zhihan Zhou等研究人员于2023年构建,隶属于DNABERT-2研究项目,旨在为多物种基因组序列分析提供标准化评估基准。该数据集涵盖人类、小鼠、病毒、真菌等多种生物体的基因组序列任务,包括转录因子结合位点预测、组蛋白修饰识别、启动子分类及物种分类等核心生物信息学问题。其创建推动了基因组学领域向大规模预训练模型的发展,为衡量模型在复杂基因组功能注释任务上的泛化能力提供了关键工具,显著促进了计算生物学与人工智能的交叉融合。
当前挑战
GUE数据集致力于解决基因组序列理解中的多重挑战,包括跨物种基因组功能元件的精确识别、序列中微弱信号模式的检测,以及高维稀疏序列数据的有效建模。在构建过程中,面临数据异质性整合的难题,需协调不同实验平台产生的组蛋白修饰与转录因子结合数据;同时,确保序列标注的生物学准确性与一致性也是一大挑战,涉及复杂的数据清洗与标准化流程。此外,数据集的多样本平衡与代表性维护,要求在多物种与多任务间取得微妙平衡,以支撑稳健的模型评估。
常用场景
经典使用场景
在基因组学领域,GUE数据集作为多物种基因组理解的基准评估工具,其经典使用场景集中于训练和验证深度学习模型对DNA序列功能的预测能力。该数据集通过整合人类、小鼠等物种的组蛋白修饰、转录因子结合位点以及启动子识别等多样化任务,为模型提供了跨物种、跨功能的统一测试平台。研究人员能够利用这些标注数据,系统评估模型在基因组元素分类与回归任务中的泛化性能,从而推动基因组智能分析技术的发展。
解决学术问题
GUE数据集有效解决了基因组学中模型评估标准不统一的学术难题,为多任务学习与迁移学习提供了严谨的基准。它通过涵盖组蛋白修饰如H3K4me3、转录因子结合以及病毒物种分类等任务,使得研究者能够量化模型在跨物种基因组功能注释中的准确性。这一基准的建立,显著促进了基因组深度学习模型的比较与优化,为理解表观遗传调控与序列功能关联提供了可靠的数据支撑。
实际应用
在实际应用中,GUE数据集为精准医学与生物技术领域提供了关键的数据资源。例如,在COVID-19病毒序列分析中,该数据集可用于训练模型识别病毒特征;在癌症研究中,通过人类细胞系如K562、HeLa-S3的组蛋白修饰数据,辅助解析肿瘤表观遗传机制。这些应用有助于开发诊断工具与靶向疗法,将基因组智能分析转化为具有临床与产业价值的解决方案。
数据集最近研究
最新研究方向
在基因组学领域,GUE数据集作为多物种基因组理解评估的基准,正推动着前沿研究向跨物种基因组功能预测与表观遗传调控机制的深度探索迈进。该数据集整合了人类、小鼠、病毒及真菌等多种生物的基因组序列与表观遗传标记,如组蛋白修饰和转录因子结合位点,为开发高效基因组基础模型提供了关键资源。当前研究热点聚焦于利用Transformer架构的预训练模型,如DNABERT-2,来解析基因组序列中的复杂模式,以预测基因表达、剪接事件和病原体进化特征。这些进展不仅加速了精准医学和合成生物学的发展,还为应对全球公共卫生挑战,如新冠病毒变异追踪,提供了数据驱动的洞察力。
以上内容由遇见数据集搜集并总结生成



