bioinfoihb/Fish_GUE
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bioinfoihb/Fish_GUE
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-classification
language:
- en
tags:
- genomics
- fish
- DNA
- benchmark
- promoter
- splice-site
- transcription-factor-binding-site
- histone-mark
size_categories:
- 100K<n<1M
pretty_name: FishGUE
---
# FishGUE
**FishGUE** is a unified fish genomics benchmark developed for evaluating DNA foundation models and other sequence models on fish genomic prediction tasks.
It was introduced in the FishNALM study, where it was used to systematically benchmark fish-specific and general DNA language models across diverse regulatory sequence prediction tasks.
## Dataset summary
FishGUE contains **17 supervised prediction tasks** spanning four task groups:
- **Histone mark prediction**
- **Transcription factor binding-site prediction**
- **Promoter prediction**
- **Splice-site prediction**
The benchmark covers input sequence lengths from **300 bp to 2500 bp** and includes both **zebrafish-based datasets** and **multi-species cyprinid splice datasets**.
## Why FishGUE?
Fish genomic benchmarks remain limited compared with those available for human, plant, or broad multi-species genomic modeling. FishGUE was constructed to provide a unified benchmark for evaluating model performance on representative fish genomic sequence prediction problems, especially those relevant to regulatory sequence recognition and gene structure annotation.
## Benchmark composition
The FishGUE benchmark includes the following 17 tasks.
| Category | Task | Species | Sequence length (bp) | Train / Val / Test |
|---|---|---|---:|---:|
| Histone mark prediction | H3K4me1 | *Danio rerio* | variable (≤2500) | 48815 / 6102 / 6102 |
| Histone mark prediction | H3K4me3 | *Danio rerio* | variable (≤2500) | 47482 / 5935 / 5936 |
| Histone mark prediction | H3K9me3 | *Danio rerio* | variable (≤2500) | 40125 / 5016 / 5016 |
| Histone mark prediction | H3K27ac | *Danio rerio* | variable (≤2500) | 27611 / 3451 / 3452 |
| Histone mark prediction | H3K27me3 | *Danio rerio* | variable (≤2500) | 27017 / 3377 / 3378 |
| TF binding-site prediction | CTCF | *Danio rerio* | 800 | 28526 / 3566 / 3566 |
| TF binding-site prediction | Pou5f1 | *Danio rerio* | 800 | 11128 / 1391 / 1391 |
| TF binding-site prediction | Sox2 | *Danio rerio* | 800 | 9717 / 1215 / 1215 |
| Promoter prediction | Core promoter | *Danio rerio* | 300 | 17222 / 2153 / 2153 |
| Promoter prediction | Core promoter (TATA) | *Danio rerio* | 300 | 3464 / 433 / 434 |
| Promoter prediction | Core promoter (non-TATA) | *Danio rerio* | 300 | 13757 / 1720 / 1720 |
| Promoter prediction | Promoter | *Danio rerio* | 1000 | 17222 / 2153 / 2153 |
| Promoter prediction | Promoter (TATA) | *Danio rerio* | 1000 | 3464 / 433 / 434 |
| Promoter prediction | Promoter (non-TATA) | *Danio rerio* | 1000 | 13757 / 1720 / 1720 |
| Splice-site prediction | Splicing (acceptor) | Cyprinidae (5 species) | 600 | 17200 / 2150 / 2150 |
| Splice-site prediction | Splicing (donor) | Cyprinidae (5 species) | 600 | 27600 / 3450 / 3450 |
| Splice-site prediction | Splicing (both) | Cyprinidae (5 species) | 600 | 1720 / 2150 / 2105 |
## Data sources and construction overview
According to the FishNALM manuscript, FishGUE was assembled from the following sources:
- **Histone mark datasets** were derived from public zebrafish ChIP-seq peak datasets.
- **Transcription factor binding-site datasets** were built from public ChIP-seq peaks for **CTCF**, **Pou5f1**, and **Sox2**.
- **Promoter datasets** were constructed from zebrafish promoter annotations from the **Eukaryotic Promoter Database (EPD)**.
- **Splice-site datasets** were compiled from annotated genomes of **five cyprinid fish species**.
For all tasks, positive and negative examples were constructed through task-specific processing, then split into **training**, **validation**, and **test** sets using an **8:1:1** ratio.
## Suggested repository organization
If you plan to upload FishGUE as a single Hugging Face dataset repository, a clean structure is:
```text
FishGUE/
├── README.md
├── histone/
│ ├── H3K4me1_train.tsv
│ ├── H3K4me1_val.tsv
│ ├── H3K4me1_test.tsv
│ └── ...
├── tfbs/
│ ├── CTCF_train.tsv
│ ├── CTCF_val.tsv
│ ├── CTCF_test.tsv
│ └── ...
├── promoter/
│ ├── core_promoter_300_train.tsv
│ └── ...
└── splice/
├── splice_acceptor_train.tsv
└── ...
```
If your files are already packaged in another layout, you can keep that layout and simply explain it in this README.
## Intended uses
FishGUE is intended for:
- benchmarking DNA foundation models on fish genomic sequence prediction tasks
- evaluating transfer learning performance in fish genomics
- comparing task robustness across promoter, chromatin, TFBS, and splice-site prediction settings
- developing new fish-specific sequence models and downstream classifiers
## Limitations
- Most non-splice tasks are based on **zebrafish** data.
- The splice-site benchmark uses **five cyprinid species**, so broader phylogenetic coverage remains limited.
- FishGUE is designed as a **research benchmark** and should not be interpreted as a clinical or diagnostic dataset.
- Performance on FishGUE does not guarantee performance on all fish species or all regulatory genomics tasks.
## Citation
If you use FishGUE in your work, please cite the FishNALM manuscript.
**Manuscript:**
> FishNALM: A Foundation DNA Language Model for Fish Genomes
## Project links
- **GitHub**: [bioinfoihb/FishNALM](https://github.com/bioinfoihb/FishNALM)
## Contact
**Xiao-Qin Xia**
Institute of Hydrobiology, Chinese Academy of Sciences
Email: xqxia@ihb.ac.cn
Email: bioinfoihb@ihb.ac.cn
提供机构:
bioinfoihb



