bioinfoihb/FishNALM-8-pretrain-corpus
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bioinfoihb/FishNALM-8-pretrain-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: FishNALM-8 Pretraining Corpus
language:
- en
license: cc-by-nc-4.0
tags:
- genomics
- fish
- DNA
- pretraining
- foundation-model
- FishNALM
- cyprinid
---
# FishNALM-8 Pretraining Corpus
The **FishNALM-8 Pretraining Corpus** is the sequence corpus used to pretrain the FishNALM-8 family of fish-specific foundation DNA language models.
This dataset contains processed genomic sequence windows derived from **8 representative cyprinid fish genomes**. It was constructed for large-scale self-supervised pretraining of DNA language models in the FishNALM project.
## Dataset summary
FishNALM is a fish-specific DNA foundation model family developed for fish genomes. In the FishNALM study, the 8-species corpus was designed to emphasize **core cyprinid lineage sequence patterns** while maintaining a unified preprocessing workflow across genomes.
In this repository, the pretraining corpus is provided as plain text sequence data. Each record corresponds to one processed genomic sequence window used for model pretraining.
## Dataset construction
The FishNALM-8 pretraining corpus was built from curated fish reference genomes downloaded from public resources and processed through a unified genome preparation workflow. According to the manuscript:
- the corpus was constructed from **8 representative cyprinid fish genomes**
- only **major chromosomes** were retained for downstream corpus construction
- non-ATCG characters were normalized to **N**
- genomes were segmented into **3,000 bp windows**
- windows were filtered and stratified according to repeat-content structure
- the retained pretraining sequence volume after filtering was approximately **3.13 Gb**
These design choices were used to make the training corpus more balanced across species and more suitable for fish genomic sequence modeling. fileciteturn2file3
## Data characteristics
This corpus is intended for **DNA language model pretraining**, rather than for supervised labels or benchmark evaluation.
Key characteristics include:
- fish-specific genomic pretraining data
- sequence windows centered on a unified **3 kb** scale
- preprocessing designed to reduce the impact of assembly artifacts, non-primary sequences and highly imbalanced repeat composition
- compatibility with the FishNALM pretraining framework described in the manuscript fileciteturn2file0turn2file3
## Recommended repository structure
A clean structure for this dataset repository is:
```text
FishNALM-8-pretrain-corpus/
├── fishnalm8_genome3000.txt
└── README.md
```
## Recommended uses
This dataset is intended for:
- self-supervised pretraining of fish genomic language models
- methodological studies on fish genome representation learning
- reproducibility and documentation of the FishNALM pretraining setup
## Limitations
- This corpus was designed for **fish genome pretraining**, especially the cyprinid-focused FishNALM-8 setting.
- It is not a labeled downstream benchmark dataset.
- Performance and transferability outside fish genomic contexts may be limited.
- The dataset card summarizes the preprocessing strategy, but detailed species lists and assembly metadata should be reported alongside the manuscript supplementary materials. fileciteturn2file1turn2file3
## Related resources
- **Project**: FishNALM
- **GitHub**: [bioinfoihb/FishNALM](https://github.com/bioinfoihb/FishNALM)
- **Manuscript**: *FishNALM: A Foundation DNA Language Model for Fish Genomes*
- **Organization**: Institute of Hydrobiology, Chinese Academy of Sciences
## Contact
**Xiao-Qin Xia**
Institute of Hydrobiology, Chinese Academy of Sciences
Email: xqxia@ihb.ac.cn
Email: bioinfoihb@ihb.ac.cn
提供机构:
bioinfoihb



