bioinfoihb/FishNALM-20-pretrain-corpus

Name: bioinfoihb/FishNALM-20-pretrain-corpus
Creator: bioinfoihb
Published: 2026-04-15 03:42:28
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/bioinfoihb/FishNALM-20-pretrain-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: FishNALM-20 Pretraining Corpus language: - en license: cc-by-nc-4.0 tags: - genomics - fish - DNA - pretraining - foundation-model - FishNALM - vertebrate-genomics --- # FishNALM-20 Pretraining Corpus The **FishNALM-20 Pretraining Corpus** is the sequence corpus used to pretrain the FishNALM-20 family of fish-specific foundation DNA language models. This dataset contains processed genomic sequence windows derived from **20 diverse fish genomes**. It was constructed to expand the phylogenetic breadth of fish genomic pretraining within the FishNALM project. ## Dataset summary FishNALM is a fish-specific DNA foundation model family developed for fish genomes. In the FishNALM study, the 20-species corpus was designed to provide broader fish lineage coverage and improve representation learning across more diverse genomic backgrounds. In this repository, the pretraining corpus is provided as plain text sequence data. Each record corresponds to one processed genomic sequence window used for model pretraining. ## Dataset construction The FishNALM-20 pretraining corpus was built from curated fish reference genomes downloaded from public resources and processed through a unified genome preparation workflow. According to the manuscript: - the corpus was constructed from **20 diverse fish genomes** - only **major chromosomes** were retained for downstream corpus construction - non-ATCG characters were normalized to **N** - genomes were segmented into **3,000 bp windows** - windows were filtered and stratified according to repeat-content structure - the total retained pretraining sequence volume after filtering was approximately **7.73 Gb** Before final filtering, the retained major-chromosome sequence span across the 20 genomes was approximately **21.13 Gb**. fileciteturn2file3 ## Data characteristics This corpus is intended for **DNA language model pretraining**, rather than for supervised labels or benchmark evaluation. Key characteristics include: - fish-specific genomic pretraining data with broader phylogenetic coverage - sequence windows centered on a unified **3 kb** scale - preprocessing designed to reduce the impact of assembly artifacts, non-primary sequences and highly imbalanced repeat composition - compatibility with the FishNALM pretraining framework described in the manuscript fileciteturn2file0turn2file3 ## Recommended repository structure ```text FishNALM-20-pretrain-corpus/ ├── fishnalm20_genome3000.txt └── README.md ``` ## Recommended uses This dataset is intended for: - self-supervised pretraining of fish genomic language models - comparative studies of lineage breadth in genomic foundation model training - reproducibility and documentation of the FishNALM pretraining setup ## Limitations - This corpus was designed for **fish genome pretraining** and is not a labeled task dataset. - It reflects the specific genome selection and preprocessing strategy used in the FishNALM manuscript. - Transferability outside fish genomic contexts may be limited. - Detailed species lists and assembly metadata should be read together with the manuscript supplementary materials. fileciteturn2file1turn2file3 ## Related resources - **Project**: FishNALM - **GitHub**: [bioinfoihb/FishNALM](https://github.com/bioinfoihb/FishNALM) - **Manuscript**: *FishNALM: A Foundation DNA Language Model for Fish Genomes* - **Organization**: Institute of Hydrobiology, Chinese Academy of Sciences ## Contact **Xiao-Qin Xia** Institute of Hydrobiology, Chinese Academy of Sciences Email: xqxia@ihb.ac.cn Email: bioinfoihb@ihb.ac.cn

提供机构：

bioinfoihb

5,000+

优质数据集

54 个

任务类型

进入经典数据集