opengenome2
收藏魔搭社区2026-01-08 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/arcinstitute/opengenome2
下载链接
链接失效反馈官方服务:
资源简介:
# OpenGenome2
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/649aee789fc303937a045f6a/Ma6udvzIkeAypUXcH-FJR.jpeg" width="70%" />
</p>
OpenGenome2 is a database of nearly 9 trillion base pairs of curated DNA from across all domains of life. Collected from diverse species and public data sources, OpenGenome2 was used to train Evo 2 models. Please refer to the [Evo 2 preprint](https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1) or [github repository](https://github.com/ArcInstitute/evo2) for further details and usage examples.
We provide OpenGenome2 in two formats, the dataset is organized into two main directories to reflect this:
- **fasta** which contain the DNA sequences
- **jsonl** which include the specific preprocessed sequences used for Evo 2 pretraining, such as adding special tokens and phylogenetic tags
This dataset was specifically curated and preprocessed for training the Evo 2 family of genomic language models and can be used for training models or bioinformatics.
## Dataset Statistics
- **Total size**: 8.8 trillion base pairs
- **Coverage**: All domains of life (Bacteria, Archaea, Eukaryota, Viruses)
- **Formats available**: FASTA, JSONL
## Data Sources
The dataset combines sequences from various public databases and repositories:
- **Prokaryotic genomes**: GTDBv220, IMG/PR
- **Metagenomics**: MGD DB
- **Viral sequences**: IMG/VR
- **Eukaryotic data**: NCBI, Ensembl (from which we identified mRNAs, genomic windows)
- **Eukaryotic elements**: Eukaryotic Promoter Database new (EPDnew)
- **RNA sequences**: RNAcentral, Rfam
- **Organellar genomes**: Various organelles
## Training Data Composition
Evo 2 uses a two stage to train on OpenGenome2, first pretraining on a focused dataset at shorter sequence length and then longer sequence length with more full genomes and special tags.
### Phase 1: Pretraining
| Dataset | Number of Tokens (billions) | Composition | Evo 2 Dataloader Weight |
|---------|------------------------------|-------------|-----------------|
| GTDBv220 + IMG/PR | 351 | 18.93% | 18.00% |
| Metagenomics (MGD DB) | 854 | 46.06% | 24.00% |
| IMG/VR | 34 | 1.83% | 3.00% |
| Euk mRNA stitched | 99 | 5.34% | 9.00% |
| Eukaryotic mRNAs (Ensembl, NCBI) | 89 | 4.80% | 9.00% |
| Euk 5kb windows stitched | 405 | 21.84% | 35.00% |
| Organelles | 3 | 0.16% | 0.50% |
| ncRNA (RNAcentral, Rfam, Ensembl, NCBI) | 19 | 1.02% | 2.00% |
| Eukaryotic Promoter Database new (EPDnew) | 0.11 | 0.01% | 0.02% |
### Phase 2: Context Extension
| Dataset | Number of Tokens (billions) | Composition | Evo 2 Dataloader Weight |
|---------|------------------------------|-------------|-----------------|
| TAGGED/Long: GTDBv220 + IMG/PR | 351 | 4.08% | 24.00% |
| Metagenomics (MGD DB) | 854 | 9.93% | 5.00% |
| TAGGED/Long: IMG/VR | 34 | 0.40% | 2.00% |
| ncRNA (RNAcentral, Rfam, Ensembl, NCBI) | 19 | 0.22% | 1.00% |
| Eukaryotic Promoter Database new (EPDnew) | 0.11 | 0.00% | 0.01% |
| Organelles | 3 | 0.03% | 0.25% |
| Euk mRNA stitched | 99 | 1.15% | 4.50% |
| Eukaryotic mRNAs (Ensembl, NCBI) | 89 | 1.04% | 4.50% |
| Euk 5kb windows stitched | 405 | 4.71% | 5.00% |
| Tagged/Long: NCBI Eukaryote: Animalia | 4,907 | 104.00% | 36.00% |
| Tagged/Long: NCBI Eukaryote: Plantae | 1,652 | 96.00% | 12.00% |
| Tagged/Long: NCBI Eukaryote: Fungi | 156 | 24.00% | 4.00% |
| Tagged/Long: NCBI Eukaryote: Protista | 17 | 0.00% | 0.80% |
| Tagged/Long: NCBI Eukaryote: Chromista | 13 | 6.00% | 0.80% |
## Citation
If you use OpenGenome2 in your research, please cite:
```bibtex
@article{Brixi2025.02.18.638918,
author = {Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K and Adams, Etowah and Baccus, Stephen A and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Ilango, Rajesh and Janik, Ken and Lu, Amy X and Mehta, Reshma and Mofrad, Mohammad R.K. and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St. John, John and Sullivan, Jeremy and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Tom and Powell, Kimberly and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D and Hie, Brian},
title = {Genome modeling and design across all domains of life with Evo 2},
elocation-id = {2025.02.18.638918},
year = {2025},
doi = {10.1101/2025.02.18.638918},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918},
eprint = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918.full.pdf},
journal = {bioRxiv}
}
```
OpenGenome2 incorporates data from multiple public databases. Please also cite the original data sources as appropriate, and refer to the [Evo 2 preprint](https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1) for further details.
**GTDB (Genome Taxonomy Database):**
Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. *Nucleic Acids Research*, 50(D1), D785–D794.
**Metagenomics (MGD DB):**
Durrant, M. G., Perry, N. T., Pai, J. J., Jangid, A. R., Athukoralage, J. S., Hiraizumi, M., McSpedon, J. P., Pawluk, A., Nishimura, H., Konermann, S., & Hsu, P. D. (2024). Bridge RNAs direct programmable recombination of target and donor DNA. *Nature*, 630(8018), 984–993.
Additional data sources include NCBI, Ensembl, IMG/VR, RNAcentral, Rfam, and EPDnew databases.
## License
Apache 2.0
# OpenGenome2
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/649aee789fc303937a045f6a/Ma6udvzIkeAypUXcH-FJR.jpeg" width="70%" />
</p>
OpenGenome2是一个收录近9万亿碱基对、经人工精选的全生命域DNA数据库。本数据集采集自多样物种与各类公开数据源,曾用于训练Evo 2系列模型。如需了解更多细节与使用示例,请参阅《Evo 2预印本》(https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1)或其GitHub代码仓库(https://github.com/ArcInstitute/evo2)。
我们以两种格式提供OpenGenome2数据集,为此将其划分为两个主目录:
- **FASTA格式**:存储原始DNA序列
- **JSONL格式**:包含用于Evo 2预训练的特定预处理序列,例如添加特殊Token(Token)与系统发育标签
本数据集专为训练Evo 2系列基因组大语言模型(Large Language Model,LLM)而精心筛选与预处理,可用于模型训练或生物信息学研究。
## 数据集统计信息
- **总规模**:8.8万亿碱基对
- **覆盖范围**:全生命域(细菌域、古菌域、真核生物域、病毒域)
- **可用格式**:FASTA、JSONL
## 数据源
本数据集整合了多个公开数据库与资源库的序列数据:
- **原核生物基因组**:GTDBv220、IMG/PR
- **宏基因组数据**:MGD数据库
- **病毒序列**:IMG/VR
- **真核生物数据**:NCBI、Ensembl(我们从中提取了mRNA序列与基因组窗口)
- **真核生物调控元件**:真核启动子数据库新版(EPDnew)
- **RNA序列**:RNAcentral、Rfam
- **细胞器基因组**:各类细胞器基因组
## 训练数据构成
Evo 2针对OpenGenome2的训练分为两个阶段:首先在较短序列长度的聚焦数据集上开展预训练,随后在包含更多完整基因组与特殊标签的较长序列数据集上进行后续训练。
### 第一阶段:预训练
| 数据集 | Token数量(十亿) | 构成占比 | Evo 2数据加载器权重 |
|---------|------------------------------|-------------|-----------------|
| GTDBv220 + IMG/PR | 351 | 18.93% | 18.00% |
| 宏基因组数据(MGD数据库) | 854 | 46.06% | 24.00% |
| IMG/VR | 34 | 1.83% | 3.00% |
| 拼接真核mRNA | 99 | 5.34% | 9.00% |
| 真核mRNA(Ensembl、NCBI) | 89 | 4.80% | 9.00% |
| 拼接真核5kb基因组窗口 | 405 | 21.84% | 35.00% |
| 细胞器基因组 | 3 | 0.16% | 0.50% |
| 非编码RNA(RNAcentral、Rfam、Ensembl、NCBI) | 19 | 1.02% | 2.00% |
| 真核启动子数据库新版(EPDnew) | 0.11 | 0.01% | 0.02% |
### 第二阶段:上下文扩展
| 数据集 | Token数量(十亿) | 构成占比 | Evo 2数据加载器权重 |
|---------|------------------------------|-------------|-----------------|
| 带标签/长序列:GTDBv220 + IMG/PR | 351 | 4.08% | 24.00% |
| 宏基因组数据(MGD数据库) | 854 | 9.93% | 5.00% |
| 带标签/长序列:IMG/VR | 34 | 0.40% | 2.00% |
| 非编码RNA(RNAcentral、Rfam、Ensembl、NCBI) | 19 | 0.22% | 1.00% |
| 真核启动子数据库新版(EPDnew) | 0.11 | 0.00% | 0.01% |
| 细胞器基因组 | 3 | 0.03% | 0.25% |
| 拼接真核mRNA | 99 | 1.15% | 4.50% |
| 真核mRNA(Ensembl、NCBI) | 89 | 1.04% | 4.50% |
| 拼接真核5kb基因组窗口 | 405 | 4.71% | 5.00% |
| 带标签/长序列:NCBI真核生物:动物界 | 4,907 | 104.00% | 36.00% |
| 带标签/长序列:NCBI真核生物:植物界 | 1,652 | 96.00% | 12.00% |
| 带标签/长序列:NCBI真核生物:真菌界 | 156 | 24.00% | 4.00% |
| 带标签/长序列:NCBI真核生物:原生生物界 | 17 | 0.00% | 0.80% |
| 带标签/长序列:NCBI真核生物:色藻界 | 13 | 6.00% | 0.80% |
## 引用规范
若您的研究中使用了OpenGenome2,请引用如下文献:
bibtex
@article{Brixi2025.02.18.638918,
author = {Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K and Adams, Etowah and Baccus, Stephen A and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Ilango, Rajesh and Janik, Ken and Lu, Amy X and Mehta, Reshma and Mofrad, Mohammad R.K. and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St. John, John and Sullivan, Jeremy and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Tom and Powell, Kimberly and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D and Hie, Brian},
title = {Genome modeling and design across all domains of life with Evo 2},
elocation-id = {2025.02.18.638918},
year = {2025},
doi = {10.1101/2025.02.18.638918},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918},
eprint = {https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918.full.pdf},
journal = {bioRxiv}
}
OpenGenome2整合了多个公开数据库的数据,请同时根据实际情况引用原始数据源,更多细节请参阅《Evo 2预印本》(https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1)。
**GTDB(基因组分类数据库,Genome Taxonomy Database):**
Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. *Nucleic Acids Research*, 50(D1), D785–D794.
**宏基因组数据库(MGD DB):**
Durrant, M. G., Perry, N. T., Pai, J. J., Jangid, A. R., Athukoralage, J. S., Hiraizumi, M., McSpedon, J. P., Pawluk, A., Nishimura, H., Konermann, S., & Hsu, P. D. (2024). Bridge RNAs direct programmable recombination of target and donor DNA. *Nature*, 630(8018), 984–993.
其余数据源还包括NCBI、Ensembl、IMG/VR、RNAcentral、Rfam与EPDnew数据库。
## 授权协议
Apache 2.0
提供机构:
maas
创建时间:
2025-02-22
搜集汇总
数据集介绍

背景与挑战
背景概述
OpenGenome2是一个大规模的DNA数据库,包含近9万亿个碱基对,覆盖生命的所有领域,提供FASTA和JSONL两种格式,专门用于训练基因组语言模型Evo 2。数据集整合了来自多个公共数据库的序列,包括GTDBv220、IMG/PR、MGD DB、NCBI等,适用于生物信息学和模型训练。
以上内容由遇见数据集搜集并总结生成



