andrewdalpino/Tiny-OpenGenome2
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/andrewdalpino/Tiny-OpenGenome2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- genomics
- dna
- opengenome2
- genetics
size_categories:
- 1M<n<10M
dataset_info:
- config_name: midtrain
features:
- name: sequence
dtype: string
- name: category
dtype: string
splits:
- name: train
num_bytes: 79381656661
num_examples: 4996
download_size: 38542629546
dataset_size: 79381656661
- config_name: pretrain
features:
- name: sequence
dtype: string
- name: category
dtype: string
splits:
- name: train
num_bytes: 106110166474
num_examples: 999995
download_size: 51542776555
dataset_size: 106110166474
configs:
- config_name: midtrain
data_files:
- split: train
path: midtrain/train-*
- config_name: pretrain
data_files:
- split: train
path: pretrain/train-*
task_categories:
- text-generation
- fill-mask
pretty_name: Tiny OG2
---
# Tiny OG2 Dataset

This is a curated subset of the [OpenGenome2](https://huggingface.co/datasets/arcinstitute/opengenome2) dataset consisting of over 1 million DNA sequences with over 185 billion base pair (BP) tokens across 16 categories covering a broad spectrum of biological life. It is designed to replicate the distribution of samples used to train the [Evo2](https://huggingface.co/arcinstitute/evo2_40b) model but with substantially fewer training examples - making it ideal for knowledge distillation, rapid iteration, and academic use. It is divided into `pretrain` and `midtrain` subsets which are suited for short and long context training respectively.
## Categories
Each `pretrain` and `midtrain` subset has a different set of categories.
### Pretrain
The pretrain subset contains about 106B BP tokens divided over the following categories.
| Category | Num Tokens | Sample Weight | Comment |
| --- | --- | --- | --- |
| eukaryotic_genic_windows | 90B | 35% | 5K BP stitched token windows. |
| gtdb_v220_imgpr | 3.5B | 18% | Genome Taxonomy Database + IMG/PR. |
| imgvr_untagged | 468M | 3% | IMG/VR viral sequences. |
| metagenomes | 11B | 24% | MGD database. |
| mrna | 196M | 9% | Eukaryotic mRNAs (Ensembl, NCBI). |
| mrna_splice_promoter | 312M | 9% | Stitched. |
| ncrna | 17M | 2% | RNAcentral, Rfam, Ensembl, NCBI. |
| organelle | 422M | 0.5% | Various organelles. |
| promoters | 119K | 0.02% | Eukaryotic Promoter Database new (EPDnew). |
### Midtrain
Midtrain contains roughly 80B BP tokens in long-context samples.
| Category | Num Tokens | Sample Weight | Comment |
| --- | --- | --- | --- |
| gtdb_v220_stitched | 2B | 13% | GTDB tagged as long. |
| imgpr_long | 18M | 13% | IMG/PR samples tagged as long. |
| ncbi_genomes_animalia | 43B | 40% | Full genomes. |
| ncbi_genomes_chromista | 630M | 0.9% | Full genomes. |
| ncbi_genomes_fungi | 3.6B | 4% | Full genomes. |
| ncbi_genomes_plantae | 29B | 27% | Full genomes. |
| ncbi_genomes_protista | 567M | 0.9% | Full genomes. |
## Example Usage
### Loading
To load the Tiny OpenGenome2 dataset using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library refer to the examples below.
First, install the `datasets` library using your favorite package manager.
```sh
pip install datasets
```
Then call the `load_dataset()` function, specifying the subset like in the examples below.
```python
from datasets import load_dataset
# Load the pretrain subset.
dataset = load_dataset("andrewdalpino/Tiny-OpenGenome2", "pretrain")
# Load the midtrain subset.
dataset = load_dataset("andrewdalpino/Tiny-OpenGenome2", "midtrain")
```
### Filtering
You can also filter the samples of the dataset like in the examples below.
```python
dataset = dataset.filter(lambda sample: len(sample["sequence"]) <= 8192)
```
```python
SELECTED_CATEGORIES = {
"eukaryotic_genic_windows",
"gtdb_v220_imgpr",
"metagenomes",
}
dataset = dataset.filter(lambda sample: sample["category"] in SELECTED_CATEGORIES)
```
## Code Repository
The code for this dataset can be found at [https://github.com/andrewdalpino/TinyOG2](https://github.com/andrewdalpino/TinyOG2).
## References
>- Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K and Adams, Etowah and Baccus, Stephen A and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Ilango, Rajesh and Janik, Ken and Lu, Amy X and Mehta, Reshma and Mofrad, Mohammad R.K. and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St. John, John and Sullivan, Jeremy and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Tom and Powell, Kimberly and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D and Hie, Brian, Genome modeling and design across all domains of life with Evo 2, https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918, 2025.
>- GTDB (Genome Taxonomy Database): Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50(D1), D785–D794.
>- Metagenomics (MGD DB): Durrant, M. G., Perry, N. T., Pai, J. J., Jangid, A. R., Athukoralage, J. S., Hiraizumi, M., McSpedon, J. P., Pawluk, A., Nishimura, H., Konermann, S., & Hsu, P. D. (2024). Bridge RNAs direct programmable recombination of target and donor DNA. Nature, 630(8018), 984–993.
Additional data sources include NCBI, Ensembl, IMG/VR, RNAcentral, Rfam, and EPDnew databases.
提供机构:
andrewdalpino



