five

andrewdalpino/Tiny-OpenGenome2

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/andrewdalpino/Tiny-OpenGenome2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - genomics - dna - opengenome2 - genetics size_categories: - 1M<n<10M dataset_info: - config_name: midtrain features: - name: sequence dtype: string - name: category dtype: string splits: - name: train num_bytes: 79381656661 num_examples: 4996 download_size: 38542629546 dataset_size: 79381656661 - config_name: pretrain features: - name: sequence dtype: string - name: category dtype: string splits: - name: train num_bytes: 106110166474 num_examples: 999995 download_size: 51542776555 dataset_size: 106110166474 configs: - config_name: midtrain data_files: - split: train path: midtrain/train-* - config_name: pretrain data_files: - split: train path: pretrain/train-* task_categories: - text-generation - fill-mask pretty_name: Tiny OG2 --- # Tiny OG2 Dataset ![Tiny OG2 Banner](https://raw.githubusercontent.com/andrewdalpino/TinyOG2/master/docs/images/tiny-og2-banner.png) This is a curated subset of the [OpenGenome2](https://huggingface.co/datasets/arcinstitute/opengenome2) dataset consisting of over 1 million DNA sequences with over 185 billion base pair (BP) tokens across 16 categories covering a broad spectrum of biological life. It is designed to replicate the distribution of samples used to train the [Evo2](https://huggingface.co/arcinstitute/evo2_40b) model but with substantially fewer training examples - making it ideal for knowledge distillation, rapid iteration, and academic use. It is divided into `pretrain` and `midtrain` subsets which are suited for short and long context training respectively. ## Categories Each `pretrain` and `midtrain` subset has a different set of categories. ### Pretrain The pretrain subset contains about 106B BP tokens divided over the following categories. | Category | Num Tokens | Sample Weight | Comment | | --- | --- | --- | --- | | eukaryotic_genic_windows | 90B | 35% | 5K BP stitched token windows. | | gtdb_v220_imgpr | 3.5B | 18% | Genome Taxonomy Database + IMG/PR. | | imgvr_untagged | 468M | 3% | IMG/VR viral sequences. | | metagenomes | 11B | 24% | MGD database. | | mrna | 196M | 9% | Eukaryotic mRNAs (Ensembl, NCBI). | | mrna_splice_promoter | 312M | 9% | Stitched. | | ncrna | 17M | 2% | RNAcentral, Rfam, Ensembl, NCBI. | | organelle | 422M | 0.5% | Various organelles. | | promoters | 119K | 0.02% | Eukaryotic Promoter Database new (EPDnew). | ### Midtrain Midtrain contains roughly 80B BP tokens in long-context samples. | Category | Num Tokens | Sample Weight | Comment | | --- | --- | --- | --- | | gtdb_v220_stitched | 2B | 13% | GTDB tagged as long. | | imgpr_long | 18M | 13% | IMG/PR samples tagged as long. | | ncbi_genomes_animalia | 43B | 40% | Full genomes. | | ncbi_genomes_chromista | 630M | 0.9% | Full genomes. | | ncbi_genomes_fungi | 3.6B | 4% | Full genomes. | | ncbi_genomes_plantae | 29B | 27% | Full genomes. | | ncbi_genomes_protista | 567M | 0.9% | Full genomes. | ## Example Usage ### Loading To load the Tiny OpenGenome2 dataset using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/index) library refer to the examples below. First, install the `datasets` library using your favorite package manager. ```sh pip install datasets ``` Then call the `load_dataset()` function, specifying the subset like in the examples below. ```python from datasets import load_dataset # Load the pretrain subset. dataset = load_dataset("andrewdalpino/Tiny-OpenGenome2", "pretrain") # Load the midtrain subset. dataset = load_dataset("andrewdalpino/Tiny-OpenGenome2", "midtrain") ``` ### Filtering You can also filter the samples of the dataset like in the examples below. ```python dataset = dataset.filter(lambda sample: len(sample["sequence"]) <= 8192) ``` ```python SELECTED_CATEGORIES = { "eukaryotic_genic_windows", "gtdb_v220_imgpr", "metagenomes", } dataset = dataset.filter(lambda sample: sample["category"] in SELECTED_CATEGORIES) ``` ## Code Repository The code for this dataset can be found at [https://github.com/andrewdalpino/TinyOG2](https://github.com/andrewdalpino/TinyOG2). ## References >- Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and Poli, Michael and Brockman, Greg and Chang, Daniel and Gonzalez, Gabriel A and King, Samuel H and Li, David B and Merchant, Aditi T and Naghipourfar, Mohsen and Nguyen, Eric and Ricci-Tam, Chiara and Romero, David W and Sun, Gwanggyu and Taghibakshi, Ali and Vorontsov, Anton and Yang, Brandon and Deng, Myra and Gorton, Liv and Nguyen, Nam and Wang, Nicholas K and Adams, Etowah and Baccus, Stephen A and Dillmann, Steven and Ermon, Stefano and Guo, Daniel and Ilango, Rajesh and Janik, Ken and Lu, Amy X and Mehta, Reshma and Mofrad, Mohammad R.K. and Ng, Madelena Y and Pannu, Jaspreet and Re, Christopher and Schmok, Jonathan C and St. John, John and Sullivan, Jeremy and Zhu, Kevin and Zynda, Greg and Balsam, Daniel and Collison, Patrick and Costa, Anthony B. and Hernandez-Boussard, Tina and Ho, Eric and Liu, Ming-Yu and McGrath, Tom and Powell, Kimberly and Burke, Dave P. and Goodarzi, Hani and Hsu, Patrick D and Hie, Brian, Genome modeling and design across all domains of life with Evo 2, https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918, 2025. >- GTDB (Genome Taxonomy Database): Parks, D. H., Chuvochina, M., Rinke, C., Mussig, A. J., Chaumeil, P.-A., & Hugenholtz, P. (2022). GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research, 50(D1), D785–D794. >- Metagenomics (MGD DB): Durrant, M. G., Perry, N. T., Pai, J. J., Jangid, A. R., Athukoralage, J. S., Hiraizumi, M., McSpedon, J. P., Pawluk, A., Nishimura, H., Konermann, S., & Hsu, P. D. (2024). Bridge RNAs direct programmable recombination of target and donor DNA. Nature, 630(8018), 984–993. Additional data sources include NCBI, Ensembl, IMG/VR, RNAcentral, Rfam, and EPDnew databases.
提供机构:
andrewdalpino
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作