five

plantcad/opengenome2-metagenomes-plantcad2-c4096

收藏
Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/plantcad/opengenome2-metagenomes-plantcad2-c4096
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - biology - DNA - genomics - genetics - metagenomics dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 2638656 - name: validation num_examples: 1000 - name: test num_examples: 1000 --- # OpenGenome2 Metagenomes PlantCAD2 Subset (4096bp) This dataset is a curated subset of [arcinstitute/opengenome2](https://huggingface.co/datasets/arcinstitute/opengenome2) designed for comparative spectral analysis with plant genomic data. ## Dataset Description Sequences were randomly sampled from OpenGenome2, filtered and truncated to match the sample sizes per split of the [plantcad/Angiosperm_65_genomes_8192bp](https://huggingface.co/datasets/plantcad/Angiosperm_65_genomes_8192bp) dataset. ### Processing Steps 1. **Streaming**: Records were streamed from the metagenomes subfolder (`json/pretraining_or_both_phases/metagenomes`) of OpenGenome2 2. **Shuffling**: Applied shuffle with buffer size of 10,000 for random sampling 3. **Filtering**: Sequences shorter than 4096bp were excluded 4. **Truncation**: Sequences ≥4096bp were truncated to exactly 4096bp 5. **Sampling**: Collected samples to match PlantCAD split sizes ### Split Sizes | Split | Number of Examples | |-------|-------------------| | train | 2,638,656 | | validation | 1,000 | | test | 1,000 | Note: there are only 1,000 samples in the validation and test splits of the OpenGenome2 Metagenomes data as opposed to 329,832 for those same splits in the PlantCAD2 data. ### Sequence Length All sequences are exactly **4096 base pairs**. ## Source Dataset OpenGenome2 is a database of nearly 9 trillion base pairs of curated DNA from across all domains of life, used to train Evo 2 models. Please refer to the [Evo 2 preprint](https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918) for further details. ## Usage ```python from datasets import load_dataset dataset = load_dataset("plantcad/opengenome2-metagenomes-plantcad2-c4096") ``` ## Citation If you use this dataset, please cite the original OpenGenome2: ```bibtex @article{Brixi2025.02.18.638918, author = {Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and others}, title = {Genome modeling and design across all domains of life with Evo 2}, year = {2025}, doi = {10.1101/2025.02.18.638918}, journal = {bioRxiv} } ``` ## License Apache 2.0 (inherited from OpenGenome2)
提供机构:
plantcad
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作