plantcad/opengenome2-metagenomes-plantcad2-c4096
收藏Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/plantcad/opengenome2-metagenomes-plantcad2-c4096
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- biology
- DNA
- genomics
- genetics
- metagenomics
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 2638656
- name: validation
num_examples: 1000
- name: test
num_examples: 1000
---
# OpenGenome2 Metagenomes PlantCAD2 Subset (4096bp)
This dataset is a curated subset of [arcinstitute/opengenome2](https://huggingface.co/datasets/arcinstitute/opengenome2)
designed for comparative spectral analysis with plant genomic data.
## Dataset Description
Sequences were randomly sampled from OpenGenome2, filtered and truncated to match the sample sizes
per split of the [plantcad/Angiosperm_65_genomes_8192bp](https://huggingface.co/datasets/plantcad/Angiosperm_65_genomes_8192bp) dataset.
### Processing Steps
1. **Streaming**: Records were streamed from the metagenomes subfolder (`json/pretraining_or_both_phases/metagenomes`) of OpenGenome2
2. **Shuffling**: Applied shuffle with buffer size of 10,000 for random sampling
3. **Filtering**: Sequences shorter than 4096bp were excluded
4. **Truncation**: Sequences ≥4096bp were truncated to exactly 4096bp
5. **Sampling**: Collected samples to match PlantCAD split sizes
### Split Sizes
| Split | Number of Examples |
|-------|-------------------|
| train | 2,638,656 |
| validation | 1,000 |
| test | 1,000 |
Note: there are only 1,000 samples in the validation and test splits of the OpenGenome2 Metagenomes data as opposed to 329,832 for those same splits in the PlantCAD2 data.
### Sequence Length
All sequences are exactly **4096 base pairs**.
## Source Dataset
OpenGenome2 is a database of nearly 9 trillion base pairs of curated DNA from across all domains of life,
used to train Evo 2 models. Please refer to the [Evo 2 preprint](https://www.biorxiv.org/content/early/2025/02/21/2025.02.18.638918)
for further details.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("plantcad/opengenome2-metagenomes-plantcad2-c4096")
```
## Citation
If you use this dataset, please cite the original OpenGenome2:
```bibtex
@article{Brixi2025.02.18.638918,
author = {Brixi, Garyk and Durrant, Matthew G and Ku, Jerome and others},
title = {Genome modeling and design across all domains of life with Evo 2},
year = {2025},
doi = {10.1101/2025.02.18.638918},
journal = {bioRxiv}
}
```
## License
Apache 2.0 (inherited from OpenGenome2)
提供机构:
plantcad



