plantcad/vertebrate-genomes-plantcad2-c4096
收藏Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/plantcad/vertebrate-genomes-plantcad2-c4096
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- biology
- DNA
- genomics
- genetics
- vertebrates
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 2638656
- name: validation
num_examples: 329832
- name: test
num_examples: 329832
---
# Vertebrate Genomes PlantCAD2 Subset (4096bp)
This dataset is a curated subset of [emarro/vertebrate_genomes](https://huggingface.co/datasets/emarro/vertebrate_genomes)
designed for comparative spectral analysis with plant genomic data.
## Dataset Description
Sequences were randomly sampled from Vertebrate Genomes (revision 9703952e2c90c822ea8a96c9638b584ccaf36d4e),
truncated to match the sample sizes per split of the
[plantcad/Angiosperm_65_genomes_8192bp](https://huggingface.co/datasets/plantcad/Angiosperm_65_genomes_8192bp) dataset.
### Processing Steps
1. **Streaming**: Records were streamed from the standard train/validation/test splits
2. **Shuffling**: Applied shuffle with buffer size of 10,000 for random sampling
3. **Truncation**: All sequences (originally 12kbp) were truncated to exactly 4096bp
4. **Sampling**: Collected samples to match PlantCAD split sizes
### Split Sizes
| Split | Number of Examples |
|-------|-------------------|
| train | 2,638,656 |
| validation | 329,832 |
| test | 329,832 |
### Sequence Length
All sequences are exactly **4096 base pairs**.
## Source Dataset
Vertebrate Genomes contains DNA sequences from vertebrate species, with all sequences being 12,000 base pairs in length.
This subset uses revision `9703952e2c90c822ea8a96c9638b584ccaf36d4e`.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("plantcad/vertebrate-genomes-plantcad2-c4096")
```
## License
Apache 2.0
提供机构:
plantcad



