five

plantcad/vertebrate-genomes-plantcad2-c4096

收藏
Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/plantcad/vertebrate-genomes-plantcad2-c4096
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - biology - DNA - genomics - genetics - vertebrates dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 2638656 - name: validation num_examples: 329832 - name: test num_examples: 329832 --- # Vertebrate Genomes PlantCAD2 Subset (4096bp) This dataset is a curated subset of [emarro/vertebrate_genomes](https://huggingface.co/datasets/emarro/vertebrate_genomes) designed for comparative spectral analysis with plant genomic data. ## Dataset Description Sequences were randomly sampled from Vertebrate Genomes (revision 9703952e2c90c822ea8a96c9638b584ccaf36d4e), truncated to match the sample sizes per split of the [plantcad/Angiosperm_65_genomes_8192bp](https://huggingface.co/datasets/plantcad/Angiosperm_65_genomes_8192bp) dataset. ### Processing Steps 1. **Streaming**: Records were streamed from the standard train/validation/test splits 2. **Shuffling**: Applied shuffle with buffer size of 10,000 for random sampling 3. **Truncation**: All sequences (originally 12kbp) were truncated to exactly 4096bp 4. **Sampling**: Collected samples to match PlantCAD split sizes ### Split Sizes | Split | Number of Examples | |-------|-------------------| | train | 2,638,656 | | validation | 329,832 | | test | 329,832 | ### Sequence Length All sequences are exactly **4096 base pairs**. ## Source Dataset Vertebrate Genomes contains DNA sequences from vertebrate species, with all sequences being 12,000 base pairs in length. This subset uses revision `9703952e2c90c822ea8a96c9638b584ccaf36d4e`. ## Usage ```python from datasets import load_dataset dataset = load_dataset("plantcad/vertebrate-genomes-plantcad2-c4096") ``` ## License Apache 2.0
提供机构:
plantcad
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作