Yangximiao/PlantCAD2_zero_shot_tasks
收藏Hugging Face2025-12-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Yangximiao/PlantCAD2_zero_shot_tasks
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
- sequence-modeling
language:
- en
tags:
- biology
- genomics
- dna
- plants
- zero-shot
- conservation
configs:
- config_name: acceptor_core_noncore_classification
data_files:
- split: test_maize
path: acceptor_core_noncore_classification/test_maize-*
- split: test_tomato
path: acceptor_core_noncore_classification/test_tomato-*
- config_name: acceptor_recovery
data_files:
- split: test_maize
path: acceptor_recovery/test_maize-*
- split: test_tomato
path: acceptor_recovery/test_tomato-*
- config_name: conservation_within_poaceae_non_tis
data_files:
- split: test
path: conservation_within_poaceae_non_tis/test-*
- config_name: conservation_within_andropogoneae
data_files:
- split: test
path: conservation_within_andropogoneae/test-*
- config_name: conservation_within_poaceae_tis
data_files:
- split: test
path: conservation_within_poaceae_tis/test-*
- config_name: structural_variant_effect_prediction
data_files:
- split: test
path: structural_variant_effect_prediction/test-*
- config_name: donor_core_noncore_classification
data_files:
- split: test_maize
path: donor_core_noncore_classification/test_maize-*
- split: test_tomato
path: donor_core_noncore_classification/test_tomato-*
- config_name: donor_recovery
data_files:
- split: test_maize
path: donor_recovery/test_maize-*
- split: test_tomato
path: donor_recovery/test_tomato-*
- config_name: tis_core_noncore_classification
data_files:
- split: test_maize
path: tis_core_noncore_classification/test_maize-*
- split: test_tomato
path: tis_core_noncore_classification/test_tomato-*
- config_name: tis_recovery
data_files:
- split: test_maize
path: tis_recovery/test_maize-*
- split: test_tomato
path: tis_recovery/test_tomato-*
- config_name: tts_core_noncore_classification
data_files:
- split: test_maize
path: tts_core_noncore_classification/test_maize-*
- split: test_tomato
path: tts_core_noncore_classification/test_tomato-*
- config_name: tts_recovery
data_files:
- split: test_maize
path: tts_recovery/test_maize-*
- split: test_tomato
path: tts_recovery/test_tomato-*
dataset_info:
- config_name: acceptor_core_noncore_classification
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test_maize
num_bytes: 1185888200
num_examples: 144550
- name: test_tomato
num_bytes: 1152260004
num_examples: 140451
download_size: 778025325
dataset_size: 2338148204
- config_name: acceptor_recovery
features:
- name: sequence
dtype: string
splits:
- name: test_maize
num_bytes: 1261110324
num_examples: 153869
- name: test_tomato
num_bytes: 1151169180
num_examples: 140455
download_size: 810355602
dataset_size: 2412279504
- config_name: conservation_within_poaceae_non_tis
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test
num_bytes: 1506951740
num_examples: 183685
download_size: 575528297
dataset_size: 1506951740
- config_name: conservation_within_poaceae_tis
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test
num_bytes: 300775048
num_examples: 36662
download_size: 139559957
dataset_size: 300775048
- config_name: donor_core_noncore_classification
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test_maize
num_bytes: 1185888200
num_examples: 144550
- name: test_tomato
num_bytes: 1152260004
num_examples: 140451
download_size: 780485274
dataset_size: 2338148204
- config_name: donor_recovery
features:
- name: sequence
dtype: string
splits:
- name: test_maize
num_bytes: 1261110324
num_examples: 153869
- name: test_tomato
num_bytes: 1151177376
num_examples: 140456
download_size: 812414184
dataset_size: 2412287700
- config_name: tis_core_noncore_classification
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test_maize
num_bytes: 298699436
num_examples: 36409
- name: test_tomato
num_bytes: 291061512
num_examples: 35478
download_size: 259785500
dataset_size: 589760948
- config_name: tis_recovery
features:
- name: sequence
dtype: string
splits:
- name: test_maize
num_bytes: 319930860
num_examples: 39035
- name: test_tomato
num_bytes: 290826864
num_examples: 35484
download_size: 269457351
dataset_size: 610757724
- config_name: tts_core_noncore_classification
features:
- name: sequence
dtype: string
- name: label
dtype: int64
splits:
- name: test_maize
num_bytes: 298699436
num_examples: 36409
- name: test_tomato
num_bytes: 291053308
num_examples: 35477
download_size: 260010533
dataset_size: 589752744
- config_name: tts_recovery
features:
- name: sequence
dtype: string
splits:
- name: test_maize
num_bytes: 319930860
num_examples: 39035
- name: test_tomato
num_bytes: 290818668
num_examples: 35483
download_size: 269872033
dataset_size: 610749528
---
# 🌱 PlantCAD2 Zero-Shot Tasks
Zero-shot evaluation tasks for **plant genomics** using **PlantCAD2**.
This dataset contains tasks designed to evaluate model performance *without task-specific training*.
---
## 📂 Available Tasks
### 🔬 Cross-species Evolutionary Conservation
| Task Name | Description | Samples | Metric |
|-----------|-------------|---------|--------|
| `conservation_within_andropogoneae` | Predict conserved vs non-conserved sites using alignments within 35 Andropogoneae genomes | 19,030 vs 19,030 | AUROC |
| `conservation_within_poaceae_non_tis` | Predict conserved vs non-conserved coding sites (excluding TIS) within Poaceae | 103,368 vs 80,317 | AUROC |
| `conservation_within_poaceae_tis` | Predict conserved vs non-conserved TIS sites | 26,650 vs 10,012 | AUROC |
### 🧬 Key Junction Recovery
| Task Name | Description | Samples | Metric |
|-----------|-------------|---------|--------|
| `tis_recovery` | Recover masked **ATG start codon** (maize) | 39,035 | Accuracy |
| `tts_recovery` | Recover masked **TAG/TAA/TGA stop codon** (maize) | 39,035 | Accuracy |
| `donor_recovery` | Recover masked **GT splice donor motif** (maize) | 153,869 | Accuracy |
| `acceptor_recovery` | Recover masked **AG splice acceptor motif** (maize) | 153,869 | Accuracy |
### 🌽 Within-species Conservation (Maize)
| Task Name | Description | Samples | Metric |
|-----------|-------------|---------|--------|
| `tis_core_noncore_classification` | Predict **core TIS vs non-core TIS** | 28,291 vs 8,118 | AUROC |
| `tts_core_noncore_classification` | Predict **core TTS vs non-core TTS** | 28,291 vs 8,118 | AUROC |
| `donor_core_noncore_classification` | Predict **core splice donor vs non-core splice donor** | 123,183 vs 21,367 | AUROC |
| `acceptor_core_noncore_classification` | Predict **core splice acceptor vs non-core splice acceptor** | 123,183 vs 21,367 | AUROC |
### 🧩 Structural Variant Effect
| Task Name | Description | Samples | Metric |
|-----------|-------------|---------|--------|
| `structural_variant_effect_prediction` | Predict **conserved deletions vs non-conserved deletions** | 7,662 vs 10,413 | AUPRC |
---
## 📑 Data Format
| Task Type | Fields | Description |
|-----------|--------|-------------|
| **Classification** | `sequence` | DNA sequence (string) |
| | `label` | Binary label: `0 = negative`, `1 = positive` |
| **Recovery** | `sequence` | DNA sequence (string) |
---
## 📊 Data Splits
| Split | Description |
|-------|-------------|
| `test` | General test data |
| `test_maize` | Zea mays (corn)-specific test data |
| `test_tomato` | Solanum lycopersicum (tomato)-specific test data |
---
## 🚀 Usage Example
```python
from datasets import load_dataset, get_dataset_config_names
# List all available tasks
tasks = get_dataset_config_names("plantcad/PlantCAD2_zero_shot_tasks")
print("Available tasks:", tasks)
# Example: Classification task
classification_data = load_dataset("plantcad/PlantCAD2_zero_shot_tasks", "conservation_within_poaceae_tis")
test_split = classification_data['test']
print(f"Test samples: {len(test_split)}")
print(f"Sample: {test_split[0]}")
# Example: TIS recovery task
recovery_data = load_dataset("plantcad/PlantCAD2_zero_shot_tasks", "tis_recovery")
if 'test_maize' in recovery_data:
maize_data = recovery_data['test_maize']
print(f"Maize recovery samples: {len(maize_data)}")
```
提供机构:
Yangximiao



