chandar-lab/CoPeP
收藏Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/chandar-lab/CoPeP
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
pretty_name: CoPeP Continual Protein Dataset
task_categories:
- other
tags:
- protein
- biology
- continual-learning
- parquet
configs:
- config_name: default
data_files:
- split: train
path: train/data-*.parquet
- split: validation
path: val/val.parquet
- config_name: task_splits
data_files:
- split: task_0
path: splits/task_0.parquet
- split: task_1
path: splits/task_1.parquet
- split: task_2
path: splits/task_2.parquet
- split: task_3
path: splits/task_3.parquet
- split: task_4
path: splits/task_4.parquet
- split: task_5
path: splits/task_5.parquet
- split: task_6
path: splits/task_6.parquet
- split: task_7
path: splits/task_7.parquet
- split: task_8
path: splits/task_8.parquet
- split: task_9
path: splits/task_9.parquet
---
# CoPeP Continual Protein Dataset
This dataset is organized for continual-learning experiments on protein sequences.
## Repository
- Dataset repo id: `chandar-lab/CoPeP`
## File layout
- `train/`: 252 parquet shards (`data-00000-of-00252.parquet` ... `data-00251-of-00252.parquet`)
- `splits/`: 10 task index parquet files (`task_0.parquet` ... `task_9.parquet`)
- `val/`: validation parquet (`val.parquet`)
## Task file mapping
- `task_0.parquet` -> task_0 (2015)
- `task_1.parquet` -> task_1 (2016)
- `task_2.parquet` -> task_2 (2017)
- `task_3.parquet` -> task_3 (2018)
- `task_4.parquet` -> task_4 (2019)
- `task_5.parquet` -> task_5 (2020)
- `task_6.parquet` -> task_6 (2021)
- `task_7.parquet` -> task_7 (2022)
- `task_8.parquet` -> task_8 (2023)
- `task_9.parquet` -> task_9 (2024)
## Important note on `splits/task_*.parquet`
The `splits/task_*.parquet` files are index-style split definitions keyed by `row_idx`.
They are intended to be joined with records from `train/` (or other source files)
using `row_idx`, rather than treated as standalone full-example datasets.
## Migration note
Task index files are now exposed through `name="task_splits"`.
The old `load_dataset(repo_id, split="task_0")` pattern is deprecated.
## Basic code: load one task index split
```python
from datasets import load_dataset
repo_id = "__HF_DATASET_REPO__"
# 1) Load train split directly (map-style dataset)
train_ds = load_dataset(repo_id, split="train")
# 2) Load one task index split via the task_splits config.
# Use streaming=True to avoid Arrow cache materialization for index files.
task0_idx = load_dataset(
repo_id,
name="task_splits",
split="task_0",
streaming=True,
)
# 3) Materialize examples by selecting train rows using row_idx
task0_rows = [example["row_idx"] for example in task0_idx]
task0_examples = train_ds.select(task0_rows)
print(task0_examples)
print(task0_examples[0])
```
提供机构:
chandar-lab



