five

chandar-lab/CoPeP

收藏
Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/chandar-lab/CoPeP
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: CoPeP Continual Protein Dataset task_categories: - other tags: - protein - biology - continual-learning - parquet configs: - config_name: default data_files: - split: train path: train/data-*.parquet - split: validation path: val/val.parquet - config_name: task_splits data_files: - split: task_0 path: splits/task_0.parquet - split: task_1 path: splits/task_1.parquet - split: task_2 path: splits/task_2.parquet - split: task_3 path: splits/task_3.parquet - split: task_4 path: splits/task_4.parquet - split: task_5 path: splits/task_5.parquet - split: task_6 path: splits/task_6.parquet - split: task_7 path: splits/task_7.parquet - split: task_8 path: splits/task_8.parquet - split: task_9 path: splits/task_9.parquet --- # CoPeP Continual Protein Dataset This dataset is organized for continual-learning experiments on protein sequences. ## Repository - Dataset repo id: `chandar-lab/CoPeP` ## File layout - `train/`: 252 parquet shards (`data-00000-of-00252.parquet` ... `data-00251-of-00252.parquet`) - `splits/`: 10 task index parquet files (`task_0.parquet` ... `task_9.parquet`) - `val/`: validation parquet (`val.parquet`) ## Task file mapping - `task_0.parquet` -> task_0 (2015) - `task_1.parquet` -> task_1 (2016) - `task_2.parquet` -> task_2 (2017) - `task_3.parquet` -> task_3 (2018) - `task_4.parquet` -> task_4 (2019) - `task_5.parquet` -> task_5 (2020) - `task_6.parquet` -> task_6 (2021) - `task_7.parquet` -> task_7 (2022) - `task_8.parquet` -> task_8 (2023) - `task_9.parquet` -> task_9 (2024) ## Important note on `splits/task_*.parquet` The `splits/task_*.parquet` files are index-style split definitions keyed by `row_idx`. They are intended to be joined with records from `train/` (or other source files) using `row_idx`, rather than treated as standalone full-example datasets. ## Migration note Task index files are now exposed through `name="task_splits"`. The old `load_dataset(repo_id, split="task_0")` pattern is deprecated. ## Basic code: load one task index split ```python from datasets import load_dataset repo_id = "__HF_DATASET_REPO__" # 1) Load train split directly (map-style dataset) train_ds = load_dataset(repo_id, split="train") # 2) Load one task index split via the task_splits config. # Use streaming=True to avoid Arrow cache materialization for index files. task0_idx = load_dataset( repo_id, name="task_splits", split="task_0", streaming=True, ) # 3) Materialize examples by selecting train rows using row_idx task0_rows = [example["row_idx"] for example in task0_idx] task0_examples = train_ds.select(task0_rows) print(task0_examples) print(task0_examples[0]) ```
提供机构:
chandar-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作