Huggggooo/ProtoCycle-Data

Name: Huggggooo/ProtoCycle-Data
Creator: Huggggooo
Published: 2026-04-18 23:56:20
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Huggggooo/ProtoCycle-Data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation tags: - protein-design - agentic - tool-use - reinforcement-learning language: - en size_categories: - 1K<n<10K --- # ProtoCycle-Data Training data for **ProtoCycle** — an agentic protein design model that performs multi-step, tool-augmented sequence design via reinforcement learning. See the [ProtoCycle](https://github.com/huggggoooooo/ProtoCycle) repository for code, training recipes, and evaluation. ## Dataset Structure ### SFT Data (`sft/desc2seq_agentic_sft_2000.parquet`) **2,000 multi-turn agentic trajectories** for cold-start supervised fine-tuning. | Column | Type | Description | |--------|------|-------------| | `messages` | list[dict] | Multi-turn conversation with `user`, `assistant`, and `tool` roles. The assistant uses `<think>`, `<plan>`, `<tool_call>`, and `<answer>` tags. | | `tools` | list[dict] | Tool schemas (10 biology tools: scaffold retrieval, constraint building, ESM inpainting, ProTrek scoring). | Each trajectory demonstrates the full agent protocol: the model receives a natural-language protein design requirement, reasons step-by-step, invokes biology tools across three stages (scaffold retrieval → constraint injection → refinement & scoring), and outputs a final amino-acid sequence. ### RL Data (`rl/desc2seq_agent_grpo_10000.parquet`) **10,000 prompts** for GRPO-TCR (Group Relative Policy Optimization with Tool-Call Reward) training. | Column | Type | Description | |--------|------|-------------| | `data_source` | str | Data source identifier (`ProteinDesignEval`) | | `prompt` | list[dict] | System + user prompt messages for the agent | | `ability` | str | Task type (`PROTEIN`) | | `reward_model` | dict | Ground truth and metadata for reward computation | | `agent_name` | str | Agent type (`tool_agent`) | | `requirement` | str | Natural-language protein design requirement | | `requirement_id` | int | Unique requirement identifier | ## Usage ```python from datasets import load_dataset # Load SFT data sft_data = load_dataset("Huggggooo/ProtoCycle-Data", data_dir="sft", split="train") # Load RL data rl_data = load_dataset("Huggggooo/ProtoCycle-Data", data_dir="rl", split="train") ``` Or directly with pandas: ```python import pandas as pd sft = pd.read_parquet("hf://datasets/Huggggooo/ProtoCycle-Data/sft/desc2seq_agentic_sft_2000.parquet") rl = pd.read_parquet("hf://datasets/Huggggooo/ProtoCycle-Data/rl/desc2seq_agent_grpo_10000.parquet") ``` ## Related Resources | Resource | Link | |----------|------| | ProtoCycle-7B (RL checkpoint) | [Huggggooo/ProtoCycle-7B](https://huggingface.co/Huggggooo/ProtoCycle-7B) | | ProtoCycle-7B-SFT (SFT checkpoint) | [Huggggooo/ProtoCycle-7B-SFT](https://huggingface.co/Huggggooo/ProtoCycle-7B-SFT) | | Code & Recipes | [ProtoCycle GitHub](https://github.com/huggggoooooo/ProtoCycle) | ## License Apache-2.0, consistent with the upstream [VeRL](https://github.com/volcengine/verl) / [Open-AgentRL](https://github.com/Gen-Verse/Open-AgentRL) projects.

提供机构：

Huggggooo

5,000+

优质数据集

54 个

任务类型

进入经典数据集