Accio-Lab/Metis-RL

Name: Accio-Lab/Metis-RL
Creator: Accio-Lab
Published: 2026-04-10 09:43:27
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Accio-Lab/Metis-RL

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - visual-question-answering - image-text-to-text language: - en tags: - multimodal - tool-use - agentic - reinforcement-learning - vision-language - HDPO - meta-cognitive size_categories: - 1K<n<10K --- # Metis-RL **Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models** Metis-RL is the **reinforcement learning training dataset** used to train the [Metis-8B-RL](https://huggingface.co/Accio-Lab/Metis-8B-RL) model via **Hierarchical Decoupled Policy Optimization (HDPO)**. It contains ~5.2K multimodal prompts spanning perception, search, and mathematical/logical reasoning tasks. [[Paper (arXiv)]](https://arxiv.org/abs/2604.08545) | [[GitHub]](https://github.com/Accio-Lab/Metis) | [[RL Model]](https://huggingface.co/Accio-Lab/Metis-8B-RL) | [[ColdStart Model]](https://huggingface.co/Accio-Lab/Metis-8B-ColdStart) | [[ColdStart Data]](https://huggingface.co/datasets/Accio-Lab/Metis-ColdStart) ## Dataset Details | Attribute | Value | |---|---| | Size | ~5.2K prompts | | Format | Parquet | | Modalities | Text + Image | | Purpose | HDPO reinforcement learning for meta-cognitive tool-use optimization | | License | Apache-2.0 | ## Dataset Composition The RL training prompts are balanced across three task categories to cultivate diverse meta-cognitive tool-use behaviors: | Task Category | Proportion | Description | |---|---|---| | Perception | 45% | Visual understanding tasks (document, chart, high-resolution image analysis) | | Search | 36% | Tasks requiring text/image search for external knowledge | | Math / Reasoning | 19% | Mathematical and logical reasoning with visual context | ## Data Schema Each sample contains: | Field | Type | Description | |---|---|---| | `data_source` | string | Source identifier for the training sample | | `prompt` | list | Conversation-format prompt (system + user messages) | | `images` | list | Associated image(s) for the multimodal query | | `ability` | string | Task category (e.g., `math`, `perception`, `search`) | | `reward_model` | dict | Contains `ground_truth` answer and reward `style` | | `extra_info` | dict | Additional metadata including the original question | ## How It's Used in HDPO Training During HDPO training, each prompt is rolled out *G* = 16 times. The dual reward system evaluates: 1. **Accuracy reward** (r_acc) — Whether the agent's final answer matches the ground truth. 2. **Tool efficiency reward** (r_tool) — Inverse of tool invocation count, *conditioned on correctness* (r_tool = 1/(T+1) if correct, else 0). Advantages are estimated independently for each reward channel, enabling the model to first learn correctness, then learn efficiency. ## Usage ```python from datasets import load_dataset dataset = load_dataset("Accio-Lab/Metis-RL", split="train") print(f"Number of prompts: {len(dataset)}") print(dataset[0].keys()) ``` ## Training Pipeline ``` Metis-8B-ColdStart (SFT checkpoint) │ ▼ HDPO with Metis-RL (~5K prompts) ← (this dataset) Metis-8B-RL (final model) ``` ### HDPO Hyperparameters | Hyperparameter | Value | |---|---| | Backbone | Qwen3-VL-8B-Instruct (via Metis-8B-ColdStart) | | Batch size | 128 | | Rollouts per prompt (*G*) | 16 | | Learning rate | 1e-6 | | KL coefficient | 0 | | Loss weights | w_acc = 1.0, w_tool = 0.15 | | Max response length | 16,384 tokens | ## Citation ```bibtex @article{yan2026metis, title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models}, author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong}, journal={arXiv preprint arXiv:2604.08545}, year={2026} } ``` ## Acknowledgments Metis is built upon [verl](https://github.com/volcengine/verl), [verl-tool](https://github.com/TIGER-AI-Lab/verl-tool), and [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL).

提供机构：

Accio-Lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集