CapRL-2M

Name: CapRL-2M
Creator: maas
Published: 2026-05-09 22:00:58
License: 暂无描述

魔搭社区2026-05-09 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/Shanghai_AI_Laboratory/CapRL-2M

下载链接

链接失效反馈

官方服务：

资源简介：

# CapRL 📖<a href="https://arxiv.org/abs/2509.22647">Paper</a> | 🏠<a href="https://github.com/InternLM/CapRL">Github</a> | 🤗<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | 🤗<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ### CapRL Series Model & Dataset | Series | Models & Resources | | :--- | :--- | | **CapRL 2.0 Series** | [🤗 CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B) \| [🤗 CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B) \| [📦 CapRL-Qwen3VL-2B-GGUF](https://huggingface.co/internlm/CapRL-Qwen3VL-2B-GGUF) \| [📦 CapRL-Qwen3VL-4B-GGUF](https://huggingface.co/internlm/CapRL-Qwen3VL-4B-GGUF) \| [🌈CapRL-Qwen3VL-4B Space](https://huggingface.co/spaces/yuhangzang/CapRL-Qwen3VL-4B) | **CapRL 1.0 Series** | [🤗 CapRL-Qwen2.5VL-3B](https://huggingface.co/internlm/CapRL-3B) \| [🤗 CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B) \|[📊 CapRL-QA-75K Dataset](https://huggingface.co/datasets/internlm/CapRL-QA-75K) \| [📊 CapRL-2M Dataset](https://huggingface.co/datasets/internlm/CapRL-2M) \| [📦 CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) \| [📦 CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) \| [🌈CapRL-Qwen2.5VL-3B Space](https://huggingface.co/spaces/yuhangzang/caprl) Now you can try out CapRL-Qwen2.5VL-3B with your own images🎨!    ➡️    [🌈CapRL Space](https://huggingface.co/spaces/yuhangzang/caprl) ## CapRL-2M Our CapRL-2M dataset includes images from [ShareGPT-1M](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) and [DenseFusion-1M](https://huggingface.co/datasets/BAAI/DenseFusion-1M), with high-quality captions re-annotated using CapRL-3B, totaling 2M samples. In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M. ## 📢 News We are working on even stronger base models and upgrading our training recipe — stay tuned! - 🔥 [04/16/2026] We have released the **[CapRL-QA-75K](https://huggingface.co/datasets/internlm/CapRL-QA-75K)** training dataset! - 🔥 [12/24/2025] We are excited to release the CapRL 2.0 series: **[CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B)** and **[CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B)**! - 🔥 [12/24/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 17,000! - 🔥 [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days! - 🚀 [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B! - 🚀 [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version. - 🚀 [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL). - 🚀 [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M). ## Introduction of CapRL We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. This is the first study of applying Reinforcement Learning with Verifiable Rewards for the open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which can lead to models memorizing a limited set of annotated captions, our method allows the model to explore and generate a broader range of creative and general descriptions. CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA curation pipeline to ensure the quality of the questions and answers used for the second stage. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B. <img src="./assets/teaser.png" alt="Main Results on GPT2" width="750"/> <img src="./assets/performance.png" alt="Main Results on GPT2" width="750"/> ## Key Features * **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B. * **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand. * **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations. ## Usage If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1). We recommend using **vLLM** to speed up inference. ### Start an OpenAI API Service Run the command below to start an OpenAI-compatible API service: ```bash vllm serve "/PATH/CapRL-3B" \ --trust-remote-code \ --tensor-parallel-size=1 \ --pipeline-parallel-size=1 \ --gpu_memory_utilization=0.95 \ --served-model-name=caprl \ --port 8000 \ --host 0.0.0.0 ``` Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details): ```python import base64 from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) image_path = "/path/to/local/image.png" with open(image_path, "rb") as f: encoded_image = base64.b64encode(f.read()) encoded_image_text = encoded_image.decode("utf-8") base64_qwen = f"data:image;base64,{encoded_image_text}" chat_response = client.chat.completions.create( model="caprl", messages=[ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": base64_qwen }, }, {"type": "text", "text": "What is the text in the illustrate?"}, ], }, ], temperature=1.0, max_tokens=max_tokens, top_p=1.0, extra_body={ "repetition_penalty": 1.0, }, ) print("Chat response:", chat_response) ``` ## Cases <img src="./assets/comparison.png" alt="Main Results on GPT2" width="750"/> <img src="./assets/info_caprl.png" alt="Main Results on GPT2" width="750"/> <img src="./assets/info_caprl2.png" alt="Main Results on GPT2" width="750"/> <img src="./assets/natural_caprl.png" alt="Main Results on GPT2" width="750"/>

# CapRL 📖<a href="https://arxiv.org/abs/2509.22647">论文</a> | 🏠<a href="https://github.com/InternLM/CapRL">GitHub仓库</a> | 🤗<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL 合集</a> | 🤗<a href="https://huggingface.co/papers/2509.22647">每日论文</a> ### CapRL系列模型与数据集 | 系列 | 模型与资源 | | :--- | :--- | | **CapRL 2.0系列** | [🤗 CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B) | [🤗 CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B) | | **CapRL 1.0系列** | [🤗 CapRL-Qwen2.5VL-3B](https://huggingface.co/internlm/CapRL-3B) | [🤗 CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B) | [📊 CapRL-2M 数据集](https://huggingface.co/datasets/internlm/CapRL-2M) | [📦 CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) | [📦 CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) | 现在你可以使用自己的图片体验CapRL-Qwen2.5VL-3B🎨! ➡️ [🌈CapRL 在线演示空间](https://huggingface.co/spaces/yuhangzang/caprl) ## CapRL-2M 数据集我们的CapRL-2M数据集包含来自[ShareGPT-1M](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)与[DenseFusion-1M](https://huggingface.co/datasets/BAAI/DenseFusion-1M)的图片，并使用CapRL-3B对高质量图像描述进行了重新标注，总计包含200万条样本。我们在JSONL文件中提供了图像描述及其对应的图片路径，图片可从ShareGPT-1M与DenseFusion-1M下载。 ## 📢 最新动态我们正在研发更强的基础模型并升级训练流程——敬请期待！ - 🔥 [2025年12月24日] 我们很高兴发布CapRL 2.0系列：**[CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B)**与**[CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B)**！ - 🔥 [2025年12月24日] CapRL相关[模型与数据集](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189)的总下载量已达17000次！ - 🔥 [2025年10月15日] CapRL相关[模型与数据集](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189)在短短20天内总下载量突破6000次！ - 🚀 [2025年10月15日] 我们很高兴宣布发布**[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**，其图像字幕生成能力超越了Qwen2.5-VL-72B！ - 🚀 [2025年10月15日] 感谢[mradermacher](https://huggingface.co/mradermacher)的宝贵贡献！[CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF)为静态量化版本，[CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF)为加权/imatrix量化版本。 - 🚀 [2025年10月15日] 我们发布了[问答数据整理代码](https://github.com/InternLM/CapRL)。 - 🚀 [2025年09月25日] 我们发布了**CapRL**代码仓库、[CapRL-3B模型](https://huggingface.co/internlm/CapRL-3B)、[评估代码](https://github.com/InternLM/CapRL)以及[数据集](https://huggingface.co/datasets/internlm/CapRL-2M)。 ## CapRL 项目介绍我们很高兴推出CapRL-3B，这是一款轻量级30亿参数图像字幕生成模型，其感知能力可与Qwen2.5-VL-72B相媲美。本研究首次将带可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards）应用于开放式、主观性的图像字幕生成任务。与传统监督微调（Supervised Fine-Tuning）可能导致模型仅记忆有限的标注描述集不同，我们的方法允许模型探索并生成更广泛的创造性与通用性描述。 CapRL是一种全新的训练范式，采用解耦的两阶段流程：第一阶段使用大视觉语言模型（Large Vision-Language Model，LVLM）生成丰富且准确的图像描述；第二阶段则通过纯视觉大语言模型（vision-only LLM）执行问答任务，以此评估图像描述的质量。我们还构建了专属的问答数据整理流程，以确保第二阶段所用问答对的质量。我们基于CapRL训练框架，以Qwen2.5-VL-3B模型为初始化权重，并使用经过严格筛选的75K条问答数据集作为训练集，最终得到了性能优异的图像字幕生成模型CapRL-3B。 <img src="./assets/teaser.png" alt="GPT2基准测试主结果" width="750"/> <img src="./assets/performance.png" alt="GPT2基准测试主结果" width="750"/> ## 核心特性 * **出色的图表、信息图与文档视觉理解能力**：CapRL-3B的感知精度与视觉信息覆盖范围可与Qwen2.5-VL-72B相媲美。 * **输出结构规整**：CapRL-3B的输出具备良好的结构化特征，清晰易懂。 * **自然图像细节描述能力**：CapRL-3B的输出可完整覆盖所有有效视觉信息，且幻觉现象更少。 ## 使用方法若你希望使用**CapRL-3B**进行图像字幕生成，可直接采用与[Qwen2.5-VL系列](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1)完全一致的推理流程。我们推荐使用**vLLM**以加速推理过程。 ### 启动OpenAI兼容API服务执行以下命令即可启动与OpenAI兼容的API服务： bash vllm serve "/PATH/CapRL-3B" --trust-remote-code --tensor-parallel-size=1 --pipeline-parallel-size=1 --gpu_memory_utilization=0.95 --served-model-name=caprl --port 8000 --host 0.0.0.0 随后你可按照以下方式调用聊天API（更多细节可参考[OpenAI API协议文档](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images)）： python import base64 from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) image_path = "/path/to/local/image.png" with open(image_path, "rb") as f: encoded_image = base64.b64encode(f.read()) encoded_image_text = encoded_image.decode("utf-8") base64_qwen = f"data:image;base64,{encoded_image_text}" chat_response = client.chat.completions.create( model="caprl", messages=[ {"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": base64_qwen }, }, {"type": "text", "text": "What is the text in the illustrate?"}, ], }, ], temperature=1.0, max_tokens=max_tokens, top_p=1.0, extra_body={ "repetition_penalty": 1.0, }, ) print("Chat response:", chat_response) ## 效果示例 <img src="./assets/comparison.png" alt="模型效果对比" width="750"/> <img src="./assets/info_caprl.png" alt="信息图生成效果" width="750"/> <img src="./assets/info_caprl2.png" alt="信息图生成效果" width="750"/> <img src="./assets/natural_caprl.png" alt="自然图像生成效果" width="750"/>

提供机构：

maas

创建时间：

2025-10-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集