CapRL-2M
收藏魔搭社区2026-05-09 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Shanghai_AI_Laboratory/CapRL-2M
下载链接
链接失效反馈官方服务:
资源简介:
# CapRL
📖<a href="https://arxiv.org/abs/2509.22647">Paper</a> | 🏠<a href="https://github.com/InternLM/CapRL">Github</a> | 🤗<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | 🤗<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a>
### CapRL Series Model & Dataset
| Series | Models & Resources |
| :--- | :--- |
| **CapRL 2.0 Series** | [🤗 CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B) \| [🤗 CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B) \| [📦 CapRL-Qwen3VL-2B-GGUF](https://huggingface.co/internlm/CapRL-Qwen3VL-2B-GGUF) \| [📦 CapRL-Qwen3VL-4B-GGUF](https://huggingface.co/internlm/CapRL-Qwen3VL-4B-GGUF) \| [🌈CapRL-Qwen3VL-4B Space](https://huggingface.co/spaces/yuhangzang/CapRL-Qwen3VL-4B)
| **CapRL 1.0 Series** | [🤗 CapRL-Qwen2.5VL-3B](https://huggingface.co/internlm/CapRL-3B) \| [🤗 CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B) \|[📊 CapRL-QA-75K Dataset](https://huggingface.co/datasets/internlm/CapRL-QA-75K) \| [📊 CapRL-2M Dataset](https://huggingface.co/datasets/internlm/CapRL-2M) \| [📦 CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) \| [📦 CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) \| [🌈CapRL-Qwen2.5VL-3B Space](https://huggingface.co/spaces/yuhangzang/caprl)
Now you can try out CapRL-Qwen2.5VL-3B with your own images🎨! ➡️ [🌈CapRL Space](https://huggingface.co/spaces/yuhangzang/caprl)
## CapRL-2M
Our CapRL-2M dataset includes images from [ShareGPT-1M](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V) and [DenseFusion-1M](https://huggingface.co/datasets/BAAI/DenseFusion-1M), with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.
In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.
## 📢 News
We are working on even stronger base models and upgrading our training recipe — stay tuned!
- 🔥 [04/16/2026] We have released the **[CapRL-QA-75K](https://huggingface.co/datasets/internlm/CapRL-QA-75K)** training dataset!
- 🔥 [12/24/2025] We are excited to release the CapRL 2.0 series: **[CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B)** and **[CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B)**!
- 🔥 [12/24/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 17,000!
- 🔥 [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
- 🚀 [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
- 🚀 [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
- 🚀 [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
- 🚀 [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).
## Introduction of CapRL
We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
can lead to models memorizing a limited set of annotated captions, our method allows the model to
explore and generate a broader range of creative and general descriptions.
CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
curation pipeline to ensure the quality of the questions and answers used for the second stage.
By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.
<p align="center">
<img src="./assets/teaser.png" alt="Main Results on GPT2" width="750"/>
</p>
<p align="center">
<img src="./assets/performance.png" alt="Main Results on GPT2" width="750"/>
</p>
## Key Features
* **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
* **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
## Usage
If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
We recommend using **vLLM** to speed up inference.
### Start an OpenAI API Service
Run the command below to start an OpenAI-compatible API service:
```bash
vllm serve "/PATH/CapRL-3B" \
--trust-remote-code \
--tensor-parallel-size=1 \
--pipeline-parallel-size=1 \
--gpu_memory_utilization=0.95 \
--served-model-name=caprl \
--port 8000 \
--host 0.0.0.0
```
Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
```python
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="caprl",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
temperature=1.0,
max_tokens=max_tokens,
top_p=1.0,
extra_body={
"repetition_penalty": 1.0,
},
)
print("Chat response:", chat_response)
```
## Cases
<p align="center">
<img src="./assets/comparison.png" alt="Main Results on GPT2" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl.png" alt="Main Results on GPT2" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl2.png" alt="Main Results on GPT2" width="750"/>
</p>
<p align="center">
<img src="./assets/natural_caprl.png" alt="Main Results on GPT2" width="750"/>
</p>
# CapRL
📖<a href="https://arxiv.org/abs/2509.22647">论文</a> | 🏠<a href="https://github.com/InternLM/CapRL">GitHub仓库</a> | 🤗<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL 合集</a> | 🤗<a href="https://huggingface.co/papers/2509.22647">每日论文</a>
### CapRL系列模型与数据集
| 系列 | 模型与资源 |
| :--- | :--- |
| **CapRL 2.0系列** | [🤗 CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B) | [🤗 CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B) |
| **CapRL 1.0系列** | [🤗 CapRL-Qwen2.5VL-3B](https://huggingface.co/internlm/CapRL-3B) | [🤗 CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B) | [📊 CapRL-2M 数据集](https://huggingface.co/datasets/internlm/CapRL-2M) | [📦 CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) | [📦 CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) |
现在你可以使用自己的图片体验CapRL-Qwen2.5VL-3B🎨! ➡️ [🌈CapRL 在线演示空间](https://huggingface.co/spaces/yuhangzang/caprl)
## CapRL-2M 数据集
我们的CapRL-2M数据集包含来自[ShareGPT-1M](https://huggingface.co/datasets/Lin-Chen/ShareGPT4V)与[DenseFusion-1M](https://huggingface.co/datasets/BAAI/DenseFusion-1M)的图片,并使用CapRL-3B对高质量图像描述进行了重新标注,总计包含200万条样本。
我们在JSONL文件中提供了图像描述及其对应的图片路径,图片可从ShareGPT-1M与DenseFusion-1M下载。
## 📢 最新动态
我们正在研发更强的基础模型并升级训练流程——敬请期待!
- 🔥 [2025年12月24日] 我们很高兴发布CapRL 2.0系列:**[CapRL-Qwen3VL-2B](https://huggingface.co/internlm/CapRL-Qwen3VL-2B)**与**[CapRL-Qwen3VL-4B](https://huggingface.co/internlm/CapRL-Qwen3VL-4B)**!
- 🔥 [2025年12月24日] CapRL相关[模型与数据集](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189)的总下载量已达17000次!
- 🔥 [2025年10月15日] CapRL相关[模型与数据集](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189)在短短20天内总下载量突破6000次!
- 🚀 [2025年10月15日] 我们很高兴宣布发布**[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**,其图像字幕生成能力超越了Qwen2.5-VL-72B!
- 🚀 [2025年10月15日] 感谢[mradermacher](https://huggingface.co/mradermacher)的宝贵贡献为静态量化版本,[CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF)为加权/imatrix量化版本。
- 🚀 [2025年10月15日] 我们发布了[问答数据整理代码](https://github.com/InternLM/CapRL)。
- 🚀 [2025年09月25日] 我们发布了**CapRL**代码仓库、[CapRL-3B模型](https://huggingface.co/internlm/CapRL-3B)、[评估代码](https://github.com/InternLM/CapRL)以及[数据集](https://huggingface.co/datasets/internlm/CapRL-2M)。
## CapRL 项目介绍
我们很高兴推出CapRL-3B,这是一款轻量级30亿参数图像字幕生成模型,其感知能力可与Qwen2.5-VL-72B相媲美。
本研究首次将带可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards)应用于开放式、主观性的图像字幕生成任务。与传统监督微调(Supervised Fine-Tuning)可能导致模型仅记忆有限的标注描述集不同,我们的方法允许模型探索并生成更广泛的创造性与通用性描述。
CapRL是一种全新的训练范式,采用解耦的两阶段流程:第一阶段使用大视觉语言模型(Large Vision-Language Model,LVLM)生成丰富且准确的图像描述;第二阶段则通过纯视觉大语言模型(vision-only LLM)执行问答任务,以此评估图像描述的质量。我们还构建了专属的问答数据整理流程,以确保第二阶段所用问答对的质量。
我们基于CapRL训练框架,以Qwen2.5-VL-3B模型为初始化权重,并使用经过严格筛选的75K条问答数据集作为训练集,最终得到了性能优异的图像字幕生成模型CapRL-3B。
<p align="center">
<img src="./assets/teaser.png" alt="GPT2基准测试主结果" width="750"/>
</p>
<p align="center">
<img src="./assets/performance.png" alt="GPT2基准测试主结果" width="750"/>
</p>
## 核心特性
* **出色的图表、信息图与文档视觉理解能力**:CapRL-3B的感知精度与视觉信息覆盖范围可与Qwen2.5-VL-72B相媲美。
* **输出结构规整**:CapRL-3B的输出具备良好的结构化特征,清晰易懂。
* **自然图像细节描述能力**:CapRL-3B的输出可完整覆盖所有有效视觉信息,且幻觉现象更少。
## 使用方法
若你希望使用**CapRL-3B**进行图像字幕生成,可直接采用与[Qwen2.5-VL系列](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1)完全一致的推理流程。
我们推荐使用**vLLM**以加速推理过程。
### 启动OpenAI兼容API服务
执行以下命令即可启动与OpenAI兼容的API服务:
bash
vllm serve "/PATH/CapRL-3B"
--trust-remote-code
--tensor-parallel-size=1
--pipeline-parallel-size=1
--gpu_memory_utilization=0.95
--served-model-name=caprl
--port 8000
--host 0.0.0.0
随后你可按照以下方式调用聊天API(更多细节可参考[OpenAI API协议文档](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images)):
python
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="caprl",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
temperature=1.0,
max_tokens=max_tokens,
top_p=1.0,
extra_body={
"repetition_penalty": 1.0,
},
)
print("Chat response:", chat_response)
## 效果示例
<p align="center">
<img src="./assets/comparison.png" alt="模型效果对比" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl.png" alt="信息图生成效果" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl2.png" alt="信息图生成效果" width="750"/>
</p>
<p align="center">
<img src="./assets/natural_caprl.png" alt="自然图像生成效果" width="750"/>
</p>
提供机构:
maas
创建时间:
2025-10-14



