LockOnN/chart-think-with-images-sft
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/LockOnN/chart-think-with-images-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- n<1K
task_categories:
- visual-question-answering
- image-to-text
tags:
- chart-understanding
- tool-use
- think-with-images
- sft
- crop
- code-interpreter
- multimodal-reasoning
- chart-qa
dataset_info:
features:
- name: messages
dtype: string
- name: image
dtype: string
- name: image_width
dtype: int64
- name: image_height
dtype: int64
- name: category
dtype: string
- name: source_dataset
dtype: string
- name: tools_used
dtype: string
splits:
- name: train
num_examples: 643
---
# 📊 Chart Think-with-Images SFT Dataset
A synthesized SFT dataset for training multimodal LLMs to perform **tool-augmented reasoning** on chart images. The model learns to use `crop` (image region zooming) and `code_interpreter` (Python code execution) tools during chain-of-thought reasoning about charts.
## Overview
| Property | Value |
|----------|-------|
| **Total Examples** | 643 |
| **Source Datasets** | [ChartQA](https://huggingface.co/datasets/ahmed-masry/ChartQA) (76.4%), [CharXiv](https://huggingface.co/datasets/princeton-nlp/CharXiv) (23.6%) |
| **Format** | ChatML messages (system + user + assistant) |
| **Tools** | `crop`, `code_interpreter` |
| **Language** | English |
| **License** | Apache 2.0 |
## Motivation
Recent research on **"Think with Images"** (e.g., [Thyme](https://arxiv.org/abs/2508.11630), [CodeVision](https://arxiv.org/abs/2512.03746)) shows that enabling VLMs to use tools during reasoning dramatically improves chart understanding. This dataset provides SFT training trajectories specifically for the chart domain, teaching models:
1. **When to crop**: Zoom into legends, axes, specific data regions for better reading
2. **When to compute**: Use Python for arithmetic (percentages, sums, comparisons)
3. **When to combine tools**: Crop to extract values → Code to compute answers
4. **Error recovery**: Recognize and fix incorrect initial crops
## Category Distribution
| Category | Count | % | Tools Used | Description |
|----------|-------|---|------------|-------------|
| `spatial_detail` | 216 | 33.6% | `crop` | Questions requiring zooming into chart regions |
| `visual_lookup` | 151 | 23.5% | none | Simple visual identification (no tools needed) |
| `calculation` | 138 | 21.5% | `code_interpreter` | Arithmetic computation on chart values |
| `multi_tool` | 75 | 11.7% | `crop` + `code_interpreter` | Complex questions needing both tools |
| `error_correction` | 63 | 9.8% | `crop` + `code_interpreter` | Multi-turn self-correction trajectories |
## Tool Usage
| Tool | Examples | % |
|------|----------|---|
| `crop` only | 279 | 43.4% |
| `code_interpreter` only | 138 | 21.5% |
| Both tools | 75 | 11.7% |
| No tools | 151 | 23.5% |
## Data Format
Each example contains ChatML-format messages with tool-augmented reasoning:
```json
{
"messages": [
{
"role": "system",
"content": "You are a chart analysis assistant with visual reasoning capabilities..."
},
{
"role": "user",
"content": "<image>\nWhat is the sum of values in 2020 and 2021?\n\n### User Image Path: chart.png\n### User Image Size: 800x557\n\nOutput Format:\n<think>reasoning + optional tool code</think>\n<answer>final answer</answer>"
},
{
"role": "assistant",
"content": "<think>\nTo answer this question, I need to first zoom into the chart to read specific values.\n\n<code>\n```python\n# Crop the data area for value extraction\ncropped = crop(\"chart.png\", 120, 80, 680, 470)\n```\n</code>\n<sandbox_output>\n[Cropped image showing the data area: Values are now clearly readable.]\n</sandbox_output>\n\nFrom the cropped view, I can read: 2020 = 45, 2021 = 62.\n\n<code>\n```python\nvalue_2020 = 45\nvalue_2021 = 62\ntotal = value_2020 + value_2021\nprint(f\"Total: {total}\")\n```\n</code>\n<sandbox_output>\nTotal: 107\n</sandbox_output>\n\nThe sum is 107.\n</think>\n<answer>107</answer>"
}
],
"image": "<base64-encoded PNG>",
"image_width": 800,
"image_height": 557,
"category": "multi_tool",
"source_dataset": "ChartQA",
"tools_used": "[\"crop\", \"code_interpreter\"]"
}
```
### Tool Format
**crop tool:**
```python
<code>
```python
cropped = crop("chart.png", x1, y1, x2, y2) # pixel coordinates
```
</code>
<sandbox_output>
[Cropped image description]
</sandbox_output>
```
**code_interpreter tool:**
```python
<code>
```python
value_a = 45
value_b = 62
result = (value_b - value_a) / value_a * 100
print(f"{result:.1f}%")
```
</code>
<sandbox_output>
37.8%
</sandbox_output>
```
## Training Usage
### With TRL SFTTrainer
```python
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
import json
# Load dataset
dataset = load_dataset("LockOnN/chart-think-with-images-sft", split="train")
# Parse messages from JSON string
def parse_messages(example):
example["messages"] = json.loads(example["messages"])
return example
dataset = dataset.map(parse_messages)
# Configure SFT (following Thyme recipe)
config = SFTConfig(
output_dir="./chart-think-sft",
learning_rate=1e-5,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
warmup_ratio=0.05,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-org/chart-think-model",
)
```
### Key Training Notes (from Thyme paper)
- **Mask sandbox outputs**: Do NOT train the model to predict `<sandbox_output>` content
- **Math annealing**: Train image ops first, then fine-tune math code at 10× lower lr
- **Recommended base model**: Qwen2.5-VL-7B-Instruct (best spatial/chart understanding)
## Synthesis Pipeline
Based on the methodology from:
- **[Thyme](https://arxiv.org/abs/2508.11630)** (Kwai, 2025): Tool-use SFT format, 3-category curriculum
- **[ChartM³](https://arxiv.org/abs/2511.02415)** (Nov 2024): Chart-specific 4-dimension QA taxonomy
- **[CodeVision](https://arxiv.org/abs/2512.03746)** (ByteDance, 2024): Multi-tool composition patterns
### Pipeline Steps:
1. Load chart images + QA pairs from ChartQA and CharXiv
2. Classify questions by difficulty → assign tool categories
3. Generate tool-augmented reasoning trajectories (template-based or VLM-assisted)
4. Validate all trajectories have correct `<think>/<answer>` structure
5. Format as ChatML messages and export
### Scaling with VLM API
For higher quality trajectories, set environment variables to use a VLM teacher:
```bash
export VLM_API_BASE="https://api.together.xyz/v1"
export VLM_API_KEY="your-key"
export VLM_MODEL="Qwen/Qwen2.5-VL-72B-Instruct"
python synthesize_chart_think_with_images.py --use-vlm-api --push-to-hub
```
## Limitations
- **Template-based generation**: Current version uses template-based trajectory synthesis. VLM-assisted generation (with `--use-vlm-api`) produces higher quality reasoning chains.
- **Crop coordinates are heuristic**: Without actual OCR/detection, crop coordinates are estimated based on common chart layouts.
- **Code in calculation trajectories**: The Python code uses plausible placeholder values rather than actual extracted chart values.
- **Scale**: 643 examples is a cold-start set. Production training typically needs 5K-50K examples (see Thyme: 500K, CodeVision: 5K).
## Citation
If you use this dataset, please cite the source datasets and methodological papers:
```bibtex
@article{thyme2025,
title={Thyme: Think Beyond Images},
author={Zhang et al.},
journal={arXiv preprint arXiv:2508.11630},
year={2025}
}
@article{chartqa2022,
title={ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning},
author={Masry et al.},
journal={ACL Findings},
year={2022}
}
@article{charxiv2024,
title={CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs},
author={Wang et al.},
journal={NeurIPS},
year={2024}
}
```
提供机构:
LockOnN



