LockOnN/chart-think-with-images-sft

Name: LockOnN/chart-think-with-images-sft
Creator: LockOnN
Published: 2026-04-28 05:14:31
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/LockOnN/chart-think-with-images-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - n<1K task_categories: - visual-question-answering - image-to-text tags: - chart-understanding - tool-use - think-with-images - sft - crop - code-interpreter - multimodal-reasoning - chart-qa dataset_info: features: - name: messages dtype: string - name: image dtype: string - name: image_width dtype: int64 - name: image_height dtype: int64 - name: category dtype: string - name: source_dataset dtype: string - name: tools_used dtype: string splits: - name: train num_examples: 643 --- # 📊 Chart Think-with-Images SFT Dataset A synthesized SFT dataset for training multimodal LLMs to perform **tool-augmented reasoning** on chart images. The model learns to use `crop` (image region zooming) and `code_interpreter` (Python code execution) tools during chain-of-thought reasoning about charts. ## Overview | Property | Value | |----------|-------| | **Total Examples** | 643 | | **Source Datasets** | [ChartQA](https://huggingface.co/datasets/ahmed-masry/ChartQA) (76.4%), [CharXiv](https://huggingface.co/datasets/princeton-nlp/CharXiv) (23.6%) | | **Format** | ChatML messages (system + user + assistant) | | **Tools** | `crop`, `code_interpreter` | | **Language** | English | | **License** | Apache 2.0 | ## Motivation Recent research on **"Think with Images"** (e.g., [Thyme](https://arxiv.org/abs/2508.11630), [CodeVision](https://arxiv.org/abs/2512.03746)) shows that enabling VLMs to use tools during reasoning dramatically improves chart understanding. This dataset provides SFT training trajectories specifically for the chart domain, teaching models: 1. **When to crop**: Zoom into legends, axes, specific data regions for better reading 2. **When to compute**: Use Python for arithmetic (percentages, sums, comparisons) 3. **When to combine tools**: Crop to extract values → Code to compute answers 4. **Error recovery**: Recognize and fix incorrect initial crops ## Category Distribution | Category | Count | % | Tools Used | Description | |----------|-------|---|------------|-------------| | `spatial_detail` | 216 | 33.6% | `crop` | Questions requiring zooming into chart regions | | `visual_lookup` | 151 | 23.5% | none | Simple visual identification (no tools needed) | | `calculation` | 138 | 21.5% | `code_interpreter` | Arithmetic computation on chart values | | `multi_tool` | 75 | 11.7% | `crop` + `code_interpreter` | Complex questions needing both tools | | `error_correction` | 63 | 9.8% | `crop` + `code_interpreter` | Multi-turn self-correction trajectories | ## Tool Usage | Tool | Examples | % | |------|----------|---| | `crop` only | 279 | 43.4% | | `code_interpreter` only | 138 | 21.5% | | Both tools | 75 | 11.7% | | No tools | 151 | 23.5% | ## Data Format Each example contains ChatML-format messages with tool-augmented reasoning: ```json { "messages": [ { "role": "system", "content": "You are a chart analysis assistant with visual reasoning capabilities..." }, { "role": "user", "content": "<image>\nWhat is the sum of values in 2020 and 2021?\n\n### User Image Path: chart.png\n### User Image Size: 800x557\n\nOutput Format:\n<think>reasoning + optional tool code</think>\n<answer>final answer</answer>" }, { "role": "assistant", "content": "<think>\nTo answer this question, I need to first zoom into the chart to read specific values.\n\n<code>\n```python\n# Crop the data area for value extraction\ncropped = crop(\"chart.png\", 120, 80, 680, 470)\n```\n</code>\n<sandbox_output>\n[Cropped image showing the data area: Values are now clearly readable.]\n</sandbox_output>\n\nFrom the cropped view, I can read: 2020 = 45, 2021 = 62.\n\n<code>\n```python\nvalue_2020 = 45\nvalue_2021 = 62\ntotal = value_2020 + value_2021\nprint(f\"Total: {total}\")\n```\n</code>\n<sandbox_output>\nTotal: 107\n</sandbox_output>\n\nThe sum is 107.\n</think>\n<answer>107</answer>" } ], "image": "<base64-encoded PNG>", "image_width": 800, "image_height": 557, "category": "multi_tool", "source_dataset": "ChartQA", "tools_used": "[\"crop\", \"code_interpreter\"]" } ``` ### Tool Format **crop tool:** ```python <code> ```python cropped = crop("chart.png", x1, y1, x2, y2) # pixel coordinates ``` </code> <sandbox_output> [Cropped image description] </sandbox_output> ``` **code_interpreter tool:** ```python <code> ```python value_a = 45 value_b = 62 result = (value_b - value_a) / value_a * 100 print(f"{result:.1f}%") ``` </code> <sandbox_output> 37.8% </sandbox_output> ``` ## Training Usage ### With TRL SFTTrainer ```python from datasets import load_dataset from trl import SFTTrainer, SFTConfig import json # Load dataset dataset = load_dataset("LockOnN/chart-think-with-images-sft", split="train") # Parse messages from JSON string def parse_messages(example): example["messages"] = json.loads(example["messages"]) return example dataset = dataset.map(parse_messages) # Configure SFT (following Thyme recipe) config = SFTConfig( output_dir="./chart-think-sft", learning_rate=1e-5, per_device_train_batch_size=2, gradient_accumulation_steps=8, num_train_epochs=3, warmup_ratio=0.05, bf16=True, gradient_checkpointing=True, push_to_hub=True, hub_model_id="your-org/chart-think-model", ) ``` ### Key Training Notes (from Thyme paper) - **Mask sandbox outputs**: Do NOT train the model to predict `<sandbox_output>` content - **Math annealing**: Train image ops first, then fine-tune math code at 10× lower lr - **Recommended base model**: Qwen2.5-VL-7B-Instruct (best spatial/chart understanding) ## Synthesis Pipeline Based on the methodology from: - **[Thyme](https://arxiv.org/abs/2508.11630)** (Kwai, 2025): Tool-use SFT format, 3-category curriculum - **[ChartM³](https://arxiv.org/abs/2511.02415)** (Nov 2024): Chart-specific 4-dimension QA taxonomy - **[CodeVision](https://arxiv.org/abs/2512.03746)** (ByteDance, 2024): Multi-tool composition patterns ### Pipeline Steps: 1. Load chart images + QA pairs from ChartQA and CharXiv 2. Classify questions by difficulty → assign tool categories 3. Generate tool-augmented reasoning trajectories (template-based or VLM-assisted) 4. Validate all trajectories have correct `<think>/<answer>` structure 5. Format as ChatML messages and export ### Scaling with VLM API For higher quality trajectories, set environment variables to use a VLM teacher: ```bash export VLM_API_BASE="https://api.together.xyz/v1" export VLM_API_KEY="your-key" export VLM_MODEL="Qwen/Qwen2.5-VL-72B-Instruct" python synthesize_chart_think_with_images.py --use-vlm-api --push-to-hub ``` ## Limitations - **Template-based generation**: Current version uses template-based trajectory synthesis. VLM-assisted generation (with `--use-vlm-api`) produces higher quality reasoning chains. - **Crop coordinates are heuristic**: Without actual OCR/detection, crop coordinates are estimated based on common chart layouts. - **Code in calculation trajectories**: The Python code uses plausible placeholder values rather than actual extracted chart values. - **Scale**: 643 examples is a cold-start set. Production training typically needs 5K-50K examples (see Thyme: 500K, CodeVision: 5K). ## Citation If you use this dataset, please cite the source datasets and methodological papers: ```bibtex @article{thyme2025, title={Thyme: Think Beyond Images}, author={Zhang et al.}, journal={arXiv preprint arXiv:2508.11630}, year={2025} } @article{chartqa2022, title={ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning}, author={Masry et al.}, journal={ACL Findings}, year={2022} } @article{charxiv2024, title={CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs}, author={Wang et al.}, journal={NeurIPS}, year={2024} } ```

提供机构：

LockOnN

5,000+

优质数据集

54 个

任务类型

进入经典数据集