opendatalab/ChartVerse-SFT-1.8M

Name: opendatalab/ChartVerse-SFT-1.8M
Creator: opendatalab
Published: 2026-02-09 15:20:46
License: 暂无描述

Hugging Face2026-02-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/opendatalab/ChartVerse-SFT-1.8M

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en task_categories: - visual-question-answering - image-text-to-text tags: - chart - reasoning - vision-language - multimodal - chart-understanding - CoT - SFT - large-scale size_categories: - 1M<n<10M --- **ChartVerse-SFT-1800K** is an extended large-scale chart reasoning dataset with Chain-of-Thought (CoT) annotations, developed as part of the **[opendatalab/ChartVerse](https://huggingface.co/collections/opendatalab/chartverse)** project. For more details about our method, datasets, and full model series, please visit our [Project Page](https://chartverse.github.io). This dataset contains **all verified correct samples** without failure rate filtering. Unlike SFT-600K which excludes easy samples (r=0), SFT-1800K includes the complete set of truth-anchored QA pairs for maximum coverage and scale. Since we generate multiple questions for a single image and multiple CoT answers for a single question, **duplicate images may appear within the dataset**. This is a normal and expected characteristic of the data structure. ## 📰 News * **2026-01-30**: 🚀 Our ChartVerse-SFT-1800K dataset ranked **Top 1 on Hugging Face Datasets Trending**. ## 🔥 Highlights - **Complete Coverage**: **All** verified correct samples, no failure rate filtering - **Massive Scale**: **1.8M** QA pairs — 3× larger than SFT-600K - **Maximum Diversity**: Includes both easy and hard samples for comprehensive learning - **Truth-Anchored**: All answers verified via Python code execution - **Rich Reasoning**: **~9B** tokens of high-quality Chain-of-Thought reasoning traces ## 📊 Dataset Statistics | Property | Value | |:---|:---:| | **Unique Charts** | ~800k | | **QA Pairs** | 1.8M | | **Total Tokens** | ~9B | | **Avg CoT Length** | ~6,500 tokens | | **Failure Rate Filter** | ❌ **None** (all correct samples) | | **Answer Accuracy** | ✅ Verified | ### Chart Examples <div align="center"> <img src="https://raw.githubusercontent.com/starriver030515/ChartVerse/main/assets/complex_images.png" width="100%" alt="Complex Chart Examples"> </div> Our dataset covers exceptional diversity in chart types: - **3D Visualizations**: Surface plots, 3D bar charts, scatter plots - **Hierarchical Structures**: Treemaps, sunburst charts, dendrograms - **Statistical Plots**: Violin plots, radar charts, box plots with annotations - **Multi-Subplot Layouts**: Complex dashboards with mixed chart types - **Specialized Charts**: Sankey diagrams, chord diagrams, heatmaps with clustering ### Dataset Variants Comparison | Property | SFT-600K | RL-40K | **SFT-1.8M** | |:---|:---:|:---:|:---:| | **QA Pairs** | 603K | 40K | **1.8M** | | **Failure Rate Filter** | r > 0 | Highest r | **None** | | **Sample Type** | Non-trivial | Hardest | **All correct** | | **Use Case** | Standard SFT | RL training | **Maximum scale SFT** | ## 🔬 Data Generation Pipeline ### Step 1: Rollout Posterior Entropy (RPE) for Chart Complexity <div align="center"> <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/rpe_illustration.png" width="100%" alt="RPE Illustration"> </div> We quantify chart complexity using RPE: - Simple charts → consistent VLM reconstructions (low RPE) - Complex charts → divergent reconstructions (high RPE) - **Threshold**: RPE ≥ 0.4 ensures high-complexity charts ### Step 2: Truth-Anchored Inverse QA Synthesis <div align="center"> <img src="https://raw.githubusercontent.com/chartverse/chartverse.github.io/main/static/images/pipeline.png" width="100%" alt="ChartVerse Pipeline"> </div> Our Answer-First paradigm ensures answer correctness: 1. **Script Generation**: LLM analyzes chart code → Python script → deterministic answer A_py 2. **Reverse Question Synthesis**: Generate question Q conditioned on the script logic 3. **Consistency Verification**: LLM infers answer Â from (code, Q); retain only if Â = A_py 4. **CoT Distillation**: Qwen3-VL-30B-A3B-Thinking generates reasoning traces ### Step 3: No Failure Rate Filtering (Complete Inclusion) Unlike SFT-600K and RL-40K, **SFT-1800K includes ALL verified samples**: | Dataset | Filtering Strategy | Result | |:---|:---|:---| | SFT-600K | Exclude r(Q) = 0 | Non-trivial samples only | | RL-40K | Select highest r(Q) | Hardest samples only | | **SFT-1800K** | **No filtering** | **All correct samples** | ## 📖 Citation ```bibtex @misc{liu2026chartversescalingchartreasoning, title={ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch}, author={Zheng Liu and Honglin Lin and Chonghan Qin and Xiaoyang Wang and Xin Gao and Yu Li and Mengzhang Cai and Yun Zhu and Zhanping Zhong and Qizhi Pei and Zhuoshi Pan and Xiaoran Shang and Bin Cui and Conghui He and Wentao Zhang and Lijun Wu}, year={2026}, eprint={2601.13606}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.13606}, } ``` ## 📄 License This dataset is released under the Apache 2.0 License. ## 🙏 Acknowledgements - Chart synthesis: [ChartVerse-Coder](https://huggingface.co/opendatalab/ChartVerse-Coder) - CoT distillation: Qwen3-VL-30B-A3B-Thinking - QA synthesis: Qwen3-30B-A3B-Thinking

提供机构：

opendatalab

5,000+

优质数据集

54 个

任务类型

进入经典数据集