opendatalab/ChartVerse-SFT-600K
收藏Hugging Face2026-01-23 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/opendatalab/ChartVerse-SFT-600K
下载链接
链接失效反馈官方服务:
资源简介:
ChartVerse-SFT-600K 是一个大规模、高质量的图表推理数据集,带有思维链(CoT)注释,作为 [opendatalab/ChartVerse](https://huggingface.co/collections/opendatalab/chartverse) 项目的一部分开发。该数据集包含通过失败率(r > 0)筛选的非平凡样本,确保每个样本都提供有意义的学习信号。过于简单的样本(r = 0,模型总是正确回答)被排除在外。由于我们为单个图像生成多个问题,并为单个问题生成多个CoT答案,数据集中可能会出现重复图像,这是数据结构的正常和预期特征。
🔥 亮点
- **非平凡样本**:通过失败率 > 0 筛选,排除过于简单的样本
- **高复杂性**:Rollout Posterior Entropy (RPE) 为 **0.44**,是所有图表数据集中最高的
- **真实锚定**:所有答案通过Python代码执行验证,消除幻觉
- **丰富推理**:**3.9B** 高质量思维链推理痕迹的令牌
📊 数据集统计
| 属性 | 值 |
|:---|:---:|
| **唯一图表** | 412K |
| **问答对** | 603K |
| **总令牌数** | 3.9B |
| **平均CoT长度** | ~6,500 令牌 |
| **失败率** | r(Q) > 0 |
| **答案准确率** | ✅ 已验证 |
ChartVerse-SFT-600K 的图表复杂性和多样性显著高于所有现有的图表推理数据集。数据集涵盖多种图表类型,包括3D可视化、层次结构、统计图表、多子图布局和专用图表。
ChartVerse-SFT-600K is a large-scale, high-quality chart reasoning dataset with Chain-of-Thought (CoT) annotations, developed as part of the [opendatalab/ChartVerse](https://huggingface.co/collections/opendatalab/chartverse) project. This dataset contains non-trivial samples filtered by failure rate (r > 0), ensuring that every sample provides meaningful learning signal. Samples that are too easy (r = 0, where the model always answers correctly) are excluded. Since we generate multiple questions for a single image and multiple CoT answers for a single question, duplicate images may appear within the dataset. This is a normal and expected characteristic of the data structure.
🔥 Highlights
- **Non-Trivial Samples**: Filtered by failure rate > 0, excluding samples that are too easy
- **High Complexity**: Rollout Posterior Entropy (RPE) of **0.44**, the highest among all chart datasets
- **Truth-Anchored**: All answers verified via Python code execution, eliminating hallucinations
- **Rich Reasoning**: **3.9B** tokens of high-quality Chain-of-Thought reasoning traces
📊 Dataset Statistics
| Property | Value |
|:---|:---:|
| **Unique Charts** | 412K |
| **QA Pairs** | 603K |
| **Total Tokens** | 3.9B |
| **Avg CoT Length** | ~6,500 tokens |
| **Failure Rate** | r(Q) > 0 |
| **Answer Accuracy** | ✅ Verified |
ChartVerse-SFT-600K features charts with significantly higher complexity and diversity than all existing chart reasoning datasets. The dataset covers exceptional diversity in chart types, including 3D visualizations, hierarchical structures, statistical plots, multi-subplot layouts, and specialized charts.
提供机构:
opendatalab



