ReachQA
收藏ReachQA: 推理密集型图表问答数据集
数据集概述
ReachQA是一个多模态指令数据集,主要通过大型语言模型(LLMs)合成。训练集包含3,000个推理密集型图表和20,000个问答对,旨在增强识别和推理能力。此外,提供了一个手动策划的测试集来评估这些能力。
数据集特征
- image: 图表图像
- chart_type: 图表类型
- qa_type: 问答类型
- question: 问题
- answer: 答案
数据集划分
- train: 包含19,963个样本,大小为4,035,840,619.625字节
- test: 包含2,000个样本,大小为480,179,508字节
数据集统计
| 统计项 | 训练集 | 测试集 |
|---|---|---|
| 总图表数 | 3,249 | 500 |
| - # 图表类型 | 10 / 32 | 10 / 32 |
| - # 叠加图 | 1,030 | 220 |
| - # 多图 | 593 | 251 |
| - 平均尺寸 (px) | 2480×1571 | 2798×1601 |
| 唯一问题数 | 19,963 | 2,000 |
| - # 每图识别问题数 | 2.53 | 2 |
| - # 每图推理问题数 | 3.62 | 2 |
| 平均长度 | ||
| - 平均识别问题长度 | 22.1 | 21.0 |
| - 平均识别答案长度 | 38.3 | 7.0 |
| - 平均推理问题长度 | 38.2 | 35.4 |
| - 平均推理答案长度 | 68.4 | 24.9 |
数据格式预览
instruction_data_20k.json json [ { "data_id": "reachqa-train-00001", "plot_id": "reachqa-train-plot-00001", "image": "images/00001.jpg", "code": "code/00001.py", "plot_level": "Easy", "plot_model": "gpt-4o-2024-08-06", "major_chart_type": "Line Charts", "minor_chart_type": "line chart", "qa_type": "Reasoning", "qa_model": "gpt-4o-2024-08-06", "question": "Based on the observed trends in ocean current intensities over the decades, determine in which decade two of the currents have exactly the same intensity.", "answer": "Examine the data for ocean current intensities over each decade. In 1980, the Kuroshio Current and Antarctic Circumpolar Current both have an intensity of 22 units. Therefore, the decade when these two currents have exactly the same intensity is 1980." }, ... ]
plot_info.jsonl json {"id": "reachqa-train-plot-00001", "code": "code/00001.py", "image": "images/00001.jpg", "level": "Easy", "plot_model": "gpt-4o-2024-08-06", "major_chart_type": "Line Charts", "minor_chart_type": "line chart"}
数据集加载
使用Hugging Face加载
python from datasets import load_dataset
从网络加载数据
squad = load_dataset(hewei2001/ReachQA)
print(squad)
本地加载.parquet文件
python from datasets import load_dataset import os
假设parquet文件存储在本地路径 /path/to/local/data/
目录结构:
/path/to/local/data/
├── test-00000-of-00001.parquet
├── train-00000-of-00009.parquet
├── train-00001-of-00009.parquet
...
加载本地Parquet文件
data_files = { "train": [f"/path/to/local/data/train-{i:05d}-of-00009.parquet" for i in range(9)], "test": ["/path/to/local/data/test-00000-of-00001.parquet"] }
使用load_dataset加载本地parquet文件
dataset = load_dataset("parquet", data_files=data_files)
print(dataset)
数据集生成
bash cd ReachQA conda activate ReachQA_data
python ./data/reachqa_train/execute_code.py --code_dir ./data/reachqa_train/code/ --image_dir ./data/reachqa_train/images/
python ./data/reachqa_test/execute_code.py --code_dir ./data/reachqa_test/code/ --image_dir ./data/reachqa_test/images/
联系信息
如有任何问题,请联系 whe23@m.fudan.edu.cn。
引用
如果本数据集对您的研究有帮助,请引用我们的论文:
@article{he2024distill, title={Distill Visual Chart Reasoning Ability from LLMs to MLLMs}, author={He, Wei and Xi, Zhiheng and Zhao, Wanxu and Fan, Xiaoran and Ding, Yiwen and Shan, Zifei and Gui, Tao and Zhang, Qi and Huang, Xuan-Jing}, journal={arXiv preprint arXiv:2410.18798}, year={2024} }




