VisRAG-Ret-Train-In-domain-data
收藏魔搭社区2025-12-05 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/OpenBMB/VisRAG-Ret-Train-In-domain-data
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is the In-domain part of the training set of [VisRAG](https://huggingface.co/openbmb/VisRAG) it includes 122,752 Query-Document (Q-D) Pairs from openly available academic datasets.
Our training data is organized with a batch size of 128, ensuring that all data within the same batch comes from the same dataset.
| Dataset | # Q-D Pairs |
|------------------------------------------|-------------------------------|
| [ArXivQA](https://arxiv.org/abs/2403.00231) | 25,856 |
| [ChartQA](https://arxiv.org/abs/2203.10244) | 4,224 |
| [MP-DocVQA](https://www.docvqa.org/datasets/docvqa) | 10,624 |
| [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 17,664 |
| [PlotQA](https://arxiv.org/abs/1909.00997) | 56,192 |
| [SlideVQA](https://arxiv.org/abs/2301.04883) | 8,192 |
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", split="train")
```
## 数据集说明
本数据集为[VisRAG](https://huggingface.co/openbmb/VisRAG)训练集的域内部分,包含122,752条来自公开学术数据集的查询-文档(Query-Document, Q-D)对。
我们采用批次大小为128的方式组织训练数据,确保同一批次内的所有数据均来自同一数据集。
| 数据集名称 | 查询-文档对数量 |
|------------------------------------------|-------------------------------|
| [ArXivQA](https://arxiv.org/abs/2403.00231) | 25,856 |
| [ChartQA](https://arxiv.org/abs/2203.10244) | 4,224 |
| [MP-DocVQA](https://www.docvqa.org/datasets/docvqa) | 10,624 |
| [InfoVQA](https://www.docvqa.org/datasets/infographicvqa) | 17,664 |
| [PlotQA](https://arxiv.org/abs/1909.00997) | 56,192 |
| [SlideVQA](https://arxiv.org/abs/2301.04883) | 8,192 |
### 数据集加载
python
from datasets import load_dataset
ds = load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", split="train")
提供机构:
maas
创建时间:
2025-05-15



