kk456123/VisRAG-Ret-Train-Synthetic-data
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kk456123/VisRAG-Ret-Train-Synthetic-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: query
dtype: string
- name: image
dtype: image
- name: source
dtype: string
splits:
- name: train
num_bytes: 162661189879.306
num_examples: 239358
download_size: 160347370819
dataset_size: 162661189879.306
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- synthetic
---
## Dataset Description
This dataset is the synthetic part of the training set of [VisRAG](https://huggingface.co/openbmb/VisRAG) it includes 239,358 Query-Document (Q-D) Pairs from a synthetic dataset made up
of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries.
Our training data is organized with a batch size of 128, ensuring that all data within the same batch comes from the same dataset.
| Name | Source | Description | # Pages |
|----------------|----------------------------------------|------------------------------------------------------|---------|
| Textbooks | [https://openstax.org/](https://openstax.org/) | College-level textbooks including various subjects | 10,000 |
| ICML Papers | ICML 2023 | ICML papers on various topics | 5,000 |
| NeurIPS Papers | NeurIPS 2023 | NeurIPS papers on various topics | 5,000 |
| Manuallib | [https://www.manualslib.com/](https://www.manualslib.com/) | Manuals of various kinds of products | 20,000 |
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", split="train")
```
---
数据集信息:
特征:
- 名称:查询(query),数据类型:字符串
- 名称:图像(image),数据类型:图像
- 名称:来源(source),数据类型:字符串
数据集划分:
- 划分名称:训练集(train),字节数:162661189879.306,样本数量:239358
下载大小:160347370819,数据集总大小:162661189879.306
配置:
- 配置名称:默认(default),数据文件:
- 划分:训练集(train),路径:data/train-*
标签:
- 合成(synthetic)
---
## 数据集描述
本数据集为[VisRAG](https://huggingface.co/openbmb/VisRAG)训练集的合成子集,共包含239,358条查询-文档(Query-Document, Q-D)对,其源自由网络爬取的PDF文档页面构建,并通过视觉语言模型(Vision-Language Model, VLM)生成的GPT-4o伪查询进行增强的合成数据集。
本训练数据以批次大小128进行组织,确保同一批次内的所有数据均来自同一来源数据集。
| 名称 | 来源 | 描述 | 页面数量 |
|----------------|----------------------------------------|------------------------------------------------------|---------|
| 教科书(Textbooks) | [https://openstax.org/](https://openstax.org/) | 涵盖多门学科的大学水平教科书 | 10,000 |
| ICML论文(ICML Papers) | ICML 2023 | 各主题的ICML论文 | 5,000 |
| NeurIPS论文(NeurIPS Papers) | NeurIPS 2023 | 各主题的NeurIPS论文 | 5,000 |
| 手册库(Manuallib) | [https://www.manualslib.com/](https://www.manualslib.com/) | 各类产品的使用手册 | 20,000 |
### 加载数据集
python
from datasets import load_dataset
ds = load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", split="train")
提供机构:
kk456123



