Name: kk456123/VisRAG-Ret-Train-Synthetic-data
Creator: kk456123
Published: 2026-03-29 14:13:06
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/kk456123/VisRAG-Ret-Train-Synthetic-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: query dtype: string - name: image dtype: image - name: source dtype: string splits: - name: train num_bytes: 162661189879.306 num_examples: 239358 download_size: 160347370819 dataset_size: 162661189879.306 configs: - config_name: default data_files: - split: train path: data/train-* tags: - synthetic --- ## Dataset Description This dataset is the synthetic part of the training set of [VisRAG](https://huggingface.co/openbmb/VisRAG) it includes 239,358 Query-Document (Q-D) Pairs from a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries. Our training data is organized with a batch size of 128, ensuring that all data within the same batch comes from the same dataset. | Name | Source | Description | # Pages | |----------------|----------------------------------------|------------------------------------------------------|---------| | Textbooks | [https://openstax.org/](https://openstax.org/) | College-level textbooks including various subjects | 10,000 | | ICML Papers | ICML 2023 | ICML papers on various topics | 5,000 | | NeurIPS Papers | NeurIPS 2023 | NeurIPS papers on various topics | 5,000 | | Manuallib | [https://www.manualslib.com/](https://www.manualslib.com/) | Manuals of various kinds of products | 20,000 | ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", split="train") ```

--- 数据集信息：特征： - 名称：查询（query），数据类型：字符串 - 名称：图像（image），数据类型：图像 - 名称：来源（source），数据类型：字符串数据集划分： - 划分名称：训练集（train），字节数：162661189879.306，样本数量：239358 下载大小：160347370819，数据集总大小：162661189879.306 配置： - 配置名称：默认（default），数据文件： - 划分：训练集（train），路径：data/train-* 标签： - 合成（synthetic） --- ## 数据集描述本数据集为[VisRAG](https://huggingface.co/openbmb/VisRAG)训练集的合成子集，共包含239,358条查询-文档（Query-Document, Q-D）对，其源自由网络爬取的PDF文档页面构建，并通过视觉语言模型（Vision-Language Model, VLM）生成的GPT-4o伪查询进行增强的合成数据集。本训练数据以批次大小128进行组织，确保同一批次内的所有数据均来自同一来源数据集。 | 名称 | 来源 | 描述 | 页面数量 | |----------------|----------------------------------------|------------------------------------------------------|---------| | 教科书（Textbooks） | [https://openstax.org/](https://openstax.org/) | 涵盖多门学科的大学水平教科书 | 10,000 | | ICML论文（ICML Papers） | ICML 2023 | 各主题的ICML论文 | 5,000 | | NeurIPS论文（NeurIPS Papers） | NeurIPS 2023 | 各主题的NeurIPS论文 | 5,000 | | 手册库（Manuallib） | [https://www.manualslib.com/](https://www.manualslib.com/) | 各类产品的使用手册 | 20,000 | ### 加载数据集 python from datasets import load_dataset ds = load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", split="train")

应用场景：