tomaarsen/llamaindex-vdr-en-train-preprocessed

Name: tomaarsen/llamaindex-vdr-en-train-preprocessed
Creator: tomaarsen
Published: 2026-03-27 15:30:58
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/tomaarsen/llamaindex-vdr-en-train-preprocessed

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: eval features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 356052075 num_examples: 300 download_size: 328567541 dataset_size: 356052075 - config_name: full features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 62370383388 num_examples: 53512 download_size: 57161629955 dataset_size: 62370383388 - config_name: train features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 11548548625 num_examples: 10000 download_size: 10570074118 dataset_size: 11548548625 configs: - config_name: eval data_files: - split: train path: eval/train-* - config_name: full data_files: - split: train path: full/train-* - config_name: train data_files: - split: train path: train/train-* license: apache-2.0 language: - en pretty_name: Visual Document Retrieval Dataset --- # llamaindex-vdr-en-train-preprocessed This dataset is a preprocessed English subset of [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), prepared for training multimodal [Sentence Transformer](https://sbert.net) embedding models on document screenshot retrieval. ## Changes from the original dataset The original [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset stores hard negatives as a list of ID strings that reference other rows. This dataset makes two key changes: 1. **English only**: Only the English subset (53,512 samples) is included. 2. **Resolved negatives as images**: 4 out of the 16 hard negatives are resolved from IDs into the actual document screenshot images, stored as `negative_0` through `negative_3`. This makes the dataset directly usable for training with Sentence Transformers without any additional preprocessing. ## Dataset Structure Each sample contains: | Column | Type | Description | |---|---|---| | `query` | `string` | A synthetic text query associated with the document screenshot | | `image` | `image` | The positive document screenshot (PDF page rendered as an image) | | `negative_0` | `image` | Hard negative document screenshot (closest) | | `negative_1` | `image` | Hard negative document screenshot | | `negative_2` | `image` | Hard negative document screenshot | | `negative_3` | `image` | Hard negative document screenshot (furthest) | ## Configs | Config | Samples | Description | |---|---|---| | `full` | 53,512 | All English samples | | `train` | 10,000 | The first 10,000 samples (0–9,999) from the `full` dataset | | `eval` | 300 | The next 300 samples (10,000–10,299) from the `full` dataset | You can certainly train on the `full` dataset, and then you're recommended to make your own eval/test splits. Do not combine `full` and `eval`, as you'll train and evaluate on the same data. ## Usage ```python from datasets import load_dataset # Load the training split train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") # Load the evaluation split eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train") # Load all English samples full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train") ``` ## Training This dataset can be used to finetune a multimodal Sentence Transformer model for document screenshot embedding. See the [training example](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py) for a full training script. ## Source This dataset is derived from [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), which consists of 500k multilingual query-image samples collected and generated from public internet PDFs. Queries were synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). See the original dataset card for full details on the data collection and curation process.

数据集信息： - 配置名称：eval 特征： - 名称："query"，数据类型：字符串 - 名称："image"，数据类型：图像 - 名称："negative_0"，数据类型：图像 - 名称："negative_1"，数据类型：图像 - 名称："negative_2"，数据类型：图像 - 名称："negative_3"，数据类型：图像划分： - 名称："train"，字节数：356052075，样本数：300 下载大小：328567541，数据集大小：356052075 - 配置名称："full" 特征： - 名称："query"，数据类型：字符串 - 名称："image"，数据类型：图像 - 名称："negative_0"，数据类型：图像 - 名称："negative_1"，数据类型：图像 - 名称："negative_2"，数据类型：图像 - 名称："negative_3"，数据类型：图像划分： - 名称："train"，字节数：62370383388，样本数：53512 下载大小：57161629955，数据集大小：62370383388 - 配置名称："train" 特征： - 名称："query"，数据类型：字符串 - 名称："image"，数据类型：图像 - 名称："negative_0"，数据类型：图像 - 名称："negative_1"，数据类型：图像 - 名称："negative_2"，数据类型：图像 - 名称："negative_3"，数据类型：图像划分： - 名称："train"，字节数：11548548625，样本数：10000 下载大小：10570074118，数据集大小：11548548625 配置项： - 配置名称："eval"，数据文件： - 划分："train"，路径：eval/train-* - 配置名称："full"，数据文件： - 划分："train"，路径：full/train-* - 配置名称："train"，数据文件： - 划分："train"，路径：train/train-* 许可证：Apache-2.0 语言：英语美观名称：视觉文档检索数据集（Visual Document Retrieval Dataset） # llamaindex-vdr-en-train-preprocessed 本数据集是[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)的预处理英语子集，专为在文档截图检索任务上训练多模态Sentence Transformer（句子转换器）嵌入模型而打造。 ## 与原始数据集的差异原始[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)数据集将难负样本存储为引用其他行的ID字符串列表。本数据集完成两项关键优化： 1. **仅保留英语语料**：仅收录英语子集，共计53512个样本。 2. **负样本解析为图像**：将16个难负样本中的4个从ID转换为实际的文档截图图像，存储为`negative_0`至`negative_3`。此设计使得本数据集可直接用于Sentence Transformer模型训练，无需额外预处理步骤。 ## 数据集结构每个样本包含以下字段： | 列名 | 数据类型 | 描述 | |---|---|---| | `"query"` | 字符串 | 与文档截图关联的合成文本查询 | | `"image"` | 图像 | 正样本文档截图（即渲染为图像格式的PDF页面） | | `"negative_0"` | 图像 | 难负样本文档截图（相似度最高） | | `"negative_1"` | 图像 | 难负样本文档截图 | | `"negative_2"` | 图像 | 难负样本文档截图 | | `"negative_3"` | 图像 | 难负样本文档截图（相似度最低） | ## 配置项说明 | 配置名称 | 样本数量 | 描述 | |---|---|---| | `"full"` | 53512 | 全部英语样本集合 | | `"train"` | 10000 | 取自`"full"`数据集的前10000个样本（索引范围0至9999） | | `"eval"` | 300 | 取自`"full"`数据集的后续300个样本（索引范围10000至10299） | 您可基于`"full"`数据集开展模型训练，后续建议自行构建评估或测试划分。请勿将`"full"`与`"eval"`配置合并使用，否则会导致训练与评估使用同一批数据，引发数据泄露问题。 ## 使用示例 python from datasets import load_dataset # 加载训练划分 train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") # 加载评估划分 eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train") # 加载全部英语样本 full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train") ## 训练指南本数据集可用于微调面向文档截图嵌入任务的多模态Sentence Transformer模型。完整训练脚本请参考[训练示例](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py)。 ## 数据集来源本数据集源自[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)，该原始数据集包含50万个多语言查询-图像样本，采集自公开互联网上的PDF文件并经生成处理得到。查询文本通过视觉语言模型（Vision-Language Model，VLMs）gemini-1.5-pro与Qwen2-VL-72B合成生成。有关数据采集与整理流程的完整细节，请参阅原始数据集卡片。

提供机构：

tomaarsen

5,000+

优质数据集

54 个

任务类型

进入经典数据集