five

tomaarsen/llamaindex-vdr-en-train-preprocessed

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tomaarsen/llamaindex-vdr-en-train-preprocessed
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: eval features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 356052075 num_examples: 300 download_size: 328567541 dataset_size: 356052075 - config_name: full features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 62370383388 num_examples: 53512 download_size: 57161629955 dataset_size: 62370383388 - config_name: train features: - name: query dtype: string - name: image dtype: image - name: negative_0 dtype: image - name: negative_1 dtype: image - name: negative_2 dtype: image - name: negative_3 dtype: image splits: - name: train num_bytes: 11548548625 num_examples: 10000 download_size: 10570074118 dataset_size: 11548548625 configs: - config_name: eval data_files: - split: train path: eval/train-* - config_name: full data_files: - split: train path: full/train-* - config_name: train data_files: - split: train path: train/train-* license: apache-2.0 language: - en pretty_name: Visual Document Retrieval Dataset --- # llamaindex-vdr-en-train-preprocessed This dataset is a preprocessed English subset of [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), prepared for training multimodal [Sentence Transformer](https://sbert.net) embedding models on document screenshot retrieval. ## Changes from the original dataset The original [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset stores hard negatives as a list of ID strings that reference other rows. This dataset makes two key changes: 1. **English only**: Only the English subset (53,512 samples) is included. 2. **Resolved negatives as images**: 4 out of the 16 hard negatives are resolved from IDs into the actual document screenshot images, stored as `negative_0` through `negative_3`. This makes the dataset directly usable for training with Sentence Transformers without any additional preprocessing. ## Dataset Structure Each sample contains: | Column | Type | Description | |---|---|---| | `query` | `string` | A synthetic text query associated with the document screenshot | | `image` | `image` | The positive document screenshot (PDF page rendered as an image) | | `negative_0` | `image` | Hard negative document screenshot (closest) | | `negative_1` | `image` | Hard negative document screenshot | | `negative_2` | `image` | Hard negative document screenshot | | `negative_3` | `image` | Hard negative document screenshot (furthest) | ## Configs | Config | Samples | Description | |---|---|---| | `full` | 53,512 | All English samples | | `train` | 10,000 | The first 10,000 samples (0–9,999) from the `full` dataset | | `eval` | 300 | The next 300 samples (10,000–10,299) from the `full` dataset | You can certainly train on the `full` dataset, and then you're recommended to make your own eval/test splits. Do not combine `full` and `eval`, as you'll train and evaluate on the same data. ## Usage ```python from datasets import load_dataset # Load the training split train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") # Load the evaluation split eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train") # Load all English samples full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train") ``` ## Training This dataset can be used to finetune a multimodal Sentence Transformer model for document screenshot embedding. See the [training example](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py) for a full training script. ## Source This dataset is derived from [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), which consists of 500k multilingual query-image samples collected and generated from public internet PDFs. Queries were synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). See the original dataset card for full details on the data collection and curation process.

数据集信息: - 配置名称:eval 特征: - 名称:"query",数据类型:字符串 - 名称:"image",数据类型:图像 - 名称:"negative_0",数据类型:图像 - 名称:"negative_1",数据类型:图像 - 名称:"negative_2",数据类型:图像 - 名称:"negative_3",数据类型:图像 划分: - 名称:"train",字节数:356052075,样本数:300 下载大小:328567541,数据集大小:356052075 - 配置名称:"full" 特征: - 名称:"query",数据类型:字符串 - 名称:"image",数据类型:图像 - 名称:"negative_0",数据类型:图像 - 名称:"negative_1",数据类型:图像 - 名称:"negative_2",数据类型:图像 - 名称:"negative_3",数据类型:图像 划分: - 名称:"train",字节数:62370383388,样本数:53512 下载大小:57161629955,数据集大小:62370383388 - 配置名称:"train" 特征: - 名称:"query",数据类型:字符串 - 名称:"image",数据类型:图像 - 名称:"negative_0",数据类型:图像 - 名称:"negative_1",数据类型:图像 - 名称:"negative_2",数据类型:图像 - 名称:"negative_3",数据类型:图像 划分: - 名称:"train",字节数:11548548625,样本数:10000 下载大小:10570074118,数据集大小:11548548625 配置项: - 配置名称:"eval",数据文件: - 划分:"train",路径:eval/train-* - 配置名称:"full",数据文件: - 划分:"train",路径:full/train-* - 配置名称:"train",数据文件: - 划分:"train",路径:train/train-* 许可证:Apache-2.0 语言:英语 美观名称:视觉文档检索数据集(Visual Document Retrieval Dataset) # llamaindex-vdr-en-train-preprocessed 本数据集是[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)的预处理英语子集,专为在文档截图检索任务上训练多模态Sentence Transformer(句子转换器)嵌入模型而打造。 ## 与原始数据集的差异 原始[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)数据集将难负样本存储为引用其他行的ID字符串列表。本数据集完成两项关键优化: 1. **仅保留英语语料**:仅收录英语子集,共计53512个样本。 2. **负样本解析为图像**:将16个难负样本中的4个从ID转换为实际的文档截图图像,存储为`negative_0`至`negative_3`。此设计使得本数据集可直接用于Sentence Transformer模型训练,无需额外预处理步骤。 ## 数据集结构 每个样本包含以下字段: | 列名 | 数据类型 | 描述 | |---|---|---| | `"query"` | 字符串 | 与文档截图关联的合成文本查询 | | `"image"` | 图像 | 正样本文档截图(即渲染为图像格式的PDF页面) | | `"negative_0"` | 图像 | 难负样本文档截图(相似度最高) | | `"negative_1"` | 图像 | 难负样本文档截图 | | `"negative_2"` | 图像 | 难负样本文档截图 | | `"negative_3"` | 图像 | 难负样本文档截图(相似度最低) | ## 配置项说明 | 配置名称 | 样本数量 | 描述 | |---|---|---| | `"full"` | 53512 | 全部英语样本集合 | | `"train"` | 10000 | 取自`"full"`数据集的前10000个样本(索引范围0至9999) | | `"eval"` | 300 | 取自`"full"`数据集的后续300个样本(索引范围10000至10299) | 您可基于`"full"`数据集开展模型训练,后续建议自行构建评估或测试划分。请勿将`"full"`与`"eval"`配置合并使用,否则会导致训练与评估使用同一批数据,引发数据泄露问题。 ## 使用示例 python from datasets import load_dataset # 加载训练划分 train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train") # 加载评估划分 eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train") # 加载全部英语样本 full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train") ## 训练指南 本数据集可用于微调面向文档截图嵌入任务的多模态Sentence Transformer模型。完整训练脚本请参考[训练示例](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py)。 ## 数据集来源 本数据集源自[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train),该原始数据集包含50万个多语言查询-图像样本,采集自公开互联网上的PDF文件并经生成处理得到。查询文本通过视觉语言模型(Vision-Language Model,VLMs)gemini-1.5-pro与Qwen2-VL-72B合成生成。有关数据采集与整理流程的完整细节,请参阅原始数据集卡片。
提供机构:
tomaarsen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作