tomaarsen/llamaindex-vdr-en-train-preprocessed
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tomaarsen/llamaindex-vdr-en-train-preprocessed
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: eval
features:
- name: query
dtype: string
- name: image
dtype: image
- name: negative_0
dtype: image
- name: negative_1
dtype: image
- name: negative_2
dtype: image
- name: negative_3
dtype: image
splits:
- name: train
num_bytes: 356052075
num_examples: 300
download_size: 328567541
dataset_size: 356052075
- config_name: full
features:
- name: query
dtype: string
- name: image
dtype: image
- name: negative_0
dtype: image
- name: negative_1
dtype: image
- name: negative_2
dtype: image
- name: negative_3
dtype: image
splits:
- name: train
num_bytes: 62370383388
num_examples: 53512
download_size: 57161629955
dataset_size: 62370383388
- config_name: train
features:
- name: query
dtype: string
- name: image
dtype: image
- name: negative_0
dtype: image
- name: negative_1
dtype: image
- name: negative_2
dtype: image
- name: negative_3
dtype: image
splits:
- name: train
num_bytes: 11548548625
num_examples: 10000
download_size: 10570074118
dataset_size: 11548548625
configs:
- config_name: eval
data_files:
- split: train
path: eval/train-*
- config_name: full
data_files:
- split: train
path: full/train-*
- config_name: train
data_files:
- split: train
path: train/train-*
license: apache-2.0
language:
- en
pretty_name: Visual Document Retrieval Dataset
---
# llamaindex-vdr-en-train-preprocessed
This dataset is a preprocessed English subset of [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), prepared for training multimodal [Sentence Transformer](https://sbert.net) embedding models on document screenshot retrieval.
## Changes from the original dataset
The original [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) dataset stores hard negatives as a list of ID strings that reference other rows. This dataset makes two key changes:
1. **English only**: Only the English subset (53,512 samples) is included.
2. **Resolved negatives as images**: 4 out of the 16 hard negatives are resolved from IDs into the actual document screenshot images, stored as `negative_0` through `negative_3`. This makes the dataset directly usable for training with Sentence Transformers without any additional preprocessing.
## Dataset Structure
Each sample contains:
| Column | Type | Description |
|---|---|---|
| `query` | `string` | A synthetic text query associated with the document screenshot |
| `image` | `image` | The positive document screenshot (PDF page rendered as an image) |
| `negative_0` | `image` | Hard negative document screenshot (closest) |
| `negative_1` | `image` | Hard negative document screenshot |
| `negative_2` | `image` | Hard negative document screenshot |
| `negative_3` | `image` | Hard negative document screenshot (furthest) |
## Configs
| Config | Samples | Description |
|---|---|---|
| `full` | 53,512 | All English samples |
| `train` | 10,000 | The first 10,000 samples (0–9,999) from the `full` dataset |
| `eval` | 300 | The next 300 samples (10,000–10,299) from the `full` dataset |
You can certainly train on the `full` dataset, and then you're recommended to make your own eval/test splits. Do not combine `full` and `eval`, as you'll train and evaluate on the same data.
## Usage
```python
from datasets import load_dataset
# Load the training split
train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
# Load the evaluation split
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")
# Load all English samples
full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train")
```
## Training
This dataset can be used to finetune a multimodal Sentence Transformer model for document screenshot embedding. See the [training example](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py) for a full training script.
## Source
This dataset is derived from [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train), which consists of 500k multilingual query-image samples collected and generated from public internet PDFs. Queries were synthetically generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B). See the original dataset card for full details on the data collection and curation process.
数据集信息:
- 配置名称:eval
特征:
- 名称:"query",数据类型:字符串
- 名称:"image",数据类型:图像
- 名称:"negative_0",数据类型:图像
- 名称:"negative_1",数据类型:图像
- 名称:"negative_2",数据类型:图像
- 名称:"negative_3",数据类型:图像
划分:
- 名称:"train",字节数:356052075,样本数:300
下载大小:328567541,数据集大小:356052075
- 配置名称:"full"
特征:
- 名称:"query",数据类型:字符串
- 名称:"image",数据类型:图像
- 名称:"negative_0",数据类型:图像
- 名称:"negative_1",数据类型:图像
- 名称:"negative_2",数据类型:图像
- 名称:"negative_3",数据类型:图像
划分:
- 名称:"train",字节数:62370383388,样本数:53512
下载大小:57161629955,数据集大小:62370383388
- 配置名称:"train"
特征:
- 名称:"query",数据类型:字符串
- 名称:"image",数据类型:图像
- 名称:"negative_0",数据类型:图像
- 名称:"negative_1",数据类型:图像
- 名称:"negative_2",数据类型:图像
- 名称:"negative_3",数据类型:图像
划分:
- 名称:"train",字节数:11548548625,样本数:10000
下载大小:10570074118,数据集大小:11548548625
配置项:
- 配置名称:"eval",数据文件:
- 划分:"train",路径:eval/train-*
- 配置名称:"full",数据文件:
- 划分:"train",路径:full/train-*
- 配置名称:"train",数据文件:
- 划分:"train",路径:train/train-*
许可证:Apache-2.0
语言:英语
美观名称:视觉文档检索数据集(Visual Document Retrieval Dataset)
# llamaindex-vdr-en-train-preprocessed
本数据集是[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)的预处理英语子集,专为在文档截图检索任务上训练多模态Sentence Transformer(句子转换器)嵌入模型而打造。
## 与原始数据集的差异
原始[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)数据集将难负样本存储为引用其他行的ID字符串列表。本数据集完成两项关键优化:
1. **仅保留英语语料**:仅收录英语子集,共计53512个样本。
2. **负样本解析为图像**:将16个难负样本中的4个从ID转换为实际的文档截图图像,存储为`negative_0`至`negative_3`。此设计使得本数据集可直接用于Sentence Transformer模型训练,无需额外预处理步骤。
## 数据集结构
每个样本包含以下字段:
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `"query"` | 字符串 | 与文档截图关联的合成文本查询 |
| `"image"` | 图像 | 正样本文档截图(即渲染为图像格式的PDF页面) |
| `"negative_0"` | 图像 | 难负样本文档截图(相似度最高) |
| `"negative_1"` | 图像 | 难负样本文档截图 |
| `"negative_2"` | 图像 | 难负样本文档截图 |
| `"negative_3"` | 图像 | 难负样本文档截图(相似度最低) |
## 配置项说明
| 配置名称 | 样本数量 | 描述 |
|---|---|---|
| `"full"` | 53512 | 全部英语样本集合 |
| `"train"` | 10000 | 取自`"full"`数据集的前10000个样本(索引范围0至9999) |
| `"eval"` | 300 | 取自`"full"`数据集的后续300个样本(索引范围10000至10299) |
您可基于`"full"`数据集开展模型训练,后续建议自行构建评估或测试划分。请勿将`"full"`与`"eval"`配置合并使用,否则会导致训练与评估使用同一批数据,引发数据泄露问题。
## 使用示例
python
from datasets import load_dataset
# 加载训练划分
train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
# 加载评估划分
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")
# 加载全部英语样本
full_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "full", split="train")
## 训练指南
本数据集可用于微调面向文档截图嵌入任务的多模态Sentence Transformer模型。完整训练脚本请参考[训练示例](https://github.com/huggingface/sentence-transformers/blob/main/examples/sentence_transformer/training/multimodal/training_document_screenshot_embedding.py)。
## 数据集来源
本数据集源自[llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train),该原始数据集包含50万个多语言查询-图像样本,采集自公开互联网上的PDF文件并经生成处理得到。查询文本通过视觉语言模型(Vision-Language Model,VLMs)gemini-1.5-pro与Qwen2-VL-72B合成生成。有关数据采集与整理流程的完整细节,请参阅原始数据集卡片。
提供机构:
tomaarsen



