vdr-multilingual-train
收藏魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/llamaindex/vdr-multilingual-train
下载链接
链接失效反馈官方服务:
资源简介:
# Multilingual Visual Document Retrieval Dataset

> This dataset consists of **500k multilingual query image samples**, collected and generated from scratch using public internet pdfs. The queries are synthetic and generated using VLMs (gemini-1.5-pro and Qwen2-VL-72B).
It was used to train the [vdr-2b-multi-v1](https://huggingface.co/llamaindex/vdr-2b-multi-v1) retrieval multimodal, multilingual embedding model.
## How it was created
This is the entire data pipeline used to create the Italian subset of this dataset. Each step of the process is explained in detail below.

#### Data gathering
For each language, we generate a long list of search queries covering many different topics, which are then used to search for PDFs. We use the language filtering capabilities of the search engine to scrape documents that are only in the specified language. This "search by topic" technique ensures that the model has seen a lot of diverse topics and domains, and that it performs well in real life scenarios.
The scraping process produced ~50k multilingual documents. Contrary to the method used in the previous [`mcdse-2b-v1`](https://huggingface.co/marco/mcdse-2b-v1) model, pages were not extracted randomly. Instead, each page of each PDF was run through a document layout analysis model to determine whether the page contained more textual or visual elements. The result is a number that classifies the page as text-only, visual-only or mixed. This labelling step was then used to sample ~100k pages, ensuring they were evenly distributed by page type.
#### Synthetic generation
The queries were then generated using gemini-1.5-pro and Qwen2-VL-72B. They were tasked to come up with a specific and a general question. Only the specific question is then used to train the model, but forcing the LLM to distinguish between the two often resulted in stronger specific questions for information retrieval training.
After generation, a further cleaning step ensures that the questions are good enough for training. This includes:
- Ensuring the language is correct
- Fix formatting problems
- Remove markdown
- Ensuring that only one question is posed
- Removing grounding phrases (e.g. "according to Figure 1", "this document", ...)
#### Filtering and hard-negative mining
This cleaning step ensures that the queries are syntactically correct and follow some strict guidelines. But it still doesn't ensure that the queries are good enough for information retrieval.
To filter out bad questions, we have embedded and indexed each broad query with the voyage-3 embedding model. For each specific question, we search the index. The query is marked as 'good' if its associated broad question appears in the top 100 results. This method removes low entropy, duplicate or too similar questions. On average, 40% of queries were removed from each language dataset.
Hard negatives were then mined using voyage-3 only on specific questions with a fixed threshold of 0.75. Experiments were also carried out using positive aware negative mining as used by [nvidia/NV-Retriever-v1](https://huggingface.co/nvidia/NV-Retriever-v1), but on this dataset it seems to produce too easy/distant negatives.
# Info and usage
The training dataset consists of 496,167 PDF pages, of which only 280,679 are associated with the filtered queries (using the method described above). The images that remain without a query are still used as hard negatives.
| Language | # filtered queries | # unfiltered queries |
|----------:|-------------------:|---------------------:|
| English | 53,512 | 94,225 |
| Spanish | 58,738 | 102,685 |
| Italian | 54,942 | 98,747 |
| German | 58,217 | 100,713 |
| French | 55,270 | 99,797 |
| **TOTAL** | **280,679** | **496,167** |
### Schema
| **Column** | **Type** |
|-----------:|--------------:|
| id | str |
| query | str |
| image | image |
| negatives | array[string] |
| language | string |
The `id` column represents the identification number of the positive image. The `negatives` column contains all the ids of the associated negatives, sorted in ascending order by their distance from the positive.
The last rows do not contain any negatives or queries, as their queries have been filtered out by the data curation process. The images are still being used as negatives for other earlier queries.
The dataset consists of 5 different subsets for each language. You can download languages individually by specifying the language subset in [`load_dataset`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset):
```python
from datasets import load_dataset
italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")
english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")
french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")
german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")
spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")
```
# 多语言视觉文档检索数据集(Multilingual Visual Document Retrieval Dataset)

> 本数据集包含**50万个多语言查询图像样本**,全部通过公开网络PDF从头收集并生成。查询为合成数据,由视觉语言模型(Vision Language Models,VLMs)gemini-1.5-pro与Qwen2-VL-72B生成。
本数据集被用于训练[vdr-2b-multi-v1](https://huggingface.co/llamaindex/vdr-2b-multi-v1)多模态多语言嵌入检索模型。
## 数据集创建流程
以下为本数据集意大利语子集的完整构建流程,下文将对每一步骤进行详细说明。

#### 数据采集
针对每种语言,我们先生成覆盖众多不同主题的长查询列表,再利用该列表检索PDF文档。我们借助搜索引擎的语言过滤功能,仅抓取指定语言的文档。这种「按主题检索」的方法可确保模型接触到丰富多样的主题与领域,从而在真实场景中拥有出色的表现。
本次抓取共获取约5万个多语言文档。与此前 [`mcdse-2b-v1`](https://huggingface.co/marco/mcdse-2b-v1) 模型所采用的方法不同,本流程并未随机抽取PDF页面。取而代之的是,我们将每份PDF的每一页输入文档布局分析模型,以判断该页面以文本元素还是视觉元素为主,最终输出分类结果,将页面划分为纯文本、纯视觉或混合类型。随后基于该标注步骤,我们采样得到约10万个页面,确保各类页面的分布均匀均衡。
#### 合成查询生成
随后我们使用gemini-1.5-pro与Qwen2-VL-72B生成查询。模型被要求生成两类问题:具体问题与通用问题。训练时仅使用具体问题,但要求大语言模型(Large Language Model,LLM)区分两类问题,往往能得到更优质的信息检索训练用具体问题。
生成查询后,我们还会执行清洗步骤,以确保查询足够适用于训练,具体包括:
- 验证语言正确性
- 修复格式问题
- 移除Markdown语法
- 确保每个查询仅包含一个问题
- 移除背景限定短语(例如「根据图1」「本文档」等)
#### 查询过滤与难负样本挖掘
上述清洗步骤可确保查询语法正确且符合严格规范,但仍无法保证查询完全适配信息检索任务。
为过滤劣质查询,我们使用voyage-3嵌入模型将所有通用查询嵌入并构建索引。针对每个具体查询,我们在索引中进行检索,若其关联的通用查询位列检索结果前100名,则该具体查询被标记为「优质」。此方法可移除低熵、重复或相似度极高的查询。平均而言,每种语言的数据集中有40%的查询会被过滤掉。
随后我们仅针对具体查询,以0.75的固定阈值使用voyage-3模型挖掘难负样本。我们也曾尝试采用[nvidia/NV-Retriever-v1](https://huggingface.co/nvidia/NV-Retriever-v1)所使用的正样本感知负样本挖掘方法,但在本数据集上该方法生成的负样本过于简单或相关性过远。
# 数据集信息与使用方法
本训练数据集共包含496,167个PDF页面,其中仅280,679个页面与经过上述方法过滤后的查询相关联。剩余未关联查询的图像仍可作为难负样本使用。
| 语言 | 过滤后查询数 | 未过滤查询数 |
|--------:|-------------:|-------------:|
| 英语 | 53,512 | 94,225 |
| 西班牙语 | 58,738 | 102,685 |
| 意大利语 | 54,942 | 98,747 |
| 德语 | 58,217 | 100,713 |
| 法语 | 55,270 | 99,797 |
| **总计** | **280,679** | **496,167** |
### 数据结构
| **字段名** | **数据类型** |
|-----------:|--------------:|
| id | 字符串 |
| query | 字符串 |
| image | 图像 |
| negatives | 字符串数组 |
| language | 字符串 |
`id` 字段代表正样本图像的唯一标识号。`negatives` 字段包含所有关联负样本的ID,且按照与正样本的距离升序排列。部分行未包含负样本或查询,这是因为其对应的查询已在数据整理流程中被过滤,但这些图像仍可作为其他早期查询的负样本使用。
本数据集针对每种语言包含5个子集。你可以通过在 [`load_dataset`](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.load_dataset) 中指定语言子集来单独下载对应语言的数据:
python
from datasets import load_dataset
italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")
english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")
french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")
german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")
spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")
提供机构:
maas
创建时间:
2025-01-20
搜集汇总
数据集介绍

背景与挑战
背景概述
vdr-multilingual-train是一个包含50万多个多语言查询图像样本的数据集,用于训练多模态、多语言嵌入模型。数据集通过严格的生成和过滤流程确保查询质量,并包含五种语言的子集。
以上内容由遇见数据集搜集并总结生成



