racineai/VDR_colpali-VisRAG-vdr
收藏Hugging Face2025-11-20 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_colpali-VisRAG-vdr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- fr
- es
- it
- de
tags:
- synthetic
- RAG
- DSE
- retrieval
size_categories:
- 100K<n<1M
task_categories:
- visual-document-retrieval
- text-retrieval
---
# WIP - there might be issues with the negatives
# VDR - Organized, Grouped, Cleaned
> **Intended for image/text to vector (DSE)**
## Dataset Composition
The dataset merges, shuffles, and formats data from:
- [vidore/colpali_train_set](https://huggingface.co/datasets/vidore/colpali_train_set)
- [openbmb/VisRAG-Ret-Train-Synthetic-data](https://huggingface.co/datasets/openbmb/VisRAG-Ret-Train-Synthetic-data)
- [llamaindex/vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train)
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total rows | 700,000+ |
| Rows with negatives | ≈ 33% |
| Rows without queries (image negatives only) | ≈ 25% |
## Language Distribution
| Language| Ratio |
|--------|-------|
| English | ≈ 52% |
| French | ≈ 12% |
| Spanish | ≈ 12% |
| Italian | ≈ 12% |
| German | ≈ 12% |
## Creators
Dataset curated by:
- **Paul Lemaistre**
- **Léo Appourchaux**
提供机构:
racineai



