racineai/VDR_MEGA_2
收藏Hugging Face2025-11-20 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_MEGA_2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- visual-question-answering
- text-retrieval
language:
- en
- fr
- de
- it
- es
- ar
tags:
- multimodal
- technical-documents
- RAG
- DSE
- merged-datasets
---
# VDR_MEGA_2
## Dataset Summary
**VDR_MEGA_2** is a high-quality multimodal dataset created through the merge of multiple domain-specific datasets with enhanced data processing techniques. This dataset represents our most refined approach to multimodal data generation, incorporating filtering algorithms and improved AI-assisted content generation to deliver superior quality for RAG, DSE, question answering, document search, and vision-language model training tasks.
## Source Datasets
This merged dataset combines the the following datasets:
| Dataset (split) | Domain | Language(s)
|---------|---------|-------------|
| [`racineai/VDR_Military (filtered)`](https://huggingface.co/datasets/racineai/VDR_Military) | Military | EN, FR |
| [`racineai/VDR_Energy (filtered)`](https://huggingface.co/datasets/racineai/VDR_Energy) | Energy/Regulation | EN, FR |
| [`racineai/VDR_Quantum (filtered)`](https://huggingface.co/datasets/racineai/VDR_Quantum) | Quantum | EN, FR |
| [`racineai/VDR_ibm-research_REAL-MM-RAG (train)`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | Technical/Research | EN |
| [`racineai/VDR_Cooking_Recipes (filtered)`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | Culinary Arts | Multiple |
| [`racineai/VDR_Geotechnie (filtered)`](https://huggingface.co/datasets/racineai/VDR_Geotechnie) | Geotechnie | EN, FR |
| [`racineai/VDR_Nuclear (filtered)`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | Nuclear/Regulation | EN, FR, DE, IT, ES |
| [`racineai/VDR_2_vdr-visRAG-colpali (filtered)`](https://huggingface.co/datasets/racineai/VDR_2_vdr-visRAG-colpali) | Various | EN, FR, DE, IT, ES |
| [`racineai/VDR_Renewable_Regulation (filtered)`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | Energy/Regulation | Multiple |
| [`racineai/VDR_Quantum_Circuit_Papers (filtered)`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | Quantum Computing | EN |
| [`racineai/VDR_Hydrogen (filtered)`](https://huggingface.co/datasets/racineai/VDR_Hydrogen) | Hydrogen/Regulation | EN, FR |
| [`racineai/VDR_History_Geography (filtered)`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | Education/History | Multiple |
| [`racineai/VDR_Energy_Arabic (train)`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | Energy | Arabic |
| [`racineai/VDR_CATIE-AQ_XMRec (train)`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | Various | FR |
## Data Fields
Each entry contains:
- **`id`** (string): Unique identifier
- **`query`** (string): High-quality technical/domain-specific question
- **`image`** (PIL.Image): High-resolution visual rendering of source document page
- **`language`** (string): Detected language of the image (queries sometimes differ on purpose)
## Dataset Curators
- **Léo Appourchaux**
- **Paul Lemaistre**
- **Yumeng Ye**
- **Mattéo KHAN**
- **André-Louis Rochet**
提供机构:
racineai



