five

racineai/VDR_MEGA_2

收藏
Hugging Face2025-11-20 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_MEGA_2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - visual-question-answering - text-retrieval language: - en - fr - de - it - es - ar tags: - multimodal - technical-documents - RAG - DSE - merged-datasets --- # VDR_MEGA_2 ## Dataset Summary **VDR_MEGA_2** is a high-quality multimodal dataset created through the merge of multiple domain-specific datasets with enhanced data processing techniques. This dataset represents our most refined approach to multimodal data generation, incorporating filtering algorithms and improved AI-assisted content generation to deliver superior quality for RAG, DSE, question answering, document search, and vision-language model training tasks. ## Source Datasets This merged dataset combines the the following datasets: | Dataset (split) | Domain | Language(s) |---------|---------|-------------| | [`racineai/VDR_Military (filtered)`](https://huggingface.co/datasets/racineai/VDR_Military) | Military | EN, FR | | [`racineai/VDR_Energy (filtered)`](https://huggingface.co/datasets/racineai/VDR_Energy) | Energy/Regulation | EN, FR | | [`racineai/VDR_Quantum (filtered)`](https://huggingface.co/datasets/racineai/VDR_Quantum) | Quantum | EN, FR | | [`racineai/VDR_ibm-research_REAL-MM-RAG (train)`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | Technical/Research | EN | | [`racineai/VDR_Cooking_Recipes (filtered)`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | Culinary Arts | Multiple | | [`racineai/VDR_Geotechnie (filtered)`](https://huggingface.co/datasets/racineai/VDR_Geotechnie) | Geotechnie | EN, FR | | [`racineai/VDR_Nuclear (filtered)`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | Nuclear/Regulation | EN, FR, DE, IT, ES | | [`racineai/VDR_2_vdr-visRAG-colpali (filtered)`](https://huggingface.co/datasets/racineai/VDR_2_vdr-visRAG-colpali) | Various | EN, FR, DE, IT, ES | | [`racineai/VDR_Renewable_Regulation (filtered)`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | Energy/Regulation | Multiple | | [`racineai/VDR_Quantum_Circuit_Papers (filtered)`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | Quantum Computing | EN | | [`racineai/VDR_Hydrogen (filtered)`](https://huggingface.co/datasets/racineai/VDR_Hydrogen) | Hydrogen/Regulation | EN, FR | | [`racineai/VDR_History_Geography (filtered)`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | Education/History | Multiple | | [`racineai/VDR_Energy_Arabic (train)`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | Energy | Arabic | | [`racineai/VDR_CATIE-AQ_XMRec (train)`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | Various | FR | ## Data Fields Each entry contains: - **`id`** (string): Unique identifier - **`query`** (string): High-quality technical/domain-specific question - **`image`** (PIL.Image): High-resolution visual rendering of source document page - **`language`** (string): Detected language of the image (queries sometimes differ on purpose) ## Dataset Curators - **Léo Appourchaux** - **Paul Lemaistre** - **Yumeng Ye** - **Mattéo KHAN** - **André-Louis Rochet**
提供机构:
racineai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作