OGC_MEGA_2
收藏魔搭社区2025-11-12 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_MEGA_2
下载链接
链接失效反馈官方服务:
资源简介:
# OGC_MEGA_2
## Dataset Summary
**OGC_MEGA_2** is a high-quality multimodal dataset created through the merge of multiple domain-specific datasets with enhanced data processing techniques. This dataset represents our most refined approach to multimodal data generation, incorporating filtering algorithms and improved AI-assisted content generation to deliver superior quality for RAG, DSE, question answering, document search, and vision-language model training tasks.
## Source Datasets
This merged dataset combines the the following datasets:
| Dataset (split) | Domain | Language(s)
|---------|---------|-------------|
| [`racineai/OGC_Military (filtered)`](https://huggingface.co/datasets/racineai/OGC_Military) | Military | EN, FR |
| [`racineai/OGC_Energy (filtered)`](https://huggingface.co/datasets/racineai/OGC_Energy) | Energy/Regulation | EN, FR |
| [`racineai/OGC_Quantum (filtered)`](https://huggingface.co/datasets/racineai/OGC_Quantum) | Quantum | EN, FR |
| [`racineai/OGC_ibm-research_REAL-MM-RAG (train)`](https://huggingface.co/datasets/racineai/OGC_ibm-research_REAL-MM-RAG) | Technical/Research | EN |
| [`racineai/OGC_Cooking_Recipes (filtered)`](https://huggingface.co/datasets/racineai/OGC_Cooking_Recipes) | Culinary Arts | Multiple |
| [`racineai/OGC_Geotechnie (filtered)`](https://huggingface.co/datasets/racineai/OGC_Geotechnie) | Geotechnie | EN, FR |
| [`racineai/OGC_Nuclear (filtered)`](https://huggingface.co/datasets/racineai/OGC_Nuclear) | Nuclear/Regulation | EN, FR, DE, IT, ES |
| [`racineai/OGC_2_vdr-visRAG-colpali (filtered)`](https://huggingface.co/datasets/racineai/OGC_2_vdr-visRAG-colpali) | Various | EN, FR, DE, IT, ES |
| [`racineai/OGC_Renewable_Regulation (filtered)`](https://huggingface.co/datasets/racineai/OGC_Renewable_Regulation) | Energy/Regulation | Multiple |
| [`racineai/OGC_Quantum_Circuit_Papers (filtered)`](https://huggingface.co/datasets/racineai/OGC_Quantum_Circuit_Papers) | Quantum Computing | EN |
| [`racineai/OGC_Hydrogen (filtered)`](https://huggingface.co/datasets/racineai/OGC_Hydrogen) | Hydrogen/Regulation | EN, FR |
| [`racineai/OGC_History_Geography (filtered)`](https://huggingface.co/datasets/racineai/OGC_History_Geography) | Education/History | Multiple |
| [`racineai/OGC_Energy_Arabic (train)`](https://huggingface.co/datasets/racineai/OGC_Energy_Arabic) | Energy | Arabic |
| [`racineai/OGC_CATIE-AQ_XMRec (train)`](https://huggingface.co/datasets/racineai/OGC_CATIE-AQ_XMRec) | Various | FR |
| [`racineai/OGC_Memes (train)`](https://huggingface.co/datasets/racineai/OGC_Memes) | Cultural/Visual/Jokes | Multiple |
## Data Fields
Each entry contains:
- **`id`** (string): Unique identifier
- **`query`** (string): High-quality technical/domain-specific question
- **`image`** (PIL.Image): High-resolution visual rendering of source document page
- **`language`** (string): Detected language of the image (queries sometimes differ on purpose)
## Dataset Curators
- **Léo Appourchaux**
- **Paul Lemaistre**
- **Yumeng Ye**
- **Mattéo KHAN**
- **André-Louis Rochet**
# OGC_MEGA_2
## 数据集概述
**OGC_MEGA_2** 是一款高质量多模态数据集,通过整合多领域专属数据集并辅以增强型数据处理技术构建而成。本数据集代表了我们在多模态数据生成领域的最精细化实践方案,融入了过滤算法与优化后的AI辅助内容生成流程,可为检索增强生成(Retrieval-Augmented Generation, RAG)、深度搜索增强(Deep Search Enhancement, DSE)、问答、文档搜索以及视觉语言模型训练等任务提供高品质数据支撑。
## 源数据集
本合并数据集整合了以下数据集:
| 数据集(拆分集) | 领域 | 语言 |
|---------|---------|-------------|
| [`racineai/OGC_Military (filtered)`](https://huggingface.co/datasets/racineai/OGC_Military) | 军事 | 英语、法语 |
| [`racineai/OGC_Energy (filtered)`](https://huggingface.co/datasets/racineai/OGC_Energy) | 能源/监管 | 英语、法语 |
| [`racineai/OGC_Quantum (filtered)`](https://huggingface.co/datasets/racineai/OGC_Quantum) | 量子 | 英语、法语 |
| [`racineai/OGC_ibm-research_REAL-MM-RAG (train)`](https://huggingface.co/datasets/racineai/OGC_ibm-research_REAL-MM-RAG) | 技术/研究 | 英语 |
| [`racineai/OGC_Cooking_Recipes (filtered)`](https://huggingface.co/datasets/racineai/OGC_Cooking_Recipes) | 烹饪艺术 | 多语言 |
| [`racineai/OGC_Geotechnie (filtered)`](https://huggingface.co/datasets/racineai/OGC_Geotechnie) | 岩土工程(Geotechnie) | 英语、法语 |
| [`racineai/OGC_Nuclear (filtered)`](https://huggingface.co/datasets/racineai/OGC_Nuclear) | 核能/监管 | 英语、法语、德语、意大利语、西班牙语 |
| [`racineai/OGC_2_vdr-visRAG-colpali (filtered)`](https://huggingface.co/datasets/racineai/OGC_2_vdr-visRAG-colpali) | 多领域 | 英语、法语、德语、意大利语、西班牙语 |
| [`racineai/OGC_Renewable_Regulation (filtered)`](https://huggingface.co/datasets/racineai/OGC_Renewable_Regulation) | 能源/监管 | 多语言 |
| [`racineai/OGC_Quantum_Circuit_Papers (filtered)`](https://huggingface.co/datasets/racineai/OGC_Quantum_Circuit_Papers) | 量子计算 | 英语 |
| [`racineai/OGC_Hydrogen (filtered)`](https://huggingface.co/datasets/racineai/OGC_Hydrogen) | 氢能/监管 | 英语、法语 |
| [`racineai/OGC_History_Geography (filtered)`](https://huggingface.co/datasets/racineai/OGC_History_Geography) | 教育/历史 | 多语言 |
| [`racineai/OGC_Energy_Arabic (train)`](https://huggingface.co/datasets/racineai/OGC_Energy_Arabic) | 能源 | 阿拉伯语 |
| [`racineai/OGC_CATIE-AQ_XMRec (train)`](https://huggingface.co/datasets/racineai/OGC_CATIE-AQ_XMRec) | 多领域 | 法语 |
| [`racineai/OGC_Memes (train)`](https://huggingface.co/datasets/racineai/OGC_Memes) | 文化/视觉/梗图 | 多语言 |
## 数据字段
每条数据包含以下字段:
- **`id`**(字符串类型):唯一标识符
- **`query`**(字符串类型):高质量的领域专属技术问题
- **`image`**(PIL.Image):源文档页面的高分辨率可视化渲染结果
- **`language`**(字符串类型):检测到的图像语言(查询文本的语言有时会故意与图像语言存在差异)
## 数据集策展人
- **Léo Appourchaux**
- **Paul Lemaistre**
- **Yumeng Ye**
- **Mattéo KHAN**
- **André-Louis Rochet**
提供机构:
maas
创建时间:
2025-09-03



