OGC_Qualitative
收藏魔搭社区2025-12-05 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Qualitative
下载链接
链接失效反馈官方服务:
资源简介:
# VDR_Qualitative
## Dataset Summary
**VDR_Qualitative** is a high-quality multimodal dataset created through the merge of multiple domain-specific datasets with enhanced data processing techniques. This dataset represents our most refined approach to multimodal data generation, incorporating filtering algorithms and improved AI-assisted content generation to deliver superior quality for RAG, DSE, question answering, document search, and vision-language model training tasks.
## Source Datasets
This merged dataset combines the filtered, high-quality versions of the following datasets:
| Dataset | Domain | Language(s)
|---------|---------|-------------|
| [`racineai/VDR_Cooking_Recipes`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | Culinary Arts | Multiple |
| [`racineai/VDR_CATIE-AQ_XMRec`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | Research/Academic | FR |
| [`racineai/VDR_ibm-research_REAL-MM-RAG`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | Technical/Research | EN |
| [`racineai/VDR_Quantum_Circuit_Papers`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | Quantum Computing | EN |
| [`racineai/VDR_Renewable_Regulation`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | Energy/Regulations | Multiple |
| [`racineai/VDR_Nuclear`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | Nuclear/Regulations | EN, FR, DE, IT, ES |
| [`racineai/VDR_History_Geography`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | Education | Multiple |
| [`racineai/VDR_Energy_Arabic`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | Energy | Arabic |
## Dataset Creation Process
### Phase 1: Individual Dataset Enhancement
1. **Source Collection**: Gather high-quality PDFs from public sources
2. **Manual Curation**: Manually clean and filter source documents
### Phase 2: Advanced Generation
1. **AI-Powered Generation**: Use **Gemini 2.5 Flash** for creating diverse, expert-level questions
2. **Multimodal Integration**: Ensure tight coupling between textual and visual elements
### Phase 3: Quality Filtering
1. **Algorithmic Assessment**: Application of quality filtering algorithms to identify substandard samples
### Phase 4: Strategic Merging & Shuffling
1. **Dataset Merge**: Combine all source datasets
2. **Shuffle**: Randomize all samples to ensure balanced domain distribution and eliminate training biases
## Data Fields
Each entry contains:
- **`id`** (string): Unique identifier
- **`query`** (string): High-quality technical/domain-specific question
- **`image`** (PIL.Image): High-resolution visual rendering of source document page
- **`language`** (string): Detected language of the content
## Dataset Curators
- **Yumeng Ye**
- **Léo Appourchaux**
- **Mattéo KHAN**
# VDR_Qualitative
## 数据集概述
**VDR_Qualitative** 是一款高质量多模态数据集,通过整合多个领域专属数据集并辅以增强型数据处理技术构建而成。本数据集代表了我们在多模态数据生成领域的最优化方案,融合了过滤算法与改进后的AI辅助内容生成技术,可为检索增强生成(Retrieval-Augmented Generation, RAG)、决策支持实验(Decision Support Experiment, DSE)、问答、文档搜索以及视觉语言模型训练等任务提供高标准的优质数据。
## 源数据集
本整合数据集由以下经过过滤与提质处理的优质数据集合并而成:
| 数据集名称 | 应用领域 | 支持语言 |
|---------|---------|-------------|
| [`racineai/VDR_Cooking_Recipes`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | 烹饪艺术 | 多语言 |
| [`racineai/VDR_CATIE-AQ_XMRec`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | 研究/学术领域 | 法语 |
| [`racineai/VDR_ibm-research_REAL-MM-RAG`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | 技术/研究领域 | 英语 |
| [`racineai/VDR_Quantum_Circuit_Papers`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | 量子计算 | 英语 |
| [`racineai/VDR_Renewable_Regulation`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | 能源/监管 | 多语言 |
| [`racineai/VDR_Nuclear`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | 核能/监管 | 英语、法语、德语、意大利语、西班牙语 |
| [`racineai/VDR_History_Geography`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | 教育领域 | 多语言 |
| [`racineai/VDR_Energy_Arabic`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | 能源领域 | 阿拉伯语 |
## 数据集构建流程
### 阶段1:单数据集提质
1. **源数据采集**:从公开渠道收集高质量PDF文档
2. **人工精修**:手动清理并过滤源文档
### 阶段2:高级内容生成
1. **AI辅助生成**:采用**Gemini 2.5 Flash**生成多样化的专业级问题
2. **多模态融合**:确保文本与视觉元素实现紧密耦合
### 阶段3:质量过滤
1. **算法评估**:应用质量过滤算法识别不合格样本
### 阶段4:整合与随机洗牌
1. **数据集合并**:将所有源数据集进行整合
2. **随机洗牌**:对所有样本进行随机打乱,以保障领域分布均衡并消除训练偏差
## 数据字段
每条数据条目包含以下内容:
- **`id`**(字符串类型):唯一标识符
- **`query`**(字符串类型):高质量的专业领域问题
- **`image`**(PIL.Image):源文档页面的高分辨率可视化渲染结果
- **`language`**(字符串类型):检测到的内容语言
## 数据集策展人
- **叶雨萌(Yumeng Ye)**
- **Léo Appourchaux**
- **Mattéo KHAN**
提供机构:
maas
创建时间:
2025-09-03



