five

OGC_Qualitative

收藏
魔搭社区2025-12-05 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Qualitative
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_Qualitative ## Dataset Summary **VDR_Qualitative** is a high-quality multimodal dataset created through the merge of multiple domain-specific datasets with enhanced data processing techniques. This dataset represents our most refined approach to multimodal data generation, incorporating filtering algorithms and improved AI-assisted content generation to deliver superior quality for RAG, DSE, question answering, document search, and vision-language model training tasks. ## Source Datasets This merged dataset combines the filtered, high-quality versions of the following datasets: | Dataset | Domain | Language(s) |---------|---------|-------------| | [`racineai/VDR_Cooking_Recipes`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | Culinary Arts | Multiple | | [`racineai/VDR_CATIE-AQ_XMRec`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | Research/Academic | FR | | [`racineai/VDR_ibm-research_REAL-MM-RAG`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | Technical/Research | EN | | [`racineai/VDR_Quantum_Circuit_Papers`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | Quantum Computing | EN | | [`racineai/VDR_Renewable_Regulation`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | Energy/Regulations | Multiple | | [`racineai/VDR_Nuclear`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | Nuclear/Regulations | EN, FR, DE, IT, ES | | [`racineai/VDR_History_Geography`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | Education | Multiple | | [`racineai/VDR_Energy_Arabic`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | Energy | Arabic | ## Dataset Creation Process ### Phase 1: Individual Dataset Enhancement 1. **Source Collection**: Gather high-quality PDFs from public sources 2. **Manual Curation**: Manually clean and filter source documents ### Phase 2: Advanced Generation 1. **AI-Powered Generation**: Use **Gemini 2.5 Flash** for creating diverse, expert-level questions 2. **Multimodal Integration**: Ensure tight coupling between textual and visual elements ### Phase 3: Quality Filtering 1. **Algorithmic Assessment**: Application of quality filtering algorithms to identify substandard samples ### Phase 4: Strategic Merging & Shuffling 1. **Dataset Merge**: Combine all source datasets 2. **Shuffle**: Randomize all samples to ensure balanced domain distribution and eliminate training biases ## Data Fields Each entry contains: - **`id`** (string): Unique identifier - **`query`** (string): High-quality technical/domain-specific question - **`image`** (PIL.Image): High-resolution visual rendering of source document page - **`language`** (string): Detected language of the content ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux** - **Mattéo KHAN**

# VDR_Qualitative ## 数据集概述 **VDR_Qualitative** 是一款高质量多模态数据集,通过整合多个领域专属数据集并辅以增强型数据处理技术构建而成。本数据集代表了我们在多模态数据生成领域的最优化方案,融合了过滤算法与改进后的AI辅助内容生成技术,可为检索增强生成(Retrieval-Augmented Generation, RAG)、决策支持实验(Decision Support Experiment, DSE)、问答、文档搜索以及视觉语言模型训练等任务提供高标准的优质数据。 ## 源数据集 本整合数据集由以下经过过滤与提质处理的优质数据集合并而成: | 数据集名称 | 应用领域 | 支持语言 | |---------|---------|-------------| | [`racineai/VDR_Cooking_Recipes`](https://huggingface.co/datasets/racineai/VDR_Cooking_Recipes) | 烹饪艺术 | 多语言 | | [`racineai/VDR_CATIE-AQ_XMRec`](https://huggingface.co/datasets/racineai/VDR_CATIE-AQ_XMRec) | 研究/学术领域 | 法语 | | [`racineai/VDR_ibm-research_REAL-MM-RAG`](https://huggingface.co/datasets/racineai/VDR_ibm-research_REAL-MM-RAG) | 技术/研究领域 | 英语 | | [`racineai/VDR_Quantum_Circuit_Papers`](https://huggingface.co/datasets/racineai/VDR_Quantum_Circuit_Papers) | 量子计算 | 英语 | | [`racineai/VDR_Renewable_Regulation`](https://huggingface.co/datasets/racineai/VDR_Renewable_Regulation) | 能源/监管 | 多语言 | | [`racineai/VDR_Nuclear`](https://huggingface.co/datasets/racineai/VDR_Nuclear) | 核能/监管 | 英语、法语、德语、意大利语、西班牙语 | | [`racineai/VDR_History_Geography`](https://huggingface.co/datasets/racineai/VDR_History_Geography) | 教育领域 | 多语言 | | [`racineai/VDR_Energy_Arabic`](https://huggingface.co/datasets/racineai/VDR_Energy_Arabic) | 能源领域 | 阿拉伯语 | ## 数据集构建流程 ### 阶段1:单数据集提质 1. **源数据采集**:从公开渠道收集高质量PDF文档 2. **人工精修**:手动清理并过滤源文档 ### 阶段2:高级内容生成 1. **AI辅助生成**:采用**Gemini 2.5 Flash**生成多样化的专业级问题 2. **多模态融合**:确保文本与视觉元素实现紧密耦合 ### 阶段3:质量过滤 1. **算法评估**:应用质量过滤算法识别不合格样本 ### 阶段4:整合与随机洗牌 1. **数据集合并**:将所有源数据集进行整合 2. **随机洗牌**:对所有样本进行随机打乱,以保障领域分布均衡并消除训练偏差 ## 数据字段 每条数据条目包含以下内容: - **`id`**(字符串类型):唯一标识符 - **`query`**(字符串类型):高质量的专业领域问题 - **`image`**(PIL.Image):源文档页面的高分辨率可视化渲染结果 - **`language`**(字符串类型):检测到的内容语言 ## 数据集策展人 - **叶雨萌(Yumeng Ye)** - **Léo Appourchaux** - **Mattéo KHAN**
提供机构:
maas
创建时间:
2025-09-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作