OGC_Quantum
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Quantum
下载链接
链接失效反馈官方服务:
资源简介:
# VDR_Quantum – Overview
**VDR_Quantum** is a curated multimodal dataset focused on **quantum technical documents**. It combines text and image data extracted from real scientific PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.
---
## Dataset Composition
This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet).
Quantum-related PDFs were collected from public online sources. Each document was processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries.
We used **Google’s Gemini 2.0 Flash Lite** model in a custom pipeline to generate diverse, expert-level questions that align with the content of each page.
---
## Dataset Structure
Each entry in the dataset contains:
- `id`: A unique identifier for the sample
- `query`: A synthetic technical question generated from that page
- `image`: A visual rendering of a PDF page
- `language`: The detected language of the query
> Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query.
---
## Purpose
This dataset is designed to support:
- Training and evaluating **vision-language models**
- Developing **multimodal search or retrieval systems**
- Research in **automated question generation** from technical sources
- Enhancing tools for **quantum document analysis and understanding**
---
## Authors
- **Yumeng Ye**
- **Léo Appourchaux**
---
# VDR_Quantum——数据集概览
**VDR_Quantum** 是一套精心整理的多模态数据集,聚焦于**量子技术文档(quantum technical documents)**。该数据集整合了从真实科学PDF文献中提取的文本与图像数据,可用于支持检索增强生成文档搜索(RAG DSE)、问答、文档检索以及视觉语言模型(vision-language models)训练等任务。
---
## 数据集构成
本数据集依托自研开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建。研究团队从公开网络渠道收集了与量子领域相关的PDF文献,并对每份文献逐页处理:提取文本内容、将页面转换为高分辨率图像,同时生成合成技术查询语句。
我们通过定制化处理流程调用**谷歌Gemini 2.0 Flash Lite**模型,生成与各页面内容匹配的多样化专业级技术问题。
---
## 数据集结构
数据集中的每条样本包含以下字段:
- `id`:样本的唯一标识符
- `query`:从该页面生成的合成技术问题
- `image`:PDF页面的可视化渲染结果
- `language`:检测到的查询语句所用语言
> 每份PDF页面将生成4条独立样本:分别为主技术查询、次级技术查询、基于视觉的问题以及多模态语义查询。
---
## 数据集用途
本数据集旨在支持以下场景:
- **视觉语言模型(vision-language models)**的训练与评估
- **多模态(multimodal)搜索或检索系统**的开发
- 面向技术文献的自动化问答生成研究
- 优化量子文献分析与理解相关工具
---
## 作者
- **Yumeng Ye**
- **Léo Appourchaux**
提供机构:
maas
创建时间:
2025-07-31



