OGC_Quantum

Name: OGC_Quantum
Creator: maas
Published: 2025-12-05 11:58:17
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/racineai/OGC_Quantum

下载链接

链接失效反馈

官方服务：

资源简介：

# VDR_Quantum – Overview **VDR_Quantum** is a curated multimodal dataset focused on **quantum technical documents**. It combines text and image data extracted from real scientific PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training. --- ## Dataset Composition This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Quantum-related PDFs were collected from public online sources. Each document was processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries. We used **Google’s Gemini 2.0 Flash Lite** model in a custom pipeline to generate diverse, expert-level questions that align with the content of each page. --- ## Dataset Structure Each entry in the dataset contains: - `id`: A unique identifier for the sample - `query`: A synthetic technical question generated from that page - `image`: A visual rendering of a PDF page - `language`: The detected language of the query > Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query. --- ## Purpose This dataset is designed to support: - Training and evaluating **vision-language models** - Developing **multimodal search or retrieval systems** - Research in **automated question generation** from technical sources - Enhancing tools for **quantum document analysis and understanding** --- ## Authors - **Yumeng Ye** - **Léo Appourchaux** ---

# VDR_Quantum——数据集概览 **VDR_Quantum** 是一套精心整理的多模态数据集，聚焦于**量子技术文档（quantum technical documents）**。该数据集整合了从真实科学PDF文献中提取的文本与图像数据，可用于支持检索增强生成文档搜索（RAG DSE）、问答、文档检索以及视觉语言模型（vision-language models）训练等任务。 --- ## 数据集构成本数据集依托自研开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建。研究团队从公开网络渠道收集了与量子领域相关的PDF文献，并对每份文献逐页处理：提取文本内容、将页面转换为高分辨率图像，同时生成合成技术查询语句。我们通过定制化处理流程调用**谷歌Gemini 2.0 Flash Lite**模型，生成与各页面内容匹配的多样化专业级技术问题。 --- ## 数据集结构数据集中的每条样本包含以下字段： - `id`：样本的唯一标识符 - `query`：从该页面生成的合成技术问题 - `image`：PDF页面的可视化渲染结果 - `language`：检测到的查询语句所用语言 > 每份PDF页面将生成4条独立样本：分别为主技术查询、次级技术查询、基于视觉的问题以及多模态语义查询。 --- ## 数据集用途本数据集旨在支持以下场景： - **视觉语言模型（vision-language models）**的训练与评估 - **多模态（multimodal）搜索或检索系统**的开发 - 面向技术文献的自动化问答生成研究 - 优化量子文献分析与理解相关工具 --- ## 作者 - **Yumeng Ye** - **Léo Appourchaux**

提供机构：

maas

创建时间：

2025-07-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集