five

OGC_Nuclear

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Nuclear
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_Nuclear - Overview ## Dataset Summary **VDR_Nuclear is a curated multimodal dataset focused on nuclear technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Nuclear-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the nuclear domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic technical question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the query ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on nuclear regulatory content - **Visual Question Answering**: Multimodal understanding of technical documents - **Document Retrieval**: Developing search systems for legal and technical nuclear documents - **Text Generation**: Automated question-answer generation from regulatory sources - **Domain-Specific Applications**: Nuclear document analysis, compliance checking, and regulatory understanding ## Dataset Use Cases - Training and evaluating vision-language models on nuclear regulatory content - Developing multimodal search or retrieval systems for legal and technical nuclear documents - Research in automated question-answer generation from regulatory and technical sources - Enhancing tools for nuclear document analysis, compliance checking, and regulatory understanding - Supporting legal and technical research in nuclear policy and regulation ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

# VDR_Nuclear——数据集概览 ## 数据集概述 **VDR_Nuclear是一套经精心甄选整理的多模态数据集,聚焦核技术文档、监管规范与法律框架。该数据集整合了从真实科学与监管PDF文档中提取的文本与图像数据,可支持检索增强生成文档搜索与提取(RAG DSE)、问答、文档检索以及视觉语言模型训练等各类任务。** ## 数据集构建 本数据集依托自研开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建。 核相关PDF文档从公开在线渠道收集,重点覆盖国际、欧盟及法国的核领域监管规范与法律法规。每份文档在处理前均经过**人工清洗与整理**,包括移除空白页、标题页、目录及其他无关内容,以确保数据集的最优质量。 清洗后的文档将逐页处理,提取文本内容、将页面转换为高分辨率图像,并生成带对应标准答案的人工合成技术查询问题。我们通过自定义流水线调用谷歌的**Gemini 2.5 Flash**模型,生成与各页面内容匹配的多样化专业级问题与详尽答案。 ## 数据字段 数据集中的每条样本包含以下字段: - **`id`**(字符串类型):样本的唯一标识符 - **`query`**(字符串类型):从该页面生成的人工合成技术查询问题 - **`answer`**(字符串类型):对应查询问题的详尽答案 - **`image`**(PIL.Image类型):PDF页面的可视化渲染结果 - **`language`**(字符串类型):检测到的查询问题所属语言 ## 数据生成流程 每个PDF页面将生成4条独立样本:分别为主技术查询问题、次级技术查询问题、基于视觉的查询问题以及多模态语义查询问题,每条均配有对应的标准答案。 ## 支持任务 本数据集旨在支持以下任务: - **问答任务**:针对核监管内容开展模型训练与评估 - **视觉问答任务**:实现技术文档的多模态理解 - **文档检索任务**:开发核领域法律与技术文档的检索系统 - **文本生成任务**:从监管源数据中自动生成问答对 - **领域专属应用**:核文档分析、合规性检查及监管框架理解 ## 数据集应用场景 - 针对核监管内容开展视觉语言模型的训练与评估 - 开发核领域法律与技术文档的多模态检索系统 - 开展基于监管与技术源数据的自动问答生成研究 - 优化核文档分析、合规性检查及监管框架理解相关工具 - 支撑核政策与监管领域的法律及技术研究 ## 数据集整理者 - **叶雨萌(Yumeng Ye)** - **莱奥·阿普尔沙(Léo Appourchaux)**
提供机构:
maas
创建时间:
2025-08-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作