five

VDR_Nuclear

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/racineai/VDR_Nuclear
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_Nuclear - Overview ## Dataset Summary **VDR_Nuclear is a curated multimodal dataset focused on nuclear technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Nuclear-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the nuclear domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic technical question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the query ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on nuclear regulatory content - **Visual Question Answering**: Multimodal understanding of technical documents - **Document Retrieval**: Developing search systems for legal and technical nuclear documents - **Text Generation**: Automated question-answer generation from regulatory sources - **Domain-Specific Applications**: Nuclear document analysis, compliance checking, and regulatory understanding ## Dataset Use Cases - Training and evaluating vision-language models on nuclear regulatory content - Developing multimodal search or retrieval systems for legal and technical nuclear documents - Research in automated question-answer generation from regulatory and technical sources - Enhancing tools for nuclear document analysis, compliance checking, and regulatory understanding - Supporting legal and technical research in nuclear policy and regulation ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

# VDR_Nuclear 数据集概览 ## 数据集概述 VDR_Nuclear是一款经精心整理的多模态数据集,聚焦核技术文档、监管条例及法律框架。该数据集整合了从真实科学与监管PDF文件中提取的文本与图像数据,可支持检索增强生成(Retrieval-Augmented Generation,RAG)、文档检索增强(Document Search and Enrichment,DSE)、问答、文档检索以及视觉语言模型(Vision-Language Model)训练等任务。 ## 数据集构建 本数据集通过开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建完成。 核相关PDF文件从公开在线资源中采集,主要聚焦国际、欧盟及法国的核领域监管条例与法律文件。每份文档在处理前均经过**人工清洗与整理**,包括移除空白页、封面、目录及其他无关内容,以保障数据集的最优质量。 清洗后的文档将逐页处理,提取文本内容、将页面转换为高分辨率图像,并生成带对应答案的合成技术查询语句。我们通过自定义流水线使用谷歌的**Gemini 2.5 Flash**模型,生成与各页面内容匹配的多样化、专业级问题及详尽答案。 ## 数据字段 数据集中的每条样本包含以下字段: - **`id`**(字符串类型):样本的唯一标识符 - **`query`**(字符串类型):从该页面生成的合成技术问题 - **`answer`**(字符串类型):对应查询问题的详尽答案 - **`image`**(PIL.Image格式):PDF页面的可视化渲染结果 - **`language`**(字符串类型):检测到的查询语句所属语言 ## 数据生成流程 每个页面可生成4条独特样本:一条核心技术查询、一条次级查询、一个基于视觉的问题以及一个多模态语义查询,每条均配有对应答案。 ## 支持的任务 本数据集旨在支持以下任务: - **问答任务**:针对核监管内容训练与评估模型 - **视觉问答任务**:实现技术文档的多模态理解 - **文档检索任务**:开发核领域法律与技术文档的搜索系统 - **文本生成任务**:从监管文件中自动生成问答对 - **领域专属应用**:核文档分析、合规性检查及监管内容理解 ## 数据集应用场景 - 针对核监管内容训练与评估视觉语言模型 - 开发核领域法律与技术文档的多模态搜索或检索系统 - 开展从监管与技术资源中自动生成问答对的相关研究 - 优化核文档分析、合规性检查及监管内容理解相关工具 - 为核政策与监管领域的法律及技术研究提供支持 ## 数据集整理者 - **叶雨萌(Yumeng Ye)** - **莱奥·阿普尔沙(Léo Appourchaux)**
提供机构:
maas
创建时间:
2025-11-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作