racineai/VDR_Nuclear

Name: racineai/VDR_Nuclear
Creator: racineai
Published: 2025-11-20 14:39:16
License: 暂无描述

Hugging Face2025-11-20 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/racineai/VDR_Nuclear

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - visual-question-answering - text-retrieval language: - en - fr - de - it - es tags: - nuclear - regulations - legal - multimodal - technical-documents - RAG - DSE configs: - config_name: train data_files: "train-*.parquet" - config_name: filtered data_files: "filtered-*.parquet" --- # VDR_Nuclear - Overview ## Dataset Summary **VDR_Nuclear is a curated multimodal dataset focused on nuclear technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Nuclear-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the nuclear domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic technical question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the query ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on nuclear regulatory content - **Visual Question Answering**: Multimodal understanding of technical documents - **Document Retrieval**: Developing search systems for legal and technical nuclear documents - **Text Generation**: Automated question-answer generation from regulatory sources - **Domain-Specific Applications**: Nuclear document analysis, compliance checking, and regulatory understanding ## Dataset Use Cases - Training and evaluating vision-language models on nuclear regulatory content - Developing multimodal search or retrieval systems for legal and technical nuclear documents - Research in automated question-answer generation from regulatory and technical sources - Enhancing tools for nuclear document analysis, compliance checking, and regulatory understanding - Supporting legal and technical research in nuclear policy and regulation ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

--- 许可证：Apache-2.0 任务类别： - 问答 - 视觉问答 - 文本检索语言： - 英语 - 法语 - 德语 - 意大利语 - 西班牙语标签： - 核 - 法规 - 法律 - 多模态 - 技术文档 - 检索增强生成（Retrieval-Augmented Generation, RAG） - 文档搜索增强（Document Search and Enrichment, DSE）配置项： - 配置名称：train，数据文件："train-*.parquet" - 配置名称：filtered，数据文件："filtered-*.parquet" --- # VDR_Nuclear - 概述 ## 数据集概况 **VDR_Nuclear是一个经过精心整理的多模态数据集，聚焦于核技术文档、法规与法律框架。其整合了从真实科学与法规PDF文档中提取的文本与图像数据，可支持检索增强生成（Retrieval-Augmented Generation, RAG）、文档搜索增强（Document Search and Enrichment, DSE）、问答、文档检索以及视觉语言模型训练等任务。** ## 数据集构建本数据集通过开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建。我们从公开在线资源中收集了核相关PDF文档，重点覆盖国际、欧盟及法国核领域的法规与法律文件。每份文档在处理前均经过**人工清洗与整理**，包括移除空白页、标题页、目录及其他无关内容，以保障数据集的最优质量。随后，我们对清洗后的文档逐页处理，提取文本内容、将页面转换为高分辨率图像，并生成带对应答案的合成技术查询语句。我们通过自定义流水线使用谷歌的**Gemini 2.5 Flash**模型，生成与各页面内容匹配的多样化、专业级问题及全面答案。 ## 数据字段数据集的每条样本包含以下字段： - **`id`**（字符串类型）：样本的唯一标识符 - **`query`**（字符串类型）：从该页面生成的合成技术问题 - **`answer`**（字符串类型）：对应查询的全面答案 - **`image`**（PIL.Image格式）：PDF页面的可视化渲染结果 - **`language`**（字符串类型）：检测到的查询语言 ## 数据生成每个页面可生成4条唯一样本：1条主技术查询、1条次要查询、1个基于视觉的问题以及1个多模态语义查询，均配有对应答案。 ## 支持任务本数据集旨在支持以下任务： - **问答任务**：针对核法规内容训练与评估模型 - **视觉问答任务**：实现技术文档的多模态理解 - **文档检索任务**：构建核领域法律与技术文档的检索系统 - **文本生成任务**：从法规源自动生成问答对 - **领域特定应用**：核文档分析、合规检查及法规理解 ## 数据集应用场景 - 在核法规内容上训练与评估视觉语言模型 - 面向核领域法律与技术文档构建多模态检索系统 - 开展从法规与技术源自动生成问答对的研究 - 优化核文档分析、合规检查及法规理解相关工具 - 为核政策与法规领域的法律及技术研究提供支持 ## 数据集整理者 - **Yumeng Ye** - **Léo Appourchaux**

提供机构：

racineai

5,000+

优质数据集

54 个

任务类型

进入经典数据集