five

OGC_Renewable_Regulation

收藏
魔搭社区2025-11-27 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Renewable_Regulation
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_Renewable_Regulation - Overview ## Dataset Summary **VDR_Renewable_Regulation is a curated multimodal dataset focused on renewable energy technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Renewable energy-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the renewable energy domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic technical question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the document ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on renewable energy regulatory content - **Visual Question Answering**: Multimodal understanding of technical documents - **Document Retrieval**: Developing search systems for legal and technical renewable energy documents - **Text Generation**: Automated question-answer generation from regulatory sources - **Domain-Specific Applications**: Renewable energy document analysis, compliance checking, and regulatory understanding ## Dataset Use Cases - Training and evaluating vision-language models on renewable energy regulatory content - Developing multimodal search or retrieval systems for legal and technical renewable energy documents - Research in automated question-answer generation from regulatory and technical sources - Enhancing tools for renewable energy document analysis, compliance checking, and regulatory understanding - Supporting legal and technical research in renewable energy policy and regulation ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

# VDR可再生能源监管数据集(VDR_Renewable_Regulation) - 概览 ## 数据集概述 **VDR可再生能源监管数据集(VDR_Renewable_Regulation)** 是一款经精心策展的多模态数据集,聚焦可再生能源技术文档、监管条例与法律框架。该数据集整合了从真实科学与监管PDF文件中提取的文本与图像数据,可用于支撑检索增强生成(Retrieval-Augmented Generation,RAG)与文档搜索增强(Document Search Enhancement,DSE)、问答、文档检索以及视觉语言模型训练等任务。 ## 数据集构建 本数据集通过开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建完成。 我们从公开在线渠道收集了可再生能源相关的PDF文件,重点覆盖国际、欧盟及法国可再生能源领域的监管条例与法律法规。每份文档在处理前均经过**人工清理与策展**,包括移除空白页、标题页、目录及其他无关内容,以保障数据集的最优质量。 清理后的文档将逐页处理,提取文本内容,将页面转换为高分辨率图像,并生成带有对应答案的合成技术查询。我们在自定义流水线中使用了谷歌的**Gemini 2.5 Flash**模型,以生成与各页面内容匹配的多样化、专家级问题及全面答案。 ## 数据字段 数据集中的每条样本包含以下内容: - **`id`**(字符串类型):样本的唯一标识符 - **`query`**(字符串类型):从对应页面生成的合成技术问题 - **`answer`**(字符串类型):对应查询的完整答案 - **`image`**(PIL图像格式):PDF页面的可视化渲染结果 - **`language`**(字符串类型):检测到的文档语言 ## 数据生成 每个页面将生成4条独特样本:分别为主技术查询、次要技术查询、基于视觉的查询以及多模态语义查询,每条查询均配有对应的答案。 ## 支持任务 本数据集旨在支撑以下任务: - **问答(Question Answering)**:针对可再生能源监管内容训练与评估模型 - **视觉问答(Visual Question Answering)**:实现技术文档的多模态理解 - **文档检索(Document Retrieval)**:开发面向可再生能源领域法律与技术文档的搜索系统 - **文本生成(Text Generation)**:从监管源自动生成问答对 - **领域特定应用(Domain-Specific Applications)**:可再生能源文档分析、合规检查与监管理解 ## 数据集应用场景 本数据集可应用于: - 针对可再生能源监管内容训练与评估视觉语言模型 - 开发面向可再生能源领域法律与技术文档的多模态搜索或检索系统 - 开展从监管与技术源自动生成问答对的相关研究 - 优化可再生能源文档分析、合规检查与监管理解相关工具 - 支撑可再生能源政策与监管领域的法律与技术研究 ## 数据集策展人 - **叶雨萌(Yumeng Ye)** - **莱奥·阿普尔肖(Léo Appourchaux)**
提供机构:
maas
创建时间:
2025-08-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作