racineai/VDR_Renewable_Regulation

Name: racineai/VDR_Renewable_Regulation
Creator: racineai
Published: 2025-11-20 14:39:56
License: 暂无描述

Hugging Face2025-11-20 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/racineai/VDR_Renewable_Regulation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - question-answering - visual-question-answering - text-retrieval language: - en - fr - de - it - es - bg - cs - da - et - el - ga - hr - lv - lt - hu - mt - nl - pl - pt - ro - sk - sl - fi - sv tags: - renewable - energy - regulations - legal - multimodal - technical-documents - RAG - DSE - multilingual configs: - config_name: train data_files: "train-*.parquet" - config_name: filtered data_files: "filtered-*.parquet" --- # VDR_Renewable_Regulation - Overview ## Dataset Summary **VDR_Renewable_Regulation is a curated multimodal dataset focused on renewable energy technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). Renewable energy-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the renewable energy domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic technical question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the document ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on renewable energy regulatory content - **Visual Question Answering**: Multimodal understanding of technical documents - **Document Retrieval**: Developing search systems for legal and technical renewable energy documents - **Text Generation**: Automated question-answer generation from regulatory sources - **Domain-Specific Applications**: Renewable energy document analysis, compliance checking, and regulatory understanding ## Dataset Use Cases - Training and evaluating vision-language models on renewable energy regulatory content - Developing multimodal search or retrieval systems for legal and technical renewable energy documents - Research in automated question-answer generation from regulatory and technical sources - Enhancing tools for renewable energy document analysis, compliance checking, and regulatory understanding - Supporting legal and technical research in renewable energy policy and regulation ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

提供机构：

racineai

5,000+

优质数据集

54 个

任务类型

进入经典数据集