racineai/VDR_Renewable_Regulation
收藏Hugging Face2025-11-20 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_Renewable_Regulation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- visual-question-answering
- text-retrieval
language:
- en
- fr
- de
- it
- es
- bg
- cs
- da
- et
- el
- ga
- hr
- lv
- lt
- hu
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- fi
- sv
tags:
- renewable
- energy
- regulations
- legal
- multimodal
- technical-documents
- RAG
- DSE
- multilingual
configs:
- config_name: train
data_files: "train-*.parquet"
- config_name: filtered
data_files: "filtered-*.parquet"
---
# VDR_Renewable_Regulation - Overview
## Dataset Summary
**VDR_Renewable_Regulation is a curated multimodal dataset focused on renewable energy technical documents, regulations, and legal frameworks. It combines text and image data extracted from real scientific and regulatory PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.**
## Dataset Creation
This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet).
Renewable energy-related PDFs were collected from public online sources, focusing primarily on international, European Union, and French regulations and laws in the renewable energy domain. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality.
The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers.
We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page.
## Data Fields
Each entry in the dataset contains:
- **`id`** (string): A unique identifier for the sample
- **`query`** (string): A synthetic technical question generated from that page
- **`answer`** (string): A comprehensive answer to the corresponding query
- **`image`** (PIL.Image): A visual rendering of a PDF page
- **`language`** (string): The detected language of the document
## Data Generation
Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers.
## Supported Tasks
This dataset is designed to support:
- **Question Answering**: Training and evaluating models on renewable energy regulatory content
- **Visual Question Answering**: Multimodal understanding of technical documents
- **Document Retrieval**: Developing search systems for legal and technical renewable energy documents
- **Text Generation**: Automated question-answer generation from regulatory sources
- **Domain-Specific Applications**: Renewable energy document analysis, compliance checking, and regulatory understanding
## Dataset Use Cases
- Training and evaluating vision-language models on renewable energy regulatory content
- Developing multimodal search or retrieval systems for legal and technical renewable energy documents
- Research in automated question-answer generation from regulatory and technical sources
- Enhancing tools for renewable energy document analysis, compliance checking, and regulatory understanding
- Supporting legal and technical research in renewable energy policy and regulation
## Dataset Curators
- **Yumeng Ye**
- **Léo Appourchaux**
提供机构:
racineai



