five

racineai/VDR_Energy_Arabic

收藏
Hugging Face2025-11-20 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/racineai/VDR_Energy_Arabic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - visual-document-retrieval - visual-question-answering - text-retrieval language: - fr - ar - en tags: - RAG - DSE - retrieval - energy - arabic --- # VDR_Energy_Arabic - Overview ## Dataset Summary **VDR_Energy_Arabic is a curated multimodal dataset focused on Arabic energy sector documents, including reports, financial statements, technical documentation, and industry analyses. It combines text and image data extracted from real energy-related PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training in Arabic.** ## Dataset Details ### Dataset Creation This dataset was created using our open-source tool **[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)**. Energy sector PDFs were collected from public online sources, focusing primarily on Arabic energy companies' annual reports, financial statements, technical documentation, and industry analyses from the Middle East and North Africa region. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic energy sector queries with corresponding answers in Arabic. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ### Data Fields Each entry in the dataset contains: - id (string): A unique identifier for the sample - query (string): A synthetic energy-related question generated from that page in Arabic - answer (string): A comprehensive answer to the corresponding query in Arabic - image (PIL.Image): A visual rendering of a PDF page - language (string): The detected language of the query (Arabic/French/English) ### Data Generation Each page produces 4 unique entries: a main energy sector query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: **Question Answering**: Training and evaluating models on Arabic energy sector content **Visual Question Answering**: Multimodal understanding of energy documents in Arabic **Document Retrieval**: Developing search systems for Arabic energy and industrial documents **Text Generation**: Automated question-answer generation from Arabic energy sources **Domain-Specific Applications**: Energy sector analysis, financial document understanding, and technical report comprehension ## Dataset Use Cases Training and evaluating vision-language models on Arabic energy sector content Developing multimodal search or retrieval systems for energy and industrial documents Research in automated question-answer generation from Arabic technical and financial sources Enhancing tools for energy sector analysis, financial document understanding, and technical report processing Supporting Arabic language processing in specialized energy and industrial domains Building RAG systems for Arabic energy sector knowledge bases ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**
提供机构:
racineai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作