five

OGC_Energy_Arabic

收藏
魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_Energy_Arabic
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_Energy_Arabic - Overview ## Dataset Summary **VDR_Energy_Arabic is a curated multimodal dataset focused on Arabic energy sector documents, including reports, financial statements, technical documentation, and industry analyses. It combines text and image data extracted from real energy-related PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training in Arabic.** ## Dataset Details ### Dataset Creation This dataset was created using our open-source tool **[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)**. Energy sector PDFs were collected from public online sources, focusing primarily on Arabic energy companies' annual reports, financial statements, technical documentation, and industry analyses from the Middle East and North Africa region. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic energy sector queries with corresponding answers in Arabic. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ### Data Fields Each entry in the dataset contains: - id (string): A unique identifier for the sample - query (string): A synthetic energy-related question generated from that page in Arabic - answer (string): A comprehensive answer to the corresponding query in Arabic - image (PIL.Image): A visual rendering of a PDF page - language (string): The detected language of the query (Arabic/French/English) ### Data Generation Each page produces 4 unique entries: a main energy sector query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: **Question Answering**: Training and evaluating models on Arabic energy sector content **Visual Question Answering**: Multimodal understanding of energy documents in Arabic **Document Retrieval**: Developing search systems for Arabic energy and industrial documents **Text Generation**: Automated question-answer generation from Arabic energy sources **Domain-Specific Applications**: Energy sector analysis, financial document understanding, and technical report comprehension ## Dataset Use Cases Training and evaluating vision-language models on Arabic energy sector content Developing multimodal search or retrieval systems for energy and industrial documents Research in automated question-answer generation from Arabic technical and financial sources Enhancing tools for energy sector analysis, financial document understanding, and technical report processing Supporting Arabic language processing in specialized energy and industrial domains Building RAG systems for Arabic energy sector knowledge bases ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux**

# VDR_Energy_Arabic - 数据集概览 ## 数据集摘要 **VDR_Energy_Arabic 是一个经过精心整理的多模态数据集(multimodal dataset),聚焦于阿拉伯语能源领域文档,涵盖报告、财务报表、技术文档与行业分析内容。该数据集整合了从真实能源相关PDF文件中提取的文本与图像数据,可用于支持阿拉伯语环境下的检索增强生成文档搜索(RAG DSE)、问答、文档检索以及视觉语言模型(vision-language model)训练等任务。** ## 数据集详情 ### 数据集构建 本数据集通过开源工具**[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)** 构建。 能源领域的PDF文件从公开网络资源采集,主要聚焦于中东与北非地区阿拉伯能源企业的年度报告、财务报表、技术文档及行业分析内容。所有文档在处理前均经过**人工清洗与整理(manual cleaning and curation)**,包括移除空白页、封面页、目录页及其他无关内容,以确保数据集的最优质量。 清洗后的文档将逐页处理,提取文本内容、将页面转换为高分辨率图像,并生成符合阿拉伯语规范的合成能源领域查询语句与对应答案。 我们采用Google的**Gemini 2.5 Flash**模型,通过自定义流水线生成与各页面内容匹配的多样化专业级问题与全面答案。 ### 数据字段 数据集中的每条样本包含以下内容: - id(字符串类型):样本的唯一标识符 - query(字符串类型):从该页面生成的阿拉伯语合成能源领域查询语句 - answer(字符串类型):对应查询语句的阿拉伯语全面答案 - image(PIL.Image):PDF页面的可视化渲染结果 - language(字符串类型):查询语句的检测语言(阿拉伯语/法语/英语) ### 数据生成 每个页面将生成4条独特样本:1条主能源领域查询、1条次级查询、1条基于视觉的问题以及1条多模态语义查询,每条均配有对应答案。 ## 支持任务 本数据集旨在支持以下任务: **问答(Question Answering)**:针对阿拉伯语能源领域内容训练与评估模型 **视觉问答(Visual Question Answering)**:实现阿拉伯语能源文档的多模态理解 **文档检索(Document Retrieval)**:开发面向阿拉伯语能源与工业文档的搜索系统 **文本生成(Text Generation)**:从阿拉伯语能源源自动生成问答对 **领域专属应用(Domain-Specific Applications)**:能源领域分析、财务文档理解与技术报告解读 ## 数据集应用场景 1. 在阿拉伯语能源领域内容上训练与评估视觉语言模型 2. 开发面向能源与工业文档的多模态搜索或检索系统 3. 研究从阿拉伯语技术与财务源自动生成问答对的方法 4. 优化能源领域分析、财务文档理解与技术报告处理工具 5. 支撑专业能源与工业领域的阿拉伯语自然语言处理研究 6. 构建面向阿拉伯语能源领域知识库的检索增强生成(Retrieval-Augmented Generation,RAG)系统 ## 数据集整理者 - **Yumeng Ye** - **Léo Appourchaux**
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作