five

OGC_History_Geography

收藏
魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/racineai/OGC_History_Geography
下载链接
链接失效反馈
官方服务:
资源简介:
# VDR_History_Geography - Overview ## Dataset Summary **VDR_History_Geography is a curated multimodal dataset focused on historical documents, geographical materials, and educational content. It combines text and image data extracted from real academic and educational PDFs to support tasks such as RAG DSE, question answering, document search, and vision-language model training.** ## Dataset Creation This dataset was created using our open-source tool [VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet). History and geography-related PDFs were collected from public online sources, focusing primarily on academic textbooks, educational materials, and scholarly publications in the history and geography domains. Each document underwent **manual cleaning and curation** before processing, including the removal of blank pages, title pages, table of contents, and other out-of-topic content to ensure optimal dataset quality. The cleaned documents were then processed page-by-page to extract text, convert pages into high-resolution images, and generate synthetic technical queries with corresponding answers. We used Google's **Gemini 2.5 Flash** model in a custom pipeline to generate diverse, expert-level questions and comprehensive answers that align with the content of each page. ## Data Fields Each entry in the dataset contains: - **`id`** (string): A unique identifier for the sample - **`query`** (string): A synthetic educational question generated from that page - **`answer`** (string): A comprehensive answer to the corresponding query - **`image`** (PIL.Image): A visual rendering of a PDF page - **`language`** (string): The detected language of the query ## Data Generation Each page produces 4 unique entries: a main technical query, a secondary one, a visual-based question, and a multimodal semantic query, all with their corresponding answers. ## Supported Tasks This dataset is designed to support: - **Question Answering**: Training and evaluating models on historical and geographical content - **Visual Question Answering**: Multimodal understanding of educational documents - **Document Retrieval**: Developing search systems for academic and educational documents - **Text Generation**: Automated question-answer generation from educational sources - **Domain-Specific Applications**: Historical document analysis, geographical information retrieval, and educational content understanding ## Dataset Use Cases - Training and evaluating vision-language models on historical and geographical educational content - Developing multimodal search or retrieval systems for academic and educational documents - Research in automated question-answer generation from educational and scholarly sources - Enhancing tools for historical document analysis, geographical data interpretation, and educational understanding - Supporting educational research in history and geography policy and curriculum ## Dataset Curators - **Yumeng Ye** - **Léo Appourchaux** -

# VDR_History_Geography - 概览 ## 数据集概述 **VDR_History_Geography是一款经过精心整理的多模态数据集,聚焦历史文献、地理资料与教育内容。该数据集整合了从真实学术及教育类PDF文档中提取的文本与图像数据,可用于支持RAG DSE、问答、文档检索以及视觉语言模型(Vision-Language Model)训练等任务。** ## 数据集构建 本数据集通过我们的开源工具[VDR_pdf-to-parquet](https://github.com/RacineAIOS/VDR_pdf-to-parquet)构建。 我们从公开在线资源中收集了与历史及地理相关的PDF文档,主要涵盖历史与地理领域的学术教科书、教育资料及学术出版物。每份文档在处理前均经过**人工清理与整理**,包括移除空白页、标题页、目录及其他无关内容,以确保数据集的最优质量。 清理完成的文档随后将逐页处理,提取文本内容、将页面转换为高分辨率图像,并生成带对应答案的合成技术查询(query)。 我们在定制化流程中使用了谷歌的**Gemini 2.5 Flash**模型,以生成与各页面内容匹配的多样化、专业级问题及详尽答案。 ## 数据字段 数据集的每条数据条目包含以下字段: - **`id`**(字符串类型):样本的唯一标识符 - **`query`**(字符串类型):从该页面生成的合成教育类问题 - **`answer`**(字符串类型):对应查询的详尽答案 - **`image`**(PIL.Image类型):PDF页面的可视化渲染结果 - **`language`**(字符串类型):检测到的查询所用语言 ## 数据生成 每个页面可生成4条独特的数据条目:一条主技术查询、一条副查询、一个基于视觉的问题,以及一个多模态语义查询,每条均配有对应的答案。 ## 支持任务 本数据集旨在支持以下任务: - **问答(Question Answering)**:针对历史与地理内容开展模型训练与评估 - **视觉问答(Visual Question Answering)**:实现教育文档的多模态理解 - **文档检索(Document Retrieval)**:开发面向学术与教育文档的检索系统 - **文本生成(Text Generation)**:从教育资源中自动生成问答对 - **领域特定应用(Domain-Specific Applications)**:开展历史文档分析、地理信息检索及教育内容理解 ## 数据集应用场景 本数据集的应用场景包括: - 针对历史与地理教育内容开展视觉语言模型的训练与评估 - 开发面向学术与教育文档的多模态检索系统 - 研究如何从教育及学术资源中自动生成问答对 - 优化历史文档分析、地理数据解读及教育内容理解相关工具 - 支持历史与地理政策及课程方向的教育研究 ## 数据集策展人 - **叶雨萌(Yumeng Ye)** - **莱奥·阿普尔舒(Léo Appourchaux)**
提供机构:
maas
创建时间:
2025-08-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作