AncientDoc

Name: AncientDoc
Creator: maas
Published: 2025-12-04 16:49:32
License: 暂无描述

魔搭社区2025-12-04 更新2025-09-13 收录

下载链接：

https://modelscope.cn/datasets/ByteDance/AncientDoc

下载链接

链接失效反馈

官方服务：

资源简介：

# AncientDoc: A Benchmark for Chinese Ancient Document Understanding AncientDoc is the first comprehensive benchmark dataset specifically designed for **Chinese Ancient Document Understanding**. It covers multi-task evaluation ranging from **OCR** to **knowledge reasoning**, aiming to promote research on the recognition, understanding, and reasoning capabilities of multimodal large models in the scenario of ancient documents. ## Dataset Overview - **Data Scale**: 2,973 pages - **Number of Literatures**: Approximately 100 books - **Types of Literatures**: 14 categories (e.g., collected works, Chu Ci-style poems, poetry and prose criticism, encyclopedic books, catalogs, etc.) - **Time Span**: Spanning from the Warring States period to the Qing Dynasty, covering multiple important historical periods - **Task Types**: 1. **Page-level OCR**: Full-page text recognition (including complex scenarios such as vertical typesetting, variant characters, and annotations) 2. **Vernacular Translation**: Intralingual translation from classical Chinese to modern Chinese 3. **Reasoning-based QA**: Implicit reasoning QA based on the meaning of the text 4. **Knowledge-based QA**: QA based on textual facts and background knowledge 5. **Linguistic Variant QA**: QA related to literary genres, rhetoric, and linguistic styles ## Data Distribution ### Distribution by Dynasty - Ming Dynasty: 1,148 pages - Qing Dynasty: 778 pages - Song Dynasty: 540 pages - Tang Dynasty: 208 pages - Han Dynasty: 110 pages - Yuan Dynasty: 69 pages - Southern and Northern Dynasties: 54 pages - Jin Dynasty: 42 pages - Warring States period: 24 pages ### Distribution by Category (Top 3 by Page Count) 1. Astronomy and Mathematics (238 pages) 2. Confucianism (232 pages) 3. Art (234 pages) ### Distribution by Script Style - Regular script: Approximately 97% - Cursive script: Approximately 3% ## Data Format Data is provided in the form of **images + CSV annotations**:

# AncientDoc：中文古籍理解基准数据集 AncientDoc是首个专为**中文古籍理解（Chinese Ancient Document Understanding）**打造的综合性基准数据集。其涵盖了从**光学字符识别（Optical Character Recognition，OCR）**到知识推理的多任务评估体系，旨在推动多模态大模型在古籍场景下的识别、理解与推理能力相关研究。 ## 数据集概览 - **数据规模**：2973页 - **文献总量**：约100部典籍 - **文献类别**：共14个类别（例如文集、楚辞体诗歌、诗文评、类书、目录学著作等） - **时间跨度**：上至战国时期，下至清代，覆盖多个关键历史时期 - **任务类型**： 1. **页面级光学字符识别（OCR）**：整页文本识别任务，涵盖竖排排版、异体字、批注等复杂场景 2. **白话翻译**：文言文向现代汉语的语内翻译任务 3. **基于推理的问答（Question Answering，QA）**：依托文本语义开展的隐含推理类问答任务 4. **基于知识的问答**：基于文本事实与背景知识的问答任务 5. **语言变体问答**：与文学体裁、修辞技巧、语言风格相关的问答任务 ## 数据分布 ### 按朝代分布 - 明代：1148页 - 清代：778页 - 宋代：540页 - 唐代：208页 - 汉代：110页 - 元代：69页 - 南北朝：54页 - 晋代：42页 - 战国时期：24页 ### 按类别分布（按页数排序的前三类） 1. 天文数学类：238页 2. 儒学类：232页 3. 艺术类：234页 ### 按字体风格分布 - 楷书：约占97% - 草书：约占3% ## 数据格式数据以**图像+CSV标注文件**的形式提供。

提供机构：

maas

创建时间：

2025-09-09

搜集汇总

数据集介绍

背景与挑战

背景概述

AncientDoc是首个专门针对中文古籍理解设计的综合基准数据集，包含2,973页古籍图像和CSV注释，涵盖从战国到清朝的多种文献类型。该数据集支持多任务评估，包括OCR、文言文翻译和问答推理，旨在推动多模态大模型在古籍场景下的识别与理解能力研究。数据分布以明朝和清朝为主，任务设计全面，适用于古籍数字化和智能处理领域。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集