AncientDoc
收藏魔搭社区2025-12-04 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/ByteDance/AncientDoc
下载链接
链接失效反馈官方服务:
资源简介:
# AncientDoc: A Benchmark for Chinese Ancient Document Understanding
AncientDoc is the first comprehensive benchmark dataset specifically designed for **Chinese Ancient Document Understanding**. It covers multi-task evaluation ranging from **OCR** to **knowledge reasoning**, aiming to promote research on the recognition, understanding, and reasoning capabilities of multimodal large models in the scenario of ancient documents.
## Dataset Overview
- **Data Scale**: 2,973 pages
- **Number of Literatures**: Approximately 100 books
- **Types of Literatures**: 14 categories (e.g., collected works, Chu Ci-style poems, poetry and prose criticism, encyclopedic books, catalogs, etc.)
- **Time Span**: Spanning from the Warring States period to the Qing Dynasty, covering multiple important historical periods
- **Task Types**:
1. **Page-level OCR**: Full-page text recognition (including complex scenarios such as vertical typesetting, variant characters, and annotations)
2. **Vernacular Translation**: Intralingual translation from classical Chinese to modern Chinese
3. **Reasoning-based QA**: Implicit reasoning QA based on the meaning of the text
4. **Knowledge-based QA**: QA based on textual facts and background knowledge
5. **Linguistic Variant QA**: QA related to literary genres, rhetoric, and linguistic styles
## Data Distribution
### Distribution by Dynasty
- Ming Dynasty: 1,148 pages
- Qing Dynasty: 778 pages
- Song Dynasty: 540 pages
- Tang Dynasty: 208 pages
- Han Dynasty: 110 pages
- Yuan Dynasty: 69 pages
- Southern and Northern Dynasties: 54 pages
- Jin Dynasty: 42 pages
- Warring States period: 24 pages
### Distribution by Category (Top 3 by Page Count)
1. Astronomy and Mathematics (238 pages)
2. Confucianism (232 pages)
3. Art (234 pages)
### Distribution by Script Style
- Regular script: Approximately 97%
- Cursive script: Approximately 3%
## Data Format
Data is provided in the form of **images + CSV annotations**:
# AncientDoc:中文古籍理解基准数据集
AncientDoc是首个专为**中文古籍理解(Chinese Ancient Document Understanding)**打造的综合性基准数据集。其涵盖了从**光学字符识别(Optical Character Recognition,OCR)**到知识推理的多任务评估体系,旨在推动多模态大模型在古籍场景下的识别、理解与推理能力相关研究。
## 数据集概览
- **数据规模**:2973页
- **文献总量**:约100部典籍
- **文献类别**:共14个类别(例如文集、楚辞体诗歌、诗文评、类书、目录学著作等)
- **时间跨度**:上至战国时期,下至清代,覆盖多个关键历史时期
- **任务类型**:
1. **页面级光学字符识别(OCR)**:整页文本识别任务,涵盖竖排排版、异体字、批注等复杂场景
2. **白话翻译**:文言文向现代汉语的语内翻译任务
3. **基于推理的问答(Question Answering,QA)**:依托文本语义开展的隐含推理类问答任务
4. **基于知识的问答**:基于文本事实与背景知识的问答任务
5. **语言变体问答**:与文学体裁、修辞技巧、语言风格相关的问答任务
## 数据分布
### 按朝代分布
- 明代:1148页
- 清代:778页
- 宋代:540页
- 唐代:208页
- 汉代:110页
- 元代:69页
- 南北朝:54页
- 晋代:42页
- 战国时期:24页
### 按类别分布(按页数排序的前三类)
1. 天文数学类:238页
2. 儒学类:232页
3. 艺术类:234页
### 按字体风格分布
- 楷书:约占97%
- 草书:约占3%
## 数据格式
数据以**图像+CSV标注文件**的形式提供。
提供机构:
maas
创建时间:
2025-09-09
搜集汇总
数据集介绍

背景与挑战
背景概述
AncientDoc是首个专门针对中文古籍理解设计的综合基准数据集,包含2,973页古籍图像和CSV注释,涵盖从战国到清朝的多种文献类型。该数据集支持多任务评估,包括OCR、文言文翻译和问答推理,旨在推动多模态大模型在古籍场景下的识别与理解能力研究。数据分布以明朝和清朝为主,任务设计全面,适用于古籍数字化和智能处理领域。
以上内容由遇见数据集搜集并总结生成



