Chinese_Documents_Dataset_PDF
收藏魔搭社区2025-12-18 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/Kratos-AI/Chinese_Documents_Dataset_PDF
下载链接
链接失效反馈官方服务:
资源简介:
# Chinese Documents Dataset (PDF)
*This dataset consists of a curated collection of Chinese-language documents in PDF format. It includes textbooks, research papers, articles, public-domain books, and official documents written in Simplified and Traditional Chinese. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.*
## Contact
For queries or collaborations related to this dataset, contact:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## Supported Tasks
- **Task Categories**:
- Document Classification
- OCR and Text Recognition
- Layout and Structure Analysis
- Language Modeling for Chinese
- **Supported Tasks**:
- Extraction of Chinese text from PDF documents
- Classification by topic (academic, legal, educational, literary)
- OCR for Simplified and Traditional Chinese scripts
- Benchmarking AI models for Chinese-language document parsing
## Languages
- **Primary Language**: Chinese (Simplified and Traditional)
- **Secondary Presence**: English, numbers, and technical symbols (common in bilingual or academic PDFs)
## Dataset Creation
### Curation Rationale
The dataset was curated to accelerate the development of AI models that can process, recognize, and understand Chinese-language PDFs with complex layouts and mixed scripts.
### Source Data
- **Contributors**: Open-access Chinese digital libraries, educational institutions, and volunteer data contributors.
- **Collection Process**: All PDFs were collected from legally available, open-licensed repositories and public-domain sources.
### Other Known Limitations
- **Bias**: Overrepresentation of educational and academic documents; fewer informal or handwritten materials
- **Format Variation**: Some PDFs contain scanned pages with varying print clarity
- **Script Variation**: Includes both Simplified (Mainland China) and Traditional (Taiwan, Hong Kong) content
## Intended Uses
### ✅ Direct Use
- Training OCR models for Simplified and Traditional Chinese
- Research on multilingual document understanding
- Digitization of Chinese educational and archival materials
### ❌ Out-of-Scope Use
- Identifying individuals from document data
- Commercial reuse of copyrighted materials
- Use in surveillance or profiling applications
## License
CC BY 4.0
# 中文文档数据集(PDF格式)
本数据集为经精心遴选汇编的PDF格式中文文档合集,涵盖简体与繁体中文撰写的教科书、研究论文、文章、公有领域图书及官方文件。本数据集可支撑光学字符识别(Optical Character Recognition,OCR)、文档理解及多语言文本提取领域的AI研究。
## 联系方式
若您对此数据集有咨询或合作需求,请联系:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## 支持任务
- **任务类别**:
- 文档分类
- 光学字符识别与文本识别
- 版面与结构分析
- 中文语言建模
- **支持任务**:
- 从PDF文档中提取中文文本
- 按主题分类(涵盖学术、法律、教育、文学领域)
- 针对简体与繁体中文文本的光学字符识别
- 中文文档解析AI模型的基准测试
## 语言分布
- **主要语言**:中文(简体与繁体)
- **次要元素**:英文、数字及技术符号(常见于双语或学术PDF文档中)
## 数据集构建
### 遴选依据
本数据集的汇编旨在加速可处理、识别并理解版式复杂、混排脚本的中文PDF文档的AI模型研发。
### 源数据
- **贡献方**:开放获取中文数字图书馆、教育机构及志愿数据贡献者。
- **采集流程**:所有PDF文档均从合法可用的开放授权仓库及公有领域来源采集所得。
### 其他已知局限性
- **偏差问题**:教育与学术类文档占比偏高,非正式或手写材料占比较少
- **格式差异**:部分PDF包含印刷清晰度不一的扫描页
- **脚本差异**:涵盖简体中文(中国大陆)与繁体中文(中国台湾、中国香港)内容
## 预期用途
### ✅ 直接用途
- 训练针对简体与繁体中文的光学字符识别模型
- 开展多语言文档理解相关研究
- 推进中文教育与档案材料的数字化工作
### ❌ 超出适用范围的用途
- 从文档数据中识别个人身份
- 对受版权保护材料进行商业再使用
- 用于监控或用户画像类应用
## 许可证
CC BY 4.0
提供机构:
maas
创建时间:
2025-11-06



