Chinese_Documents_Dataset_PDF

Name: Chinese_Documents_Dataset_PDF
Creator: maas
Published: 2025-12-18 16:55:33
License: 暂无描述

魔搭社区2025-12-18 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/Kratos-AI/Chinese_Documents_Dataset_PDF

下载链接

链接失效反馈

官方服务：

资源简介：

# Chinese Documents Dataset (PDF) *This dataset consists of a curated collection of Chinese-language documents in PDF format. It includes textbooks, research papers, articles, public-domain books, and official documents written in Simplified and Traditional Chinese. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for Chinese - **Supported Tasks**: - Extraction of Chinese text from PDF documents - Classification by topic (academic, legal, educational, literary) - OCR for Simplified and Traditional Chinese scripts - Benchmarking AI models for Chinese-language document parsing ## Languages - **Primary Language**: Chinese (Simplified and Traditional) - **Secondary Presence**: English, numbers, and technical symbols (common in bilingual or academic PDFs) ## Dataset Creation ### Curation Rationale The dataset was curated to accelerate the development of AI models that can process, recognize, and understand Chinese-language PDFs with complex layouts and mixed scripts. ### Source Data - **Contributors**: Open-access Chinese digital libraries, educational institutions, and volunteer data contributors. - **Collection Process**: All PDFs were collected from legally available, open-licensed repositories and public-domain sources. ### Other Known Limitations - **Bias**: Overrepresentation of educational and academic documents; fewer informal or handwritten materials - **Format Variation**: Some PDFs contain scanned pages with varying print clarity - **Script Variation**: Includes both Simplified (Mainland China) and Traditional (Taiwan, Hong Kong) content ## Intended Uses ### ✅ Direct Use - Training OCR models for Simplified and Traditional Chinese - Research on multilingual document understanding - Digitization of Chinese educational and archival materials ### ❌ Out-of-Scope Use - Identifying individuals from document data - Commercial reuse of copyrighted materials - Use in surveillance or profiling applications ## License CC BY 4.0

# 中文文档数据集（PDF格式）本数据集为经精心遴选汇编的PDF格式中文文档合集，涵盖简体与繁体中文撰写的教科书、研究论文、文章、公有领域图书及官方文件。本数据集可支撑光学字符识别（Optical Character Recognition，OCR）、文档理解及多语言文本提取领域的AI研究。 ## 联系方式若您对此数据集有咨询或合作需求，请联系： - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**： - 文档分类 - 光学字符识别与文本识别 - 版面与结构分析 - 中文语言建模 - **支持任务**： - 从PDF文档中提取中文文本 - 按主题分类（涵盖学术、法律、教育、文学领域） - 针对简体与繁体中文文本的光学字符识别 - 中文文档解析AI模型的基准测试 ## 语言分布 - **主要语言**：中文（简体与繁体） - **次要元素**：英文、数字及技术符号（常见于双语或学术PDF文档中） ## 数据集构建 ### 遴选依据本数据集的汇编旨在加速可处理、识别并理解版式复杂、混排脚本的中文PDF文档的AI模型研发。 ### 源数据 - **贡献方**：开放获取中文数字图书馆、教育机构及志愿数据贡献者。 - **采集流程**：所有PDF文档均从合法可用的开放授权仓库及公有领域来源采集所得。 ### 其他已知局限性 - **偏差问题**：教育与学术类文档占比偏高，非正式或手写材料占比较少 - **格式差异**：部分PDF包含印刷清晰度不一的扫描页 - **脚本差异**：涵盖简体中文（中国大陆）与繁体中文（中国台湾、中国香港）内容 ## 预期用途 ### ✅ 直接用途 - 训练针对简体与繁体中文的光学字符识别模型 - 开展多语言文档理解相关研究 - 推进中文教育与档案材料的数字化工作 ### ❌ 超出适用范围的用途 - 从文档数据中识别个人身份 - 对受版权保护材料进行商业再使用 - 用于监控或用户画像类应用 ## 许可证 CC BY 4.0

提供机构：

maas

创建时间：

2025-11-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集