five

Chinese_Documents_Dataset_PDF

收藏
魔搭社区2025-12-18 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/Kratos-AI/Chinese_Documents_Dataset_PDF
下载链接
链接失效反馈
官方服务:
资源简介:
# Chinese Documents Dataset (PDF) *This dataset consists of a curated collection of Chinese-language documents in PDF format. It includes textbooks, research papers, articles, public-domain books, and official documents written in Simplified and Traditional Chinese. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for Chinese - **Supported Tasks**: - Extraction of Chinese text from PDF documents - Classification by topic (academic, legal, educational, literary) - OCR for Simplified and Traditional Chinese scripts - Benchmarking AI models for Chinese-language document parsing ## Languages - **Primary Language**: Chinese (Simplified and Traditional) - **Secondary Presence**: English, numbers, and technical symbols (common in bilingual or academic PDFs) ## Dataset Creation ### Curation Rationale The dataset was curated to accelerate the development of AI models that can process, recognize, and understand Chinese-language PDFs with complex layouts and mixed scripts. ### Source Data - **Contributors**: Open-access Chinese digital libraries, educational institutions, and volunteer data contributors. - **Collection Process**: All PDFs were collected from legally available, open-licensed repositories and public-domain sources. ### Other Known Limitations - **Bias**: Overrepresentation of educational and academic documents; fewer informal or handwritten materials - **Format Variation**: Some PDFs contain scanned pages with varying print clarity - **Script Variation**: Includes both Simplified (Mainland China) and Traditional (Taiwan, Hong Kong) content ## Intended Uses ### ✅ Direct Use - Training OCR models for Simplified and Traditional Chinese - Research on multilingual document understanding - Digitization of Chinese educational and archival materials ### ❌ Out-of-Scope Use - Identifying individuals from document data - Commercial reuse of copyrighted materials - Use in surveillance or profiling applications ## License CC BY 4.0

# 中文文档数据集(PDF格式) 本数据集为经精心遴选汇编的PDF格式中文文档合集,涵盖简体与繁体中文撰写的教科书、研究论文、文章、公有领域图书及官方文件。本数据集可支撑光学字符识别(Optical Character Recognition,OCR)、文档理解及多语言文本提取领域的AI研究。 ## 联系方式 若您对此数据集有咨询或合作需求,请联系: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**: - 文档分类 - 光学字符识别与文本识别 - 版面与结构分析 - 中文语言建模 - **支持任务**: - 从PDF文档中提取中文文本 - 按主题分类(涵盖学术、法律、教育、文学领域) - 针对简体与繁体中文文本的光学字符识别 - 中文文档解析AI模型的基准测试 ## 语言分布 - **主要语言**:中文(简体与繁体) - **次要元素**:英文、数字及技术符号(常见于双语或学术PDF文档中) ## 数据集构建 ### 遴选依据 本数据集的汇编旨在加速可处理、识别并理解版式复杂、混排脚本的中文PDF文档的AI模型研发。 ### 源数据 - **贡献方**:开放获取中文数字图书馆、教育机构及志愿数据贡献者。 - **采集流程**:所有PDF文档均从合法可用的开放授权仓库及公有领域来源采集所得。 ### 其他已知局限性 - **偏差问题**:教育与学术类文档占比偏高,非正式或手写材料占比较少 - **格式差异**:部分PDF包含印刷清晰度不一的扫描页 - **脚本差异**:涵盖简体中文(中国大陆)与繁体中文(中国台湾、中国香港)内容 ## 预期用途 ### ✅ 直接用途 - 训练针对简体与繁体中文的光学字符识别模型 - 开展多语言文档理解相关研究 - 推进中文教育与档案材料的数字化工作 ### ❌ 超出适用范围的用途 - 从文档数据中识别个人身份 - 对受版权保护材料进行商业再使用 - 用于监控或用户画像类应用 ## 许可证 CC BY 4.0
提供机构:
maas
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作