Japanese_Documents_Dataset_PDF

Name: Japanese_Documents_Dataset_PDF
Creator: maas
Published: 2025-12-05 16:56:06
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/Kratos-AI/Japanese_Documents_Dataset_PDF

下载链接

链接失效反馈

官方服务：

资源简介：

# Japanese Documents Dataset (PDF) *This dataset contains a curated collection of Japanese-language documents in PDF format. The corpus includes textbooks, research papers, news articles, public-domain books, and government publications written in Japanese. It is intended to support AI research in OCR, document understanding, and multilingual text recognition.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for Japanese - **Supported Tasks**: - Extraction of Japanese text from PDFs - OCR for Kanji, Hiragana, and Katakana scripts - Classification of documents by type (academic, literary, official, educational) - Benchmarking AI systems for Japanese document parsing and layout understanding ## Languages - **Primary Language**: Japanese - **Secondary Presence**: English numerals, Romanized Japanese, and technical symbols (common in bilingual or academic PDFs) ## Dataset Creation ### Curation Rationale The dataset was curated to improve AI systems’ ability to read and understand Japanese-language documents, including vertically written text and mixed-script layouts, enabling more accurate OCR and multilingual NLP research. ### Source Data - **Contributors**: Open-access Japanese libraries, educational institutions, and public-domain repositories. - **Collection Process**: PDFs were collected from publicly available sources with clear open licenses permitting use for research and AI model training. ### Other Known Limitations - **Bias**: Primarily educational and official documents; limited representation of informal or handwritten material - **Layout Complexity**: Some documents contain vertical text or mixed Japanese-English layouts that may challenge OCR accuracy - **Script Variation**: Coverage includes modern Japanese; limited inclusion of historical or classical scripts ## Intended Uses ### ✅ Direct Use - Training OCR systems for Japanese text recognition - Research in document layout understanding and vertical text OCR - Digitization and analysis of Japanese-language academic or archival materials ### ❌ Out-of-Scope Use - Identifying or profiling individuals from document content - Commercial redistribution of copyrighted PDFs - Use in surveillance, handwriting identification, or behavioral analysis ## License CC BY 4.0

# 日语文档数据集（PDF格式）本数据集收录了经精选的PDF格式日语文档。该语料库涵盖日语撰写的教科书、研究论文、新闻文章、公共领域图书及政府出版物，旨在支持光学字符识别（Optical Character Recognition, OCR）、文档理解及多语种文本识别相关的人工智能研究。 ## 联系信息若您对此数据集有任何咨询或合作需求，请联系： - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**： - 文档分类 - 光学字符识别（Optical Character Recognition, OCR）与文本识别 - 版式与结构分析 - 日语语言建模 - **支持任务**： - 从PDF文档中提取日语文本 - 针对汉字（Kanji）、平假名（Hiragana）、片假名（Katakana）的OCR - 按类型对文档进行分类（学术类、文学类、官方类、教育类） - 面向日语文档解析与版式理解的人工智能系统基准测试 ## 语言 - **主要语言**：日语 - **次要存在形式**：英语数字、罗马化日语及学术符号（常见于双语或学术PDF中） ## 数据集构建 ### 筛选依据本数据集的筛选旨在提升人工智能系统阅读与理解日语文档的能力，包括竖排文本及多脚本混合版式，助力更精准的OCR及多语种自然语言处理（Natural Language Processing, NLP）研究。 ### 源数据 - **贡献方**：开放获取日语图书馆、教育机构及公共领域知识库 - **收集流程**：从具备明确开放许可、允许用于研究及人工智能模型训练的公开渠道收集PDF文档 ### 已知其他局限性 - **偏倚性**：数据集以教育及官方文档为主，对非正式或手写材料的覆盖较为有限 - **版式复杂度**：部分文档包含竖排文本或日英混合版式，可能对OCR识别精度造成挑战 - **脚本多样性**：数据集仅覆盖现代日语，对历史或古典日语脚本的收录较为有限 ## 预期用途 ### ✅ 直接用途 - 训练用于日语文本识别的OCR系统 - 开展文档版式理解及竖排文本OCR相关研究 - 对日语学术或档案资料进行数字化处理与分析 ### ❌ 超出范围的用途 - 从文档内容中识别或勾勒个人画像 - 对受版权保护的PDF文档进行商业再分发 - 用于监控、笔迹识别或行为分析 ## 许可证 CC BY 4.0

提供机构：

maas

创建时间：

2025-11-06

搜集汇总

数据集介绍