Russian_Documents_Dataset_PDF

Name: Russian_Documents_Dataset_PDF
Creator: maas
Published: 2025-11-12 16:54:32
License: 暂无描述

魔搭社区2025-11-12 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/Kratos-AI/Russian_Documents_Dataset_PDF

下载链接

链接失效反馈

官方服务：

资源简介：

# Russian Documents Dataset (PDF) *This dataset contains a curated collection of Russian-language documents in PDF format. The corpus includes books, academic papers, government publications, articles, and educational materials written in Russian. It is designed to support AI research in OCR, document understanding, and multilingual text recognition.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for Russian - **Supported Tasks**: - Extraction of printed Russian text from PDF files - Classification of documents by category (academic, legal, educational, literary) - OCR for Cyrillic script with varying fonts and structures - Benchmarking of AI models for Russian-language document processing ## Languages - **Primary Language**: Russian - **Secondary Presence**: English and other Slavic languages (common in bilingual or regional publications) ## Dataset Creation ### Curation Rationale This dataset was curated to support research on AI systems that process Russian-language documents, handle Cyrillic text layouts, and improve OCR and multilingual document understanding performance. ### Source Data - **Contributors**: Open-access Russian digital libraries, academic repositories, and public-domain archives. - **Collection Process**: All PDFs were collected from open-license and publicly available repositories ensuring legal and ethical reuse. ### Other Known Limitations - **Bias**: Overrepresentation of formal and academic materials; fewer informal or regional texts - **Format Variation**: Some PDFs contain scanned pages or multi-column layouts - **Language Scope**: Focused on Standard Russian; limited inclusion of minority languages or dialects ## Intended Uses ### ✅ Direct Use - Training OCR and NLP models for Russian document understanding - Research in Cyrillic-based text recognition and multilingual AI - Digitization of Russian-language archives, literature, and educational resources ### ❌ Out-of-Scope Use - Extraction of private or identifiable information from PDFs - Commercial redistribution of copyrighted materials - Use in profiling, surveillance, or handwriting recognition applications ## License CC BY 4.0

# 俄语文档数据集（PDF格式）本数据集收录了经过精选的俄语PDF格式文档合集。该语料库涵盖俄语撰写的图书、学术论文、政府出版物、文章以及教育素材，旨在支持光学字符识别（Optical Character Recognition，OCR）、文档理解以及多语言文本识别领域的人工智能研究。 ## 联系方式若您有关于本数据集的咨询或合作意向，请联系： - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**： - 文档分类 - 光学字符识别与文本识别 - 版面与结构分析 - 俄语语言建模 - **支持的具体任务**： - 从PDF文件中提取印刷体俄语文本 - 按类别（学术、法律、教育、文学）对文档进行分类 - 针对不同字体与结构的西里尔字母（Cyrillic script）文本的光学字符识别 - 俄语文档处理人工智能模型的基准测试 ## 语言情况 - **主要语言**：俄语 - **次要存在语言**：英语及其他斯拉夫语言（常见于双语或区域出版物中） ## 数据集构建 ### 遴选依据本数据集经遴选构建，旨在支持面向俄语文档处理、西里尔文版面处理的人工智能系统研究，以及提升光学字符识别与多语言文档理解的性能。 ### 源数据 - **贡献方**：开放获取的俄语数字图书馆、学术仓储库以及公有领域档案库。 - **收集流程**：所有PDF文件均从开放许可且可公开获取的仓储库中收集，确保可合法合规地二次使用。 ### 已知其他局限性 - **偏倚问题**：正式与学术类材料占比过高，非正式或区域文本占比较少 - **格式差异**：部分PDF包含扫描页或多栏布局 - **语言范围**：以标准俄语为核心，仅少量收录小众语言或方言内容 ## 预期用途 ### ✅ 直接用途 - 训练用于俄语文档理解的光学字符识别与自然语言处理模型 - 西里尔字母文本识别与多语言人工智能领域的研究 - 俄语档案、文学作品与教育资源的数字化工作 ### ❌ 超出范围的用途 - 从PDF文件中提取私有或可识别的个人信息 - 对受版权保护的材料进行商业再分发 - 用于画像分析、监视或手写识别相关应用 ## 许可证 CC BY 4.0

提供机构：

maas

创建时间：

2025-11-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集