French_Documents_Dataset_PDF

Name: French_Documents_Dataset_PDF
Creator: maas
Published: 2025-11-12 16:54:32
License: 暂无描述

魔搭社区2025-11-12 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/Kratos-AI/French_Documents_Dataset_PDF

下载链接

链接失效反馈

官方服务：

资源简介：

# French Documents Dataset (PDF) *This dataset contains a curated collection of French-language documents in PDF format. It includes educational materials, books, news articles, government publications, and public-domain literature written in French. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for French - **Supported Tasks**: - Automatic extraction of French text from PDFs - Classification of documents by topic (literary, educational, official, news) - OCR for printed French text with diacritics and typographic variation - Benchmarking AI models for multilingual and French-language document understanding ## Languages - **Primary Language**: French - **Secondary Presence**: English and other European languages (common in academic or bilingual contexts) ## Dataset Creation ### Curation Rationale The dataset was curated to improve AI systems' ability to process and understand French-language documents with varied fonts, layouts, and linguistic structures. It is intended for research in multilingual OCR and document intelligence. ### Source Data - **Contributors**: Open-access French repositories, public libraries, and volunteer data curators. - **Collection Process**: All documents were sourced from legally accessible, open-licensed PDF archives and public-domain resources. ### Other Known Limitations - **Bias**: Primarily formal and academic content; limited representation of informal or regional French - **Format Variation**: Some PDFs may include embedded scanned pages or mixed formats - **Geographical Bias**: Mostly European French; limited coverage of African and Canadian French variants ## Intended Uses ### ✅ Direct Use - Training and evaluation of OCR systems for French text - Research in document classification and layout analysis - Digitization of French-language archives and educational material ### ❌ Out-of-Scope Use - Identification of individuals or private data from PDFs - Commercial use of copyrighted works without permission - Use in profiling, surveillance, or handwriting analysis ## License CC BY 4.0

# 法语文档数据集（PDF格式） *本数据集收录了经精心甄选的法语PDF格式文档集合，涵盖教育资料、图书、新闻稿件、政府出版物以及公有领域法语文学作品。本数据集可支持光学字符识别（OCR, Optical Character Recognition）、文档理解以及多语种文本提取相关的人工智能研究。 ## 联系方式如需就本数据集进行咨询或开展合作，请联系： - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**： - 文档分类 - 光学字符识别与文本识别 - 版式与结构分析 - 法语语言建模 - **支持的具体任务**： - 从PDF文档中自动提取法语文本 - 按主题对文档进行分类（涵盖文学、教育、官方文件、新闻四大类别） - 针对带变音符号及排版变体的印刷体法语文本开展光学字符识别 - 为多语种及法语文档理解类人工智能模型提供基准测试 ## 语言情况 - **主要语言**：法语 - **次要语种分布**：英语及其他欧洲语言（常见于学术或双语语境中） ## 数据集构建 ### 甄选初衷本数据集的甄选旨在提升人工智能系统处理、理解字体、版式及语言结构多样的法语文档的能力，主要面向多语种光学字符识别与文档智能领域的研究工作。 ### 源数据 - **贡献方**：开放获取法语资源库、公共图书馆以及志愿数据整理者。 - **采集流程**：所有文档均来自合法可获取的开放授权PDF档案库及公有领域资源。 ### 已知其他局限性 - **偏倚性**：数据集以正式及学术内容为主，对非正式法语或区域法语的覆盖不足 - **格式差异**：部分PDF文档可能包含嵌入式扫描页面或混合格式内容 - **地域偏倚**：数据集以欧洲法语为主，对非洲及加拿大法语变体的覆盖有限 ## 预期用途 ### ✅ 直接适用场景 - 针对法语文本的光学字符识别系统的训练与评估 - 文档分类与版式分析相关研究 - 法语档案及教育资料的数字化工作 ### ❌ 不适用场景 - 从PDF文档中识别个人身份或提取私人数据 - 未经授权的受版权保护作品的商业使用 - 用于人物画像、监控或手写体分析的场景 ## 授权协议 CC BY 4.0

提供机构：

maas

创建时间：

2025-11-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集