five

French_Documents_Dataset_PDF

收藏
魔搭社区2025-11-12 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/Kratos-AI/French_Documents_Dataset_PDF
下载链接
链接失效反馈
官方服务:
资源简介:
# French Documents Dataset (PDF) *This dataset contains a curated collection of French-language documents in PDF format. It includes educational materials, books, news articles, government publications, and public-domain literature written in French. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.* ## Contact For queries or collaborations related to this dataset, contact: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## Supported Tasks - **Task Categories**: - Document Classification - OCR and Text Recognition - Layout and Structure Analysis - Language Modeling for French - **Supported Tasks**: - Automatic extraction of French text from PDFs - Classification of documents by topic (literary, educational, official, news) - OCR for printed French text with diacritics and typographic variation - Benchmarking AI models for multilingual and French-language document understanding ## Languages - **Primary Language**: French - **Secondary Presence**: English and other European languages (common in academic or bilingual contexts) ## Dataset Creation ### Curation Rationale The dataset was curated to improve AI systems' ability to process and understand French-language documents with varied fonts, layouts, and linguistic structures. It is intended for research in multilingual OCR and document intelligence. ### Source Data - **Contributors**: Open-access French repositories, public libraries, and volunteer data curators. - **Collection Process**: All documents were sourced from legally accessible, open-licensed PDF archives and public-domain resources. ### Other Known Limitations - **Bias**: Primarily formal and academic content; limited representation of informal or regional French - **Format Variation**: Some PDFs may include embedded scanned pages or mixed formats - **Geographical Bias**: Mostly European French; limited coverage of African and Canadian French variants ## Intended Uses ### ✅ Direct Use - Training and evaluation of OCR systems for French text - Research in document classification and layout analysis - Digitization of French-language archives and educational material ### ❌ Out-of-Scope Use - Identification of individuals or private data from PDFs - Commercial use of copyrighted works without permission - Use in profiling, surveillance, or handwriting analysis ## License CC BY 4.0

# 法语文档数据集(PDF格式) *本数据集收录了经精心甄选的法语PDF格式文档集合,涵盖教育资料、图书、新闻稿件、政府出版物以及公有领域法语文学作品。本数据集可支持光学字符识别(OCR, Optical Character Recognition)、文档理解以及多语种文本提取相关的人工智能研究。 ## 联系方式 如需就本数据集进行咨询或开展合作,请联系: - anoushka@kgen.io - abhishek.vadapalli@kgen.io ## 支持任务 - **任务类别**: - 文档分类 - 光学字符识别与文本识别 - 版式与结构分析 - 法语语言建模 - **支持的具体任务**: - 从PDF文档中自动提取法语文本 - 按主题对文档进行分类(涵盖文学、教育、官方文件、新闻四大类别) - 针对带变音符号及排版变体的印刷体法语文本开展光学字符识别 - 为多语种及法语文档理解类人工智能模型提供基准测试 ## 语言情况 - **主要语言**:法语 - **次要语种分布**:英语及其他欧洲语言(常见于学术或双语语境中) ## 数据集构建 ### 甄选初衷 本数据集的甄选旨在提升人工智能系统处理、理解字体、版式及语言结构多样的法语文档的能力,主要面向多语种光学字符识别与文档智能领域的研究工作。 ### 源数据 - **贡献方**:开放获取法语资源库、公共图书馆以及志愿数据整理者。 - **采集流程**:所有文档均来自合法可获取的开放授权PDF档案库及公有领域资源。 ### 已知其他局限性 - **偏倚性**:数据集以正式及学术内容为主,对非正式法语或区域法语的覆盖不足 - **格式差异**:部分PDF文档可能包含嵌入式扫描页面或混合格式内容 - **地域偏倚**:数据集以欧洲法语为主,对非洲及加拿大法语变体的覆盖有限 ## 预期用途 ### ✅ 直接适用场景 - 针对法语文本的光学字符识别系统的训练与评估 - 文档分类与版式分析相关研究 - 法语档案及教育资料的数字化工作 ### ❌ 不适用场景 - 从PDF文档中识别个人身份或提取私人数据 - 未经授权的受版权保护作品的商业使用 - 用于人物画像、监控或手写体分析的场景 ## 授权协议 CC BY 4.0
提供机构:
maas
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作