French_Documents_Dataset_PDF
收藏魔搭社区2025-11-12 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/Kratos-AI/French_Documents_Dataset_PDF
下载链接
链接失效反馈官方服务:
资源简介:
# French Documents Dataset (PDF)
*This dataset contains a curated collection of French-language documents in PDF format. It includes educational materials, books, news articles, government publications, and public-domain literature written in French. The dataset supports AI research in OCR, document understanding, and multilingual text extraction.*
## Contact
For queries or collaborations related to this dataset, contact:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## Supported Tasks
- **Task Categories**:
- Document Classification
- OCR and Text Recognition
- Layout and Structure Analysis
- Language Modeling for French
- **Supported Tasks**:
- Automatic extraction of French text from PDFs
- Classification of documents by topic (literary, educational, official, news)
- OCR for printed French text with diacritics and typographic variation
- Benchmarking AI models for multilingual and French-language document understanding
## Languages
- **Primary Language**: French
- **Secondary Presence**: English and other European languages (common in academic or bilingual contexts)
## Dataset Creation
### Curation Rationale
The dataset was curated to improve AI systems' ability to process and understand French-language documents with varied fonts, layouts, and linguistic structures. It is intended for research in multilingual OCR and document intelligence.
### Source Data
- **Contributors**: Open-access French repositories, public libraries, and volunteer data curators.
- **Collection Process**: All documents were sourced from legally accessible, open-licensed PDF archives and public-domain resources.
### Other Known Limitations
- **Bias**: Primarily formal and academic content; limited representation of informal or regional French
- **Format Variation**: Some PDFs may include embedded scanned pages or mixed formats
- **Geographical Bias**: Mostly European French; limited coverage of African and Canadian French variants
## Intended Uses
### ✅ Direct Use
- Training and evaluation of OCR systems for French text
- Research in document classification and layout analysis
- Digitization of French-language archives and educational material
### ❌ Out-of-Scope Use
- Identification of individuals or private data from PDFs
- Commercial use of copyrighted works without permission
- Use in profiling, surveillance, or handwriting analysis
## License
CC BY 4.0
# 法语文档数据集(PDF格式)
*本数据集收录了经精心甄选的法语PDF格式文档集合,涵盖教育资料、图书、新闻稿件、政府出版物以及公有领域法语文学作品。本数据集可支持光学字符识别(OCR, Optical Character Recognition)、文档理解以及多语种文本提取相关的人工智能研究。
## 联系方式
如需就本数据集进行咨询或开展合作,请联系:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## 支持任务
- **任务类别**:
- 文档分类
- 光学字符识别与文本识别
- 版式与结构分析
- 法语语言建模
- **支持的具体任务**:
- 从PDF文档中自动提取法语文本
- 按主题对文档进行分类(涵盖文学、教育、官方文件、新闻四大类别)
- 针对带变音符号及排版变体的印刷体法语文本开展光学字符识别
- 为多语种及法语文档理解类人工智能模型提供基准测试
## 语言情况
- **主要语言**:法语
- **次要语种分布**:英语及其他欧洲语言(常见于学术或双语语境中)
## 数据集构建
### 甄选初衷
本数据集的甄选旨在提升人工智能系统处理、理解字体、版式及语言结构多样的法语文档的能力,主要面向多语种光学字符识别与文档智能领域的研究工作。
### 源数据
- **贡献方**:开放获取法语资源库、公共图书馆以及志愿数据整理者。
- **采集流程**:所有文档均来自合法可获取的开放授权PDF档案库及公有领域资源。
### 已知其他局限性
- **偏倚性**:数据集以正式及学术内容为主,对非正式法语或区域法语的覆盖不足
- **格式差异**:部分PDF文档可能包含嵌入式扫描页面或混合格式内容
- **地域偏倚**:数据集以欧洲法语为主,对非洲及加拿大法语变体的覆盖有限
## 预期用途
### ✅ 直接适用场景
- 针对法语文本的光学字符识别系统的训练与评估
- 文档分类与版式分析相关研究
- 法语档案及教育资料的数字化工作
### ❌ 不适用场景
- 从PDF文档中识别个人身份或提取私人数据
- 未经授权的受版权保护作品的商业使用
- 用于人物画像、监控或手写体分析的场景
## 授权协议
CC BY 4.0
提供机构:
maas
创建时间:
2025-11-06



