five

GaloisField2718/French-PD-Newspapers

收藏
Hugging Face2026-02-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/GaloisField2718/French-PD-Newspapers
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation language: - fr tags: - ocr pretty_name: French-Public Domain-Newspapers --- # 🇫🇷 French Public Domain Newspapers 🇫🇷 **French-Public Domain-Newspapers** or **French-PD-Newpapers** is a large collection aiming to agregate all the French newspapers and periodicals in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for [Gallicagram](https://shiny.ens-paris-saclay.fr/app/gallicagram) and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural heritage datasets in French and other languages. ## Content As of January 2024, the collection contains nearly three million unique newspaper and periodical editions (69,763,525,347 words) from the French National Library (Gallica). Each parquet file has the full text of a few thousand selected at random and, when available, a few core metadatas (Gallica id, title, author, word counts…). The metadata can be easily expanded thanks to the BNF API. This initial agregation was made possible thanks to the open data program of the French National Library and the consolidation of public domain status for cultural heritage works in the EU with the 2019 Copyright Directive (art. 14) The composition of the dataset adheres to the French criteria for public domain of collective works (any publication older than 70 years ago) and individual works (any publication with an author dead for more than 70 years). In agreement with the shorter term rules, the dataset is in the public domain everywhere. ## Uses The primary use of the collection is for cultural analytics project on a wide scale. The collection also aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. ## License The entire collection is in the public domain everywhere. This means that the patrimonial rights of each individual or collective rightholders have expired. The French National Library claims additional rights in its terms of use and restrict commercial use: "La réutilisation commerciale de ces contenus est payante et fait l'objet d'une licence. Est entendue par réutilisation commerciale la revente de contenus sous forme de produits élaborés ou de fourniture de service ou toute autre réutilisation des contenus générant directement des revenus." There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive state that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation."(art. 14) ## Future developments This dataset is not a one time work but will continue to evolve significantly on two directions: * Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction. * Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well formatted. Major enhancements could be experted through applying new SOTA layout recognition models (like COLAF) on the original PDF files. * Expansion of the collection to other cultural heritage holdings, especially coming from Hathi Trust, Internet Archive and Google Books. ## Acknowledgements The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC). Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI). <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>

任务类别: - 文本生成 语言: - 法语 标签: - OCR(光学字符识别,Optical Character Recognition) 规范名称:French-Public Domain-Newspapers(法国公共域报纸数据集) # 🇫🇷 法国公共域报纸数据集 🇫🇷 **French-Public Domain-Newspapers(法国公共域报纸数据集,简称French-PD-Newpapers)** 是一个大型合集,旨在聚合所有处于公共域的法语报纸与期刊。 该合集最初由Pierre-Carl Langlais基于Benoît de Courson与Benjamin Azoulay为[Gallicagram](https://shiny.ens-paris-saclay.fr/app/gallicagram)整理的大型语料库编纂,并与OpenLLMFrance合作完成。Gallicagram是一项领先的文化分析项目,支持对法语及其他语言的超大型文化遗产数据集开展词汇与n-gram检索。 ## 内容 截至2024年1月,该合集包含来自法国国家图书馆(Gallica)的近300万份独特报纸与期刊刊本,总词量达69,763,525,347词。每个帕库(Parquet)文件随机收录数千份完整文本,并在可用时附带若干核心元数据(Gallica编号、标题、作者、词数等)。借助法国国家图书馆API,元数据可轻松扩展。 本次初始聚合得益于法国国家图书馆的开放数据项目,以及欧盟2019年《版权指令》(第14条)中关于文化遗产作品公共域状态的统一规则。 本数据集的构成符合法国关于集体作品(出版时长超过70年)与个人作品(作者去世超过70年)的公共域判定标准。结合更短的适用期限规则,本数据集在全球范围内均属于公共域。 ## 用途 该合集的核心用途是开展大规模文化分析项目。 此外,本合集旨在拓展开放作品的可用性,以供大语言模型(Large Language Model, LLM)训练使用。文本可无限制地用于模型训练与再发布,以保障研究可复现性。 ## 许可 整个合集在全球范围内均属于公共域,这意味着所有个人或集体权利持有人的著作财产权均已过期。 法国国家图书馆在其使用条款中主张额外权利,并限制商业使用:“此类内容的商业再使用需付费并获得授权许可。商业再使用指以制成品形式转售内容、提供服务,或任何直接产生收益的其他内容再使用方式。” 欧洲多年来围绕公共域的定义以及限制其使用的可能性存在争议。自2019年起,欧盟《版权指令》规定:“成员国应规定,当视觉艺术作品的保护期限届满时,对该作品进行复制行为所产生的任何材料不受版权或相关权利约束,除非该复制行为产生的材料具有独创性,即属于作者的原创智力成果。”(第14条) ## 未来发展 本数据集并非一次性项目,将在三个方向持续显著演进: * 修正文本中的计算机生成错误。所有文本均通过光学字符识别(Optical Character Recognition, OCR)软件自动转录。原始文件自2000年代中期以来历经多年数字化,部分文档仍存在瑕疵。未来版本将致力于对原始文本重新进行OCR处理,或使用实验性大语言模型开展部分OCR校正工作。 * 优化原始文本的结构与编辑呈现。部分原始文档内容可能不适合大规模分析或模型训练(如页眉、页码等)。此外,部分高级文档结构(如表格或多栏布局)可能未得到良好格式化。通过在原始PDF文件上应用最新的前沿布局识别模型(如COLAF),可实现大幅优化。 * 将合集扩展至其他文化遗产馆藏,尤其是来自Hathi Trust、互联网档案馆(Internet Archive)与谷歌图书(Google Books)的馆藏。 ## 致谢 本语料库的存储与处理得益于Scaleway的慷慨支持。本项目在法国初创企业LANGU:IA(国家支持的初创企业)的支持与协同努力下构建完成,该企业由法国文化部与DINUM支持,作为语言技术联盟ALT-EDIC(Alliance for Language technologies EDIC)服务预配置的一部分。 语料库的收集得益于开放科学大语言模型社区的见解与合作(Occiglot、Eleuther AI、Allen AI)。 <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>
提供机构:
GaloisField2718
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作