five

French-PD-Books

收藏
魔搭社区2026-05-09 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/French-PD-Books
下载链接
链接失效反馈
官方服务:
资源简介:
# 🇫🇷 French Public Domain Books 🇫🇷 **French-Public Domain-Book** or **French-PD-Books** is a large collection aiming to agregate all the French monographies in the public domain. The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for [Gallicagram](https://shiny.ens-paris-saclay.fr/app/gallicagram) and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural heritage datasets in French and other languages. ## Content As of January 2024, the collection contains 289,000 books (16,407,292,362 words) from the French National Library (Gallica). Each parquet file has the full text of 2,000 books selected at random and few core metadatas (Gallica id, title, author, word counts…). The metadata can be easily expanded thanks to the BNF API. This initial agregation was made possible thanks to the open data program of the French National Library and the consolidation of public domain status for cultural heritage works in the EU with the 2019 Copyright Directive (art. 14) The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. ## Uses The primary use of the collection is for cultural analytics project on a wide scale. It is already in use by the Gallicagram project, an open and significantly enhanced version of ngram viewer. The collection also aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. ## License The entire collection is in the public domain everywhere. This means that the patrimonial rights of each individual or collective rightholders have expired. The French National Library claims additional rights in its terms of use and restricts commercial use: "La réutilisation commerciale de ces contenus est payante et fait l'objet d'une licence. Est entendue par réutilisation commerciale la revente de contenus sous forme de produits élaborés ou de fourniture de service ou toute autre réutilisation des contenus générant directement des revenus." There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14) ## Future developments This dataset is not a one time work but will continue to evolve significantly on three directions: * Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction. * Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well formatted. Major enhancements could be experted through applying new SOTA layout recognition models (like COLAF) on the original PDF files. * Expansion of the collection to other cultural heritage holdings, especially coming from Hathi Trust, Internet Archive and Google Books. ## Acknowledgements The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC). Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI). <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>

# 🇫🇷 法国公共领域书籍 🇫🇷 **French-Public Domain-Book**(简称**French-PD-Books**)是一个旨在聚合所有处于公共领域(Public Domain)的法国专著的大型数据集。 本数据集最初由Pierre-Carl Langlais汇编,其基础是由Benoît de Courson与Benjamin Azoulay为[Gallicagram](https://shiny.ens-paris-saclay.fr/app/gallicagram)所整理的大型语料库,并与OpenLLMFrance合作完成。Gallicagram是一项前沿的文化分析项目,支持对法语及其他语言的超大型文化遗产数据集进行词汇与n元语法(ngram)检索。 ## 内容 截至2024年1月,本数据集包含来自法国国家图书馆(Gallica)的289,000部书籍,总字数达16,407,292,362。每个Parquet文件均随机选取2,000部书籍的完整文本,并附带少量核心元数据(如Gallica编号、书名、作者、字数统计等)。借助BNF API,可轻松扩展元数据内容。 本次数据集的初始聚合得以实现,得益于法国国家图书馆的开放数据计划,以及2019年欧盟版权指令(第14条)对文化遗产作品公共领域地位的明确化。 本数据集的构成符合欧盟以及所有伯尔尼公约成员国针对欧盟作者的公共领域作品判定标准:即作者去世已满70年的所有出版物。 ## 用途 本数据集的核心用途为大规模文化分析项目。目前它已被Gallicagram项目采用,该项目是一款开源且经过大幅优化的n元语法查看器(ngram viewer)。 本数据集还旨在为大语言模型(Large Language Model, LLM)的训练开放更多可用作品。其文本可用于模型训练,且为确保可复现性,可无限制地重新发布。 ## 许可 本数据集的全部内容在全球范围内均属于公共领域(Public Domain)。这意味着所有个人或集体权利持有人的著作财产权均已过期。 法国国家图书馆在其使用条款中主张额外权利,并对商业使用作出限制:"此类内容的商业再使用需付费,并需获得授权许可。商业再使用指将内容以加工产品形式转售、提供服务,或以其他直接产生收益的方式对内容进行再利用。" 多年来,欧洲学界围绕公共领域的定义以及限制其使用的可能性一直存在争议。自2019年起,欧盟版权指令明确规定:"成员国应规定,当视觉艺术作品的保护期届满后,对该作品进行复制所产生的任何材料不受版权或相关权利约束,除非该复制材料具有作者独立智力创作的独创性。"(第14条) ## 未来发展 本数据集并非一次性项目,未来将从三个方向持续进行大幅优化: * 文本自动生成错误的校正。所有文本均通过光学字符识别(Optical Character Recognition, OCR)软件自动转录而来。原始文件自2000年代中期起历经多年数字化,部分文档存在转录缺陷。未来版本将致力于对原始文本进行重新OCR处理,或使用实验性大语言模型(LLM)对OCR结果进行部分校正。 * 原始文本结构与编辑呈现形式的优化。原始文档中的部分内容(如页眉、页码等)可能不适用于大规模分析或模型训练。此外,部分复杂文档结构(如表格、多栏布局)的格式可能无法适配需求。未来可通过在原始PDF文件上应用新的SOTA布局识别模型(如COLAF)来实现大幅优化。 * 数据集向其他文化遗产馆藏的扩展,尤其是来自Hathi Trust、Internet Archive以及Google Books的馆藏。 ## 致谢 本语料库的存储与处理得到了Scaleway的慷慨支持。数据集的构建得益于法国国有初创企业LANGU:IA(State Startup)的支持与协同努力,该企业由法国文化部与DINUM资助,隶属于语言技术联盟EDIC(ALT-EDIC)服务预配置计划。 语料库的收集工作在很大程度上得益于开放科学大语言模型社区的见解与协作(包括Occiglot、Eleuther AI、Allen AI)。 <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作