Latin-PD
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/Latin-PD
下载链接
链接失效反馈官方服务:
资源简介:
# 🇲🇪 Latin Public Domain Books (Latin) 🇲🇪
**Latin-Public Domain** or **Latin-PD** is a large collection aiming to aggregate all Latin monographies and periodicals in the public domain. As of June 2024, it is the largest Latin open corpus.
## Dataset summary
The collection contains 16,521,454,086 words (159,070 titles) recovered from multiple sources, including the Internet Archive and various European national libraries and cultural heritage institutions (BDH, BNF). Each parquet file has the full text of 1,000 books selected at random.
## Curation method
The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14).
As of June 2024, to limit rights verification, we have retained exclusively titles published prior to 1884.
The corpus will be expanded at a later stage to encompass late 19th century and early 20th century publications, after checking for public domain validity.
## Uses
The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.
The rationales for creation of this collection are multifold:
* **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility.
* **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices.
* **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications.
* **Economic**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors.
## License
The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired.
There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14)
## Future work
This dataset is not a one-time work but will continue to evolve significantly in three directions:
* Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories.
* Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
* Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted.
## Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Corpus collection has been largely facilitated thanks to the open science LLM community insights, cooperation and support (Occiglot, Eleuther AI, OpenLLM France, Allen AI).
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
# 🇲🇪 拉丁公共域(public domain)书籍(拉丁语)🇲🇪
**拉丁公共域(public domain)** 或 **拉丁公共域数据集(Latin-PD)** 是一个大型馆藏项目,旨在聚合所有处于公共域的拉丁语专著与期刊。截至2024年6月,它已是规模最大的拉丁语开放语料库。
## 数据集概览
该馆藏共收录165,214,540,864词(共159,070种标题),数据源自多个渠道,包括互联网档案馆(Internet Archive)以及欧洲多国国家图书馆与文化遗产机构(BDH、BNF)。每个Parquet文件包含随机选取的1000部书籍的完整文本。
## 馆藏遴选标准
本数据集的构成遵循欧盟公共域作品判定规则,同时适用于欧盟作者所属的全部伯尔尼公约成员国:即作者逝世已满70年的出版物。此外,欧盟境内文化遗产作品的公共域身份认定,依据2019年《欧盟版权指令》第14条执行。
截至2024年6月,为简化版权核查流程,本次仅收录1884年之前出版的作品。后续将在完成公共域合法性校验后,扩展收录19世纪末至20世纪初的出版物。
## 应用场景
本馆藏旨在扩大开放作品的可用性,以供大语言模型(Large Language Model, LLM)训练使用。相关文本可用于模型训练,且可无限制重发布,以保障研究的可复现性。
构建该馆藏的核心动因涵盖多个维度:
* **科研维度**:当前训练语料库的封闭性已成为人工智能研究的重大障碍,大语言模型正面临切实的可复现性危机。
* **法律维度**:随着《人工智能法案》的出台,预训练语料库需满足版权合规要求,欧洲人工智能生态系统将不得不调整其数据源获取模式。
* **文化维度**:欧盟的语言多样性目前尚未得到充分体现。与网页归档文本不同,开放的文化遗产、行政或科学文本往往具备高质量特征:篇幅较长、多语言适配且经过专业编辑加工。
* **经济维度**:当前数据价值的获取高度集中于财力雄厚的头部企业,它们能够以高昂成本收集或采购数据集。向广泛群体提供免版税语料库,能够解放创新应用空间,并降低对头部厂商的经济依赖。
## 授权协议
本馆藏全部作品在全球范围内均属于公共域范畴,即所有个人或集体权利持有人的财产性权利均已过期。
欧洲曾就公共域的定义及限制使用的可能性展开过多年争论。自2019年起,欧盟《版权指令》明确规定:"成员国应规定,当视觉艺术作品的保护期届满时,对该作品进行复制行为所产生的材料不受版权或相关权利约束,除非该复制材料具有原创性,即属于作者的独立智力创作成果。"(第14条)
## 后续规划
本数据集并非一次性项目,后续将从三个方向持续迭代优化:
* 扩展收录范围至19世纪末至20世纪初的作品,并整合欧洲文化遗产数据仓库中尚未开发的馆藏资源。
* 修正文本中由计算机自动生成的识别错误。所有文本均通过光学字符识别(Optical Character Recognition, OCR)软件自动转录而来,原始文件的数字化工作始于2005年前后,部分文档存在识别误差。未来版本将通过重新光学字符识别,或借助实验性大语言模型完成部分OCR错误校正。
* 优化原始文本的结构与编辑呈现形式。部分原始文档中的冗余内容(如页眉、页码等)可能不适合大规模分析或模型训练;此外,部分复杂文档结构(如表格、多栏布局)的格式可能存在瑕疵。
## 致谢
本语料库的存储与处理工作得到了Scaleway的慷慨支持。该数据集由法国国家扶持初创企业LANGU:IA牵头构建,并获得法国文化部与DINUM的资助,作为语言技术联盟(ALT-EDIC)服务预研的一部分。
语料库的收集工作得益于开放科学大语言模型社区的见解、合作与支持(包括Occiglot、Eleuther AI、OpenLLM France、Allen AI)。
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
提供机构:
maas
创建时间:
2025-06-19



