Spanish-PD-Books
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/Spanish-PD-Books
下载链接
链接失效反馈官方服务:
资源简介:
# 🇪🇸 Spanish Public Domain Books 🇪🇸
**Spanish-Public Domain-Newspapers** or **Spanish-PD-Newspapers** is a large collection aiming to aggregate all Spanish monographies in the public domain. As of March 2024, with Spanish-PD-Newspapers, it is the biggest Spanish open corpus.
## Dataset summary
The collection contains 302,640 individual texts making up 13.9 billion words recovered from multiple sources, including Spanish leading cultural heritage institution Biblioteca Digitale Hispanica (BDH) and Internet Archive. Each parquet file has the full text of 2,000 books selected at random.
## Curation method
The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14).
## Uses
The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.
The rationales for creation of this collection are multifold:
* **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility.
* **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices.
* **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications.
* **Economical**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors.
## License
The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired.
There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14)
## Future work
This dataset is not a one-time work but will continue to evolve significantly in three directions:
* Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories.
* Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
* Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted.
## Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Corpus collection has been largely facilitated thanks to the open science LLM community insights, cooperation and support (Occiglot, Eleuther AI, OpenLLM France, Allen AI).
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
# 🇪🇸 西班牙语公有领域书籍 🇪🇸
**西班牙语公有领域报纸数据集(Spanish-Public Domain-Newspapers,简称Spanish-PD-Newspapers)** 是一个旨在聚合所有处于公有领域的西班牙语专著的大型合集。截至2024年3月,该数据集已是规模最大的西班牙语开放语料库。
## 数据集概述
该合集包含302,640篇独立文本,总字数达139亿,数据来源于西班牙顶尖文化遗产机构西班牙数字化西班牙图书馆(Biblioteca Digitale Hispanica,下称BDH)以及互联网档案馆(Internet Archive)。每个Parquet文件随机收录2000部书籍的完整文本。
## 数据遴选规则
该数据集的构成遵循欧盟公有领域作品判定标准,同时适用于欧盟作者所属的伯尔尼公约成员国:即作者去世超过70年的已出版作品。此外,欧盟2019年版权指令(第14条)为文化遗产作品的公有领域身份初始认定提供了依据。
## 使用场景
该合集旨在拓展开放作品的可用性,以供大语言模型(Large Language Model,下称LLM)训练使用。相关文本可无限制用于模型训练与再发布,以保障研究可复现性。
创建该数据集的多重动因如下:
* **学术层面**:当前训练语料库的封闭化已成为人工智能研究的主要障碍之一,大语言模型正面临切实的可复现性危机。
* **法律层面**:随着《人工智能法案》的通过,预训练语料库需符合版权法相关义务,欧洲人工智能生态必须改变其数据源获取实践。
* **文化层面**:欧盟的语言多样性目前尚未得到充分体现。与网络档案不同,开放的文化遗产、行政或科学文本往往质量上乘:它们篇幅较长、多语言且经过编辑加工。
* **经济层面**:当下数据价值的获取集中于财力雄厚的头部企业,这些企业能够以高价收集或采购数据。向尽可能多的群体提供免版税语料库,能够推动相关应用的创新,并降低对主导企业的经济依赖。
## 许可协议
该合集在所有地区均属于公有领域,即任何个人或集体权利持有者的财产性权利均已过期。
欧洲曾就公有领域的定义以及限制其使用的可能性展开过多年讨论。自2019年起,欧盟版权指令规定:“当视觉艺术作品的保护期届满后,对该作品进行复制行为所产生的任何材料不受版权或相关权利保护,除非该复制行为所产生的材料具有独创性,属于作者的原创智力成果。”(第14条)
## 未来工作计划
本数据集并非一次性项目,将从三个方向持续进行大幅迭代升级:
* 将数据集覆盖范围拓展至19世纪末至20世纪初的作品,并引入来自欧洲文化遗产数据仓库中尚未被利用的馆藏进行扩充。
* 修正文本中的计算机生成错误。所有文本均通过光学字符识别(Optical Character Recognition,下称OCR)软件自动转录而来。原始文件自2000年代中期起历经多年数字化,部分文档存在错误。未来版本将致力于重新对原始文本进行OCR处理,或使用实验性大语言模型对OCR结果进行部分修正。
* 优化原始文本的结构与编辑呈现形式。部分原始文档中的内容(如页眉、页码等)可能不适合大规模分析或模型训练。此外,部分复杂文档结构如表格或多栏布局往往格式不佳。
## 致谢
本语料库的存储与处理得到了Scaleway的慷慨支持。数据集的构建得到了国家初创企业LANGU:IA(法国文化部与DINUM支持的国家级初创企业)的协助,该项目隶属于语言技术联盟(ALT-EDIC)服务方案的预研工作。
语料库的收集工作在很大程度上得益于开放科学大语言模型社区的见解、合作与支持(包括Occiglot、Eleuther AI、OpenLLM France、Allen AI)。
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
提供机构:
maas
创建时间:
2025-06-19



