PleIAs/BDH-Books
收藏Hugging Face2024-03-19 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/PleIAs/BDH-Books
下载链接
链接失效反馈官方服务:
资源简介:
# 🇪🇸 Biblioteca Digitale Hispanica - Books 🇪🇸
**Biblioteca Digitale Hispanica-Books** or **BDH-Books** is a large collection aiming to aggregate all Spanish books in the public domain coming from the Biblioteca Digitale Hispanica.
## Dataset summary
The collection contains 139,932 individual titles mostly published in the 19th century and the first half of the 20th century, making up nearly 11 billion words (10,753,912,288 space-separated words).
## Curation method
The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14).
As of March 2024, to limit rights verification, we have retained exclusively titles published prior to 1884.
The corpus will be expanded at a later stage to encompass late 19th century and early 20th century publications, after checking for public domain validity.
## Uses
The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes.
The rationales for creation of this collection are multifold:
* **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility.
* **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices.
* **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications.
* **Economical**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors.
## License
The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired.
There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14)
## Future work
This dataset is not a one-time work but will continue to evolve significantly in three directions:
* Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories.
* Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction.
* Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted.
## Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI).
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
Biblioteca Digitale Hispanica-Books (BDH-Books) is a large collection aiming to aggregate all Spanish books in the public domain coming from the Biblioteca Digitale Hispanica. The dataset contains 139,932 individual titles mostly published in the 19th century and the first half of the 20th century, making up nearly 11 billion words. The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. As of March 2024, to limit rights verification, we have retained exclusively titles published prior to 1884. The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. The rationales for creation of this collection are multifold: scientific, legal, cultural, and economical. The entire collection is in the public domain in all regions. Future work includes expanding the dataset to late 19th and early 20th century works, correcting computer-generated errors in the text, and enhancing the structure/editorial presentation of the original text.
提供机构:
PleIAs
原始信息汇总
Biblioteca Digitale Hispanica - Books (BDH-Books)
数据集概述
- 名称: Biblioteca Digitale Hispanica - Books (BDH-Books)
- 内容: 包含139,932本西班牙公共领域书籍,主要出版于19世纪和20世纪上半叶。
- 字数: 约110亿字(10,753,912,288个空格分隔的单词)。
数据集组成
- 版权状态: 符合欧盟及伯尔尼公约国家的公共领域作品标准,即作者去世超过70年的出版物。
- 时间范围: 截至2024年3月,仅保留1884年之前出版的书籍。
- 未来扩展: 计划扩展至19世纪末和20世纪初的出版物,并验证其公共领域状态。
用途
- 科学研究: 旨在为大型语言模型的训练提供开放的文本资源,解决训练语料库封闭的问题。
- 法律合规: 支持欧盟AI法案中关于版权法合规的要求,促进欧洲AI生态系统的变革。
- 文化多样性: 提升欧盟语言多样性的代表性,与网络档案相比,这些文本质量更高,具有多语言和编辑化的特点。
- 经济效益: 通过提供免费的数据集,减少对主导数据收集者的经济依赖,促进创新。
许可
- 公共领域: 整个数据集在所有地区均属于公共领域,无版权限制。
- 欧盟版权指令: 自2019年起,欧盟版权指令规定,当艺术作品的保护期满后,任何由此作品复制的材料不受版权或相关权利的限制,除非该复制材料具有原创性。
未来工作
- 数据扩展: 计划扩展至19世纪末和20世纪初的作品,并整合来自欧洲文化遗产数据仓库的未开发资源。
- 文本校正: 所有文本均通过OCR软件自动转录,未来版本将努力重新转录或使用实验性LLM模型进行部分OCR校正。
- 结构优化: 改进原始文本的结构和编辑呈现,去除不适合大规模分析或模型训练的部分(如页眉、页码等),并优化复杂文档结构(如表格、多列布局)。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



