five

BDH-Books

收藏
魔搭社区2025-12-04 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/BDH-Books
下载链接
链接失效反馈
官方服务:
资源简介:
# 🇪🇸 Biblioteca Digitale Hispanica - Books 🇪🇸 **Biblioteca Digitale Hispanica-Books** or **BDH-Books** is a large collection aiming to aggregate all Spanish books in the public domain coming from the Biblioteca Digitale Hispanica. ## Dataset summary The collection contains 139,932 individual titles mostly published in the 19th century and the first half of the 20th century, making up nearly 11 billion words (10,753,912,288 space-separated words). ## Curation method The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14). As of March 2024, to limit rights verification, we have retained exclusively titles published prior to 1884. The corpus will be expanded at a later stage to encompass late 19th century and early 20th century publications, after checking for public domain validity. ## Uses The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. The rationales for creation of this collection are multifold: * **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility. * **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices. * **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications. * **Economical**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors. ## License The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired. There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14) ## Future work This dataset is not a one-time work but will continue to evolve significantly in three directions: * Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories. * Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction. * Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted. ## Acknowledgements The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC). Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI). <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>

# 🇪🇸 西班牙数字图书馆——图书数据集 🇪🇸 **西班牙数字图书馆-图书数据集(Biblioteca Digitale Hispanica-Books,简称BDH-Books)** 是一个大型聚合数据集,旨在收录来自西班牙数字图书馆的所有公有领域西班牙语图书。 ## 数据集概览 本数据集共收录139,932部独立图书,绝大多数出版于19世纪及20世纪上半叶,总词量近110亿(按空格分隔统计的词量为10,753,912,288)。 ## 遴选标准 本数据集的构成遵循欧盟及所有伯尔尼公约成员国针对欧盟作者的公有领域作品判定标准:即作者逝世超过70年的已出版作品。此外,欧盟范围内文化遗产作品的公有领域身份认定,需符合2019年《版权指令》第14条的相关规定。 截至2024年3月,为简化版权核验流程,当前仅收录1884年之前出版的图书。后续将在完成公有领域合法性核验后,拓展收录19世纪末至20世纪初的出版作品。 ## 应用场景 本数据集旨在提升开放作品的可获取性,以供大语言模型(Large Language Model)训练使用。文本可用于模型训练,且为保障可复现性,可无限制地重新发布。 构建本数据集的核心理由涵盖多维度: * **科研维度**:当前训练语料库的封闭化已成为人工智能研究的核心障碍,大语言模型正面临严峻的可复现性危机。 * **法律维度**:随着《人工智能法案》的通过,预训练语料库需符合版权合规相关要求,欧洲人工智能生态系统必须调整其数据源获取方式。 * **文化维度**:当前欧盟的语言多样性尚未得到充分体现。与网络档案不同,开放馆藏、文化遗产、行政或学术文本通常具备更高质量:它们篇幅更长、支持多语言且经过专业编辑加工。 * **经济维度**:当下数据价值的获取高度集中于财力雄厚的头部企业,这类主体可通过高价收购或自主采集垄断数据资源。向尽可能多的群体提供免版权使用费的语料库,能够释放下游应用的创新活力,降低对行业主导者的经济依赖。 ## 授权协议 本数据集所有内容在全球范围内均属于公有领域,即所有个体或集体权利持有人的财产性权利均已过期。多年来欧洲学界围绕公有领域的定义及使用限制可能性存在诸多争议。自2019年起,欧盟《版权指令》第14条明确规定:"成员国应规定,当视觉艺术作品的保护期届满后,对该作品进行复制所产生的任何材料不受版权或相关权利约束,除非该复制材料具备作者原创智力创作的独创性特征。" ## 后续规划 本数据集并非一次性项目,后续将从三个方向持续迭代优化: * **数据集扩容**:将收录范围拓展至19世纪末至20世纪初的作品,并引入欧洲文化遗产数据仓库中尚未开发的馆藏资源进一步丰富数据集。 * **文本错误修正**:当前所有文本均通过光学字符识别(Optical Character Recognition,OCR)软件自动转录生成。原始文件自2000年代中期起历经多年数字化处理,部分文档存在识别误差。后续版本将通过重新执行OCR扫描或使用实验性大语言模型完成部分OCR错误校正。 * **文本结构优化**:原始文档中的部分内容(如页眉、页码等)可能不适合大规模分析或模型训练。此外,表格、多栏布局等复杂文档结构的格式规范性有待提升,后续将优化文本的结构化呈现与编辑处理效果。 ## 致谢 本语料库的存储与处理工作得到了Scaleway的慷慨支持。数据集的构建依托法国文化部与DINUM支持的国家初创企业LANGU:IA(官方初创企业)的协作支持,作为语言技术联盟EDIC(ALT-EDIC)服务预配置的一部分完成。同时,开放科学大语言模型社区(包括Occiglot、Eleuther AI、Allen AI)的经验分享与协作,极大推动了语料库的采集工作。 <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作