five

PleIAs/Spanish-PD-Newspapers

收藏
Hugging Face2024-03-21 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/PleIAs/Spanish-PD-Newspapers
下载链接
链接失效反馈
官方服务:
资源简介:
# 🇪🇸 Spanish Public Domain Newspapers 🇪🇸 **Spanish-Public Domain-Newspapers** or **Spanish-PD-Newspapers** is a large collection aiming to aggregate all Spanish monographies in the public domain. As of March 2024, with Spanish-PD-Books, it is the biggest Spanish open corpus. ## Dataset summary The collection contains 247,491 individual texts making up 2,697,414,811 words recovered from multiple sources, including Spanish leading cultural heritage institution Biblioteca Digital Hispanica (BDH) and Internet Archive. Each parquet file has the full text of 2,000 books selected at random. ## Curation method The composition of the dataset adheres to the criteria for public domain works in the EU and, consequently, all Berne-countries for EU authors: any publication whose author is dead for more than 70 years. Additionally, the initial consolidation of public domain status for cultural heritage operates in the EU under the 2019 Copyright Directive (art. 14). ## Uses The collection aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. The rationales for creation of this collection are multifold: * **Scientific**: We observe that the closure of training corpora represents a major barrier to AI research. Large language models face a real crisis of reproducibility. * **Legal**: With the adoption of the AI Act with its obligations in terms of copyright law compliance for the pretraining corpora, the European AI ecosystem will have to change its provenance practices. * **Cultural**: The linguistic diversity of the European Union is currently underrepresented. Unlike web archives, open, heritage, administrative, or scientific texts are often of high quality: they are long, multilingual, and editorialized publications. * **Economical**: Today, value capture is concentrated on players whose financial resources are already considerable, allowing them to collect or purchase data at a high price. Making a royalty-free corpus available to as many people as possible frees innovation in uses and minimizes economic dependencies on dominant actors. ## License The entire collection is in the public domain in all regions. This means that the patrimonial rights of each individual or collective right holders have expired. There has been a debate for years in Europe over the definition of public domain and the possibility to restrict its use. Since 2019, the EU Copyright Directive states that "Member States shall provide that, when the term of protection of a work of visual art has expired, any material resulting from an act of reproduction of that work is not subject to copyright or related rights, unless the material resulting from that act of reproduction is original in the sense that it is the author's own intellectual creation." (art. 14) ## Future work This dataset is not a one-time work but will continue to evolve significantly in three directions: * Expansion of the dataset to the late 19th and early 20th century works and its further enhancement with currently unexploited collections coming from European patrimonial data repositories. * Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s) and some documents should be. Future versions will strive either to re-OCRize the original text or use experimental LLM models for partial OCR correction. * Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well-formatted. ## Acknowledgements The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC). Corpus collection has been largely facilitated thanks to the open science LLM community insights, cooperation and support (Occiglot, Eleuther AI, OpenLLM France, Allen AI). <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>
提供机构:
PleIAs
原始信息汇总

🇪🇸 Spanish Public Domain Newspapers 🇪🇸

Spanish-Public Domain-NewspapersSpanish-PD-Newspapers 是一个大型集合,旨在聚合所有西班牙公共领域内的单行本。截至2024年3月,与 Spanish-PD-Books 一起,它是最大的西班牙开放语料库。

数据集概述

该集合包含247,491个独立文本,共计2,697,414,811个单词,来源于多个来源,包括西班牙主要文化遗产机构 Biblioteca Digital Hispanica (BDH) 和 Internet Archive。每个 parquet 文件包含随机选择的2,000本书的全文。

数据集构成方法

数据集的构成遵循欧盟及所有伯尔尼国家对欧盟作者公共领域作品的标准:任何作者去世超过70年的出版物。此外,公共领域文化遗产的初始整合在欧盟根据2019年版权指令(第14条)进行。

用途

该集合旨在扩大大型语言模型训练的开放作品可用性。文本可用于模型训练,并可无限制地重新发布以供可重复性目的使用。

创建此集合的理由包括:

  • 科学性:我们观察到,训练语料库的封闭是人工智能研究的主要障碍。大型语言模型面临着可重复性的真正危机。
  • 法律性:随着人工智能法案的通过及其对预训练语料库版权法合规性的义务,欧洲人工智能生态系统将不得不改变其来源实践。
  • 文化性:欧盟的语言多样性目前未得到充分代表。与网络档案不同,开放的、遗产的、行政的或科学文本通常质量较高:它们是长篇、多语言和经过编辑的出版物。
  • 经济性:如今,价值捕获集中在已经拥有大量财务资源的参与者身上,使他们能够以高价收集或购买数据。向尽可能多的人提供免版税的语料库,可以释放创新用途并最小化对主导参与者的经济依赖。

许可证

整个集合在所有地区均属于公共领域。这意味着每个个人或集体权利持有人的遗产权利已经过期。

多年来,欧洲一直在争论公共领域的定义及其使用限制的可能性。自2019年起,欧盟版权指令规定,“成员国应规定,当视觉艺术作品的保护期限届满时,任何由此类作品复制行为产生的材料不受版权或相关权利的约束,除非该材料由此类复制行为产生,并且在该意义上是作者自己的智力创作。”(第14条)

未来工作

该数据集不是一个一次性工作,而是将持续在三个方向上显著发展:

  • 扩展数据集至19世纪末和20世纪初的作品,并进一步增强目前未利用的来自欧洲文化遗产数据存储库的集合。
  • 修正文本中的计算机生成错误。所有文本均通过光学字符识别(OCR)软件自动转录。原始文件自21世纪初以来已经数字化,某些文档应重新进行OCR处理。未来的版本将努力重新OCR原始文本或使用实验性LLM模型进行部分OCR校正。
  • 增强原始文本的结构/编辑呈现。原始文档的某些部分可能不适合大规模分析或模型训练(如页眉、页码等)。此外,一些高级文档结构,如表格或多列布局,不太可能被良好格式化。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作