five

US-PD-Newspapers

收藏
魔搭社区2025-12-04 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/US-PD-Newspapers
下载链接
链接失效反馈
官方服务:
资源简介:
# 🇺🇸 US Public Domain Newspapers 🇺🇸 **US-PD-Newspapers** is an agregation of all the archives of US newspapers digitized by the Library of Congress for the Chronicling America digital library. With nearly 100 billion words, it is one of the largest open corpus in the United States. All the materials are now part of the public domain and have no intellectual property rights remaining. ## Content As of January 2024, the collection contains nearly 21 millions unique newspaper and periodical editions published from the 1690 to 1963 (98,742,987,471 words). The collection was compiled by Pierre-Carl Langlais based on the [dumps](https://chroniclingamerica.loc.gov/data/ocr/) made available by the Library of Congress. Each parquet file matches one of the 2618 original dump files, including their code name. It has the full text of a few thousand selected at random and a few core metadatas (edition id, date, word counts…). The metadata can be easily expanded thanks to the LOC APIs and other data services. The [American Stories dataset](https://huggingface.co/datasets/dell-research-harvard/AmericanStories) is a curated and enhanced version of the same resource, with significant progress in regards to text quality and documentation. It currently retains about 20% of the original material. ## Language While most of the collection is in English, it also covers a wider variety of European languages, especially German (600k editions) and Spanish (400k editions). ## Uses The primary use of the collection is for cultural analytics on a wide scale. It has been instrumental for some major digital humanities projects like [Viral Texts](https://viraltexts.org/). The collection also aims to expand the availability of open works for the training of Large Language Models. The text can be used for model training and republished without restriction for reproducibility purposes. ## License The composition of the dataset adheres to the US criteria for public domain (any publication without a copyright removal). In agreement with the shorter term rules, the dataset is in the public domain for all countries with a Berne author-right model. The Library of Congress does not claim any additional rights: "As a publicly supported institution, we generally do not own the rights to materials in our collections. You should determine for yourself whether or not an item is protected by copyright or in the public domain, and then satisfy any copyright or use restrictions when publishing or distributing materials from our collections." ## Future developments This dataset is not a one time work but will continue to evolve significantly on several directions: * Correction of computer generated errors in the text. All the texts have been transcribed automatically through the use of Optical Character Recognition (OCR) software. The original files have been digitized over a long time period (since the mid-2000s). * Enhancement of the structure/editorial presentation of the original text. Some parts of the original documents are likely unwanted for large scale analysis or model training (header, page count…). Additionally, some advanced document structures like tables or multi-column layout are unlikely to be well formatted. Major enhancements could be experted through applying new SOTA layout recognition models on the original PDF files. * Expansion of the collection to other cultural heritage holdings, especially coming from Hathi Trust, Internet Archive and Google Books. The American Stories dataset already include some of theses features (especially better OCR and article-level segmentation) and may be a preferable solution if text quality is a concern. ## Acknowledgements The corpus was stored and processed with the generous support of [OpenLLM France](https://www.openllm-france.fr/) and Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC). Corpus collection has been largely facilitated thanks to the open science LLM community insights and cooperation (Occiglot, Eleuther AI, Allen AI). <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>

# 🇺🇸 美国公共领域报纸数据集 🇺🇸 **US-PD-Newspapers** 是美国国会图书馆为《编年史美国(Chronicling America)》数字图书馆数字化的所有美国报纸档案的聚合数据集。 该数据集坐拥近1000亿词量,是美国规模最大的开放语料库之一。所有收录材料均已进入公共领域,无剩余知识产权。 ## 数据集内容 截至2024年1月,该数据集收录了1690年至1963年间出版的近2100万份独特报纸与期刊刊次,总词量达98742987471词。 本数据集由Pierre-Carl Langlais基于美国国会图书馆公开的[数据集转储文件](https://chroniclingamerica.loc.gov/data/ocr/)汇编而成。每个Parquet文件对应2618个原始转储文件之一,并保留其原始代码名称。文件包含随机选取的数千份文本的完整内容,以及核心元数据(刊次ID、出版日期、词量统计等)。借助美国国会图书馆的API及其他数据服务,可便捷扩展元数据范围。 [美国故事数据集(American Stories)](https://huggingface.co/datasets/dell-research-harvard/AmericanStories)是该资源经过精选与增强的版本,在文本质量与文档记录方面有显著提升,目前仅保留约20%的原始数据集内容。 ## 语言分布 尽管该数据集的绝大多数内容为英语,但也涵盖了更多欧洲语言,其中德语刊次约60万份,西班牙语刊次约40万份。 ## 应用场景 该数据集的主要用途是开展大规模文化分析,曾为《病毒文本(Viral Texts)》等多个重磅数字人文项目提供关键支撑。此外,该数据集还旨在扩大开放作品的可及性,以供大语言模型(Large Language Model)训练使用。其文本可用于模型训练,且可无限制重发布以保障研究可复现性。 ## 授权协议 本数据集的构成符合美国公共领域判定标准(即无版权保护的出版物)。根据较短版权保护期规则,本数据集在所有采用伯尔尼著作权制度的国家均属于公共领域。美国国会图书馆不主张任何额外权利:"作为公共资助的机构,我们通常不对馆藏材料主张权利。您应自行判断某一物品是否受版权保护或属于公共领域,并在发布或分发本馆藏材料时遵守相关版权或使用限制。" ## 未来发展规划 本数据集并非一次性项目,将在多个方向持续大幅演进: * 修正文本中的机器生成错误:所有文本均通过光学字符识别(Optical Character Recognition, OCR)软件自动转录,原始文件自2000年代中期起历经多年数字化。 * 优化原始文本的结构与编辑呈现:部分原始文档内容(如页眉、页码等)可能不适合大规模分析或模型训练,且部分高级文档结构(如表格、多栏布局)的格式可能存在缺陷。可通过在原始PDF文件上应用最新的前沿布局识别模型,实现大幅优化。 * 扩展数据集收录范围,纳入其他文化遗产馆藏,尤其是来自Hathi Trust、互联网档案馆(Internet Archive)与谷歌图书(Google Books)的资源。 [美国故事数据集(American Stories)](https://huggingface.co/datasets/dell-research-harvard/AmericanStories)已集成部分此类优化功能(尤其是更优质的OCR与文章级分段),若关注文本质量,该数据集或为更优选择。 ## 致谢 本语料库的存储与处理得到了[OpenLLM France](https://www.openllm-france.fr/)与Scaleway的慷慨支持。本数据集的构建得到了法国国家初创企业LANGU:IA(由法国文化部与DINUM支持的国家初创企业)的协助与协同努力,作为语言技术联盟EDIC(ALT-EDIC)服务预配置的一部分。本语料库的收集工作,也得益于开放科学大语言模型社区的见解与协作(包括Occiglot、Eleuther AI、Allen AI)。 <div style="text-align: center;"> <img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> <img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/> </div>
提供机构:
maas
创建时间:
2025-06-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作