newswire
收藏魔搭社区2025-12-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/newswire
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for NewsWire
## Dataset Description
- **Homepage:** [Dell Research homepage](https://dell-research-harvard.github.io/)
- **Repository:** [Github repository](https://github.com/dell-research-harvard)
- **Paper:** [arxiv submission](https://arxiv.org/abs/2406.09490)
- **Point of Contact:** [Melissa Dell](mailto:melissadell@fas.harvard.edu)
### Dataset Summary
NewsWire contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model.
### Languages
English (en)
## Dataset Structure
Each year in the dataset is divided into a distinct file (eg. 1952_data_clean.json)
### Data Instances
An example from the NewsWire dataset looks like:
```
{
"year": 1880,
"dates": ["Feb-23-1880"],
"article": "SENATE Washington, Feb. 23.--Bayard moved that in respect of the
memory of George Washington the senate adjourn ... ",
"byline": "",
"newspaper_metadata": [
{
"lccn": "sn92053943",
"newspaper_title": "the rock island argus",
"newspaper_city": "rock island",
"newspaper_state": " illinois "
},
...
],
"antitrust": 0,
"civil_rights": 0,
"crime": 0,
"govt_regulation": 1,
"labor_movement": 0,
"politics": 1,
"protests": 0,
"ca_topic": "Federal Government Operations",
"ner_words": ["SENATE", "Washington", "Feb", "23", "Bayard", "moved", "that",
"in", "respect", "of", "the", "memory", "of", "George", "Washington",
"the", "senate", "adjourn", ... ],
"ner_labels": ["B-ORG", "B-LOC", "O", "B-PER", "B-PER", "O", "O", "O", "O",
"O", "O", "O", "O", "B-PER", "I-PER", "O", "B-ORG", "O", ...],
"wire_city": "Washington",
"wire_state": "district of columbia",
"wire_country": "United States",
"wire_coordinates": [38.89511, -77.03637],
"wire_location_notes": "",
"people_mentioned": [
{
"wikidata_id": "Q23",
"person_name": "George Washington",
"person_gender": "man",
"person_occupation": "politician"
},
...
],
"cluster_size": 8
}
```
### Data Fields
- `year`: year of article publication.
- `dates`: list of dates on which this article was published, as strings in the form mmm-DD-YYYY.
- `byline`: article byline, if any.
- `article`: article text.
- `newspaper_metadata`: list of newspapers that carried the article. Each newspaper is represented as a list of dictionaries, where `lccn` is the newspaper's Library of Congress identifier, `newspaper_title` is the name of the newspaper, and `newspaper_city` and `newspaper_state` give the location of the newspaper.
- `antitrust`: binary variable. 1 if the article was classified as being about antitrust.
- `civil_rights`: binary variable. 1 if the article was classified as being about civil rights.
- `crime`: binary variable. 1 if the article was classified as being about crime.
- `govt_regulation`: binary variable. 1 if the article was classified as being about government regulation.
- `labor_movement`: binary variable. 1 if the article was classified as being about the labor movement.
- `politics`: binary variable. 1 if the article was classified as being about politics.
- `protests`: binary variable. 1 if the article was classified as being about protests.
- `ca_topic`: predicted Comparative Agendas topic of article.
- `wire_city`: City of wire service bureau that wrote the article.
- `wire_state`: State of wire service bureau that wrote the article.
- `wire_country`: Country of wire service bureau that wrote the article.
- `wire_coordinates`: Coordinates of city of wire service bureau that wrote the article.
- `wire_location_notes`: Contains wire dispatch location if it is not a geographic location. Can be one of ``Pacific Ocean (WWII)'', ``Supreme Headquarters Allied Expeditionary Force (WWII)'', ``North Africa'', ``War Front (WWI)'', ``War Front (WWII)'' or ``Johnson Space Center''.
- `people_mentioned`: list of disambiguated people mentioned in the article. Each disambiguated person is represented as a dictionary, where `wikidata_id` is their ID in Wikidata, `person_name` is their name on Wikipedia, `person_gender` is their gender from Wikidata and `person_occupation` is the first listed occupation on Wikidata.
- `cluster_size`: Number of newspapers that ran the wire article. Equals length of `newspaper_metadata`.
### Accessing the Data
The whole dataset can be easily downloaded using the `datasets` library:
```
from datasets import load_dataset
dataset_dict = load_dataset("dell-research-harvard/newswire")
```
Specific files can be downloaded by specifying them:
```
from datasets import load_dataset
load_dataset(
"dell-research-harvard/newswire",
data_files=["1929_data_clean.json", "1969_data_clean.json"]
)
```
## Dataset Creation
### Curation Rationale
The dataset was created to provide researchers with a large, high-quality corpus of historical news articles.
These texts provide a massive repository of information about historical topics and events - and which newspapers were covering them.
The dataset will be useful to a wide variety of researchers including historians, other social scientists, and NLP practitioners.
### Source Data
#### Initial Data Collection and Normalization
Dataset construction is described in the associated paper.
#### Who are the source language producers?
The source language was produced by people - by newspaper editors, columnists, and other sources.
### Annotations
#### Annotation process
Not Applicable
#### Who are the annotators?
The dataset does not contain any additional annotations.
### Personal and Sensitive Information
The dataset may contain information about individuals, to the extent that this is covered in news stories. However we make no additional information about individuals publicly available.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset provides high-quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.
The dataset could also be added to the external database of a retrieval-augmented language model to make historical information more widely accessible.
### Discussion of Biases
This dataset contains unfiltered content composed by newspaper editors, columnists, and other sources.
In addition to other potentially harmful content, the corpus may contain factual errors and intentional misrepresentations of news events.
All content should be viewed as individuals' opinions and not as a purely factual account of events of the day.
## Additional Information
### Dataset Curators
Emily Silcock (Harvard), Abhishek Arora (Harvard), Luca D'Amico-Wong (Harvard), Melissa Dell (Harvard)
### Licensing Information
The dataset has a CC-BY 4.0 license
### Citation Information
You can cite this dataset using
```
@misc{silcock2024newswirelargescalestructureddatabase,
title={Newswire: A Large-Scale Structured Database of a Century of Historical News},
author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
year={2024},
eprint={2406.09490},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.09490},
}
```
### Contributions
Coming Soon
# NewsWire 数据集卡片
## 数据集描述
- **主页:** [戴尔研究院(Dell Research)主页](https://dell-research-harvard.github.io/)
- **代码仓库:** [Github仓库](https://github.com/dell-research-harvard)
- **相关论文:** [arXiv预印本](https://arxiv.org/abs/2406.09490)
- **联系人:** [Melissa Dell](mailto:melissadell@fas.harvard.edu)
### 数据集概览
NewsWire 包含270万篇独特的公有领域美国新闻电讯稿(news wire articles),创作时间跨度为1878年至1977年。本数据集对文稿中的地点进行了地理编码(georeferenced),采用定制化神经主题分类模型对主题进行标注,完成了命名实体识别(named entity recognition),并通过全新的实体消歧模型将提及的个体链接至维基百科(Wikipedia)实体。
### 语言
英语(en)
## 数据集结构
数据集中的每一年对应一个独立文件(例如 1952_data_clean.json)。
### 数据实例
NewsWire 数据集的一则示例如下:
json
{
"year": 1880,
"dates": ["Feb-23-1880"],
"article": "SENATE Washington, Feb. 23.--Bayard moved that in respect of the memory of George Washington the senate adjourn ... ",
"byline": "",
"newspaper_metadata": [
{
"lccn": "sn92053943",
"newspaper_title": "the rock island argus",
"newspaper_city": "rock island",
"newspaper_state": " illinois "
},
...
],
"antitrust": 0,
"civil_rights": 0,
"crime": 0,
"govt_regulation": 1,
"labor_movement": 0,
"politics": 1,
"protests": 0,
"ca_topic": "Federal Government Operations",
"ner_words": ["SENATE", "Washington", "Feb", "23", "Bayard", "moved", "that", "in", "respect", "of", "the", "memory", "of", "George", "Washington", "the", "senate", "adjourn", ... ],
"ner_labels": ["B-ORG", "B-LOC", "O", "B-PER", "B-PER", "O", "O", "O", "O", "O", "O", "O", "O", "B-PER", "I-PER", "O", "B-ORG", "O", ...],
"wire_city": "Washington",
"wire_state": "district of columbia",
"wire_country": "United States",
"wire_coordinates": [38.89511, -77.03637],
"wire_location_notes": "",
"people_mentioned": [
{
"wikidata_id": "Q23",
"person_name": "George Washington",
"person_gender": "man",
"person_occupation": "politician"
},
...
],
"cluster_size": 8
}
### 数据字段
- `year`:文章发表年份。
- `dates`:文章发表日期列表,格式为`mmm-DD-YYYY`的字符串。
- `byline`:文章署名栏(如有)。
- `article`:文章正文。
- `newspaper_metadata`:刊载该文章的报纸列表。每篇报纸以字典形式表示,其中`lccn`为该报纸的美国国会图书馆(Library of Congress)标识符,`newspaper_title`为报纸名称,`newspaper_city`与`newspaper_state`分别为报纸所在地的城市与州。
- `antitrust`:二元变量,若文章被归类为涉及反垄断主题则取值为1。
- `civil_rights`:二元变量,若文章被归类为涉及民权主题则取值为1。
- `crime`:二元变量,若文章被归类为涉及犯罪主题则取值为1。
- `govt_regulation`:二元变量,若文章被归类为涉及政府监管主题则取值为1。
- `labor_movement`:二元变量,若文章被归类为涉及劳工运动主题则取值为1。
- `politics`:二元变量,若文章被归类为涉及政治主题则取值为1。
- `protests`:二元变量,若文章被归类为涉及抗议活动主题则取值为1。
- `ca_topic`:文章预测的比较议程(Comparative Agendas)主题。
- `ner_words`:识别出的命名实体词列表。
- `ner_labels`:对应命名实体的BIO标注标签序列。
- `wire_city`:刊发该电讯稿的通讯社分社所在城市。
- `wire_state`:分社所在州。
- `wire_country`:分社所在国家。
- `wire_coordinates`:分社所在城市的经纬度坐标。
- `wire_location_notes`:若电讯稿发送地并非常规地理点位,则记录该地点,可选值包括`太平洋(二战时期)(Pacific Ocean (WWII))`、`盟军远征部队最高司令部(二战时期)(Supreme Headquarters Allied Expeditionary Force (WWII))`、`北非`、`一战战场(War Front (WWI))`、`二战战场(War Front (WWII))`或`约翰逊航天中心(Johnson Space Center)`。
- `people_mentioned`:文章中提及的已消歧的人物列表。每个人物以字典形式表示,其中`wikidata_id`为其维基数据(Wikidata)编号,`person_name`为其维基百科姓名,`person_gender`为维基数据中记录的性别,`person_occupation`为维基数据中列出的首个职业。
- `cluster_size`:刊载该电讯稿的报纸总数,等于`newspaper_metadata`的长度。
### 数据获取方式
可通过`datasets`库便捷下载完整数据集:
python
from datasets import load_dataset
dataset_dict = load_dataset("dell-research-harvard/newswire")
也可通过指定文件名下载特定文件:
python
from datasets import load_dataset
load_dataset(
"dell-research-harvard/newswire",
data_files=["1929_data_clean.json", "1969_data_clean.json"]
)
## 数据集构建
### 采集初衷
本数据集旨在为研究者提供大规模高质量的历史新闻语料库。这些文稿构成了海量的历史主题与事件信息仓库,同时记录了刊载这些内容的报纸信息。本数据集可广泛服务于历史学家、其他社会科学家以及自然语言处理(Natural Language Processing, NLP)从业者等多领域研究者。
### 源数据
#### 初始数据采集与标准化
数据集的构建流程详见相关论文。
#### 源文本的创作者是谁?
本数据集的源文本由报纸编辑、专栏作家及其他创作者撰写。
### 标注信息
#### 标注流程
不适用。
#### 标注者是谁?
本数据集未包含额外标注内容。
### 个人与敏感信息
本数据集可能包含新闻报道中提及的个体相关信息,但我们未公开任何关于个体的额外信息。
## 数据集使用注意事项
### 数据集的社会影响
本数据集提供的高质量数据可用于大语言模型(Large Language Model, LLM)的预训练,以提升模型对历史英语及历史世界知识的理解能力。此外,该数据集也可被添加至检索增强型语言模型的外部数据库中,让历史信息得以更广泛地传播与获取。
### 数据集偏差讨论
本数据集包含未经筛选的、由报纸编辑、专栏作家及其他创作者撰写的内容。除潜在有害内容外,本语料库还可能包含事实性错误及对新闻事件的故意歪曲。所有内容均应被视为创作者的个人观点,而非对当日事件的纯事实性记述。
## 附加信息
### 数据集策展人
Emily Silcock(哈佛大学)、Abhishek Arora(哈佛大学)、Luca D'Amico-Wong(哈佛大学)、Melissa Dell(哈佛大学)
### 授权信息
本数据集采用CC-BY 4.0开源许可协议。
### 引用信息
您可通过以下格式引用本数据集:
bibtex
@misc{silcock2024newswirelargescalestructureddatabase,
title={Newswire: A Large-Scale Structured Database of a Century of Historical News},
author={Emily Silcock and Abhishek Arora and Luca D'Amico-Wong and Melissa Dell},
year={2024},
eprint={2406.09490},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.09490},
}
### 贡献说明
即将推出
提供机构:
maas
创建时间:
2024-07-03



