Dataset and Models for Detection of News Agency Releases in Historical Newspapers
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8333932
下载链接
链接失效反馈官方服务:
资源简介:
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).
Please cite this report if you are using the models/datasets or find it relevant to your research:
@article{Marxen:305129,
title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
author = {Marxen, Lea},
pages = {114p},
year = {2023},
url = {http://infoscience.epfl.ch/record/305129},
}
1. DATA
The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.
The distribution of articles in the different sets is as follows:
Dataset Statistics
Lg.
Docs
Agency Mentions
Train
de
333
493
fr
903
1,122
Dev
de
32
26
fr
110
114
Test
de
32
58
fr
120
163
Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).
2. MODELS
The two agency detection and classification models used for the inference on the impresso Corpus are released as well:
newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset
newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset
The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.
Please refer to the report for further information or contact us.
3. CODE
https://github.com/impresso/newsagency-classification
4. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)
创建时间:
2023-09-12



