five

Dataset and Models for Detection of News Agency Releases in Historical Newspapers

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8333932
下载链接
链接失效反馈
官方服务:
资源简介:
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link). Please cite this report if you are using the models/datasets or find it relevant to your research: @article{Marxen:305129, title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers}, author = {Marxen, Lea}, pages = {114p}, year = {2023}, url = {http://infoscience.epfl.ch/record/305129}, } 1. DATA The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format. The distribution of articles in the different sets is as follows: Dataset Statistics   Lg. Docs Agency Mentions Train de 333 493   fr 903 1,122 Dev de 32 26   fr 110 114 Test de 32 58   fr 120 163 Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).   2. MODELS The two agency detection and classification models used for the inference on the impresso Corpus are released as well: newsagency-model-de: based on German BERT (with maximum sequence length 128), fine-tuned with the German training set of the newsagency-dataset newsagency-model-fr: based on French Europeana BERT (with maximum sequence length 128), fine-tuned with the French training set of the newsagency-dataset The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.   Please refer to the report for further information or contact us.   3. CODE https://github.com/impresso/newsagency-classification   4. CONTACT Maud Ehrmann (EPFL-DHLAB) Emanuela Boros (EPFL-DHLAB)
创建时间:
2023-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作