SIGARRA News Corpus
收藏DataCite Commons2020-07-23 更新2024-07-13 收录
下载链接:
https://rdm.inesctec.pt/dataset/cs-2017-004
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.
提供机构:
INESC TEC
创建时间:
2020-02-19



