five

Global English-language Corpus of Humanitarian News

收藏
DataCite Commons2025-05-19 更新2024-07-13 收录
下载链接:
https://datashare.ed.ac.uk/handle/10283/8748
下载链接
链接失效反馈
官方服务:
资源简介:
A document-feature matrix (DFM) with word frequencies of 1,118,397 news articles that mention the term "humanitarian*" published in English language media between January 1, 2010 and 15 August, 2020. Features in the DFM have not been lowercased, but the following elements have been removed: punctuation, symbols, numbers, separators, commonly used stopwords and words with two or fewer characters. To remove stopwords,Porter's Snowball list of 175 common English-language words was used. The DFM includes 79,229 features. For each news article in the DFM, the following metadata are included: news organisation where it was published (news_source_name), country of the news organisation in ISO-3 format (country_iso3), continent of the news organisation in ISO-2 format (continent_iso2), reach of the news organisation (media_reach) and publication date in YYYYMMDD format (publication_date). The file format is RDS, which needs to be read using the R programming language. It includes one single object called humanitarian_dfm, which has been created using R's quanteda package. This object can be easily converted to other formats commonly used for text mining, NLP and computational text analysis using quanteda's "convert" function.
提供机构:
University of Edinburgh, School of Social and Political Science
创建时间:
2024-03-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作