five

Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis

收藏
DataCite Commons2025-06-10 更新2026-05-05 收录
下载链接:
https://dataverse.tdl.org/citation?persistentId=doi:10.18738/T8/UKJZ3E
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process.<br><br> This dataset includes the following items:<br> <ul> <li>31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble).</li> <li>A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL.</li> <li>A metadata table with the following fields: document id, publication date, source url, news source and country of the news source.</li> <li>A list of sources included in the course grouped by country name.</li> </ul> All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. <br><br> If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.
提供机构:
Texas Data Repository
创建时间:
2021-02-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作