Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis

Name: Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis
Creator: Texas Data Repository
Published: 2025-06-10 08:32:26
License: 暂无描述

DataCite Commons2025-06-10 更新2026-05-05 收录

下载链接：

https://dataverse.tdl.org/citation?persistentId=doi:10.18738/T8/UKJZ3E

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: <ul> <li>31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble).</li> <li>A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL.</li> <li>A metadata table with the following fields: document id, publication date, source url, news source and country of the news source.</li> <li>A list of sources included in the course grouped by country name.</li> </ul> All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.

提供机构：

Texas Data Repository

创建时间：

2021-02-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集