Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis
收藏DataCite Commons2025-06-10 更新2026-05-05 收录
下载链接:
https://dataverse.tdl.org/citation?persistentId=doi:10.18738/T8/UKJZ3E
下载链接
链接失效反馈官方服务:
资源简介:
This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process.<br><br>
This dataset includes the following items:<br>
<ul>
<li>31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble).</li>
<li>A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL.</li>
<li>A metadata table with the following fields: document id, publication date, source url, news source and country of the news source.</li>
<li>A list of sources included in the course grouped by country name.</li>
</ul>
All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. <br><br>
If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.
提供机构:
Texas Data Repository
创建时间:
2021-02-12



