Kazakhstani news corpus for social significance identification with topic modelling results

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://data.mendeley.com/datasets/hwj24p9gkh

下载链接

链接失效反馈

官方服务：

资源简介：

The presented news corpora consists of 1142735 documents from open Kazakhstani news media and from governmental development programs. The dataset is presented in a form of zip archive containing 12CSV (comma-separated values) files with the dataset split into 100 000 documents in each file. Each document (row) consists of the following fields: ID Title Text Source URL Datetime Number of views 90 columns with hand-picked and topic groups weights with semantic names (group_economy, group_politics, etc.). They were normalized to range from 0 to 1 200 columns with topic weights obtained through topic modelling. These columns represent a theta-matrix of the topic model topic-words.json file represents words with weights for the 200 topics obtained through topic-modellig. It is a compressed representation of a phi matrix topic-expert-labelling-sentiment.json contains expert labelling of topics sentiment. It was used to obtain results described in the cited article.

创建时间：

2020-12-18