Kazakhstani news corpus for social significance identification with topic modelling results
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://data.mendeley.com/datasets/hwj24p9gkh
下载链接
链接失效反馈官方服务:
资源简介:
The presented news corpora consists of 1142735 documents from open Kazakhstani news media and from governmental development programs. The dataset is presented in a form of zip archive containing 12CSV (comma-separated values) files with the dataset split into 100 000 documents in each file.
Each document (row) consists of the following fields:
ID
Title
Text
Source
URL
Datetime
Number of views
90 columns with hand-picked and topic groups weights with semantic names (group_economy, group_politics, etc.). They were normalized to range from 0 to 1
200 columns with topic weights obtained through topic modelling. These columns represent a theta-matrix of the topic model
topic-words.json file represents words with weights for the 200 topics obtained through topic-modellig. It is a compressed representation of a phi matrix
topic-expert-labelling-sentiment.json contains expert labelling of topics sentiment. It was used to obtain results described in the cited article.
创建时间:
2020-12-18



