five

Kazakhstani news corpus for social significance identification with topic modelling results

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://data.mendeley.com/datasets/hwj24p9gkh
下载链接
链接失效反馈
官方服务:
资源简介:
The presented news corpora consists of 1142735 documents from open Kazakhstani news media and from governmental development programs. The dataset is presented in a form of zip archive containing 12CSV (comma-separated values) files with the dataset split into 100 000 documents in each file. Each document (row) consists of the following fields: ID Title Text Source URL Datetime Number of views 90 columns with hand-picked and topic groups weights with semantic names (group_economy, group_politics, etc.). They were normalized to range from 0 to 1 200 columns with topic weights obtained through topic modelling. These columns represent a theta-matrix of the topic model topic-words.json file represents words with weights for the 200 topics obtained through topic-modellig. It is a compressed representation of a phi matrix topic-expert-labelling-sentiment.json contains expert labelling of topics sentiment. It was used to obtain results described in the cited article.
创建时间:
2020-12-18
二维码
社区交流群
二维码
科研交流群
商业服务