A new corpus of one million articles from four post-soviet countries and Poland.
收藏DataCite Commons2025-05-12 更新2025-05-17 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/CEF7RU
下载链接
链接失效反馈官方服务:
资源简介:
Data is in the .Rata format. Should be read into R using the load() function.
It contains two R data frames called tx_pl_lang (articles in Polish laniuage) and
tx_ru_lang (articles in Russian language).
Covered newspapers
Table 1. Newspapers and portals included in the analysis and their Alexia ranks
Country Newspaper site Number of articles Alexia global rank Alexia local rank
Russia iz.ru 43,782 1,378 45
Russia kommersant.ru 46,070 1,335 44
Russia novayagazeta.ru 29,357 9,215 459
Russia vedomosti.ru 27,797 6,302 288
Kazakhstan informburo.kz 29,375 38,916 119
Kazakhstan nur.kz 67,350 951 6
Kazakhstan tengrinews.kz 44,285 13,036 34
Kazakhstan zakon.kz 109,442 9,477 30
Belarus bdg.by 33,447 292,678 746
Belarus belgazeta.by 21,995 1,392,332 11,041
Belarus sb.by 83,685 41,015 79
Ukraine kp.ua 194,792 64,062 860
Ukraine segodnya.ua 45,835 18,658 256
Ukraine vesti.ua 90,559 58,573 1,096
Poland gazeta.pl 53,321 1,749 14
Poland rp.pl 49,587 20,930 167
Poland wpolityce.pl 76,625 13,833 105
Note: Alexia rank, 90-day average, checked on 17 February 2021. Total number of articles 1,047,304.
List of columns is the same for both data frames:
- short name of the newspaper,
- text of the article
- date when the article was scraped
- sentiment calculated using the standard sentiment lexicons
- sentiment calculated using the Covid-extended lexicons
- name of the topic
- surnames of influential politicians 0/1 variable, 1 if the name is in the article
> colnames(tx_pl_lang)
[1] "name" "art" "date" "sent" "sent.c" "tname"
[7] "putin" "medvedev" "vaino" "shoigu" "bortnikov" "lavrov"
[13] "mishustin" "kirienko" "sechin" "zelensky" "shmygal" "akhmetov"
[19] "avakov" "ermak" "poroshenko" "medvedchuk" "groisman" "sagyntaev"
[25] "mamin" "tokayev" "nnazarbayev" "dnazarbayeva" "kulibayev" "masimov"
[31] "alukashenko" "vakulchik" "vlukashenko" "kobyakov" "makei" "myasnikovich"
[37] "rumas" "golovchenko" "kaczynski" "duda" "morawiecki" "ziobro"
List of LDA topics (tname column)
cri crime
ind industry
itr international trade
sov former Soviet republics
pol politics
tou tourism
con construction
air air transport
ban banking
fin finance
med media
pro protests
hou housing
spo sport
pol politics
reg regional
mkt financial markets
cul culture
acc accidents
edu education
lab labour market
war war
pub public finances
eco economy
int international
eur europe
mob mobile/internet
com commodities, oil, gas
usa USA
fam family
hea health
his history
tra transport
pop gossip/beauty/weather
rel religion
aut automotive
spa space
misc cannot decide the topic
ussr soviet history
ukr Ukraine
mos Moscow
提供机构:
Harvard Dataverse
创建时间:
2021-04-28



