One million articles from five post socialist countries with extracted features: sentiment, basic emotions, LDA topics and presence of influential domestic politicians

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://doi.org/10.7910/DVN/8NMD1U

下载链接

链接失效反馈

官方服务：

资源简介：

This is a replication data for my paper under blind review. Abstract This paper develops a new prediction model for media content presence on a website. It analyses a new corpus of one million articles from five countries: Poland, Russia, Belarus, Kazakhstan and Ukraine, in two languages, Polish and Russian. These articles were scraped daily from seventeen websites in 2017-2020 period. The research applies a wide range of natural language processing methods to automatically derive several properties of each article: its topic, sentiment, basic emotions, mentions of influential domestic politicians. The articles’ embeddings and their cosine similarity are used to calculate the news context, such as how an article differs from the daily issue main themes. These features are used to estimate a logistic regression assessing the likelihood that the same or slightly modified, as measured by cosine similarity, article will remain on the main web page the next day. The key, and somewhat unexpected result is that articles with negative sentiment polarity are less likely to be published for more than one day. This result holds for all countries analyzed. It means that the negative news bias documented in the literature is partly offset by their shorter life cycle. Data is in the Python pickle format. Should be read into Python using the pickle.load() function. Each element (row) is the data frames or list represents one news article. Each file has the same format. Loading a pickle file returns a list of four elements: 1. A dummy variable equal to 1 when the article was published the next day, with the text being identical 2. A dummy variable equal to 1 when the article was published the next day, but we allow for small text modifications (cosine similarity > 0.99) 3. Dataframe with extracted features, described below. 4. List with texts of articles in Polish or Russian Ad 3. The columns of the dataframe are as follows (we refer to row number i in description): - pandas index (may appear once or twice in the datafame) - maxcosine: maximum cosine similarity between art i and all articles published next day - cosine_diff: cosine similarity between article i and the elementwise average of embeddings of all articles in the current issue. Measure how similar is the article i to the core narrative of the current issue - cosine_std: std. dev. of cosine similarity measures between all pairs of articles in the current issue. Measures how focused or dispersed is the current issue news coverage - thirteen LDA topic groups: politics, legislation and legal affairs (POL); economy, finance, various sectors of the economy (ECO); military, war, protests, crime, security threats (MIL); international affairs, specific issues concerning foreign countries (INT); technology (TECH); family issues, culture, sport, education (FAM); regional issues and housing (REG); health issues and the Covid-19 pandemic (HEA); media (MED); accidents (ACC); religion (REL); the Soviet Union (USSR); and articles for which no topic could be determined (MISC). - rsent.c: relative sentiment that is dictionary based sentiment of articles i minus the average sentiment of the newspaper. This approach eliminates newspaper or country idiosyncratic sentiment factors. c stands for Covid, the sentiment lexicon was augmented with Covid related terms - dip_*: Variable measuring if influential domestic politicians are mentioned in article i, * represent a country acronym. If N is equal to the number of occurrences of the names of influential domestic politicians in the article i, dip_* = 0 if N=0, dip_* = 1+ log(N) if N>0. - three or four names of news portals from which the data was scraped. - names of six basic emotions and the article i emotion scores calculated using zero-shot learning and the large version of the XLM (Conneau et al., 2019) model from the huggingface transformers library available at https://huggingface.co/vicgalle/xlm-roberta-large-xnli-anli Names of the politicians used to calculate dip variables Russia "putin" "medvedev" "vaino" "shoigu" "bortnikov" "lavrov" "mishustin" "kirienko" "sechin" Ukraine "zelensky" "shmygal" "akhmetov" "avakov" "ermak" "poroshenko" "medvedchuk" "groisman" Kazakhstan "sagyntaev" "mamin" "tokayev" "nnazarbayev" "dnazarbayeva" "kulibayev" "masimov" Belarus "alukashenko" "vakulchik" "vlukashenko" "kobyakov" "makei" "myasnikovich" [37] "rumas" "golovchenko" Poland "kaczynski" "duda" "morawiecki" "ziobro" Data coverage Country, news portal, numbr of articles Russia iz.ru 43,782 Russia kommersant.ru 46,070 Russia novayagazeta.ru 29,357 Russia vedomosti.ru 27,797 Kazakhstan informburo.kz 29,375 Kazakhstan nur.kz 67,350 Kazakhstan tengrinews.kz 44,285 Kazakhstan zakon.kz 109,442 Belarus bdg.by 33,447 Belarus belgazeta.by 21,995 Belarus sb.by 83,685 Ukraine kp.ua 194,792 Ukraine segodnya.ua 45,835 Ukraine vesti.ua 90,559 Poland gazeta.pl 53,321 Poland rp.pl 49,587 Poland wpolityce.pl 76,625 In the provided dataframes the number of observations is smaller, because the issues for which there was no next day issue, were removed. Data was scraped daily between 2017 or 2018 (depending on the country) and January 2021.

创建时间：

2022-03-01