Replication data for "Economic uncertainty and natural language processing; The case of Russia"

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://doi.org/10.7910/DVN/XAMFQK

下载链接

链接失效反馈

官方服务：

资源简介：

The paper proposes a method of constructing text-based country-specific measures for economic policy uncertainty. To avoid problems of translation and human validation costs, we apply natural language processing and sentiment analysis to construct such measures for Russia. We compare our measure with that developed earlier using direct translations from English and human validation. In this comparison, our measure does equally well at evaluating the uncertainty related to key events that affected Russia between 1994 and 2018 and performs better at detecting the effects of uncertainty in Russia’s industrial production. Data used to construct uncertainty indexes We have constructed the EPU using data from four daily newspapers available electronically, which are : 1. Kommersant (Oct 1992 – Feb 2018), 579 997 articles 2. Moskovskiy Komsomolets (Jan 2005 – Feb 2018), 143 758 articles 3. Novaya Gazeta (Feb 2004 – Feb 2018), 63 884 articles 4. Vedomosti (Dec 2003 – Feb 2018), 342 309 articles These newspapers represent a good spectrum of the newspapers aimed at different categories of readers. Kommersant is a daily of broad circulation that is primarily but loosely associated with information and news on business and commerce for a wide group of readers. According to https://www.kommersant.ru/about/kommersant, 23 January 2020, its daily circulation is around 100,000 — 110,000 copies. Moskovskiy Komsomolets is a popular newspaper addressed at a general audience with a print circulation of around 700,000 copies, according to https://ria.ru/20091211/198562973.html. Vedomosti is a business daily aimed at students and professionals, with quite limited circulation. According to the Russian Wikipedia page https://ru.wikipedia.org/wiki/ведомости, its daily circulation is 75,000 copies. Novaya Gazeta is regarded as relatively independent and sometimes critical towards the Russian government. It is not a proper daily, as it has been published in 2019 three times a week. Its reported circulation in August 2009 was 104,700 (https://web.archive.org/web/20090822153334/http://www.pressaudit.ru/registry ). There are four csv files, one for each newspaper, named *-sent2.csv with the following data: - date - article's number of words in economy category - number of words in policy category - number of words in uncertainty category - document id - number of the LDA topic (15 latent topics) - name of the LDA topic (15 latent topics) - number of the LDA topic (30 topics, 20 for Kommersant) - name of the LDA topic (30 topics, 20 for Kommersant) - *20/50 - article's number of words in word2vec dictionary in categories uncertainty, policy and economy, 20 or 50 words with smallest cosine distance - pos/neg/sent - percentage of words with positive/negative inclination and sent=pos-neg. - 1 for standard sentiment lexicons, 2 for Covid-augmented lexicons Uncertainty indexes and macroeconomic data Data description File U_data: data for different uncertainty indices Symbols are as in the Appendix in the paper: Pairs of uncertainty indices symbols of columns U computed for all newspapers U U computed for Kommersant only U(Kom.) U under homogeneity of journalistic style U(Hom.) U under heterogeneity of journalistic style U(Het.) U computed with the use of Loukash. lexicons U(Louk.) U computed with the use of Kaggle lexicons U(Kag.) U weighed by negative sentiments only U- Other files with micro data are stored in files named by the following convention: RU_s_LDA_VINTAGE_LEXICON where integers s, LDA,VIVTAGE and LEXICON describes the different ways of computing sentimentso, topic modelling, vintage of data and sentiment lexicons applied. The files contain monthly data, mainly the frequencies of the appearance of the articles selected by different methods and weighted by different sentiment indicators. In detail: Excel_data_recomp_s, where s=0,..,6, and: s=0: indices are weighted by crude sentiment frequencies. s=1: indices are weighted by 1+- crude sentiment frequencies. s= 2: as for s=1, but the sentiments are rescaled. s=3 as for s=1, but sentiments are values of exponential distribution . s=4 Valance is used as measures of sentiments; see Ferrara E, Yang Z (2015) ‘Measuring Emotional Contagion in Social Media’. PLoS ONE 10 e0142390. doi:10.1371/journal.pone.0142390. Valance is computed from the sentiment ratios, that is, as if s=0. It is, of course, possible to combine valence with switch_sent 1, 2 and 3 . if s=5 and s=6, weights are classified, according to the SentiStrength methodology, where the classes are set according to the quantiles of the frequency of sentiments. There are 4 quantile points used for dividing the sentiments into classes: 0.15; 0.5; 0.75;0.9 . s=5 classes are set according to the quantiles computed for all journals (assumption of the homogeneity of readers' perception). s=6 quantiles are computed separately for each journal and lexicon (assumption of heterogeneity of readers' perception). In each directory, there are 10 files with data. The convention of naming the files is the following: RU_LDA_VINTAGE_LEXICON where if LDA=0 U is computed using data from all articles in the newspaper. if LDA=1 U is computed using data from ‘relevant’ articles, where the ‘relevance’ is decided by the 15-topic LDA. if LDA=2 U is computed using data from ‘relevant’ articles, where the ‘relevance’ is decided by the 30-topic LDA. if VINTAGE=0: Reduced number of words in descriptors is used. if VINTAGE=1: Extended number of words in descriptors is used. if LEXICON=0 Loukashevich sentiment lexicon is used. if LEXICON=1: Kaggle lexicon is used. In each file, there are 18 sheets containing the following: Sheet 1: U: monthly EPU frequencies computed using articles with non-stemmed descriptors. Sheet 2: U1: monthly EPU frequencies computed using articles with -stemmed descriptors. Sheet 3: U20: monthly EPU frequencies computed using Word2vec 20-words descriptors. Sheet 4: U50: monthly EPU frequencies computed using Word2vec 50-words descriptors. Sheet 5: U+ monthly EPU frequencies weighted by positive sentiments using all articles with non-stemmed descriptors. Sheet 6: U1+ monthly EPU frequencies weighted by positive sentiments using all articles with -stemmed descriptors. Sheet 7: U20+ monthly EPU frequencies weighted by positive sentiments using Word2vec 20-words descriptors. Sheet 8: U50- monthly EPU frequencies weighted by negative sentiments using Word2vec 50-words descriptors. Sheet 9: U- monthly EPU frequencies weighted by negative sentiments using all articles with non-stemmed descriptors. Sheet 10: U1- monthly EPU frequencies weighted by negative sentiments using all articles with -stemmed descriptors. Sheet 11: U20- monthly EPU frequencies weighted by negative sentiments using Word2vec 20-words descriptors. Sheet 12: U50- monthly EPU frequencies weighted by negative sentiments using Word2vec 50-words descriptors. Sheet 13: U+- monthly EPU frequencies weighted by the balance of sentiments using all articles with -stemmed descriptors. Sheet 14: U1+-- monthly EPU frequencies weighted by the balance of sentiments using stemmed descriptors. Sheet 15: U20+- monthly EPU frequencies weighted by the balance of sentiments using Word2vec 20-words descriptors. Sheet 16: U20+- monthly EPU frequencies weighted by the balance of sentiments using Word2vec 50-words descriptors. Sheet 17: Sentiment scores: positive, negative and balanced (difference between positive and negative scores) for each article. Sheet 18: Total number of articles considered for each month. Except for sheets 17 and 18, columns C to F contain monthly frequencies for the newspapers Kommersant, Vedomosti, Moskovskiy Komsomolets and Novaya Gazeta, respectively. Columns containing zeros should be ignored

创建时间：

2022-03-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集