Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/2612867
下载链接
链接失效反馈官方服务:
资源简介:
Sentiment analysis of tech media articles using VADER package and co-occurrence analysis
Sources: Above 140k articles (01.2016-03.2019):
Gigaom 0.5%
Euractiv 0.9%
The Conversation 1.3%
Politico Europe 1.3%
IEEE Spectrum 1.8%
Techforge 4.3%
Fastcompany 4.5%
The Guardian (Tech) 9.2%
Arstechnica 10.0%
Reuters 11%
Gizmodo 17.5%
ZDNet 18.3%
The Register 19.5%
Methodology
The sentiment analysis has been prepared using VADER*, an open-source lexicon and rule-based sentiment analysis tool. VADER is specifically designed for social media analysis, but can be also applied for other text sources. The sentiment lexicon was compiled using various sources (other sentiment data sets, Twitter etc.) and was validated by human input. The advantage of VADER is that the rule-based engine includes word-order sensitive relations and degree modifiers.
As VADER is more robust in the case of shorter social media texts, the analysed articles have been divided into paragraphs. The analysis have been carried out for the social issues presented in the co-occurrence exercise.
The process included the following main steps:
The 100 most frequently co-occurring terms are identified for every social issue (using the co-occurrence methodology)
The articles containing the given social issue and co-occurring term are identified
The identified articles are divided into paragraphs
Social issue and co-occurring words are removed from the paragraph
The VADER sentiment analysis is carried out for every identified and modified paragraph
The average for the given word pair is calculated for the final result
Therefore, the procedure has been repeated for 100 words for all identified social issues.
The sentiment analysis resulted in a compound score for every paragraph. The score is calculated from the sum of the valence scores of each word in the paragraph, and normalised between the values -1 (most extreme negative) and +1 (most extreme positive). Finally, the average is calculated from the paragraph results. Removal of terms is meant to exclude sentiment of the co-occurring word itself, because the word may be misleading, e.g. when some technologies or companies attempt to solve a negative issue. The neighbourhood's scores would be positive, but the negative term would bring the paragraph's score down.
The presented tables include the most extreme co-occurring terms for the analysed social issue. The examples are chosen from the list of words with 30 most positive and 30 most negative sentiment. The presented graphs show the evolution of sentiments for social issues. The analysed paragraphs are selected the following way:
The articles containing the given social issue are identified
The paragraphs containing the social issue are selected for sentiment analysis
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Files
sentiments_mod11.csv sentiment score based on chosen unigrams
sentiments_mod22.csv sentiment score based on chosen bigrams
sentiments_cooc_mod11.csv, sentiments_cooc_mod12.csv, sentiments_cooc_mod21.csv, sentiments_cooc_mod22.csv combinations of co-occurrences: unigrams-unigrams, unigrams-bigrams, bigrams-unigrams, bigrams-bigrams
基于VADER(VADER)工具包与共现分析(co-occurrence analysis)的科技媒体文章情感分析
数据集来源:14万余篇2016年1月至2019年3月的科技媒体文章,各媒体占比如下:
Gigaom:0.5%
Euractiv:0.9%
The Conversation:1.3%
Politico Europe:1.3%
IEEE Spectrum:1.8%
Techforge:4.3%
Fastcompany:4.5%
《卫报》科技版(The Guardian (Tech)):9.2%
Arstechnica:10.0%
路透社(Reuters):11%
Gizmodo:17.5%
ZDNet:18.3%
The Register:19.5%
方法论
本次情感分析采用VADER(VADER)工具完成,该工具是一款开源词典与基于规则的情感分析工具。VADER原本专为社交媒体文本分析设计,但同样可应用于其他文本数据源。其情感词典整合了多类数据源(其他情感数据集、Twitter等),并经人工标注验证。VADER的优势在于其基于规则的引擎支持词序敏感关系与程度修饰词处理。
由于VADER在较短的社交媒体文本场景中表现更稳健,因此本次分析将所涉文章拆分为段落,并针对共现分析环节涉及的社会议题开展情感分析。
本次分析的核心流程包含以下主要步骤:
1. 针对每个社会议题,通过共现分析方法识别出出现频率最高的100个共现术语;
2. 筛选出包含指定社会议题及共现术语的文章;
3. 将筛选出的文章拆分为段落;
4. 从段落文本中移除社会议题词汇及共现术语;
5. 对经上述处理后的段落开展VADER情感分析;
6. 针对给定词对计算平均得分,得到最终分析结果。
因此,针对所有已识别的社会议题,需重复上述流程共100次。
本次情感分析为每个段落生成复合得分(compound score),该得分由段落中每个词的价态得分(valence score)求和得到,并归一化至-1(极端负面)至+1(极端正面)区间。最终取所有段落得分的平均值作为该词对的最终结果。移除指定词汇的目的是排除共现词汇自身的情感倾向,避免误导性分析结果——例如当某类技术或企业试图解决一项负面议题时,周边文本的情感倾向可能为正面,但原负面词汇会拉低段落整体得分。
本次展示的表格包含了所分析社会议题下情感倾向最极端的共现术语,示例选自情感得分排名前30的正面词汇与前30的负面词汇。所展示的图表则呈现了社会议题的情感演化趋势。本次分析段落的筛选方式如下:
1. 筛选出包含指定社会议题的文章;
2. 选取其中包含该社会议题的段落用于情感分析。
*Hutto, C.J. & Gilbert, E.E. (2014). VADER: 一种用于社交媒体文本情感分析的简洁规则化模型. 第八届博客与社交媒体国际会议(ICWSM-14),密歇根州安阿伯,2014年6月。
数据集文件说明:
- sentiments_mod11.csv:基于选定单字组(unigrams)的情感得分文件
- sentiments_mod22.csv:基于选定双字组(bigrams)的情感得分文件
- sentiments_cooc_mod11.csv、sentiments_cooc_mod12.csv、sentiments_cooc_mod21.csv、sentiments_cooc_mod22.csv:共现组合文件,分别对应单字组-单字组、单字组-双字组、双字组-单字组、双字组-双字组的共现组合。
创建时间:
2020-01-24



