Data for manuscript: "Using Word Embeddings to Probe Sentiment Associations of Politically Loaded Terms in News and Opinion Articles from News Media Outlets"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/4797463
下载链接
链接失效反馈官方服务:
资源简介:
This data set contains material for the purpose of scientific reproducibility of the accompanying manuscript "Using Word Embeddings to Probe Sentiment Associations of Politically Loaded Terms in News and Opinion Articles from News Media Outlets".
Note that this data set is distributed with an Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License. NonCommercial means you may not use the material for commercial purposes. NoDerivatives means if you remix, transform, or build upon the material, you may not distribute the modified material. Attribution means you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. See attached license terms for details.
The work "Using Word Embeddings to Probe Sentiment Associations of Politically Loaded Terms in News and Opinion Articles from News Media Outlets" describes an analysis of political associations in 27 million diachronic (1975-2019) news and opinion articles from 47 news media outlets popular in the United States. We use embedding models trained on individual outlets content to quantify outlet-specific latent associations between positive/negative sentiment words and terms loaded with political connotations such as those describing political orientation, party affiliation, names of influential politicians and ideologically aligned public figures.
News and opinion articles from the outlets listed in Figure 3 are available in the outlet's online domains and/or public cache repositories such as Google cache, The Internet Wayback Machine [31] and Common Crawl [32]. This work has not analyzed video or audio content of news media organizations, except when the outlet explicitly provides a transcript of such content in article form.
The temporal coverage of articles from different news outlets is not uniform. For most media organizations, news articles availability in their online domains or Internet cache backups becomes sparse as a function of articles’ age. This is not the case for some news outlets, where availability of news articles goes back to the 1970s. The Supplementary Material (SM) illustrates the time ranges of article data analyzed based on news outlets articles online availability.
Textual content included in our analysis is circumscribed to the articles’ headlines and main text and does not include other article elements such as figure captions. Targeted textual content was located in HTML raw data using outlet specific XPath expressions. Tokens were lowercased prior to estimating embedding models. Markup language tags, URLs, nonalphanumeric characters, punctuation, digits, 330 common stop words and multiple spaces were removed prior to estimating word embeddings models.
All the analysis scripts and the diachronic word embedding models built from each of the 47 news media outlets analyzed in this work are available in this repository.
For the purpose of reproducibility, we also provide in the above repository the articles’ text used to train the news outlets embedding models with the caveat that outlets articles not accessible without a subscription have been excluded. Also, for the included articles, stop words have been removed and the remaining words have been randomly scrambled within a sliding window of size 10 to render the articles incomprehensible to a human reader. These steps have been taken to not infringe articles copyright. These preprocessing steps have only minor impact on Continuous Bag of Words (CBOW) word2vec and the results reported in this work are similar when using the scrambled articles text to train outlet-specific embedding models.
We derived outlet-specific word embedding models at every five-year time intervals within the 1975-2019 time range. The gensim [33] implementation of word2vec was used to train the embedding models. The continuous bag of words (CBOW) architecture performed slightly better than the Skip-Gram architecture in commonly used validation metrics so it was used for all subsequent analysis.
For training the word embedding models, the following parameters were used: vector dimensions=300, window size=10, negative sampling=10, down sampling frequent words = 0.0001, minimum frequency count of 5 (only terms that appear more than 5 times in the corpus were included into the word embedding model vocabulary), number of training iterations (epochs) through the corpus=5. The exponent used to shape the negative sampling distribution was the default 0.75.
Outlet-specific embedding models performance across a range of commonly used semantic, syntactic and analogy tasks was similar to popular pre-trained embedding models trained on corpora such as Twitter or Google books on similarity, association and word analogy tasks, see Supplemeentary Material of the manuscript for detailed validation tests results.
创建时间:
2024-07-19



