Replication data for: A Method of Automated Nonparametric Content Analysis for Social Science

Name: Replication data for: A Method of Automated Nonparametric Content Analysis for Social Science
Creator: Harvard Dataverse
Published: 2025-05-11 23:52:27
License: 暂无描述

DataCite Commons2025-05-11 更新2025-05-17 收录

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/NV0SZJ

下载链接

链接失效反馈

官方服务：

资源简介：

The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, new spapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis. This article led to the formation of Crimson Hexagon See also: <a href="http://gking.harvard.edu/category/research-interests/applications/automated-text-analysis" target= "_new"> Software for Automated Content Analysis</a>

数字化文本资源的日益普及，为社会科学家带来了极为可观的研究机遇。然而，针对海量博客、演讲文稿、政府档案、报纸及其他非结构化文本源开展人工编码，在实操中并不可行。尽管计算机科学家已开发出自动化内容分析方法，但此类方法大多针对单文档分类任务进行优化，而社会科学家的研究目标往往是对文档总体进行归纳推断，例如估算某一指定类别下的文档占比。遗憾的是，即便某一方法对单文档的分类准确率极高，在估算类别占比时仍可能产生严重偏差。为此，我们直接针对该社会科学研究目标开展优化，提出了一种新方法，即便在最优分类器表现欠佳的场景下，仍可对类别占比给出近似无偏的估计结果。我们通过多组多样化数据集对该方法进行了实证演示，其中包含数千民众每日针对美国总统任期发表的观点文本。此外，我们还公开了实现本研究方法的软件工具，以及可供后续分析使用的大规模文本语料库。 本文推动了Crimson Hexagon的创立。 另可参阅：<a href="http://gking.harvard.edu/category/research-interests/applications/automated-text-analysis" target="_new">自动化内容分析软件</a>

提供机构：

Harvard Dataverse

创建时间：

2019-02-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集