Objectifying the Subjective: Cognitive Biases in Topic Interpretations

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14711182

下载链接

链接失效反馈

官方服务：

资源简介：

Abstract: Interpretation of topics is crucial for their downstream applications. State-of-the-art evaluation measures of topic quality such as coherence and word intrusion do not measure how much a topic facilitates the exploration of a corpus. To design evaluation measures grounded on a task, and a population of users, we do user studies to understand how users interpret topics. We propose constructs of topic quality and ask users to assess them in the context of a topic and provide rationale behind evaluations. We use reflexive thematic analysis to identify themes of topic interpretations from rationales. Users interpret topics based on availability and representativeness heuristics rather than probability. We propose a theory of topic interpretation based on the anchoring-and-adjustment heuristic: users anchor on salient words and make semantic adjustments to arrive at an interpretation. Topic interpretation can be viewed as making a judgment under uncertainty by an ecologically rational user, and hence cognitive biases aware user models and evaluation frameworks are needed. Datasets: Our first dataset is ACL OCL by (Rohatgi et al., 2023) (ACL) a scholarly corpus of papers hosted by the ACL Anthology published from 1952 to September 2022. As a second dataset, we consider U.S. Senate Speeches (SENATE), the texts of U.S. Senate speeches provided by (Gentzkow et al., 2019). We focus on the speeches in the 114th session of Congress (2015-2017). The third dataset consists of ``Software Design'' (DESIGN) related posts on StackOverflow (SO). Each SO post is allowed to have up to 5 tags that assigns the topic of the post. (Mahadi et al., 2020) use 10 software-design related SO tags (viz. “design-patterns”, “software-design”, “class-design”, “design-principles”, “system-design”, “code-design”, “api-design”, “language-design”, “dependency-injection” and “architecture”) to identify design-related posts on SO. We use SO API to extend the tags by querying tag descriptions (identified by PostTypes: TagWikiExcerpt and TagWiki) for the keyword “design” and having more than 100 posts. We manually removed tags that were not related to software design-related posts resulting in a set of 61 tags. We excluded UI design related tags such as css. We identified over 227 thousand design-related question-answer(s) pairs published till the end of December 2020 using the SOTorrent dataset (Baltes et al., 2019). (Datasets.zip contains the three datasets.) Preprocessing: We tokenized texts using the Spacy library with model ``en_core_web_sm''. We discarded documents with less than five words. We used (Hoyel et al., 2021)'s library to do the pre-processing. Following (Vafa et al., 2020), we considered all the unigrams appearing in at least 0.1% and at most 30% of documents in the corpus. After pre-processing, the SENATE dataset had 17,573, the DESGIN dataset had 174,416 and the ACL dataset had 71,736 documents. User Interface:We customized the Potato text annotation tool by (Pei et al., 2022) as an interface for the annotation task. (UserInterface.zip contains our customization of Potato) Annotations: (Annotations.zip contains user annotations of topics specific to each dataset.)

创建时间：

2025-01-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集