five

coastalcph/fair-rationales

收藏
Hugging Face2023-10-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/coastalcph/fair-rationales
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en annotations_creators: - crowdsourced source_datasets: - extended task_categories: - text-classification task_ids: - sentiment-classification - open-domain-qa tags: - bias - fairness - rationale - demographic pretty_name: FairRationales --- # Dataset Card for "FairRationales" ## Dataset Summary We present a new collection of annotations for a subset of CoS-E [[1]](#1), DynaSent [[2]](#2), and SST [[3]](#3)/Zuco [[4]](#4) with demographics-augmented annotations, balanced across age and ethnicity. We asked participants to choose a label and then provide supporting evidence (rationales) based on the input sentence for their answer. Existing rationale datasets are typically constructed by giving annotators 'gold standard' labels, and having them provide rationales for these labels. Instead, we let annotators provide rationales for labels they choose themselves. This lets them engage in the decision process, but it also acknowledges that annotators with different backgrounds may disagree on classification decisions. Explaining other people’s choices is error-prone [[5]](#5), and we do not want to bias the rationale annotations by providing labels that align better with the intuitions of some demographics than with those of others. Our annotators are balanced across age and ethnicity for six demographic groups, defined by ethnicity {Black/African American, White/Caucasian, Latino/Hispanic} and age {Old, Young}. Therefore, we can refer to our groups as their cross-product: **{BO, BY, WO, WY, LO, LY}**. ## Dataset Details ### DynaSent We re-annotate N=480 instances six times (for six demographic groups), comprising 240 instances labeled as positive, and 240 instances labeled as negative in the DynaSent Round 2 **test** set (see [[2]](#2)). This amounts to 2,880 annotations, in total. To annotate rationales, we formulate the task as marking 'supporting evidence' for the label, following how the task is defined by [[6]](#6). Specifically, we ask annotators to mark all the words, in the sentence, they think shows evidence for their chosen label. #### >Our annotations: negative 1555 | positive 1435 | no sentiment 470 Total 3460 Note that all the data is uploaded under a single 'train' split (read [## Uses](uses) for further details). ### SST2 We re-annotate N=263 instances six times (for six demographic groups), which are all the positive and negative instances from the Zuco* dataset of Hollenstein et al. (2018), comprising a **mixture of train, validation and test** set instances from SST-2, *which should be removed from the original SST data before training any model*. These 263 reannotated instances do not contain any instances originally marked as `neutral` (or not conveying sentiment) because rationale annotation for neutral instances is ill-defined. Yet, we still allow annotators to evaluate a sentence as neutral, since we do not want to force our annotators to provide rationales for positive and negative sentiment that they do not see. *The Zuco data contains eye-tracking data for 400 instances from SST. By annotating some of these with rationales, we add an extra layer of information for future research. #### >Our annotations: positive 1027 | negative 900 | no sentiment 163 Total 2090 Note that all the data is uploaded under a single 'train' split (read [## Uses](uses) for further details). ### CoS-E We use the simplified version of CoS-E released by [[6]](#6). We re-annotate N=500 instances from the CoS-E **test** set six times (for six demographic groups) and ask annotators to firstly select the answer to the question that they find most correct and sensible, and then mark words that justifies that answer. Following [[7]](#7), we specify the rationale task with a wording that should guide annotators to make short, precise rationale annotations: ‘For each word in the question, if you think that removing it will decrease your confidence toward your chosen label, please mark it.’ #### >Our annotations: Total 3760 Note that all the data is uploaded under a single 'train' split (read [## Uses](uses) for further details). ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/terne/Being_Right_for_Whose_Right_Reasons - **Paper:** [Being Right for Whose Right Reasons?](https://aclanthology.org/2023.acl-long.59/) <a id="uses">## Uses</a> <!-- Address questions around how the dataset is intended to be used. --> In our paper, we present a collection of three existing datasets (SST2, DynaSent and Cos-E) with demographics-augmented annotations to enable profiling of models, i.e., quantifying their alignment (or agreement) with rationales provided by different socio-demographic groups. Such profiling enables us to ask whose right reasons models are being right for and fosters future research on performance equality/robustness. For each dataset, we provide the data under a unique **'train'** split due to the current limitation of not being possible to upload a dataset with a single *'test'* split. Note, however, that the original itended used of these collection of datasets was to **test** the quality & alignment of post-hoc explainability methods. If you use it following different splits, please clarify it to ease reproducibility of your work. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> | Variable | Description | | --- | --- | | QID | The ID of the Question (i.e. the annotation element/sentence) in the Qualtrics survey. Every second question asked for the classification and every other asked for the rationale, of the classification, to be marked. These two questions and answers for the same sentence is merged to one row and therefore the QID looks as if every second is skipped. | | text_id | A numerical ID given to each unique text/sentence for easy sorting before comparing annotations across groups. | | sentence | The text/sentence that is annotated, in it's original formatting. | | label | The (new) label given by the respective annotator/participant from Prolific. | | label_index | The numerical format of the (new) label. | | original_label | The label from the original dataset (Cose/Dynasent/SST). | | rationale | The tokens marked as rationales by our annotators. | | rationale_index | The indeces of the tokens marked as rationales. In the processed files the index start at 0. However in the unprocessed files ("_all.csv", "_before_exclussions.csv") the index starts at 1.| | rationale_binary | A binary version of the rationales where a token marked as part of the rationale = 1 and tokens not marked = 0. | | age | The reported age of the annotator/participant (i.e. their survey response). This may be different from the age-interval the participant was recruited by (see recruitment_age). | | recruitment_age | The age interval specified for the Prolific job to recruit the participant by. A mismatch between this and the participant's reported age, when asked in our survey, may mean a number of things, such as: Prolific's information is wrong or outdated; the participant made a mistake when answering the question; the participant was inattentive. | | ethnicity | The reported ethnicity of the annotator/participant. This may be different from the ethnicity the participant was recruited by (see recruitment_ethnicity). | | recruitment_ethnicity | The ethnicity specified for the Prolific job to recruit the participant by. Sometimes there is a mismatch between the information Prolific has on participants (which we use for recruitment) and what the participants report when asked again in the survey/task. This seems especially prevalent with some ethnicities, likely because participants may in reality identify with more than one ethnic group. | | gender | The reported gender of the annotator/participant. | | english_proficiency | The reported English-speaking ability (proxy for English proficiency) of the annotator/participant. Options were "Not well", "Well" or "Very well". | | attentioncheck | All participants were given a simple attention check question at the very end of the Qualtrics survey (i.e. after annotation) which was either PASSED or FAILED. Participants who failed the check were still paid for their work, but their response should be excluded from the analysis. | | group_id | An id describing the socio-demographic subgroup a participant belongs to and was recruited by. | | originaldata_id | The id given to the text/sentence in the original dataset. In the case of SST data, this refers to ids within the Zuco dataset – a subset of SST which was used in our study.| | annotator_ID | Anonymised annotator ID to enable analysis such as annotators (dis)agreement | | sst2_id | The processed SST annotations contain an extra column with the index of the text in the SST-2 dataset. -1 means that we were unable to match the text to an instance in SST-2 | | sst2_split | The processed SST annotations contain an extra column refering to the set which the instance appears in within SST-2. Some instances a part of the train set and should therefore be removed before training a model on SST-2 and testing on our annotations. | ## Dataset Creation ### Curation Rationale Terne Sasha Thorn Jakobsen, Laura Cabello, Anders Søgaard. Being Right for Whose Right Reasons? In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). #### Annotation process We refer to our [paper](https://aclanthology.org/2023.acl-long.59/) for further details on the data (Section 3), and specifically on the Annotation Process (Section 3.1) and Annotator Population (Section 3.2). #### Who are the annotators? Annotators were recruited via Prolific and consented to the use of their responses and demographic information for research purposes. The annotation tasks were conducted through Qualtrics surveys. The exact surveys can be found [here](https://github.com/terne/Being_Right_for_Whose_Right_Reasons/tree/main/data/qualtrics_survey_exports). ## References <a id="1">[1]</a> Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics. <a id="2">[2]</a> Christopher Potts, Zhengxuan Wu, Atticus Geiger, and Douwe Kiela. 2021. DynaSent: A Dynamic Benchmark for Sentiment Analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2388–2404, Online. Association for Computational Linguistics. <a id="3">[3]</a> Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. <a id="4">[4]</a> Nora Hollenstein, Jonathan Rotsztejn, Marius Troendle, Andreas Pedroni, Ce Zhang, and Nicolas Langer. 2018. Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading. Scientific Data. <a id="5">[5]</a> Kate Barasz and Tami Kim. 2022. Choice perception: Making sense (and nonsense) of others’ decisions. Current opinion in psychology, 43:176–181. <a id="6">[6]</a> Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2019. Eraser: A benchmark to evaluate rationalized nlp models. <a id="7">[7]</a> Cheng-Han Chiang and Hung-yi Lee. 2022. Reexamining human annotations for interpretable nlp. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> ```bibtex @inproceedings{thorn-jakobsen-etal-2023-right, title = "Being Right for Whose Right Reasons?", author = "Thorn Jakobsen, Terne Sasha and Cabello, Laura and S{\o}gaard, Anders", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.59", doi = "10.18653/v1/2023.acl-long.59", pages = "1033--1054", abstract = "Explainability methods are used to benchmark the extent to which model predictions align with human rationales i.e., are {`}right for the right reasons{'}. Previous work has failed to acknowledge, however, that what counts as a rationale is sometimes subjective. This paper presents what we think is a first of its kind, a collection of human rationale annotations augmented with the annotators demographic information. We cover three datasets spanning sentiment analysis and common-sense reasoning, and six demographic groups (balanced across age and ethnicity). Such data enables us to ask both what demographics our predictions align with and whose reasoning patterns our models{'} rationales align with. We find systematic inter-group annotator disagreement and show how 16 Transformer-based models align better with rationales provided by certain demographic groups: We find that models are biased towards aligning best with older and/or white annotators. We zoom in on the effects of model size and model distillation, finding {--}contrary to our expectations{--} negative correlations between model size and rationale agreement as well as no evidence that either model size or model distillation improves fairness.", } ``` ## Dataset Card Contact Thanks to [@lautel](https://github.com/lautel) for adding this dataset.

The FairRationales dataset is a collection of annotations for subsets of CoS-E, DynaSent, and SST/Zuco datasets, with demographics-augmented annotations balanced across age and ethnicity. Participants were asked to choose a label and provide supporting evidence (rationales) based on the input sentence. The annotators are balanced across six demographic groups defined by ethnicity and age. The dataset is designed to enable profiling of models by quantifying their alignment with rationales provided by different socio-demographic groups, fostering research on performance equality and robustness. Due to current upload limitations, all annotation data is merged into a single train split.
提供机构:
coastalcph
原始信息汇总

数据集卡片 for "FairRationales"

数据集概述

我们提供了一个新的注释集合,针对CoS-E、DynaSent和SST/Zuco数据集的子集,增加了基于年龄和种族平衡的注释。我们要求参与者选择一个标签,并根据输入句子提供支持其选择的证据(理由)。现有的理由数据集通常是给注释者提供“黄金标准”标签,并让他们为这些标签提供理由。相反,我们让注释者为他们自己选择的标签提供理由。这使他们能够参与到决策过程中,但也承认不同背景的注释者可能在分类决策上存在分歧。解释他人的选择容易出错,我们不希望通过提供与某些群体直觉更一致的标签来偏倚理由注释。

我们的注释者按年龄和种族分为六个群体,定义为种族{黑人/非裔美国人,白人/高加索人,拉丁裔/西班牙裔}和年龄{老年,年轻}的交叉组合:{BO, BY, WO, WY, LO, LY}

数据集详情

DynaSent

我们重新注释了N=480个实例,每个实例注释六次(针对六个群体),包括240个正面标签和240个负面标签的实例,来自DynaSent Round 2的测试集。总共产生了2,880个注释。为了注释理由,我们将任务定义为标记支持标签的“支持证据”,具体要求注释者标记他们认为支持所选标签的所有单词。

我们的注释:

  • 负面:1555
  • 正面:1435
  • 无情感:470
  • 总计:3460

所有数据都上传在一个单独的train分割下。

SST2

我们重新注释了N=263个实例,每个实例注释六次(针对六个群体),这些实例来自Hollenstein等人的Zuco数据集,包括SST-2的训练、验证和测试集的混合。这些263个重新注释的实例不包含任何最初标记为中性的实例,因为中性实例的理由注释定义不明确。然而,我们仍然允许注释者将句子评估为中性,因为我们不希望强迫注释者提供他们不认同的正面和负面情感的理由。

我们的注释:

  • 正面:1027
  • 负面:900
  • 无情感:163
  • 总计:2090

所有数据都上传在一个单独的train分割下。

CoS-E

我们使用了[6]发布的CoS-E简化版本。我们重新注释了N=500个来自CoS-E 测试集的实例,每个实例注释六次(针对六个群体),并要求注释者首先选择他们认为最正确和合理的答案,然后标记支持该答案的单词。

我们的注释:

  • 总计:3760

所有数据都上传在一个单独的train分割下。

数据集结构

变量 描述
QID 问卷中问题的ID(即注释元素/句子)。
text_id 每个唯一文本/句子的数字ID,便于在比较不同群体的注释之前进行排序。
sentence 注释的文本/句子,保持其原始格式。
label 注释者/参与者从Prolific给出的新标签。
label_index 新标签的数字格式。
original_label 原始数据集(Cose/Dynasent/SST)中的标签。
rationale 注释者标记为理由的单词。
rationale_index 标记为理由的单词的索引。
rationale_binary 理由的二进制版本,标记为理由的单词=1,未标记的单词=0。
age 注释者/参与者报告的年龄。
recruitment_age 用于招募参与者的年龄区间。
ethnicity 注释者/参与者报告的种族。
recruitment_ethnicity 用于招募参与者的种族。
gender 注释者/参与者报告的性别。
english_proficiency 注释者/参与者报告的英语能力。
attentioncheck 参与者在问卷结束时通过了一个简单的注意力检查问题。
group_id 描述参与者所属的社会人口子群体的ID。
originaldata_id 原始数据集中文本/句子的ID。
annotator_ID 匿名注释者ID,便于分析注释者之间的一致性。
sst2_id 处理后的SST注释包含一个额外的列,表示文本在SST-2数据集中的索引。
sst2_split 处理后的SST注释包含一个额外的列,表示实例在SST-2中的集合。

数据集创建

注释过程

注释者通过Prolific招募,并同意将其响应和人口统计信息用于研究目的。注释任务通过Qualtrics调查进行。

引用

bibtex @inproceedings{thorn-jakobsen-etal-2023-right, title = "Being Right for Whose Right Reasons?", author = "Thorn Jakobsen, Terne Sasha and Cabello, Laura and S{o}gaard, Anders", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.59", doi = "10.18653/v1/2023.acl-long.59", pages = "1033--1054", abstract = "Explainability methods are used to benchmark the extent to which model predictions align with human rationales i.e., are {`}right for the right reasons{}. Previous work has failed to acknowledge, however, that what counts as a rationale is sometimes subjective. This paper presents what we think is a first of its kind, a collection of human rationale annotations augmented with the annotators demographic information. We cover three datasets spanning sentiment analysis and common-sense reasoning, and six demographic groups (balanced across age and ethnicity). Such data enables us to ask both what demographics our predictions align with and whose reasoning patterns our models{} rationales align with. We find systematic inter-group annotator disagreement and show how 16 Transformer-based models align better with rationales provided by certain demographic groups: We find that models are biased towards aligning best with older and/or white annotators. We zoom in on the effects of model size and model distillation, finding {--}contrary to our expectations{--} negative correlations between model size and rationale agreement as well as no evidence that either model size or model distillation improves fairness.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作