copenlu/sofa
收藏Hugging Face2024-11-18 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/copenlu/sofa
下载链接
链接失效反馈官方服务:
资源简介:
SoFa(社会公平)数据集是一个新颖的大规模公平性基准测试集,涵盖了超过400个身份和总计1.49百万个探测点,涉及11,000种刻板印象。该数据集主要用于社会偏见探测,特别是用于语言模型的细粒度公平性基准测试。数据集的文本特征主要来源于SBIC数据集,包括来自Reddit、Twitter等社交媒体的帖子,这些帖子被标注为包含有害的刻板印象。数据集的创建基于SBIC数据集的概念框架,旨在表示隐性的偏见和冒犯性。
The SoFa dataset is a novel large-scale fairness benchmark, encompassing over 400 identities and a total of 1.49 million probes across 11,000 stereotypes, predominantly in US English. Curated by Marta Marchiori Manerba and Karolina Stańczak, it is designed for fine-grained fairness benchmarking of Language Models (LMs). The dataset is structured with columns: ID, Category, Identity, Stereotype, and Probe. The stereotypes are extracted from the SBIC dataset, while the identities are referenced from Czarnowska et al. (2021). The text characteristics section describes the stereotypes extracted from annotated English Reddit posts and other social media sources, focusing on implicit biases and offensiveness. The dataset creation section details the rationale behind curating the dataset and the source data used. The bias, risks, and limitations section discusses various technical and sociotechnical considerations, and the recommendations section advocates for responsible use of the dataset. The citation section provides the necessary information for referencing the dataset.
提供机构:
copenlu



