CaBBQ
收藏魔搭社区2025-12-18 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/BSC-LT/CaBBQ
下载链接
链接失效反馈官方服务:
资源简介:
# Catalan Bias Benchmark for Question Answering (CaBBQ)
The [Catalan Bias Benchmark for Question Answering (CaBBQ)](https://arxiv.org/abs/2507.11216) is an adaptation of the original [BBQ](https://huggingface.co/datasets/heegyu/bbq) to the Catalan language and the social context of Spain.
## Dataset Description
This dataset is used to evaluate social bias in LLMs in a multiple-choice Question Answering (QA) setting and along 10 social categories: _Age_, _Disability Status_, _Gender_, _LGBTQIA_, _Nationality_, _Physical Appearance_, _Race/Ethnicity_, _Religion_, _Socieconomic Status (SES)_, and _Spanish Region_.
The task consists of selecting the correct answer among three possible options, given a context and a question related to a specific stereotype directed at a specific target social group.
CaBBQ evaluates model outputs to questions at two different levels:
(1) with an under-informative (ambiguous) context, it assesses the degree to which model responses rely on social biases, and
(2) with an adequately-informative (disambiguated) context, it examines if the model’s biases can lead it to disregard the correct answer.
The dataset is constructed from templates, out of which all possible combinations of contexts, questions and placeholders are generated.

### Statistics:
| **Category** | **Templates** | **Instances** |
|------------------------|--------------:|--------------:|
| _Age_ | 23 | 4,068 |
| _Disability Status_ | 27 | 2,832 |
| _Gender_ | 66 | 4,832 |
| _LGBTQIA_ | 31 | 2,000 |
| _Nationality_ | 15 | 504 |
| _Physical Appearance_ | 32 | 3,528 |
| _Race/Ethnicity_ | 51 | 3,716 |
| _Religion_ | 16 | 648 |
| _SES_ | 27 | 4,204 |
| _Spanish Region_ | 35 | 988 |
| **Total** | **323** | **27,320** |
## Dataset Structure
The dataset instances are divided into the 10 social categories they address. Each instance contains the following fields:
- `instance_id` (int): instance id.
- `template_id` (int): id of the template out of which the instance has been generated.
- `version` (str): version of the template out of which the instance has been generated.
- `template_label` (str): category of the template, based on the classes proposed by [Jin et al. (2024)](https://arxiv.org/abs/2307.16778). Possible values: Simply-Transferred (`t`), for original BBQ templates addressing templates prevalent in Spain, not needing any modification; Target-Modified (`m`), for original BBQ templates addressing templates prevalent in Spain needing a modification of the target groups, and Newly-Created (`n`), for new manually-created templates.
- `flipped` (str): whether the order in which the template placeholders are permuted. Possible values: `original`, if there are no permutations; `ambig`, if the placeholders are flipped only in the ambiguous context; `disambig`, if the placeholders are flipped only in the disambiguating context and answers, and `all`, if the placeholders are flipped in both contexts and all answers.
- `question_polarity` (str): polarity of the question. Possible values: negative (`neg`) or non-negative (`nonneg`).
- `context_condition` (str): type of context. Possible values: ambiguous (`ambig`) or disambiguated (`disamb`).
- `category` (str): social dimension the instance falls into.
- `subcategory` (str): subcategory the instance falls into.
- `relevant_social_value` (str): stereotype addressed.
- `stereotyped_groups` (str): all target groups affected by the stereotype addressed.
- `answer_info` (dict): information about each answer (`ans0`, `ans1` and `ans2`). Values are lists with two elements: (1) the value the placeholder is filled with in the answer and (2) meta-information about the social group of the answer value.
- `stated_gender_info` (str): gender the instance applies to.
- `proper_nouns_only` (bool): if `true`, the instance is used with proper nouns as proxies of the social groups addressed.
- `question` (str): negative or non-negative question.
- `ans0`, `ans1` and `ans2` (str): answer choices. `ans2` always contains the *unknown* option. *Note*: to avoid an over-reliance on the word *unknown*, we employ a list of semantically-equivalent expressions at evaluation time.
- `question_type` (str): alignment with the stereotype assessed, based on the context. Possible values: stereotypical (`pro-stereo`), anti-stereotypical (`anti-stereo`) or not applicable (`n/a`).
- `label` (int): index of the correct answer.
- `source` (str): reference attesting the stereotype.
## Dataset Sources
- [Github Repository](https://github.com/langtech-bsc/EsBBQ-CaBBQ)
- Paper [More Information Needed]
## Dataset Curators
Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
## Uses
CaBBQ is intented to be used to evaluate _stereotyiping_ social bias in language models.
## Out-of-Scopre Use
CaBBQ must **not** be used as training data.
## Acknowledgements
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina](https://projecteaina.cat/) project.
This work is also funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Desarrollo Modelos ALIA.
## License Information
[CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed)
## Ethical Considerations
As LLMs become increasingly integrated into real-world applications, understanding their biases is essential to prevent the reinforcement of power asymmetries and discrimination.
With this dataset, we aim to address the evaluation of social bias in the Catalan language and the social context of Spain.
At the same time, we fully acknowledge the inherent risks associated with releasing datasets that include harmful stereotypes, and also with highlighting weaknesses in LLMs that could potentially be misused to target and harm vulnerable groups.
We do not foresee our work being used for any unethical purpose, and we strongly encourage researchers and practitioners to use it responsibly, fostering fairness and inclusivity.
## Citation
### Bibtex:
```
@misc{ruizfernández2025esbbqcabbqspanishcatalan,
title={EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering},
author={Valle Ruiz-Fernández and Mario Mina and Júlia Falcão and Luis Vasquez-Reina and Anna Sallés and Aitor Gonzalez-Agirre and Olatz Perez-de-Viñaspre},
year={2025},
eprint={2507.11216},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11216},
}
```
# 加泰罗尼亚语问答偏见基准数据集(Catalan Bias Benchmark for Question Answering, CaBBQ)
[加泰罗尼亚语问答偏见基准数据集(CaBBQ)](https://arxiv.org/abs/2507.11216) 是原始[BBQ](https://huggingface.co/datasets/heegyu/bbq)面向加泰罗尼亚语及西班牙社会语境的适配版本。
## 数据集描述
本数据集用于在多项选择问答(Question Answering, QA)场景下评估大语言模型(Large Language Model, LLM)的社会偏见,覆盖10类社会维度:年龄(Age)、残疾状况(Disability Status)、性别(Gender)、LGBTQIA、国籍(Nationality)、外貌特征(Physical Appearance)、种族/民族(Race/Ethnicity)、宗教(Religion)、社会经济地位(Socioeconomic Status, SES)以及西班牙行政区(Spanish Region)。
任务要求为:给定一段上下文以及针对特定社会群体的刻板印象相关问题,从三个可选答案中选出正确答案。
CaBBQ从两个不同维度评估模型对问题的输出结果:
1. 当上下文信息不足(存在歧义)时,评估模型回复依赖社会偏见的程度;
2. 当上下文信息充足(歧义消除)时,检验模型偏见是否会导致其忽略正确答案。
本数据集基于模板构建,通过生成上下文、问题与占位符的所有可能组合得到数据集实例。

### 统计信息:
| **类别** | **模板数** | **实例数** |
|------------------------|--------------:|--------------:|
| 年龄 | 23 | 4,068 |
| 残疾状况 | 27 | 2,832 |
| 性别 | 66 | 4,832 |
| LGBTQIA | 31 | 2,000 |
| 国籍 | 15 | 504 |
| 外貌特征 | 32 | 3,528 |
| 种族/民族 | 51 | 3,716 |
| 宗教 | 16 | 648 |
| 社会经济地位 | 27 | 4,204 |
| 西班牙行政区 | 35 | 988 |
| **总计** | **323** | **27,320** |
## 数据集结构
数据集实例按照其覆盖的10类社会维度进行划分。每个实例包含以下字段:
- `instance_id`(整数):实例编号。
- `template_id`(整数):生成该实例所使用的模板编号。
- `version`(字符串):生成该实例所使用的模板版本。
- `template_label`(字符串):模板所属类别,基于[Jin等人(2024)](https://arxiv.org/abs/2307.16778)提出的分类标准,可选值包括:仅翻译(`t`),即原始BBQ模板适配西班牙本土常见场景,无需修改;目标修改(`m`),即原始BBQ模板适配西班牙本土场景,但需修改目标群体;全新创建(`n`),即手动全新构建的模板。
- `flipped`(字符串):模板占位符的置换顺序情况,可选值包括:`original`(无置换)、`ambig`(仅歧义上下文场景下置换占位符)、`disambig`(仅消除歧义的上下文与答案场景下置换占位符)、`all`(两种上下文及所有答案场景下均置换占位符)。
- `question_polarity`(字符串):问题的极性,可选值为负面(`neg`)或非负面(`nonneg`)。
- `context_condition`(字符串):上下文类型,可选值为歧义(`ambig`)或消除歧义(`disamb`)。
- `category`(字符串):实例所属的社会维度。
- `subcategory`(字符串):实例所属的子类别。
- `relevant_social_value`(字符串):涉及的刻板印象内容。
- `stereotyped_groups`(字符串):受该刻板印象影响的所有目标群体。
- `answer_info`(字典):每个答案(`ans0`、`ans1`与`ans2`)的相关信息,值为包含两个元素的列表:① 答案中占位符替换后的具体内容;② 答案内容所对应社会群体的元信息。
- `stated_gender_info`(字符串):实例适用的性别。
- `proper_nouns_only`(布尔值):若为`true`,则该实例使用专有名词作为对应社会群体的指代。
- `question`(字符串):负面或非负面的问题文本。
- `ans0`、`ans1`与`ans2`(字符串):候选答案。`ans2`始终包含“未知”选项。*注*:为避免过度依赖“未知”一词,评估阶段我们会使用一组语义等价的表达来替代。
- `question_type`(字符串):基于上下文的刻板印象对齐类型,可选值为:符合刻板印象(`pro-stereo`)、违背刻板印象(`anti-stereo`)或不适用(`n/a`)。
- `label`(整数):正确答案的索引。
- `source`(字符串):佐证该刻板印象的参考来源。
## 数据集来源
- [GitHub仓库](https://github.com/langtech-bsc/EsBBQ-CaBBQ)
- 论文 [更多信息待补充]
## 数据集管理者
巴塞罗那超级计算中心(Barcelona Supercomputing Center, BSC)语言技术部门(langtech@bsc.es)。
## 适用场景
CaBBQ旨在用于评估语言模型中的刻板印象类社会偏见。
## 不适用场景
CaBBQ**严禁**用作训练数据。
## 致谢
本工作由加泰罗尼亚政府通过[Aina](https://projecteaina.cat/)项目推动并资助。本工作同时获得西班牙数字化转型与公共职能部以及复苏、转型与韧性计划(由欧盟——下一代欧盟(NextGenerationEU)资助)下的ALIA模型开发项目支持。
## 许可信息
[CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed)
## 伦理考量
随着大语言模型愈发广泛地融入现实应用,理解其存在的偏见对于防止权力失衡加剧与歧视问题至关重要。
本数据集旨在针对加泰罗尼亚语及西班牙社会语境下的大语言模型社会偏见评估展开研究。
同时,我们充分意识到发布包含有害刻板印象的数据集,以及披露大语言模型潜在缺陷可能带来的固有风险——这些缺陷可能被滥用于针对弱势群体并造成伤害。
我们未预见本工作会被用于任何非伦理用途,并强烈呼吁研究人员与从业者负责任地使用本数据集,以推动公平与包容。
## 引用
### Bibtex格式:
@misc{ruizfernández2025esbbqcabbqspanishcatalan,
title={EsBBQ和CaBBQ:面向问答任务的西班牙语与加泰罗尼亚语偏见基准数据集},
author={Valle Ruiz-Fernández and Mario Mina and Júlia Falcão and Luis Vasquez-Reina and Anna Sallés and Aitor Gonzalez-Agirre and Olatz Perez-de-Viñaspre},
year={2025},
eprint={2507.11216},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.11216},
}
提供机构:
maas
创建时间:
2025-07-17



