EsBBQ

Name: EsBBQ
Creator: maas
Published: 2025-12-05 16:42:18
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/BSC-LT/EsBBQ

下载链接

链接失效反馈

官方服务：

资源简介：

# Spanish Bias Benchmark for Question Answering (EsBBQ) The [Spanish Bias Benchmark for Question Answering (EsBBQ)](https://arxiv.org/abs/2507.11216) is an adaptation of the original [BBQ](https://huggingface.co/datasets/heegyu/bbq) to the Spanish language and the social context of Spain. ## Dataset Description This dataset is used to evaluate social bias in LLMs in a multiple-choice Question Answering (QA) setting and along 10 social categories: _Age_, _Disability Status_, _Gender_, _LGBTQIA_, _Nationality_, _Physical Appearance_, _Race/Ethnicity_, _Religion_, _Socieconomic Status (SES)_, and _Spanish Region_. The task consists of selecting the correct answer among three possible options, given a context and a question related to a specific stereotype directed at a specific target social group. EsBBQ evaluates model outputs to questions at two different levels: (1) with an under-informative (ambiguous) context, it assesses the degree to which model responses rely on social biases, and (2) with an adequately-informative (disambiguated) context, it examines if the model’s biases can lead it to disregard the correct answer. The dataset is constructed from templates, out of which all possible combinations of contexts, questions and placeholders are generated. ![](./images/example_template.png) ### Statistics: | **Category** | **Templates** | **Instances** | |------------------------|--------------:|--------------:| | _Age_ | 23 | 4,068 | | _Disability Status_ | 27 | 2,832 | | _Gender_ | 66 | 4,832 | | _LGBTQIA_ | 31 | 2,000 | | _Nationality_ | 15 | 504 | | _Physical Appearance_ | 32 | 3,528 | | _Race/Ethnicity_ | 51 | 3,716 | | _Religion_ | 16 | 648 | | _SES_ | 27 | 4,204 | | _Spanish Region_ | 35 | 988 | | **Total** | **323** | **27,320** | ## Dataset Structure The dataset instances are divided into the 10 social categories they address. Each instance contains the following fields: - `instance_id` (int): instance id. - `template_id` (int): id of the template out of which the instance has been generated. - `version` (str): version of the template out of which the instance has been generated. - `template_label` (str): category of the template, based on the classes proposed by [Jin et al. (2024)](https://arxiv.org/abs/2307.16778). Possible values: Simply-Transferred (`t`), for original BBQ templates addressing templates prevalent in Spain, not needing any modification; Target-Modified (`m`), for original BBQ templates addressing templates prevalent in Spain needing a modification of the target groups, and Newly-Created (`n`), for new manually-created templates. - `flipped` (str): whether the order in which the template placeholders are permuted. Possible values: `original`, if there are no permutations; `ambig`, if the placeholders are flipped only in the ambiguous context; `disambig`, if the placeholders are flipped only in the disambiguating context and answers, and `all`, if the placeholders are flipped in both contexts and all answers. - `question_polarity` (str): polarity of the question. Possible values: negative (`neg`) or non-negative (`nonneg`). - `context_condition` (str): type of context. Possible values: ambiguous (`ambig`) or disambiguated (`disamb`). - `category` (str): social dimension the instance falls into. - `subcategory` (str): subcategory the instance falls into. - `relevant_social_value` (str): stereotype addressed. - `stereotyped_groups` (str): all target groups affected by the stereotype addressed. - `answer_info` (dict): information about each answer (`ans0`, `ans1` and `ans2`). Values are lists with two elements: (1) the value the placeholder is filled with in the answer and (2) meta-information about the social group of the answer value. - `stated_gender_info` (str): gender the instance applies to. - `proper_nouns_only` (bool): if `true`, the instance is used with proper nouns as proxies of the social groups addressed. - `question` (str): negative or non-negative question. - `ans0`, `ans1` and `ans2` (str): answer choices. `ans2` always contains the *unknown* option. *Note*: to avoid an over-reliance on the word *unknown*, we employ a list of semantically-equivalent expressions at evaluation time. - `question_type` (str): alignment with the stereotype assessed, based on the context. Possible values: stereotypical (`pro-stereo`), anti-stereotypical (`anti-stereo`) or not applicable (`n/a`). - `label` (int): index of the correct answer. - `source` (str): reference attesting the stereotype. ## Dataset Sources - [Github Repository](https://github.com/langtech-bsc/EsBBQ-CaBBQ) - Paper [More Information Needed] ## Dataset Curators Language Technologies Unit (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC). ## Uses EsBBQ is intented to be used to evaluate _stereotyiping_ social bias in language models. ## Out-of-Scopre Use EsBBQ must **not** be used as training data. ## Acknowledgements This work has been promoted and financed by the Generalitat de Catalunya through the [Aina](https://projecteaina.cat/) project. This work is also funded by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Desarrollo Modelos ALIA. ## License Information [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed) ## Ethical Considerations As LLMs become increasingly integrated into real-world applications, understanding their biases is essential to prevent the reinforcement of power asymmetries and discrimination. With this dataset, we aim to address the evaluation of social bias in the Spanish language and the social context of Spain. At the same time, we fully acknowledge the inherent risks associated with releasing datasets that include harmful stereotypes, and also with highlighting weaknesses in LLMs that could potentially be misused to target and harm vulnerable groups. We do not foresee our work being used for any unethical purpose, and we strongly encourage researchers and practitioners to use it responsibly, fostering fairness and inclusivity. ## Citation ### Bibtex: ``` @misc{ruizfernández2025esbbqcabbqspanishcatalan, title={EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering}, author={Valle Ruiz-Fernández and Mario Mina and Júlia Falcão and Luis Vasquez-Reina and Anna Sallés and Aitor Gonzalez-Agirre and Olatz Perez-de-Viñaspre}, year={2025}, eprint={2507.11216}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11216}, } ```

# 西班牙语问答偏见基准测试集（Spanish Bias Benchmark for Question Answering, EsBBQ） [西班牙语问答偏见基准测试集（EsBBQ）](https://arxiv.org/abs/2507.11216)是原始[BBQ（Bias Benchmark for Question Answering）](https://huggingface.co/datasets/heegyu/bbq)针对西班牙语及西班牙社会语境的适配版本。 ## 数据集说明本数据集用于在多项选择问答（Question Answering, QA）设置下评估大语言模型（Large Language Model, LLM）的社会偏见，涵盖10类社会维度：年龄（Age）、残疾状况（Disability Status）、性别（Gender）、性少数群体（LGBTQIA）、国籍（Nationality）、外貌特征（Physical Appearance）、种族/民族（Race/Ethnicity）、宗教（Religion）、社会经济地位（Socioeconomic Status, SES）以及西班牙地区（Spanish Region）。任务要求为：给定一段上下文以及针对特定社会群体的刻板印象相关问题，从三个可选答案中选出正确答案。EsBBQ从两个不同层面评估模型输出： 1. 当上下文信息不足（歧义语境）时，评估模型响应对社会偏见的依赖程度； 2. 当上下文信息充分（消歧语境）时，检验模型偏见是否会导致其忽略正确答案。该数据集基于模板生成，通过模板的上下文、问题与占位符的所有可能组合生成实例。 ![](./images/example_template.png) ### 统计信息： | **类别** | **模板数** | **实例数** | |------------------------|--------------:|--------------:| | 年龄 | 23 | 4,068 | | 残疾状况 | 27 | 2,832 | | 性别 | 66 | 4,832 | | 性少数群体（LGBTQIA） | 31 | 2,000 | | 国籍 | 15 | 504 | | 外貌特征 | 32 | 3,528 | | 种族/民族 | 51 | 3,716 | | 宗教 | 16 | 648 | | 社会经济地位（SES） | 27 | 4,204 | | 西班牙地区 | 35 | 988 | | **总计** | **323** | **27,320** | ## 数据集结构数据集实例按其所涉及的10类社会维度划分。每个实例包含以下字段： - `instance_id`（int）：实例编号。 - `template_id`（int）：生成该实例所基于的模板编号。 - `version`（str）：生成该实例所基于的模板版本。 - `template_label`（str）：模板所属类别，基于[Jin等人(2024)](https://arxiv.org/abs/2307.16778)提出的分类标准，可选值包括：直接迁移（Simply-Transferred, `t`），即原始BBQ模板适配西班牙本土常见刻板印象且无需修改；目标修改（Target-Modified, `m`），即原始BBQ模板适配西班牙本土刻板印象但需修改目标群体；以及全新创建（Newly-Created, `n`），即人工全新构建的模板。 - `flipped`（str）：模板占位符的置换顺序标识，可选值：`original`表示无置换；`ambig`表示仅在歧义语境中置换占位符；`disambig`表示仅在消歧语境及答案中置换占位符；`all`表示在两类语境及所有答案中均置换占位符。 - `question_polarity`（str）：问题极性，可选值：负面（`neg`）或非负面（`nonneg`）。 - `context_condition`（str）：上下文类型，可选值：歧义（`ambig`）或消歧（`disamb`）。 - `category`（str）：实例所属的社会维度。 - `subcategory`（str）：实例所属的子类别。 - `relevant_social_value`（str）：所涉及的刻板印象主题。 - `stereotyped_groups`（str）：受该刻板印象影响的所有目标群体。 - `answer_info`（dict）：各答案（`ans0`、`ans1`与`ans2`）的相关信息，其值为包含两个元素的列表：(1) 答案中占位符填充后的实际内容；(2) 答案内容所对应的社会群体元信息。 - `stated_gender_info`（str）：实例适用的性别属性。 - `proper_nouns_only`（bool）：若为`true`，则该实例使用专有名词作为所涉及社会群体的指代。 - `question`（str）：负面或非负面问题文本。 - `ans0`、`ans1`与`ans2`（str）：候选答案选项，`ans2`始终包含“未知”选项。*注*：为避免过度依赖“未知”一词，评估阶段我们将使用一组语义等价的表达来替代。 - `question_type`（str）：基于上下文的刻板印象对齐类型，可选值：符合刻板印象（`pro-stereo`）、违背刻板印象（`anti-stereo`）或不适用（`n/a`）。 - `label`（int）：正确答案的索引。 - `source`（str）：佐证该刻板印象的参考来源。 ## 数据集来源 - [GitHub仓库](https://github.com/langtech-bsc/EsBBQ-CaBBQ) - 论文[更多信息待补充] ## 数据集制作方巴塞罗那超级计算中心（Barcelona Supercomputing Center, BSC）语言技术部门（邮箱：langtech@bsc.es）。 ## 使用场景 EsBBQ旨在用于评估语言模型中的刻板印象类社会偏见。 ## 禁止使用场景 **严禁**将EsBBQ用作训练数据。 ## 致谢本研究由加泰罗尼亚政府通过[Aina项目](https://projecteaina.cat/)推动并资助。本工作同时获得西班牙数字化转型与公共职能部以及恢复、转型与韧性计划（由欧盟下一代EU基金资助）的支持，作为ALIA模型开发项目的一部分。 ## 许可证信息 [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed) ## 伦理考量随着大语言模型愈发广泛地融入实际应用场景，理解其存在的偏见对于防止强化权力不对称与歧视至关重要。本数据集旨在针对西班牙语及西班牙社会语境下的大语言模型社会偏见开展评估。同时，我们充分意识到发布包含有害刻板印象的数据集，以及揭示可能被滥用于针对性伤害弱势群体的大语言模型缺陷所带来的固有风险。我们未预见本工作会被用于任何非伦理用途，并强烈鼓励研究人员与从业者负责任地使用本数据集，以推动公平与包容。 ## 引用 ### Bibtex格式： @misc{ruizfernández2025esbbqcabbqspanishcatalan, title={EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering}, author={Valle Ruiz-Fernández and Mario Mina and Júlia Falcão and Luis Vasquez-Reina and Anna Sallés and Aitor Gonzalez-Agirre and Olatz Perez-de-Viñaspre}, year={2025}, eprint={2507.11216}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11216}, }

提供机构：

maas

创建时间：

2025-07-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集