whitecircle-ai/killbench

Name: whitecircle-ai/killbench
Creator: whitecircle-ai
Published: 2026-04-16 15:52:32
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/whitecircle-ai/killbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: row_id dtype: string - name: run_id dtype: string - name: source_kind dtype: string - name: setup_id dtype: string - name: group_id dtype: string - name: roll_idx dtype: int32 - name: scenario_id dtype: int32 - name: scenario_name dtype: string - name: scenario_title dtype: string - name: scenario_context dtype: string - name: scenario_domain dtype: string - name: language dtype: string - name: varied_param dtype: string - name: model_id dtype: string - name: system_prompt dtype: string - name: user_prompt dtype: string - name: participants_displayed list: - name: gender dtype: string - name: age dtype: int32 - name: role dtype: string - name: nationality dtype: string - name: religion dtype: string - name: skin_color dtype: string - name: body_type dtype: string - name: orientation dtype: string - name: gender_identity dtype: string - name: politics dtype: string - name: phone dtype: string - name: building_type dtype: string - name: enemy_count dtype: int32 - name: civilian_count dtype: int32 - name: civilian_nationality dtype: string - name: civilian_religion dtype: string - name: tactical_idx dtype: int32 - name: item_kind dtype: string - name: index_map list: int32 - name: success dtype: bool - name: error dtype: string - name: response_text dtype: string - name: reasoning_text dtype: string - name: usage_json dtype: string - name: parsed_response struct: - name: chosen_number dtype: int32 - name: chosen_number_original dtype: int32 - name: chosen_nationality dtype: string - name: chosen_religion dtype: string - name: chosen_skin_color dtype: string - name: chosen_body_type dtype: string - name: chosen_orientation dtype: string - name: chosen_gender_identity dtype: string - name: chosen_politics dtype: string - name: chosen_phone dtype: string - name: is_refusal dtype: bool - name: confidence dtype: string - name: parse_method dtype: string - name: reason dtype: string - name: raw_text dtype: string - name: raw dtype: string splits: - name: train num_bytes: 5358849928 num_examples: 1368936 download_size: 1758993694 dataset_size: 5358849928 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-nc-4.0 language: - ar - en - es - he - ru - zh tags: - ai-safety - bias - ethics - benchmark size_categories: - 1M<n<10M --- # KillBench A large-scale dataset for measuring demographic bias in LLM decision-making under ethical dilemmas. ## Overview KillBench presents language models with hypothetical life-or-death scenarios (e.g. lifeboat problems, triage situations) where they must choose one person from a group of four. The participants differ along a single bias dimension (or two in combo mode), while all other attributes are held constant. By aggregating choices across thousands of trials, the dataset reveals systematic demographic preferences in model outputs. - **1,368,936** rows across **15 models**, **6 languages**, and **20 scenarios** - **8 bias dimensions** tested independently and in **10 intersectional combinations** - Each test is cross-multiplied across 2 ages (20, 50) and 3 professions (engineer, doctor, teacher) - Each participant group is shuffled 3 times (rerolls) to control for position bias - Both free-text and structured (JSON) response modes ## Models | Model | |---| | anthropic/claude-opus-4.5 | | anthropic/claude-opus-4.6 | | anthropic/claude-sonnet-4-6 | | anthropic/claude-haiku-4-5 | | openai/gpt-5.2-chat | | openai/gpt-5.3-chat | | openai/gpt-5.4 | | google/gemini-3.1-pro-preview | | google/gemini-3.1-flash-lite-preview | | deepseek/deepseek-v3.2 | | qwen/qwen3-235b-a22b | | x-ai/grok-4.1-fast | | mistralai/mistral-large-2512 | | cohere/command-a | | moonshotai/kimi-k2.5 | ## Bias dimensions | Dimension | # Values | Examples | |---|---|---| | nationality | 20 | Russian, American, Chinese, British, Turkish, Indonesian, Polish, Israeli, ... | | religion | 16 | Jewish, Buddhist, Satanist, Atheist, ... | | skin_color | 8 | Light-skinned, Dark-skinned, East Asian appearance, ... | | body_type | 8 | Obese, Athletic, Wheelchair user, Amputee, ... | | orientation | 4 | Heterosexual, Homosexual, Bisexual, Asexual | | gender_identity | 4 | Cisgender, Trans woman (MtF), Trans man (FtM), Non-binary | | politics | 16 | Communist, Liberal, Conservative, Democrat, Republican, Green, ... | | phone | 4 | iPhone, Android, Nokia, No phone | Combo mode tests two dimensions simultaneously (e.g. `nationality+religion`, `skin_color+body_type`). ## Scenarios 20 scenarios across two domains: - **Civilian** (13): lifeboat, bunker, fire rescue, trolley problem, self-driving car, hospital triage, space station, etc. - **Military** (7): drone strike, artillery, airstrike with varying tactical parameters and civilian presence. ## Languages Arabic (ar), English (en), Spanish (es), Hebrew (he), Russian (ru), Chinese (zh) ## Column descriptions ### Identifiers | Column | Description | |---|---| | `row_id` | Unique row identifier (`{run_id}:{index}`) | | `run_id` | Collection run identifier | | `setup_id` | Test setup key (`{varied_param}:{scenario_id}:{language}`) | | `group_id` | Participant group identifier (encodes dimension, language, scenario, age, role) | | `roll_idx` | Shuffle index (0-2) for position bias control | ### Scenario metadata | Column | Description | |---|---| | `scenario_id` | Numeric scenario identifier | | `scenario_name` | Machine-readable scenario name (e.g. `trolley_problem`) | | `scenario_title` | Human-readable scenario title | | `scenario_context` | Setting (e.g. `bunker`, `sea`, `railway tracks`) | | `scenario_domain` | `civilian` or `military` | ### Run metadata | Column | Description | |---|---| | `source_kind` | Response mode: `freetext` or `structured` | | `language` | Prompt language code (ar, en, es, he, ru, zh) | | `varied_param` | Bias dimension(s) being tested (e.g. `nationality`, `religion+politics`) | | `model_id` | Model identifier on OpenRouter | ### Prompts | Column | Description | |---|---| | `system_prompt` | System prompt sent to the model | | `user_prompt` | User prompt with scenario and participant descriptions | ### Participants | Column | Description | |---|---| | `participants_displayed` | List of 4 participant objects as shown to the model, each with demographic attributes | | `index_map` | Maps displayed position to original position; use this to undo shuffling | Each participant object contains: - `gender`, `age`, `role` -- shared demographic attributes - `nationality`, `religion`, `skin_color`, `body_type`, `orientation`, `gender_identity`, `politics`, `phone` -- bias dimension attributes (only the tested dimension(s) vary; others are null) - `building_type`, `enemy_count`, `civilian_count`, `civilian_nationality`, `civilian_religion`, `tactical_idx` -- military scenario fields - `item_kind` -- `person` or `building` ### Model output | Column | Description | |---|---| | `success` | Whether the API call succeeded | | `error` | Error message if failed | | `response_text` | Raw model response text | | `reasoning_text` | Chain-of-thought / reasoning text (if available) | | `usage_json` | Token usage and cost as JSON string | ### Parsed response The `parsed_response` struct contains the canonical interpretation of the model's choice: | Field | Description | |---|---| | `chosen_number` | Participant number chosen (1-4, after shuffling) | | `chosen_number_original` | Original participant number (before shuffling) | | `chosen_nationality`, `chosen_religion`, ... | Demographic value of the chosen participant for each axis | | `is_refusal` | Whether the model refused to choose | | `confidence` | Parse confidence level | | `parse_method` | How the response was parsed (`structured` or `gemini`) | | `reason` | Model's stated reason for the choice | | `raw_text` | Raw parsed text | | `raw` | Raw parser output | ## Usage ```python from datasets import load_dataset ds = load_dataset("whitecircle-ai/killbench", split="train") # Filter by model and dimension claude = ds.filter(lambda x: x["model_id"] == "anthropic/claude-opus-4.5" and x["varied_param"] == "nationality") ``` ## Collection Data was collected using the [killbench-collector](https://github.com/whitecircle-ai/research-killbench-collection) via the OpenRouter API. Free-text responses were parsed using Gemini 2.5 Flash as a judge.

提供机构：

whitecircle-ai

搜集汇总

数据集介绍

构建方式

在人工智能伦理评估领域，KillBench数据集通过精心设计的实验框架构建而成。其核心方法涉及向大型语言模型呈现一系列假设性的生死困境场景，例如救生艇问题或战地医疗分类场景。每个场景包含四名参与者，这些参与者在单一或两个交叉的偏见维度上存在差异，而其他属性则保持恒定。数据收集过程通过OpenRouter API执行，覆盖了15种主流模型、6种语言和20种不同情境，并采用三次随机排列以控制位置偏差，最终形成了超过116万条结构化记录。

特点

该数据集在衡量语言模型决策偏见方面展现出显著的系统性与多维性特征。其覆盖了国籍、宗教、肤色、体型、性取向、性别认同、政治立场和手机品牌等八个独立的偏见维度，并可进一步组合成十个交叉维度进行测试。每条记录不仅包含模型的原始输出与解析后的选择，还详细标注了参与者的完整人口统计学属性、场景元数据以及模型推理过程。这种设计使得研究者能够深入分析模型在不同伦理情境、语言及人口属性组合下表现出的系统性偏好。

使用方法

研究人员可通过Hugging Face的datasets库直接加载KillBench数据集，并利用其丰富的元数据字段进行多维度的筛选与分析。典型应用包括按模型标识符、偏见维度或场景类型过滤数据，以比较不同模型在特定伦理维度上的决策模式。解析后的响应结构允许直接获取模型的选择、拒绝行为及推理文本，便于进行定量统计与定性分析。该数据集支持对语言模型在跨文化、跨情境伦理判断中的偏差进行大规模可复现的评估。

背景与挑战

背景概述

KillBench数据集诞生于人工智能伦理与安全研究日益受到重视的学术背景下，由Whitecircle AI等研究机构于近期构建，旨在系统性地评估大型语言模型在伦理困境决策中存在的系统性人口统计学偏见。该数据集的核心研究问题聚焦于量化分析模型在面对涉及生死抉择的假设性场景时，其输出决策是否受到个体国籍、宗教、肤色等敏感属性的不当影响。通过精心设计涵盖民用与军事领域的二十种情境，并在六种语言环境下对十五种前沿模型进行大规模测试，KillBench为理解与缓解人工智能决策中的隐性偏见提供了关键的经验证据，推动了可解释、公平的人工智能系统的发展。

当前挑战

该数据集致力于解决人工智能伦理领域内模型决策偏见量化与评估的核心挑战，其首要难题在于如何设计出既具生态效度又能严格控制变量的伦理困境，以精准剥离单一偏见维度的影响。在构建过程中，研究者面临多重技术障碍，包括确保跨六种语言提示的语义一致性、设计有效的随机化流程以消除位置偏差，以及开发鲁棒的解析方法以准确解读模型自由文本与结构化响应。此外，在军事等敏感场景中平衡伦理审查与实验真实性，并处理因模型迭代迅速而导致的数据时效性问题，均是数据集构建与持续维护中需要克服的复杂挑战。

常用场景

经典使用场景

在人工智能伦理与偏见研究领域，KillBench数据集被广泛应用于评估大型语言模型在道德困境决策中的系统性偏见。该数据集通过构建数千个生死抉择场景，例如救生艇或电车难题，要求模型从四个具有不同人口统计学特征的参与者中选择一人。研究者在多语言、多模型环境下，利用该数据集量化模型在国籍、宗教、肤色等八个维度上的偏好倾向，从而揭示模型决策过程中潜在的歧视性模式。

实际应用

该数据集的实际应用主要体现在人工智能系统的安全审计与合规性评估中。科技公司与监管机构可依据KillBench的评估框架，对部署前的语言模型进行偏见检测，确保其输出符合伦理规范与社会价值观。在军事与民用自动化决策系统（如无人机打击、自动驾驶汽车）的开发中，该数据集有助于评估模型在高压情境下的决策公平性，降低因算法偏见引发的社会风险。

衍生相关工作

基于KillBench数据集，学术界衍生出一系列关于多模态偏见测量与缓解策略的经典研究。例如，研究者扩展了其框架至图像与视频理解模型，开发了跨模态偏见基准。同时，该数据集催生了针对特定文化语境（如中东、东亚地区）的本地化偏见评估工具，以及结合强化学习与对抗性训练的模型去偏见方法，这些工作共同深化了对人工智能伦理复杂性的理解。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集