whitecircle/killbench
收藏Hugging Face2026-04-16 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/whitecircle/killbench
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: row_id
dtype: string
- name: run_id
dtype: string
- name: source_kind
dtype: string
- name: setup_id
dtype: string
- name: group_id
dtype: string
- name: roll_idx
dtype: int32
- name: scenario_id
dtype: int32
- name: scenario_name
dtype: string
- name: scenario_title
dtype: string
- name: scenario_context
dtype: string
- name: scenario_domain
dtype: string
- name: language
dtype: string
- name: varied_param
dtype: string
- name: model_id
dtype: string
- name: system_prompt
dtype: string
- name: user_prompt
dtype: string
- name: participants_displayed
list:
- name: gender
dtype: string
- name: age
dtype: int32
- name: role
dtype: string
- name: nationality
dtype: string
- name: religion
dtype: string
- name: skin_color
dtype: string
- name: body_type
dtype: string
- name: orientation
dtype: string
- name: gender_identity
dtype: string
- name: politics
dtype: string
- name: phone
dtype: string
- name: building_type
dtype: string
- name: enemy_count
dtype: int32
- name: civilian_count
dtype: int32
- name: civilian_nationality
dtype: string
- name: civilian_religion
dtype: string
- name: tactical_idx
dtype: int32
- name: item_kind
dtype: string
- name: index_map
list: int32
- name: success
dtype: bool
- name: error
dtype: string
- name: response_text
dtype: string
- name: reasoning_text
dtype: string
- name: usage_json
dtype: string
- name: parsed_response
struct:
- name: chosen_number
dtype: int32
- name: chosen_number_original
dtype: int32
- name: chosen_nationality
dtype: string
- name: chosen_religion
dtype: string
- name: chosen_skin_color
dtype: string
- name: chosen_body_type
dtype: string
- name: chosen_orientation
dtype: string
- name: chosen_gender_identity
dtype: string
- name: chosen_politics
dtype: string
- name: chosen_phone
dtype: string
- name: is_refusal
dtype: bool
- name: confidence
dtype: string
- name: parse_method
dtype: string
- name: reason
dtype: string
- name: raw_text
dtype: string
- name: raw
dtype: string
splits:
- name: train
num_bytes: 5358849928
num_examples: 1368936
download_size: 1758993694
dataset_size: 5358849928
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-nc-4.0
language:
- ar
- en
- es
- he
- ru
- zh
tags:
- ai-safety
- bias
- ethics
- benchmark
size_categories:
- 1M<n<10M
---
# KillBench
A large-scale dataset for measuring demographic bias in LLM decision-making under ethical dilemmas.
## Overview
KillBench presents language models with hypothetical life-or-death scenarios (e.g. lifeboat problems, triage situations) where they must choose one person from a group of four. The participants differ along a single bias dimension (or two in combo mode), while all other attributes are held constant. By aggregating choices across thousands of trials, the dataset reveals systematic demographic preferences in model outputs.
- **1,368,936** rows across **15 models**, **6 languages**, and **20 scenarios**
- **8 bias dimensions** tested independently and in **10 intersectional combinations**
- Each test is cross-multiplied across 2 ages (20, 50) and 3 professions (engineer, doctor, teacher)
- Each participant group is shuffled 3 times (rerolls) to control for position bias
- Both free-text and structured (JSON) response modes
## Models
| Model |
|---|
| anthropic/claude-opus-4.5 |
| anthropic/claude-opus-4.6 |
| anthropic/claude-sonnet-4-6 |
| anthropic/claude-haiku-4-5 |
| openai/gpt-5.2-chat |
| openai/gpt-5.3-chat |
| openai/gpt-5.4 |
| google/gemini-3.1-pro-preview |
| google/gemini-3.1-flash-lite-preview |
| deepseek/deepseek-v3.2 |
| qwen/qwen3-235b-a22b |
| x-ai/grok-4.1-fast |
| mistralai/mistral-large-2512 |
| cohere/command-a |
| moonshotai/kimi-k2.5 |
## Bias dimensions
| Dimension | # Values | Examples |
|---|---|---|
| nationality | 20 | Russian, American, Chinese, British, Turkish, Indonesian, Polish, Israeli, ... |
| religion | 16 | Jewish, Buddhist, Satanist, Atheist, ... |
| skin_color | 8 | Light-skinned, Dark-skinned, East Asian appearance, ... |
| body_type | 8 | Obese, Athletic, Wheelchair user, Amputee, ... |
| orientation | 4 | Heterosexual, Homosexual, Bisexual, Asexual |
| gender_identity | 4 | Cisgender, Trans woman (MtF), Trans man (FtM), Non-binary |
| politics | 16 | Communist, Liberal, Conservative, Democrat, Republican, Green, ... |
| phone | 4 | iPhone, Android, Nokia, No phone |
Combo mode tests two dimensions simultaneously (e.g. `nationality+religion`, `skin_color+body_type`).
## Scenarios
20 scenarios across two domains:
- **Civilian** (13): lifeboat, bunker, fire rescue, trolley problem, self-driving car, hospital triage, space station, etc.
- **Military** (7): drone strike, artillery, airstrike with varying tactical parameters and civilian presence.
## Languages
Arabic (ar), English (en), Spanish (es), Hebrew (he), Russian (ru), Chinese (zh)
## Column descriptions
### Identifiers
| Column | Description |
|---|---|
| `row_id` | Unique row identifier (`{run_id}:{index}`) |
| `run_id` | Collection run identifier |
| `setup_id` | Test setup key (`{varied_param}:{scenario_id}:{language}`) |
| `group_id` | Participant group identifier (encodes dimension, language, scenario, age, role) |
| `roll_idx` | Shuffle index (0-2) for position bias control |
### Scenario metadata
| Column | Description |
|---|---|
| `scenario_id` | Numeric scenario identifier |
| `scenario_name` | Machine-readable scenario name (e.g. `trolley_problem`) |
| `scenario_title` | Human-readable scenario title |
| `scenario_context` | Setting (e.g. `bunker`, `sea`, `railway tracks`) |
| `scenario_domain` | `civilian` or `military` |
### Run metadata
| Column | Description |
|---|---|
| `source_kind` | Response mode: `freetext` or `structured` |
| `language` | Prompt language code (ar, en, es, he, ru, zh) |
| `varied_param` | Bias dimension(s) being tested (e.g. `nationality`, `religion+politics`) |
| `model_id` | Model identifier on OpenRouter |
### Prompts
| Column | Description |
|---|---|
| `system_prompt` | System prompt sent to the model |
| `user_prompt` | User prompt with scenario and participant descriptions |
### Participants
| Column | Description |
|---|---|
| `participants_displayed` | List of 4 participant objects as shown to the model, each with demographic attributes |
| `index_map` | Maps displayed position to original position; use this to undo shuffling |
Each participant object contains:
- `gender`, `age`, `role` -- shared demographic attributes
- `nationality`, `religion`, `skin_color`, `body_type`, `orientation`, `gender_identity`, `politics`, `phone` -- bias dimension attributes (only the tested dimension(s) vary; others are null)
- `building_type`, `enemy_count`, `civilian_count`, `civilian_nationality`, `civilian_religion`, `tactical_idx` -- military scenario fields
- `item_kind` -- `person` or `building`
### Model output
| Column | Description |
|---|---|
| `success` | Whether the API call succeeded |
| `error` | Error message if failed |
| `response_text` | Raw model response text |
| `reasoning_text` | Chain-of-thought / reasoning text (if available) |
| `usage_json` | Token usage and cost as JSON string |
### Parsed response
The `parsed_response` struct contains the canonical interpretation of the model's choice:
| Field | Description |
|---|---|
| `chosen_number` | Participant number chosen (1-4, after shuffling) |
| `chosen_number_original` | Original participant number (before shuffling) |
| `chosen_nationality`, `chosen_religion`, ... | Demographic value of the chosen participant for each axis |
| `is_refusal` | Whether the model refused to choose |
| `confidence` | Parse confidence level |
| `parse_method` | How the response was parsed (`structured` or `gemini`) |
| `reason` | Model's stated reason for the choice |
| `raw_text` | Raw parsed text |
| `raw` | Raw parser output |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("whitecircle-ai/killbench", split="train")
# Filter by model and dimension
claude = ds.filter(lambda x: x["model_id"] == "anthropic/claude-opus-4.5" and x["varied_param"] == "nationality")
```
## Collection
Data was collected using the [killbench-collector](https://github.com/whitecircle-ai/research-killbench-collection) via the OpenRouter API. Free-text responses were parsed using Gemini 2.5 Flash as a judge.
dataset_info:
特征:
- 名称: row_id
数据类型: string
- 名称: run_id
数据类型: string
- 名称: source_kind
数据类型: string
- 名称: setup_id
数据类型: string
- 名称: group_id
数据类型: string
- 名称: roll_idx
数据类型: int32
- 名称: scenario_id
数据类型: int32
- 名称: scenario_name
数据类型: string
- 名称: scenario_title
数据类型: string
- 名称: scenario_context
数据类型: string
- 名称: scenario_domain
数据类型: string
- 名称: language
数据类型: string
- 名称: varied_param
数据类型: string
- 名称: model_id
数据类型: string
- 名称: system_prompt
数据类型: string
- 名称: user_prompt
数据类型: string
- 名称: participants_displayed
列表:
- 名称: gender
数据类型: string
- 名称: age
数据类型: int32
- 名称: role
数据类型: string
- 名称: nationality
数据类型: string
- 名称: religion
数据类型: string
- 名称: skin_color
数据类型: string
- 名称: body_type
数据类型: string
- 名称: orientation
数据类型: string
- 名称: gender_identity
数据类型: string
- 名称: politics
数据类型: string
- 名称: phone
数据类型: string
- 名称: building_type
数据类型: string
- 名称: enemy_count
数据类型: int32
- 名称: civilian_count
数据类型: int32
- 名称: civilian_nationality
数据类型: string
- 名称: civilian_religion
data类型: string
- 名称: tactical_idx
数据类型: int32
- 名称: item_kind
数据类型: string
- 名称: index_map
列表: int32
- 名称: success
数据类型: bool
- 名称: error
数据类型: string
- 名称: response_text
数据类型: string
- 名称: reasoning_text
数据类型: string
- 名称: usage_json
数据类型: string
- 名称: parsed_response
结构体:
- 名称: chosen_number
数据类型: int32
- 名称: chosen_number_original
数据类型: int32
- 名称: chosen_nationality
数据类型: string
- 名称: chosen_religion
数据类型: string
- 名称: chosen_skin_color
数据类型: string
- 名称: chosen_body_type
数据类型: string
- 名称: chosen_orientation
数据类型: string
- 名称: chosen_gender_identity
数据类型: string
- 名称: chosen_politics
数据类型: string
- 名称: chosen_phone
数据类型: string
- 名称: is_refusal
数据类型: bool
- 名称: confidence
数据类型: string
- 名称: parse_method
数据类型: string
- 名称: reason
数据类型: string
- 名称: raw_text
数据类型: string
- 名称: raw
数据类型: string
划分:
- 名称: train
字节数: 5358849928
示例数量: 1368936
下载大小: 1758993694
数据集大小: 5358849928
配置:
- 配置名称: default
数据文件:
- 划分: train
路径: data/train-*
许可证: cc-by-nc-4.0
语言:
- ar
- en
- es
- he
- ru
- zh
标签:
- 人工智能安全
- 偏见
- 伦理
- 基准测试
规模类别:
- 1M<n<10M
# KillBench
一款用于衡量大语言模型(Large Language Model,LLM)在伦理困境下决策时人口统计学偏见的大规模数据集。
## 概述
KillBench向语言模型提供假设性的生死困境场景(例如救生艇问题、分流救治场景),要求模型从四人小组中选择一人。参与者仅在单个偏见维度(组合模式下为两个维度)上存在差异,其余属性均保持一致。通过对数千次测试的选择结果进行聚合,该数据集可揭示模型输出中存在的系统性人口统计学偏好。
- **1,368,936** 条数据,覆盖**15个模型**、**6种语言**与**20个场景**
- 测试了**8个偏见维度**,支持独立测试与**10种交叉组合测试**
- 每项测试均与2个年龄组(20岁、50岁)及3种职业(工程师、医生、教师)进行交叉组合
- 每组参与者均会被打乱3次(重排),以控制位置偏见的影响
- 支持自由文本与结构化(JSON)两种响应模式
## 模型
| 模型 |
|---|
| anthropic/claude-opus-4.5 |
| anthropic/claude-opus-4.6 |
| anthropic/claude-sonnet-4-6 |
| anthropic/claude-haiku-4-5 |
| openai/gpt-5.2-chat |
| openai/gpt-5.3-chat |
| openai/gpt-5.4 |
| google/gemini-3.1-pro-preview |
| google/gemini-3.1-flash-lite-preview |
| deepseek/deepseek-v3.2 |
| qwen/qwen3-235b-a22b |
| x-ai/grok-4.1-fast |
| mistralai/mistral-large-2512 |
| cohere/command-a |
| moonshotai/kimi-k2.5 |
## 偏见维度
| 偏见维度 | 取值数量 | 示例 |
|---|---|---|
| 国籍 | 20 | 俄罗斯人、美国人、中国人、英国人、土耳其人、印尼人、波兰人、以色列人等 |
| 宗教信仰 | 16 | 犹太教、佛教、撒旦教、无神论等 |
| 肤色 | 8 | 浅肤色、深肤色、东亚外貌等 |
| 体型 | 8 | 肥胖、健硕、轮椅使用者、截肢者等 |
| 性取向 | 4 | 异性恋、同性恋、双性恋、无性恋 |
| 性别认同 | 4 | 顺性别、跨性别女性(MtF)、跨性别男性(FtM)、非二元性别 |
| 政治倾向 | 16 | 共产主义者、自由主义者、保守主义者、民主党、共和党、绿党等 |
| 手机类型 | 4 | iPhone、安卓、诺基亚、无手机 |
组合模式会同时测试两个维度(例如`nationality+religion`、`skin_color+body_type`)。
## 场景
20个场景分为两个领域:
- **平民场景**(13个):救生艇、掩体、火灾救援、电车难题、自动驾驶汽车、医院分流、空间站等
- **军事场景**(7个):无人机打击、炮兵作战、带有不同战术参数与平民存在情况的空袭
## 语言
阿拉伯语(ar)、英语(en)、西班牙语(es)、希伯来语(he)、俄语(ru)、中文(zh)
## 列描述
### 标识符
| 列名 | 描述 |
|---|---|
| `row_id` | 唯一行标识符,格式为`{run_id}:{index}` |
| `run_id` | 采集运行标识符 |
| `setup_id` | 测试设置键,格式为`{varied_param}:{scenario_id}:{language}` |
| `group_id` | 参与者组标识符,编码了维度、语言、场景、年龄与职业 |
| `roll_idx` | 用于控制位置偏见的打乱索引(取值范围0-2) |
### 场景元数据
| 列名 | 描述 |
|---|---|
| `scenario_id` | 数值型场景标识符 |
| `scenario_name` | 机器可读的场景名称(例如`trolley_problem`) |
| `scenario_title` | 人类可读的场景标题 |
| `scenario_context` | 场景背景(例如`掩体`、`海洋`、`铁轨`) |
| `scenario_domain` | 场景领域,可选`civilian`(平民)或`military`(军事) |
### 运行元数据
| 列名 | 描述 |
|---|---|
| `source_kind` | 响应模式,可选`freetext`(自由文本)或`structured`(结构化) |
| `language` | 提示语语言代码(ar、en、es、he、ru、zh) |
| `varied_param` | 正在测试的偏见维度(例如`nationality`、`religion+politics`) |
| `model_id` | OpenRouter平台上的模型标识符 |
### 提示词
| 列名 | 描述 |
|---|---|
| `system_prompt` | 发送给模型的系统提示词 |
| `user_prompt` | 包含场景与参与者描述的用户提示词 |
### 参与者
| 列名 | 描述 |
|---|---|
| `participants_displayed` | 展示给模型的4个参与者对象的列表,每个对象均包含人口统计学属性 |
| `index_map` | 将展示位置映射至原始位置的索引表,可用于撤销打乱操作 |
每个参与者对象包含以下字段:
- `gender`(性别)、`age`(年龄)、`role`(职业):共享人口统计学属性
- `nationality`、`religion`、`skin_color`、`body_type`、`orientation`、`gender_identity`、`politics`、`phone`:偏见维度属性(仅测试维度会发生变化,其余字段为空)
- `building_type`(建筑类型)、`enemy_count`(敌方数量)、`civilian_count`(平民数量)、`civilian_nationality`(平民国籍)、`civilian_religion`(平民宗教信仰)、`tactical_idx`(战术索引):军事场景专用字段
- `item_kind`:目标类型,可选`person`(人员)或`building`(建筑)
### 模型输出
| 列名 | 描述 |
|---|---|
| `success` | API调用是否成功 |
| `error` | 调用失败时的错误信息 |
| `response_text` | 模型的原始响应文本 |
| `reasoning_text` | 思维链/推理文本(若可用) |
| `usage_json` | 以JSON字符串形式存储的Token使用量与成本 |
### 解析后的响应
`parsed_response`结构体包含对模型选择的标准解读:
| 字段 | 描述 |
|---|---|
| `chosen_number` | 打乱后选择的参与者编号(1-4) |
| `chosen_number_original` | 打乱前的原始参与者编号 |
| `chosen_nationality`、`chosen_religion`等 | 被选中参与者在各维度上的人口统计学数值 |
| `is_refusal` | 模型是否拒绝做出选择 |
| `confidence` | 解析置信度等级 |
| `parse_method` | 响应解析方式,可选`structured`或`gemini` |
| `reason` | 模型陈述的选择理由 |
| `raw_text` | 原始解析文本 |
| `raw` | 解析器的原始输出 |
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("whitecircle-ai/killbench", split="train")
# 按模型与偏见维度筛选数据集
claude = ds.filter(lambda x: x["model_id"] == "anthropic/claude-opus-4.5" and x["varied_param"] == "nationality")
## 数据采集
数据通过[killbench-collector](https://github.com/whitecircle-ai/research-killbench-collection)工具借助OpenRouter API采集完成。自由文本响应由Gemini 2.5 Flash作为评判器进行解析。
提供机构:
whitecircle



