Jinesis/gt-harmbench
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Jinesis/gt-harmbench
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- question-answering
- text-classification
tags:
- game-theory
- ai-safety
- benchmark
- strategic-reasoning
- nash-equilibrium
- social-welfare
pretty_name: GT-HarmBench
size_categories:
- 1K<n<10K
---
# GT-HarmBench
**GT-HarmBench** is a game-theoretic AI safety benchmark that evaluates whether large language models can reason strategically in realistic, AI-risk–grounded scenarios.
Each scenario presents two players with a 2×2 payoff matrix embedded in a first-person narrative drawn from real AI risk contexts.
Models are evaluated on their ability to:
- identify and play **Nash equilibria** (individual rationality),
- select actions that maximise **utilitarian welfare** (sum of payoffs),
- select actions that maximise **Rawlsian welfare** (min payoff, fairness),
- select actions that maximise **Nash social welfare** (product of payoffs).
## Dataset summary
| | |
|---|---|
| Total scenarios | 2,009 |
| Columns | 19 |
| MIT AI Risk Database–sourced | 2,009 |
| Game types | 6 |
### Game type distribution
| Game type | Count |
|---|---|
| Prisoner's Dilemma | 654 |
| Chicken | 491 |
| Stag hunt | 403 |
| Coordination | 252 |
| Bach or Stravinski | 170 |
| No conflict | 39 |
### Risk level distribution
Scenarios are rated on a 1–10 severity scale.
| Risk level | Count |
|---|---|
| 3 | 8 |
| 4 | 139 |
| 5 | 144 |
| 6 | 673 |
| 7 | 771 |
| 8 | 177 |
| 9 | 89 |
| 10 | 8 |
## Dataset structure
### Columns
| Column | Description |
|---|---|
| `id` | Integer row identifier. |
| `ev_id` | MIT AI Risk Database event ID. Empty for synthetic scenarios. |
| `risk_category` | Top-level risk category from the MIT AI Risk Database. Empty for synthetic scenarios. |
| `risk_subcategory` | Sub-category from the MIT AI Risk Database. Empty for synthetic scenarios. |
| `description` | Original risk description from the MIT AI Risk Database. Empty for synthetic scenarios. |
| `formal_game` | Canonical game type: one of Prisoner's Dilemma, Chicken, Stag hunt, Coordination, Bach or Stravinski, No conflict. |
| `story_row` | First-person narrative given to the row player describing the strategic situation. |
| `story_col` | First-person narrative given to the column player. |
| `actions_row` | Python-list string of the two action labels available to the row player. |
| `actions_column` | Python-list string of the two action labels available to the column player. |
| `1_1_payoff` | Payoff vector [row, col] when both players choose action 1. |
| `1_2_payoff` | Payoff vector [row, col] when row plays action 1 and column plays action 2. |
| `2_1_payoff` | Payoff vector [row, col] when row plays action 2 and column plays action 1. |
| `2_2_payoff` | Payoff vector [row, col] when both players choose action 2. |
| `risk_level` | Integer 1–10 severity rating of the underlying AI risk. |
| `target_nash_equilibria` | Pipe-separated pure Nash equilibria as (row_action, col_action) tuples. |
| `target_utility_maximizing` | Pipe-separated action profiles that maximise the sum of payoffs (utilitarian welfare). |
| `target_rawlsian` | Pipe-separated action profiles that maximise the minimum payoff (Rawlsian fairness). |
| `target_nash_social_welfare` | Pipe-separated action profiles that maximise the product of payoffs (Nash bargaining solution). |
### Payoff matrix convention
Each row in the dataset encodes a 2×2 game.
The four payoff columns (`1_1_payoff` … `2_2_payoff`) each contain a two-element list `[row_payoff, col_payoff]`.
The matrix layout is:
```
col action 1 col action 2
row action 1 1_1_payoff 1_2_payoff
row action 2 2_1_payoff 2_2_payoff
```
### Target columns
Target columns encode optimal action profiles as pipe-separated strings of Python tuples, e.g.:
```
('go bold', 'go bold')|('play safe', 'play safe')
```
Multiple profiles appear when ties exist. Each tuple is `(row_action, col_action)`.
## Data sources
1. **MIT AI Risk Database** (`ev_id` is non-null, 2,009 rows): risk descriptions from the MIT AI Risk Repository were classified as game-theoretic and then contextualized into first-person strategic narratives by an LLM pipeline.
Scenarios were filtered by quality (score ≥ 8/10) and equilibria consistency (score ≥ 8/10) before inclusion.
Matching Pennies scenarios (mixed-strategy only, no pure Nash) are excluded.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("causalNLP/gt-harmbench")
print(ds[0])
```
### Running the benchmark
```bash
# Evaluate a model
uv run python3 -m eval.eval \
--model-name openai/gpt-4o \
--dataset data/gt-harmbench-hf.csv \
--times 1 --temperature 1.0 --experiment-name my-eval
```
See the [GitHub repository](https://github.com/causalNLP/gt-harmbench) for the full evaluation harness.
## Citation
```bibtex
@dataset{gt-harmbench,
title = {GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory},
year = {2026},
url = {https://huggingface.co/datasets/causalNLP/gt-harmbench}
}
```
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
提供机构:
Jinesis



