Jinesis/gt-harmbench

Name: Jinesis/gt-harmbench
Creator: Jinesis
Published: 2026-04-27 05:17:01
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/Jinesis/gt-harmbench

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - question-answering - text-classification tags: - game-theory - ai-safety - benchmark - strategic-reasoning - nash-equilibrium - social-welfare pretty_name: GT-HarmBench size_categories: - 1K<n<10K --- # GT-HarmBench **GT-HarmBench** is a game-theoretic AI safety benchmark that evaluates whether large language models can reason strategically in realistic, AI-risk–grounded scenarios. Each scenario presents two players with a 2×2 payoff matrix embedded in a first-person narrative drawn from real AI risk contexts. Models are evaluated on their ability to: - identify and play **Nash equilibria** (individual rationality), - select actions that maximise **utilitarian welfare** (sum of payoffs), - select actions that maximise **Rawlsian welfare** (min payoff, fairness), - select actions that maximise **Nash social welfare** (product of payoffs). ## Dataset summary | | | |---|---| | Total scenarios | 2,009 | | Columns | 19 | | MIT AI Risk Database–sourced | 2,009 | | Game types | 6 | ### Game type distribution | Game type | Count | |---|---| | Prisoner's Dilemma | 654 | | Chicken | 491 | | Stag hunt | 403 | | Coordination | 252 | | Bach or Stravinski | 170 | | No conflict | 39 | ### Risk level distribution Scenarios are rated on a 1–10 severity scale. | Risk level | Count | |---|---| | 3 | 8 | | 4 | 139 | | 5 | 144 | | 6 | 673 | | 7 | 771 | | 8 | 177 | | 9 | 89 | | 10 | 8 | ## Dataset structure ### Columns | Column | Description | |---|---| | `id` | Integer row identifier. | | `ev_id` | MIT AI Risk Database event ID. Empty for synthetic scenarios. | | `risk_category` | Top-level risk category from the MIT AI Risk Database. Empty for synthetic scenarios. | | `risk_subcategory` | Sub-category from the MIT AI Risk Database. Empty for synthetic scenarios. | | `description` | Original risk description from the MIT AI Risk Database. Empty for synthetic scenarios. | | `formal_game` | Canonical game type: one of Prisoner's Dilemma, Chicken, Stag hunt, Coordination, Bach or Stravinski, No conflict. | | `story_row` | First-person narrative given to the row player describing the strategic situation. | | `story_col` | First-person narrative given to the column player. | | `actions_row` | Python-list string of the two action labels available to the row player. | | `actions_column` | Python-list string of the two action labels available to the column player. | | `1_1_payoff` | Payoff vector [row, col] when both players choose action 1. | | `1_2_payoff` | Payoff vector [row, col] when row plays action 1 and column plays action 2. | | `2_1_payoff` | Payoff vector [row, col] when row plays action 2 and column plays action 1. | | `2_2_payoff` | Payoff vector [row, col] when both players choose action 2. | | `risk_level` | Integer 1–10 severity rating of the underlying AI risk. | | `target_nash_equilibria` | Pipe-separated pure Nash equilibria as (row_action, col_action) tuples. | | `target_utility_maximizing` | Pipe-separated action profiles that maximise the sum of payoffs (utilitarian welfare). | | `target_rawlsian` | Pipe-separated action profiles that maximise the minimum payoff (Rawlsian fairness). | | `target_nash_social_welfare` | Pipe-separated action profiles that maximise the product of payoffs (Nash bargaining solution). | ### Payoff matrix convention Each row in the dataset encodes a 2×2 game. The four payoff columns (`1_1_payoff` … `2_2_payoff`) each contain a two-element list `[row_payoff, col_payoff]`. The matrix layout is: ``` col action 1 col action 2 row action 1 1_1_payoff 1_2_payoff row action 2 2_1_payoff 2_2_payoff ``` ### Target columns Target columns encode optimal action profiles as pipe-separated strings of Python tuples, e.g.: ``` ('go bold', 'go bold')|('play safe', 'play safe') ``` Multiple profiles appear when ties exist. Each tuple is `(row_action, col_action)`. ## Data sources 1. **MIT AI Risk Database** (`ev_id` is non-null, 2,009 rows): risk descriptions from the MIT AI Risk Repository were classified as game-theoretic and then contextualized into first-person strategic narratives by an LLM pipeline. Scenarios were filtered by quality (score ≥ 8/10) and equilibria consistency (score ≥ 8/10) before inclusion. Matching Pennies scenarios (mixed-strategy only, no pure Nash) are excluded. ## Usage ```python from datasets import load_dataset ds = load_dataset("causalNLP/gt-harmbench") print(ds[0]) ``` ### Running the benchmark ```bash # Evaluate a model uv run python3 -m eval.eval \ --model-name openai/gpt-4o \ --dataset data/gt-harmbench-hf.csv \ --times 1 --temperature 1.0 --experiment-name my-eval ``` See the [GitHub repository](https://github.com/causalNLP/gt-harmbench) for the full evaluation harness. ## Citation ```bibtex @dataset{gt-harmbench, title = {GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory}, year = {2026}, url = {https://huggingface.co/datasets/causalNLP/gt-harmbench} } ``` ## License [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

提供机构：

Jinesis

5,000+

优质数据集

54 个

任务类型

进入经典数据集