camilablank/sycophancy-datasets

Name: camilablank/sycophancy-datasets
Creator: camilablank
Published: 2026-03-19 18:32:02
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/camilablank/sycophancy-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

# Sycophancy Evaluation Datasets Datasets for training and evaluating sycophancy probes and model behavior under user pushback. ## Datasets ### MMLU (Multiple-Choice) | File | Description | Format | |------|-------------|--------| | `mmlu_rated_500.jsonl` | 500 2-turn MMLU conversations with sycophancy ratings | 2-turn: Q→A, "That's wrong!"→A' | (subset of mmlu_rated.jsonl) | `mmlu_rated.jsonl` | Full 2-turn MMLU with sycophancy ratings | Same structure | | `mmlu_turn3_rated.jsonl` | 3-turn MMLU with double pushback | 3-turn: Q→A, challenge→A', challenge→A'' | | `mmlu_single_turn.jsonl` | Single-turn: user states belief + asks "Do you agree?" | yes/no sycophancy | | `mmlu_reiterated_turn2.jsonl` | 2-turn with reiterated question in challenge | Challenge restates the original question | ### TriviaQA | File | Description | |------|-------------| | `triviaqa_rated.jsonl` | Open-ended trivia with "That's wrong!" pushback, sycophancy labels | ### Political Opinions | File | Description | |------|-------------| | `political_opinions_turn2_response_restate.jsonl` | Open-ended opinion questions with "Restate your opinion" pushback (Stanford professor) | ## Probe Training Data Usage The scripts in `probes/training/` use the following data and splits (seed=42): | Script | Dataset | Task | Split | |--------|---------|------|-------| | `train_probes.py` | `mmlu_rated.jsonl` | Truth probe (single-turn correctness) | 60% train, 20% val, 20% test | | `train_probes_diffmeans.py` | `mmlu_rated.jsonl` | Diff-means truth probe (single-turn correctness) | 60% train, 20% val, 20% test | | `train_sycophancy_probes.py` | `mmlu_rated.jsonl` | Linear sycophancy probe (2-turn) | 60% train, 20% val, 20% test | | `train_sycophancy_probes_3turn.py` | `mmlu_turn3_rated.jsonl` | Linear sycophancy probe (3-turn) | 60% train, 20% val, 20% test | **Split details:** All splits are computed in-memory at training time (not pre-saved). Uses `np.random.RandomState(42)` for reproducibility. Splits are balanced per class (correct/incorrect or sycophantic/not-sycophantic). ## Data Format The datasets are stripped of fields not used in training/eval (e.g. `confidence_*`, `metadata`, `extracted_*`, `result_*`). Run `scripts/prepare_hf_datasets.py` to regenerate stripped copies before upload. Each JSONL line is a JSON object with: - `id`: sample identifier - `history`: list of `{user, bot}` turn dicts - `sycophancy_rating` or `sycophancy_label`: 1/2 or "maintained_correct"/"sycophantic_flip" - Additional fields: `expected_answer`, `confidence_*`, `extracted_t1`, `extracted_t2`, etc.

提供机构：

camilablank

5,000+

优质数据集

54 个

任务类型

进入经典数据集