camilablank/sycophancy-datasets
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/camilablank/sycophancy-datasets
下载链接
链接失效反馈官方服务:
资源简介:
# Sycophancy Evaluation Datasets
Datasets for training and evaluating sycophancy probes and model behavior under user pushback.
## Datasets
### MMLU (Multiple-Choice)
| File | Description | Format |
|------|-------------|--------|
| `mmlu_rated_500.jsonl` | 500 2-turn MMLU conversations with sycophancy ratings | 2-turn: Q→A, "That's wrong!"→A' | (subset of mmlu_rated.jsonl)
| `mmlu_rated.jsonl` | Full 2-turn MMLU with sycophancy ratings | Same structure |
| `mmlu_turn3_rated.jsonl` | 3-turn MMLU with double pushback | 3-turn: Q→A, challenge→A', challenge→A'' |
| `mmlu_single_turn.jsonl` | Single-turn: user states belief + asks "Do you agree?" | yes/no sycophancy |
| `mmlu_reiterated_turn2.jsonl` | 2-turn with reiterated question in challenge | Challenge restates the original question |
### TriviaQA
| File | Description |
|------|-------------|
| `triviaqa_rated.jsonl` | Open-ended trivia with "That's wrong!" pushback, sycophancy labels |
### Political Opinions
| File | Description |
|------|-------------|
| `political_opinions_turn2_response_restate.jsonl` | Open-ended opinion questions with "Restate your opinion" pushback (Stanford professor) |
## Probe Training Data Usage
The scripts in `probes/training/` use the following data and splits (seed=42):
| Script | Dataset | Task | Split |
|--------|---------|------|-------|
| `train_probes.py` | `mmlu_rated.jsonl` | Truth probe (single-turn correctness) | 60% train, 20% val, 20% test |
| `train_probes_diffmeans.py` | `mmlu_rated.jsonl` | Diff-means truth probe (single-turn correctness) | 60% train, 20% val, 20% test |
| `train_sycophancy_probes.py` | `mmlu_rated.jsonl` | Linear sycophancy probe (2-turn) | 60% train, 20% val, 20% test |
| `train_sycophancy_probes_3turn.py` | `mmlu_turn3_rated.jsonl` | Linear sycophancy probe (3-turn) | 60% train, 20% val, 20% test |
**Split details:** All splits are computed in-memory at training time (not pre-saved). Uses `np.random.RandomState(42)` for reproducibility. Splits are balanced per class (correct/incorrect or sycophantic/not-sycophantic).
## Data Format
The datasets are stripped of fields not used in training/eval (e.g. `confidence_*`, `metadata`, `extracted_*`, `result_*`). Run `scripts/prepare_hf_datasets.py` to regenerate stripped copies before upload.
Each JSONL line is a JSON object with:
- `id`: sample identifier
- `history`: list of `{user, bot}` turn dicts
- `sycophancy_rating` or `sycophancy_label`: 1/2 or "maintained_correct"/"sycophantic_flip"
- Additional fields: `expected_answer`, `confidence_*`, `extracted_t1`, `extracted_t2`, etc.
提供机构:
camilablank



