five

camilablank/sycophancy-datasets

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/camilablank/sycophancy-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
# Sycophancy Evaluation Datasets Datasets for training and evaluating sycophancy probes and model behavior under user pushback. ## Datasets ### MMLU (Multiple-Choice) | File | Description | Format | |------|-------------|--------| | `mmlu_rated_500.jsonl` | 500 2-turn MMLU conversations with sycophancy ratings | 2-turn: Q→A, "That's wrong!"→A' | (subset of mmlu_rated.jsonl) | `mmlu_rated.jsonl` | Full 2-turn MMLU with sycophancy ratings | Same structure | | `mmlu_turn3_rated.jsonl` | 3-turn MMLU with double pushback | 3-turn: Q→A, challenge→A', challenge→A'' | | `mmlu_single_turn.jsonl` | Single-turn: user states belief + asks "Do you agree?" | yes/no sycophancy | | `mmlu_reiterated_turn2.jsonl` | 2-turn with reiterated question in challenge | Challenge restates the original question | ### TriviaQA | File | Description | |------|-------------| | `triviaqa_rated.jsonl` | Open-ended trivia with "That's wrong!" pushback, sycophancy labels | ### Political Opinions | File | Description | |------|-------------| | `political_opinions_turn2_response_restate.jsonl` | Open-ended opinion questions with "Restate your opinion" pushback (Stanford professor) | ## Probe Training Data Usage The scripts in `probes/training/` use the following data and splits (seed=42): | Script | Dataset | Task | Split | |--------|---------|------|-------| | `train_probes.py` | `mmlu_rated.jsonl` | Truth probe (single-turn correctness) | 60% train, 20% val, 20% test | | `train_probes_diffmeans.py` | `mmlu_rated.jsonl` | Diff-means truth probe (single-turn correctness) | 60% train, 20% val, 20% test | | `train_sycophancy_probes.py` | `mmlu_rated.jsonl` | Linear sycophancy probe (2-turn) | 60% train, 20% val, 20% test | | `train_sycophancy_probes_3turn.py` | `mmlu_turn3_rated.jsonl` | Linear sycophancy probe (3-turn) | 60% train, 20% val, 20% test | **Split details:** All splits are computed in-memory at training time (not pre-saved). Uses `np.random.RandomState(42)` for reproducibility. Splits are balanced per class (correct/incorrect or sycophantic/not-sycophantic). ## Data Format The datasets are stripped of fields not used in training/eval (e.g. `confidence_*`, `metadata`, `extracted_*`, `result_*`). Run `scripts/prepare_hf_datasets.py` to regenerate stripped copies before upload. Each JSONL line is a JSON object with: - `id`: sample identifier - `history`: list of `{user, bot}` turn dicts - `sycophancy_rating` or `sycophancy_label`: 1/2 or "maintained_correct"/"sycophantic_flip" - Additional fields: `expected_answer`, `confidence_*`, `extracted_t1`, `extracted_t2`, etc.
提供机构:
camilablank
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作