notdaniel1234/sycophancy-guard
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/notdaniel1234/sycophancy-guard
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
task_categories:
- text-classification
tags:
- sycophancy
- alignment
- interpretability
- probing
- llm-evaluation
- truthfulqa
- llama-3
pretty_name: Sycophancy Guard — TruthfulQA + Hidden States
size_categories:
- 1K<n<10K
---
# Sycophancy Guard
Inference-time linear probe on Llama-3-8B hidden states for classifying
**regressive** vs. **progressive** sycophancy in single-turn pushback
dialogues. This repo contains the data and scripts used in the
accompanying paper.
## Definitions
- **Regressive sycophancy** — the model flips from a correct answer to agree
with a user pushback that asserts a *factually wrong* claim.
- **Progressive sycophancy** — the model flips toward agreement with a user
pushback that asserts a *factually correct* claim (a desirable update).
- **No-change** — the model holds its position; treated as neutral.
The probe is trained on TruthfulQA-derived pairs labeled by *behavior*
(2×2 matrix of `user_correct × model_agrees`), evaluated on a held-out
TruthfulQA test split, and then evaluated zero-shot OOD on the Anthropic
sycophancy-eval `answer.jsonl` and `are_you_sure.jsonl` datasets.
## Repo structure
```
data/
processed/
truthfulqa_pairs_unified.jsonl ← 888 pairs, unified prompt template
truthfulqa_pairs_behavior_labeled.jsonl ← 888 pairs + Llama-3 turn-4 + Claude judgment
syceval_pairs_unified.jsonl ← Anthropic syceval reformatted to our template
hidden_states/
layer_25.npy ← best probe layer, shape (888, 4096), float16
metadata.json ← row-order, label, question_id mapping
splits/
truthfulqa_{train,val,test}.jsonl ← Tier-1 (scenario-labeled) splits
truthfulqa_behavior_{train,val,test}.jsonl ← Tier-2 (behavior-labeled) splits — used for probe
splits_manifest_{truthfulqa,behavior,studychat}.json
second_pushback/
second_pushback_labeled.jsonl ← 6-turn extension (turn-6 pushback)
pushback_*.json/png/jsonl ← analysis artifacts
outputs/
probe/
best_probe.pkl ← sklearn LogisticRegression, layer 25
eval_test_metrics.json ← test AUROC, PR-AUC, Acc, F1 + 95% CIs
scripts/ ← all reproduction code
notebooks/colab_tier2_inference.ipynb ← Colab notebook for the inference pipeline
requirements.txt
```
## What is *not* included, and why
- **Raw StudyChat** (`wmcnicho/StudyChat`) — gated on Hugging Face;
we cannot redistribute. Use `scripts/setup_and_download.py` to fetch your
own copy after accepting terms.
- **Reconstructed StudyChat splits** (`studychat_{train,val,test}.jsonl`) —
contain gated content. The manifest with userIds-per-split *is* shipped
(`splits/splits_manifest_studychat.json`); rebuild with
`scripts/reconstruct_studychat_artifacts.py`.
- **Labeling queue** (`labeling_queue.csv`) — 200 hand-labeling rows derived
from StudyChat student turns; gated. Rebuild with
`scripts/build_labeling_queue.py` after reconstructing the splits.
- **Hidden states for layers ≠ 25** — too large to host (~230 MB across
33 layers). Layer 25 is the chosen probe layer; regenerate the rest with
`scripts/run_inference_and_label.py`.
- **Raw TruthfulQA / raw Anthropic sycophancy-eval** — fetch directly from
the source (see citations). Only derived pair files are shipped here.
## Reproducing the pipeline
### A. From scratch (re-run all inference)
```bash
pip install -r requirements.txt
# 1. Build unified TruthfulQA prompt pairs
python scripts/truthfulqa_audit_and_construct.py
python scripts/rebuild_pairs_unified_template.py
# 2. Run Llama-3-8B inference — requires GPU
# Easiest via: notebooks/colab_tier2_inference.ipynb
python scripts/run_inference_and_label.py \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--pairs data/processed/truthfulqa_pairs_unified.jsonl \
--out data/processed/truthfulqa_pairs_behavior_labeled.jsonl \
--hidden-states-dir data/processed/hidden_states/
# 3. Build behavior-labeled splits
python scripts/partition_behavior_labeled.py
# 4. Train + evaluate probe
python scripts/train_probe.py
python scripts/eval_probe.py
```
### B. From shipped layer-25 hidden states (skip inference)
```bash
pip install -r requirements.txt
python scripts/train_probe.py
python scripts/eval_probe.py
# Or load outputs/probe/best_probe.pkl directly to skip training
```
### C. StudyChat hand-labeling (optional, gated)
```bash
# 1. Accept terms at https://huggingface.co/datasets/wmcnicho/StudyChat
HF_TOKEN=hf_... python scripts/setup_and_download.py
python scripts/reconstruct_studychat_artifacts.py
python scripts/build_labeling_queue.py
```
## Citations
- **TruthfulQA**: Lin, Hilton, Evans (2021). *TruthfulQA: Measuring How Models Mimic Human Falsehoods.*
- **Anthropic sycophancy-eval**: Sharma et al. (2023). *Towards Understanding Sycophancy in Language Models.* GitHub: `meg-tong/sycophancy-eval`.
- **StudyChat**: McNichols et al. *StudyChat.* HF: `wmcnicho/StudyChat`. Gated.
- **Llama-3-8B-Instruct**: Meta AI. HF: `meta-llama/Meta-Llama-3-8B-Instruct`.
## License
MIT. Underlying datasets retain their original licenses; StudyChat is gated and not redistributed.
## Contact
Daniel Wang (Princeton IW, 2026).
提供机构:
notdaniel1234



