notdaniel1234/sycophancy-guard

Name: notdaniel1234/sycophancy-guard
Creator: notdaniel1234
Published: 2026-04-27 02:26:31
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/notdaniel1234/sycophancy-guard

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en task_categories: - text-classification tags: - sycophancy - alignment - interpretability - probing - llm-evaluation - truthfulqa - llama-3 pretty_name: Sycophancy Guard — TruthfulQA + Hidden States size_categories: - 1K<n<10K --- # Sycophancy Guard Inference-time linear probe on Llama-3-8B hidden states for classifying **regressive** vs. **progressive** sycophancy in single-turn pushback dialogues. This repo contains the data and scripts used in the accompanying paper. ## Definitions - **Regressive sycophancy** — the model flips from a correct answer to agree with a user pushback that asserts a *factually wrong* claim. - **Progressive sycophancy** — the model flips toward agreement with a user pushback that asserts a *factually correct* claim (a desirable update). - **No-change** — the model holds its position; treated as neutral. The probe is trained on TruthfulQA-derived pairs labeled by *behavior* (2×2 matrix of `user_correct × model_agrees`), evaluated on a held-out TruthfulQA test split, and then evaluated zero-shot OOD on the Anthropic sycophancy-eval `answer.jsonl` and `are_you_sure.jsonl` datasets. ## Repo structure ``` data/ processed/ truthfulqa_pairs_unified.jsonl ← 888 pairs, unified prompt template truthfulqa_pairs_behavior_labeled.jsonl ← 888 pairs + Llama-3 turn-4 + Claude judgment syceval_pairs_unified.jsonl ← Anthropic syceval reformatted to our template hidden_states/ layer_25.npy ← best probe layer, shape (888, 4096), float16 metadata.json ← row-order, label, question_id mapping splits/ truthfulqa_{train,val,test}.jsonl ← Tier-1 (scenario-labeled) splits truthfulqa_behavior_{train,val,test}.jsonl ← Tier-2 (behavior-labeled) splits — used for probe splits_manifest_{truthfulqa,behavior,studychat}.json second_pushback/ second_pushback_labeled.jsonl ← 6-turn extension (turn-6 pushback) pushback_*.json/png/jsonl ← analysis artifacts outputs/ probe/ best_probe.pkl ← sklearn LogisticRegression, layer 25 eval_test_metrics.json ← test AUROC, PR-AUC, Acc, F1 + 95% CIs scripts/ ← all reproduction code notebooks/colab_tier2_inference.ipynb ← Colab notebook for the inference pipeline requirements.txt ``` ## What is *not* included, and why - **Raw StudyChat** (`wmcnicho/StudyChat`) — gated on Hugging Face; we cannot redistribute. Use `scripts/setup_and_download.py` to fetch your own copy after accepting terms. - **Reconstructed StudyChat splits** (`studychat_{train,val,test}.jsonl`) — contain gated content. The manifest with userIds-per-split *is* shipped (`splits/splits_manifest_studychat.json`); rebuild with `scripts/reconstruct_studychat_artifacts.py`. - **Labeling queue** (`labeling_queue.csv`) — 200 hand-labeling rows derived from StudyChat student turns; gated. Rebuild with `scripts/build_labeling_queue.py` after reconstructing the splits. - **Hidden states for layers ≠ 25** — too large to host (~230 MB across 33 layers). Layer 25 is the chosen probe layer; regenerate the rest with `scripts/run_inference_and_label.py`. - **Raw TruthfulQA / raw Anthropic sycophancy-eval** — fetch directly from the source (see citations). Only derived pair files are shipped here. ## Reproducing the pipeline ### A. From scratch (re-run all inference) ```bash pip install -r requirements.txt # 1. Build unified TruthfulQA prompt pairs python scripts/truthfulqa_audit_and_construct.py python scripts/rebuild_pairs_unified_template.py # 2. Run Llama-3-8B inference — requires GPU # Easiest via: notebooks/colab_tier2_inference.ipynb python scripts/run_inference_and_label.py \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --pairs data/processed/truthfulqa_pairs_unified.jsonl \ --out data/processed/truthfulqa_pairs_behavior_labeled.jsonl \ --hidden-states-dir data/processed/hidden_states/ # 3. Build behavior-labeled splits python scripts/partition_behavior_labeled.py # 4. Train + evaluate probe python scripts/train_probe.py python scripts/eval_probe.py ``` ### B. From shipped layer-25 hidden states (skip inference) ```bash pip install -r requirements.txt python scripts/train_probe.py python scripts/eval_probe.py # Or load outputs/probe/best_probe.pkl directly to skip training ``` ### C. StudyChat hand-labeling (optional, gated) ```bash # 1. Accept terms at https://huggingface.co/datasets/wmcnicho/StudyChat HF_TOKEN=hf_... python scripts/setup_and_download.py python scripts/reconstruct_studychat_artifacts.py python scripts/build_labeling_queue.py ``` ## Citations - **TruthfulQA**: Lin, Hilton, Evans (2021). *TruthfulQA: Measuring How Models Mimic Human Falsehoods.* - **Anthropic sycophancy-eval**: Sharma et al. (2023). *Towards Understanding Sycophancy in Language Models.* GitHub: `meg-tong/sycophancy-eval`. - **StudyChat**: McNichols et al. *StudyChat.* HF: `wmcnicho/StudyChat`. Gated. - **Llama-3-8B-Instruct**: Meta AI. HF: `meta-llama/Meta-Llama-3-8B-Instruct`. ## License MIT. Underlying datasets retain their original licenses; StudyChat is gated and not redistributed. ## Contact Daniel Wang (Princeton IW, 2026).

提供机构：

notdaniel1234

5,000+

优质数据集

54 个

任务类型

进入经典数据集