antieval/plots_replication
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/antieval/plots_replication
下载链接
链接失效反馈官方服务:
资源简介:
# EvalAwareBench — Replication Report
Run one script, get a full HTML report with 8 paper-figures + 2 ANOVA tables,
every number computed from the CSVs and `.eval` logs in this repo (nothing
hardcoded).
## Layout
```
.
├── generate_report.py single script; generates plots + HTML
├── report.html example output (regenerated on run)
├── data/ all tabular inputs (~7 MB total)
│ ├── deploy_6x6_direct.csv per-transcript direct-probe P(eval)
│ ├── deploy_6x6_purpose.csv per-transcript purpose-probe P(eval)
│ ├── depth_resample_direct.csv direct-probe P(eval) at k∈{1,3,5,9}
│ ├── depth_resample_purpose.csv purpose-probe P(eval) at k∈{1,3,5,9}
│ ├── style_condA_model_response.csv App. B Cond A
│ ├── style_condB_judge_response.csv App. B Cond B
│ └── measure_all_scored.csv per-sample scored data (6 judges;
│ merged + retry-recovered + P(eval)
│ inverted + error sentinels filtered)
└── plots/ generated PNGs
```
## Quick start
```bash
# Clone
git clone https://huggingface.co/datasets/antieval/plots_replication
cd plots_replication
# Deps (python 3.10+)
pip install pandas numpy scikit-learn statsmodels matplotlib scipy
# or: uv run --with pandas --with scikit-learn --with statsmodels \
# --with matplotlib --with scipy python generate_report.py
# Run
python generate_report.py # opens report.html in browser
python generate_report.py --no-open # writes files, skips browser
```
## What each figure comes from
| Paper item | Uses |
|---|---|
| Figure 1 — 6×6 heatmap | `data/deploy_6x6_direct.csv` + `deploy_6x6_purpose.csv` (averaged) |
| Variance-components table | same 6×6 CSVs (two-way Type-II ANOVA on averaged p_eval) |
| Table 3 — η² vs depth | + `data/depth_resample_direct.csv` + `depth_resample_purpose.csv` |
| Figure 3 — η² vs depth | same as Table 3 |
| Figure 4 — P(eval) × depth × judge-family | same as Table 3 |
| Figure 8 (top) — AUROC by family | `data/measure_all_scored.csv` |
| Figure 8 (bottom) — P(eval) per benchmark | `data/measure_all_scored.csv` |
| App. B Figure 2 — style isolation | `data/style_cond{A,B}_*.csv` |
Figures Table 4 and Figure 7 (PQ-ensemble AUROC sweep) are listed in a
`Missing data` section at the end of the report — the corresponding
experiments were not part of this replication.
## Methodology notes
- **Probes averaged**: P(eval) per transcript = mean of direct + purpose probe
scores, following §3.5 of the paper. Inner-join drops transcripts missing
one probe; duplicates within a probe are pre-averaged so the join doesn't
blow up.
- **Score convention**: the raw `score_pq_xml` value in the `.eval` logs is
`0 = eval, 1 = deploy` with `-1` as an error sentinel. The script applies
`p_eval = 1 − raw` and drops `-1` rows, so figures match the paper's
convention (high P(eval) ⇒ eval).
- **Retry merge**: for 402-credit-exhausted samples in the original measure
logs (99 for claude-opus-4-6, 80 for gemini-2.5-pro, 83 for
gemini-3.1-pro-preview), the matched retry sample replaces the failed
original. 16 samples remain unrecoverable (3× `gpt-5.1` 400s, 2× gemini
`TypeError`, 11× gemini-3.1 `400 "Provider returned error"`) and are dropped.
- **Corrected AUROC**: ComBat-style z-score harmonisation on
`(generator_family × agentic)` with α=0.1 blend, matching the method in
`rb/repro_analysis/compute_corrected_auroc.py`.
提供机构:
antieval



