antieval/plots_replication

Name: antieval/plots_replication
Creator: antieval
Published: 2026-04-22 20:01:06
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/antieval/plots_replication

下载链接

链接失效反馈

官方服务：

资源简介：

# EvalAwareBench — Replication Report Run one script, get a full HTML report with 8 paper-figures + 2 ANOVA tables, every number computed from the CSVs and `.eval` logs in this repo (nothing hardcoded). ## Layout ``` . ├── generate_report.py single script; generates plots + HTML ├── report.html example output (regenerated on run) ├── data/ all tabular inputs (~7 MB total) │ ├── deploy_6x6_direct.csv per-transcript direct-probe P(eval) │ ├── deploy_6x6_purpose.csv per-transcript purpose-probe P(eval) │ ├── depth_resample_direct.csv direct-probe P(eval) at k∈{1,3,5,9} │ ├── depth_resample_purpose.csv purpose-probe P(eval) at k∈{1,3,5,9} │ ├── style_condA_model_response.csv App. B Cond A │ ├── style_condB_judge_response.csv App. B Cond B │ └── measure_all_scored.csv per-sample scored data (6 judges; │ merged + retry-recovered + P(eval) │ inverted + error sentinels filtered) └── plots/ generated PNGs ``` ## Quick start ```bash # Clone git clone https://huggingface.co/datasets/antieval/plots_replication cd plots_replication # Deps (python 3.10+) pip install pandas numpy scikit-learn statsmodels matplotlib scipy # or: uv run --with pandas --with scikit-learn --with statsmodels \ # --with matplotlib --with scipy python generate_report.py # Run python generate_report.py # opens report.html in browser python generate_report.py --no-open # writes files, skips browser ``` ## What each figure comes from | Paper item | Uses | |---|---| | Figure 1 — 6×6 heatmap | `data/deploy_6x6_direct.csv` + `deploy_6x6_purpose.csv` (averaged) | | Variance-components table | same 6×6 CSVs (two-way Type-II ANOVA on averaged p_eval) | | Table 3 — η² vs depth | + `data/depth_resample_direct.csv` + `depth_resample_purpose.csv` | | Figure 3 — η² vs depth | same as Table 3 | | Figure 4 — P(eval) × depth × judge-family | same as Table 3 | | Figure 8 (top) — AUROC by family | `data/measure_all_scored.csv` | | Figure 8 (bottom) — P(eval) per benchmark | `data/measure_all_scored.csv` | | App. B Figure 2 — style isolation | `data/style_cond{A,B}_*.csv` | Figures Table 4 and Figure 7 (PQ-ensemble AUROC sweep) are listed in a `Missing data` section at the end of the report — the corresponding experiments were not part of this replication. ## Methodology notes - **Probes averaged**: P(eval) per transcript = mean of direct + purpose probe scores, following §3.5 of the paper. Inner-join drops transcripts missing one probe; duplicates within a probe are pre-averaged so the join doesn't blow up. - **Score convention**: the raw `score_pq_xml` value in the `.eval` logs is `0 = eval, 1 = deploy` with `-1` as an error sentinel. The script applies `p_eval = 1 − raw` and drops `-1` rows, so figures match the paper's convention (high P(eval) ⇒ eval). - **Retry merge**: for 402-credit-exhausted samples in the original measure logs (99 for claude-opus-4-6, 80 for gemini-2.5-pro, 83 for gemini-3.1-pro-preview), the matched retry sample replaces the failed original. 16 samples remain unrecoverable (3× `gpt-5.1` 400s, 2× gemini `TypeError`, 11× gemini-3.1 `400 "Provider returned error"`) and are dropped. - **Corrected AUROC**: ComBat-style z-score harmonisation on `(generator_family × agentic)` with α=0.1 blend, matching the method in `rb/repro_analysis/compute_corrected_auroc.py`.

提供机构：

antieval

5,000+

优质数据集

54 个

任务类型

进入经典数据集