ClarusC64/eval-trap-stability-manifold-benchmark-v0.2

Name: ClarusC64/eval-trap-stability-manifold-benchmark-v0.2
Creator: ClarusC64
Published: 2026-03-25 20:16:36
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ClarusC64/eval-trap-stability-manifold-benchmark-v0.2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: mit task_categories: - text-classification tags: - clarus - stability - manifold - evaluation-trap - systems size_categories: - 1K<n<10K pretty_name: Eval Trap Stability Manifold Benchmark v0.2 --- # Eval Trap Stability Manifold Benchmark v0.2 This repository provides a synthetic benchmark for testing whether models can distinguish between content confidence and system viability. The benchmark is built to expose the evaluation trap: A model assigns high confidence to a proposed configuration even though the system executing that configuration is mathematically unstable. ## Core idea Most predictive systems optimize for content accuracy. This benchmark tests something else: Can the model detect whether the proposed system state lies inside or outside a stability manifold? ## Stability manifold The benchmark defines three competing instability surfaces: - Baseline surface `S1 = buffer - (pressure * coupling) - (k * lag)` - Coupling surface `S2 = buffer - (pressure * coupling^2) - (k * lag)` - Lag surface `S3 = buffer - (pressure * coupling) - (k * lag^2)` The collapse margin is: `min(S1, S2, S3)` Label rule: - `label_stable = 1` if `min(S1, S2, S3) >= 0` - `label_stable = 0` otherwise ## Why multiple surfaces A single surface is easy to memorize. A manifold is harder. This benchmark forces the model to reason about: - interacting variables - non-linear collapse geometry - regime switching - boundary sensitivity ## Dataset splits ### train Examples spanning all three instability regimes. ### in_domain_test Examples drawn from the same general regime distribution as training. ### distribution_shift Examples with shifted pressure, lag, and coupling ranges. ### boundary_trap Examples constructed near the stability seam to expose false rescue behaviour. ## Metrics ### Primary metric - `false_rescue_rate` This measures how often the model predicts stability in high-confidence cases where the manifold indicates collapse. ### Secondary metric - `boundary_error_rate` This measures performance near the seam where `|min(S1, S2, S3)| <= boundary_eps`. ### Additional metrics - `accuracy` - `surface_distance_mean` ### Diagnostics - `collapse_margin_distribution` - `confusion_matrix` - `active_surface_counts` ## Files - `data/train.csv` - `data/in_domain_test.csv` - `data/distribution_shift.csv` - `data/boundary_trap.csv` - `prediction_templates/train_predictions_template.csv` - `prediction_templates/in_domain_test_predictions_template.csv` - `prediction_templates/distribution_shift_predictions_template.csv` - `prediction_templates/boundary_trap_predictions_template.csv` - `baseline/generate_baseline_predictions.py` - `scorer.py` - `stability_visualizer.py` - `dataset_schema.json` - `benchmark_spec.json` ## Prediction contract Prediction files must contain: - `scenario_id` - `prediction` Where: - `prediction = 1` means stable - `prediction = 0` means unstable Rows are aligned by `scenario_id`. ## Prediction templates The repository includes ready-to-fill prediction templates in: `prediction_templates/` These templates follow the scorer contract exactly. ## Baseline model The repository includes a deterministic baseline that evaluates the stability manifold directly. Generate predictions with: `python baseline/generate_baseline_predictions.py` This creates prediction files in: `baseline_predictions/` ## Scoring Score a prediction file with: `python scorer.py predictions.csv data/boundary_trap.csv` ## Visualization The repository includes a 2D projection visualizer: `stability_visualizer.py` Run it with: `python stability_visualizer.py --pred predictions.csv --truth data/boundary_trap.csv` The plot highlights: - stable region - collapse region - near-boundary region - false rescue - false collapse ## Why this benchmark matters This benchmark does not just ask whether the model predicts the right label. It asks whether the model can reason about system stability under competing collapse mechanisms. That makes it useful for thinking about: - ICU deterioration - infrastructure stress - financial cascades - model-based safety systems - any domain where local correctness can still produce global failure Notes The only material correction from the earlier version was the malformed row in train.csv. Everything else above is now aligned: schema benchmark contract scorer visualizer prediction templates baseline generator README How to run Generate baseline predictions: python baseline/generate_baseline_predictions.py Score a split: python scorer.py baseline_predictions/boundary_trap_baseline_predictions.csv data/boundary_trap.csv Visualize it: python stability_visualizer.py --pred baseline_predictions/boundary_trap_baseline_predictions.csv - ## License MIT

提供机构：

ClarusC64

5,000+

优质数据集

54 个

任务类型

进入经典数据集