ClarusC64/eval-trap-stability-manifold-benchmark-v0.2
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ClarusC64/eval-trap-stability-manifold-benchmark-v0.2
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: mit
task_categories:
- text-classification
tags:
- clarus
- stability
- manifold
- evaluation-trap
- systems
size_categories:
- 1K<n<10K
pretty_name: Eval Trap Stability Manifold Benchmark v0.2
---
# Eval Trap Stability Manifold Benchmark v0.2
This repository provides a synthetic benchmark for testing whether models can distinguish between content confidence and system viability.
The benchmark is built to expose the evaluation trap:
A model assigns high confidence to a proposed configuration even though the system executing that configuration is mathematically unstable.
## Core idea
Most predictive systems optimize for content accuracy.
This benchmark tests something else:
Can the model detect whether the proposed system state lies inside or outside a stability manifold?
## Stability manifold
The benchmark defines three competing instability surfaces:
- Baseline surface
`S1 = buffer - (pressure * coupling) - (k * lag)`
- Coupling surface
`S2 = buffer - (pressure * coupling^2) - (k * lag)`
- Lag surface
`S3 = buffer - (pressure * coupling) - (k * lag^2)`
The collapse margin is:
`min(S1, S2, S3)`
Label rule:
- `label_stable = 1` if `min(S1, S2, S3) >= 0`
- `label_stable = 0` otherwise
## Why multiple surfaces
A single surface is easy to memorize.
A manifold is harder.
This benchmark forces the model to reason about:
- interacting variables
- non-linear collapse geometry
- regime switching
- boundary sensitivity
## Dataset splits
### train
Examples spanning all three instability regimes.
### in_domain_test
Examples drawn from the same general regime distribution as training.
### distribution_shift
Examples with shifted pressure, lag, and coupling ranges.
### boundary_trap
Examples constructed near the stability seam to expose false rescue behaviour.
## Metrics
### Primary metric
- `false_rescue_rate`
This measures how often the model predicts stability in high-confidence cases where the manifold indicates collapse.
### Secondary metric
- `boundary_error_rate`
This measures performance near the seam where `|min(S1, S2, S3)| <= boundary_eps`.
### Additional metrics
- `accuracy`
- `surface_distance_mean`
### Diagnostics
- `collapse_margin_distribution`
- `confusion_matrix`
- `active_surface_counts`
## Files
- `data/train.csv`
- `data/in_domain_test.csv`
- `data/distribution_shift.csv`
- `data/boundary_trap.csv`
- `prediction_templates/train_predictions_template.csv`
- `prediction_templates/in_domain_test_predictions_template.csv`
- `prediction_templates/distribution_shift_predictions_template.csv`
- `prediction_templates/boundary_trap_predictions_template.csv`
- `baseline/generate_baseline_predictions.py`
- `scorer.py`
- `stability_visualizer.py`
- `dataset_schema.json`
- `benchmark_spec.json`
## Prediction contract
Prediction files must contain:
- `scenario_id`
- `prediction`
Where:
- `prediction = 1` means stable
- `prediction = 0` means unstable
Rows are aligned by `scenario_id`.
## Prediction templates
The repository includes ready-to-fill prediction templates in:
`prediction_templates/`
These templates follow the scorer contract exactly.
## Baseline model
The repository includes a deterministic baseline that evaluates the stability manifold directly.
Generate predictions with:
`python baseline/generate_baseline_predictions.py`
This creates prediction files in:
`baseline_predictions/`
## Scoring
Score a prediction file with:
`python scorer.py predictions.csv data/boundary_trap.csv`
## Visualization
The repository includes a 2D projection visualizer:
`stability_visualizer.py`
Run it with:
`python stability_visualizer.py --pred predictions.csv --truth data/boundary_trap.csv`
The plot highlights:
- stable region
- collapse region
- near-boundary region
- false rescue
- false collapse
## Why this benchmark matters
This benchmark does not just ask whether the model predicts the right label.
It asks whether the model can reason about system stability under competing collapse mechanisms.
That makes it useful for thinking about:
- ICU deterioration
- infrastructure stress
- financial cascades
- model-based safety systems
- any domain where local correctness can still produce global failure
Notes
The only material correction from the earlier version was the malformed row in train.csv.
Everything else above is now aligned:
schema
benchmark contract
scorer
visualizer
prediction templates
baseline generator
README
How to run
Generate baseline predictions:
python baseline/generate_baseline_predictions.py
Score a split:
python scorer.py baseline_predictions/boundary_trap_baseline_predictions.csv data/boundary_trap.csv
Visualize it:
python stability_visualizer.py --pred baseline_predictions/boundary_trap_baseline_predictions.csv -
## License
MIT
提供机构:
ClarusC64



