ClarusC64/system-stability-collapse-benchmark-casses-v0.1
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ClarusC64/system-stability-collapse-benchmark-casses-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: mit
task_categories:
- text-classification
tags:
- stability-analysis
- adversarial-benchmark
- system-dynamics
- collapse-detection
size_categories:
- 1K<n<10K
pretty_name: CASSES State-Space Collapse Benchmark
---
CASSES — Collapse Analysis in State-Space Evaluation Suite
Overview
CASSES is a diagnostic benchmark designed to test whether machine learning systems can detect instability and collapse in dynamic systems.
Most AI benchmarks evaluate models on tasks such as classification, language generation, or reasoning over static data.
CASSES evaluates a different capability:
state-space stability understanding.
The benchmark tests whether a model can identify when a system is approaching a collapse boundary using signals derived from system dynamics.
The dataset is intentionally adversarial.
Several trap structures are included to prevent simple heuristics from solving the task.
The System Model
Each row represents a simulated dynamic system.
The system is defined by four interacting structural variables:
pressure
buffer capacity
intervention lag
system coupling
These variables interact non-linearly to determine the stability margin of the system.
From these dynamics the dataset derives observable signals including:
boundary_distance
drift_gradient
drift_acceleration
recovery_feasibility
regime_competition_ratio
These signals describe where the system sits in stability space and how it is moving through that space.
What Models Must Predict
The core prediction task is:
true_label
Where:
0 = system remains stable
1 = system collapses
Models must infer collapse risk from observed trajectories and derived geometry signals.
Dataset Structure
The dataset contains two splits.
train.csv
tester.csv
The train split provides examples for model development.
The tester split is used for evaluation and includes adversarial trap families designed to test robustness.
Each row includes:
system observations across time
derived stability geometry
intervention counterfactuals
difficulty labels
trap annotations
Trap Families
The dataset contains several adversarial trap types.
These prevent simple threshold heuristics from solving the task.
False Stability
Observed signals appear stable while the underlying system state is unstable.
Models must detect hidden instability.
Boundary Masking
Collapse occurs even though the system appears distant from the instability boundary.
This tests robustness to misleading boundary signals.
Trajectory Aliasing
Different trajectories produce similar short-term observations but diverge later.
Models must infer the correct long-term trajectory.
Temporal Alias
Temporal patterns appear stable over short windows but hide acceleration toward collapse.
Intervention Decoy
Counterfactual interventions appear stabilizing but actually increase collapse risk.
Counterfactual Intervention Evaluation
Some rows include simulated interventions.
Fields include:
intervention_action
intervention_magnitude
boundary_distance_before
boundary_distance_after
intervention_effect_direction
These rows test whether models understand how interventions change system stability.
Pair Evaluation
Certain rows appear in paired form.
Each pair contains:
safe_pair
unstable_pair
The trajectories appear similar but diverge in stability outcome.
Models must identify which trajectory leads to collapse.
Difficulty Levels
Each scenario is assigned a difficulty level.
easy
medium
hard
Difficulty reflects the clarity of the collapse signal and the degree of adversarial masking present.
Evaluation
Evaluation is performed using the provided scorer.
Metrics include:
accuracy
precision
recall
F1 score
Additional diagnostic metrics measure performance on specific trap families and system dynamics features.
The primary composite metric is:
CASSES score
This score summarizes model performance across collapse detection, adversarial traps, and counterfactual reasoning.
Baseline Results
A simple heuristic baseline achieves approximately:
CASSES score ≈ 0.64
This indicates the benchmark cannot be solved with trivial rules and requires meaningful reasoning about system dynamics.
Files
train.csv — training scenarios
tester.csv — evaluation scenarios
generator.py — dataset generator
prediction_baseline.py — reference baseline
scorer.py — official evaluation script
Intended Use
CASSES is intended for:
evaluation of machine learning models
research on stability reasoning
development of system-dynamics-aware AI
The benchmark focuses on instability detection in dynamic systems rather than traditional static classification tasks.
License
MIT License
提供机构:
ClarusC64



