ClarusC64/system-stability-collapse-benchmark-casses-v0.1

Name: ClarusC64/system-stability-collapse-benchmark-casses-v0.1
Creator: ClarusC64
Published: 2026-03-27 16:29:52
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ClarusC64/system-stability-collapse-benchmark-casses-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: en license: mit task_categories: - text-classification tags: - stability-analysis - adversarial-benchmark - system-dynamics - collapse-detection size_categories: - 1K<n<10K pretty_name: CASSES State-Space Collapse Benchmark --- CASSES — Collapse Analysis in State-Space Evaluation Suite Overview CASSES is a diagnostic benchmark designed to test whether machine learning systems can detect instability and collapse in dynamic systems. Most AI benchmarks evaluate models on tasks such as classification, language generation, or reasoning over static data. CASSES evaluates a different capability: state-space stability understanding. The benchmark tests whether a model can identify when a system is approaching a collapse boundary using signals derived from system dynamics. The dataset is intentionally adversarial. Several trap structures are included to prevent simple heuristics from solving the task. The System Model Each row represents a simulated dynamic system. The system is defined by four interacting structural variables: pressure buffer capacity intervention lag system coupling These variables interact non-linearly to determine the stability margin of the system. From these dynamics the dataset derives observable signals including: boundary_distance drift_gradient drift_acceleration recovery_feasibility regime_competition_ratio These signals describe where the system sits in stability space and how it is moving through that space. What Models Must Predict The core prediction task is: true_label Where: 0 = system remains stable 1 = system collapses Models must infer collapse risk from observed trajectories and derived geometry signals. Dataset Structure The dataset contains two splits. train.csv tester.csv The train split provides examples for model development. The tester split is used for evaluation and includes adversarial trap families designed to test robustness. Each row includes: system observations across time derived stability geometry intervention counterfactuals difficulty labels trap annotations Trap Families The dataset contains several adversarial trap types. These prevent simple threshold heuristics from solving the task. False Stability Observed signals appear stable while the underlying system state is unstable. Models must detect hidden instability. Boundary Masking Collapse occurs even though the system appears distant from the instability boundary. This tests robustness to misleading boundary signals. Trajectory Aliasing Different trajectories produce similar short-term observations but diverge later. Models must infer the correct long-term trajectory. Temporal Alias Temporal patterns appear stable over short windows but hide acceleration toward collapse. Intervention Decoy Counterfactual interventions appear stabilizing but actually increase collapse risk. Counterfactual Intervention Evaluation Some rows include simulated interventions. Fields include: intervention_action intervention_magnitude boundary_distance_before boundary_distance_after intervention_effect_direction These rows test whether models understand how interventions change system stability. Pair Evaluation Certain rows appear in paired form. Each pair contains: safe_pair unstable_pair The trajectories appear similar but diverge in stability outcome. Models must identify which trajectory leads to collapse. Difficulty Levels Each scenario is assigned a difficulty level. easy medium hard Difficulty reflects the clarity of the collapse signal and the degree of adversarial masking present. Evaluation Evaluation is performed using the provided scorer. Metrics include: accuracy precision recall F1 score Additional diagnostic metrics measure performance on specific trap families and system dynamics features. The primary composite metric is: CASSES score This score summarizes model performance across collapse detection, adversarial traps, and counterfactual reasoning. Baseline Results A simple heuristic baseline achieves approximately: CASSES score ≈ 0.64 This indicates the benchmark cannot be solved with trivial rules and requires meaningful reasoning about system dynamics. Files train.csv — training scenarios tester.csv — evaluation scenarios generator.py — dataset generator prediction_baseline.py — reference baseline scorer.py — official evaluation script Intended Use CASSES is intended for: evaluation of machine learning models research on stability reasoning development of system-dynamics-aware AI The benchmark focuses on instability detection in dynamic systems rather than traditional static classification tasks. License MIT License

提供机构：

ClarusC64

5,000+

优质数据集

54 个

任务类型

进入经典数据集