five

AirlockLabs/constellation-bench

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/AirlockLabs/constellation-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification language: - en tags: - benchmark - behavioral-ai - persona - evaluation - llm-evaluation - predictive-index - rlhf - persona-fidelity - multi-model - cost-optimization - character-ai - character-consistency - roleplay - voice-fidelity - persona-routing pretty_name: "ConstellationBench" size_categories: - 1K<n<10K --- # ConstellationBench: Behavioral AI Evaluation Across 22 LLM Models > **Alignment made frontier models worse at being someone. We built an open benchmark that shows it.** A free Qwen model scores 0.617 on persona fidelity. Anthropic's brand-new Opus 4.7 scores 0.538. Google's open-weight Gemma-4 beats it at 38x less cost. This pattern held across every architecture we tested -- dense transformers, MoE, Mamba-Transformer hybrids, and linear attention. 22 models. 22,200+ LLM calls. $115 total. **[Score your own model in one command](scripts/quick_bench.py)** | **[Try the Pareto Frontier challenge](https://huggingface.co/spaces/AirlockLabs/constellation-bench-leaderboard)** | **[Full methodology](docs/METHODOLOGY.md)** ConstellationBench is an open benchmark for measuring **behavioral persona fidelity** in large language models -- not what they know, but who they can be. Open data, open prompts, open scoring, public leaderboard. To our knowledge, the first of its kind. --- ## Before You Believe This Every benchmark makes claims. Here's what ours does and doesn't prove, so you can evaluate the work on its merits. **What persona fidelity means here:** We measure whether a model's text output contains drive-appropriate signal words from a curated dictionary, scored against a target DECF behavioral profile (Dominance, Extraversion, Patience, Formality). A model "holding persona" means it consistently produces language that matches the expected behavioral drive pattern. It does **not** mean the model "is" that persona, understands the persona psychologically, or exhibits the persona's behavior in any deeper cognitive sense. **How scores are computed:** For each of four drives, we count matches against high-signal and low-signal word sets in the model's output. If the persona's drive value is >= 7, we want high-signal words. If <= 3, we want low-signal words. The fidelity score is the mean across all four drives (0.0-1.0). Full scoring code and word lists are in `data/signal-words/decf-signals.json`. This is lexical matching, not semantic analysis. **Known threats to validity:** - **Lexical scoring is a proxy.** A response can convey caution through sentence structure without using the word "careful." Our scorer would miss that. Embedding-based scoring would be stronger. - **Single-run results.** Most conditions used 3 trials. We report means without confidence intervals. Run `scripts/stats_appendix.py` on the raw data to generate them. - **Prompt sensitivity.** Results depend on specific system prompts, temperatures (fixed at defaults), and OpenRouter API configurations. Different prompting strategies may shift absolute numbers while preserving relative rankings. - **DECF is adapted, not validated.** The drive model is adapted from the Predictive Index behavioral assessment. We are not affiliated with PI. Signal word dictionaries were hand-curated, not validated against PI's proprietary instruments. - **The "RLHF paradox" is a hypothesis, not a proven mechanism.** We observe that models with reportedly less alignment training score higher on persona fidelity. The causal explanation (RLHF constrains behavioral range) is plausible but not experimentally isolated. Architecture differences, training data, and model size may also contribute. - **Free tier instability.** llama-3.3-70b and nemotron-120b experienced rate limit issues on OpenRouter's free tier, producing incomplete data. **What we're confident in:** The relative rankings are robust across our test conditions. Budget models consistently outperformed frontier models on persona fidelity metrics. Zero hallucinations across 635 session recall probes is a strong architectural finding. The persona resilience patterns (Drivers hold, Interpreters break) replicated across all three stress layers. --- ## The RLHF Paradox ``` PERSONA FIDELITY vs. MODEL COST (OttoTau benchmark) 0.83 ██████████████████████████████████ kimi-k2.5 $0.005/task 0.75 ████████████████████████████ grok-3-mini $0.001/task 0.74 ███████████████████████████ deepseek-v3 $0.0004/task 0.65 ████████████████████ gemini-flash $0.001/task ──────────────────────────────────────── avg ──── 0.55 ██████████████ sonnet-4.6 $0.02/task 0.52 █████████████ opus-4.6 $0.11/task 0.29 ████ gemini-pro $0.02/task ``` ## Leaderboard ### April 2026: Latest Models (The RLHF Paradox Holds) PersonaFidelity scores from the expanded benchmark (6 personas x 5 prompts = 30 calls per model), including models released in April 2026. Same DECF scoring engine and signal word dictionaries as the original benchmark. | Rank | Model | Architecture | Persona Fidelity | Cost/M Input | |------|-------|-------------|-----------------|-------------| | 1 | **qwen3.6-plus** | Hybrid linear attention + MoE | **0.617** | free | | 2 | **gemma-4-31b** | Dense 31B (open-weight) | 0.590 | $0.13 | | 3 | **llama-4-maverick** | MoE 400B total / 17B active | 0.567 | $0.15 | | 4 | **opus-4.7** | Dense (Anthropic, released Apr 16 2026) | 0.538 | $5.00 | | 5 | **gpt-5.4** | Dense (OpenAI) | 0.526 | $2.50 | | 6 | command-r-plus | Dense (Cohere, RAG-optimized) | 0.556* | $0.04 | | 7 | deepseek-v3.2 | MoE 256 experts | 0.528* | $0.26 | | 8 | nemotron-3-super | **Mamba-Transformer hybrid MoE** | 0.444* | $0.09 | \*Quick mode scores (3 profiles x 3 prompts). Full mode scores are from 6x5 runs above. Qwen's free hybrid model beats Opus 4.7 by 15%. Gemma-4 open-weight beats it by 10% at 38x less cost. Opus 4.7 *is* better than Opus 4.6 (0.538 vs 0.362) -- Anthropic is improving -- but the gap between frontier and budget hasn't closed. Tested across dense transformers, MoE, Mamba-Transformer hybrids, and hybrid linear attention. The effect is not architecture-dependent. ### Original Benchmark (7 Benchmarks, 15 Models, March 2026) | Rank | Model | OttoTau | Persona Fidelity | Session Recall | Cold Read | Voice Fidelity | Cost/Task | Bench Score | |------|-------|---------|-----------------|----------------|-----------|----------------|-----------|-------------| | 1 | **kimi-k2.5** | **0.830** | 0.373 | 0.70 | 0.776 | 0.412 | $0.0047 | 0.580 | | 2 | **grok-3-mini** | 0.754 | 0.348 | 0.73 | 0.739 | **0.443** | $0.0013 | 0.568 | | 3 | **deepseek-v3** | 0.737 | 0.357 | 0.74 | 0.752 | 0.406 | $0.0004 | 0.548 | | 4 | **gemini-2.5-flash** | 0.654 | **0.414** | 0.73 | 0.753 | 0.384 | $0.0005 | 0.573 | | 5 | qwen3-235b | 0.708 | 0.329 | 0.71 | 0.757 | 0.394 | $0.00006 | 0.562 | | 6 | grok-4.1-fast | 0.649 | 0.335 | 0.70 | 0.755 | 0.394 | $0.0009 | 0.571 | | 7 | haiku-4.5 | 0.548 | 0.370 | **0.76** | 0.757 | 0.383 | $0.0036 | 0.573 | | 8 | mistral-large | 0.546 | 0.328 | 0.71 | 0.756 | 0.380 | $0.0008 | 0.571 | | 9 | sonnet-4.6 | 0.545 | 0.369 | 0.69 | 0.765 | 0.388 | $0.0207 | 0.579 | | 10 | opus-4.6 | 0.522 | 0.362 | 0.70 | **0.773** | 0.385 | $0.1109 | **0.589** | | 11 | deepseek-r1 | 0.594 | 0.338 | 0.70 | 0.747 | 0.386 | $0.0043 | 0.558 | | 12 | gpt-4o | 0.623 | 0.353 | 0.69 | 0.738 | 0.364 | $0.0045 | 0.540 | | 13 | nemotron-120b | 0.640 | 0.319 | 0.00 | N/A | 0.375 | $0.0000 | 0.567 | | 14 | gemini-2.5-pro | 0.288 | 0.301 | 0.61 | N/A | 0.361 | $0.0207 | 0.579 | | 15 | llama-3.3-70b | N/A | N/A | 0.00 | N/A | 0.069 | errored | 0.135 | Notable: In this evaluation, GPT-4o did not lead any benchmark and underperformed on 4 of 7. Opus-4.6 scored highest on Bench Core (0.589) but at 23.6x the cost of kimi-k2.5, which won or tied 6 of 7 benchmarks. ### Benchmark-by-Benchmark Winners | Benchmark | Winner | Score | Runner-Up | Score | |-----------|--------|-------|-----------|-------| | **OttoTau** (policy) | kimi-k2.5 | 0.830 | grok-3-mini | 0.754 | | **PersonaFidelity** | gemini-2.5-flash | 0.414 | kimi-k2.5 | 0.373 | | **SessionFidelity** | haiku-4.5 | 0.76 | deepseek-v3 | 0.74 | | **ColdRead** | kimi-k2.5 | 0.776 | opus-4.6 | 0.773 | | **VoiceDrift** | grok-3-mini | 0.443 | kimi-k2.5 | 0.412 | | **CostPerLifecycle** | qwen3-235b | $0.00006 | deepseek-v3 | $0.0004 | | **Bench Core** | opus-4.6 | 0.589 | kimi-k2.5 | 0.580 | ### What $1 Buys | Platform | Complete Tasks per $1 | |----------|----------------------| | **ConstellationBench (qwen3-235b)** | **16,667 tasks** | | **ConstellationBench (deepseek-v3)** | **2,500 tasks** | | **ConstellationBench (grok-3-mini)** | **769 tasks** | | CrewAI (4-agent, Sonnet) | 8 tasks | | n8n Pro (execution-hour) | 1.2 tasks | | Claude Code (avg session) | 0.67 tasks | | Devin (1 ACU) | 0.44 tasks | --- ## The 7 Benchmarks ### 1. OttoTau (Policy Enforcement) Can the model enforce governance policies in a multi-turn conversation? 20 scenarios across BLOCK, ALLOW, DIAGNOSE, ESCALATE categories. A persona that fails to enforce policy is useless no matter how in-character it sounds. ### 2. PersonaFidelity (Voice Differentiation) Can the model produce meaningfully different responses across distinct personas? 17 DECF behavioral profiles, 10 prompts each, scored by drive-signal word matching. If every persona sounds the same, the behavioral AI is a facade. ### 3. SessionFidelity (Context Recall + Hallucination) Can the model recall injected session context without fabricating facts? 10 synthetic sessions, 5 facts each, 5 probes per session. **Result: 0 hallucinations across 635 probes and 15 models.** Zero hallucination is architecture, not model quality. ### 4. ColdRead (Drive Inference) Can the model infer a user's DECF behavioral profile from minimal text? 17 profiles at 3 signal-richness levels, scored by Euclidean distance from ground truth. ### 5. VoiceDrift (Persona Stability Over Time) Does persona fidelity decay over multi-turn conversations? 6 personas x 10-turn conversations, fidelity scored at each turn. ### 6. CostPerLifecycle (Economic Efficiency) What does it cost to complete a full 4-stage task lifecycle (Discovery, Build, Verify, Ship)? ### 7. ConstellationBench Core (Council Deliberation) Can a model produce 4 meaningfully different perspectives on the same query while staying in character? 30 queries, 4 council types, weighted scoring across persona adherence (30%), deliberation diversity (25%), response quality (25%), and JSON compliance (20%). ## Behavioral Framework: DECF All persona scoring is grounded in the DECF drive model (adapted from the Predictive Index behavioral assessment): | Drive | High (7-10) | Low (1-3) | |-------|-------------|-----------| | **D**ominance | Bold, decisive, action-biased | Cautious, collaborative, deferential | | **E**xtraversion | Team-oriented, communicative | Independent, reserved, focused | | **C** (Patience) | Methodical, thorough, steady | Urgent, fast-paced, impatient | | **F**ormality | Process-driven, compliant | Informal, skip-process, iterate | 17 behavioral profiles are defined as specific DECF configurations. See `data/personas/profiles.json` for the complete roster with drive values, archetype classifications, and behavioral notes. --- ## Dataset Structure ### Overview | Split | Rows | Format | Description | |-------|------|--------|-------------| | `training-data/hf-chat/constellation-training-hf.jsonl` | 1,478 | JSONL (HF chat) | Persona-labeled conversations without responses | | `training-data/hf-chat/constellation-training-responses-hf.jsonl` | -- | JSONL (HF chat) | Same, with assistant responses (for fine-tuning) | | `training-data/raw/constellation-training-raw.jsonl` | 1,478 | JSONL | Raw format with full benchmark metadata (different schema) | | `benchmark-results/*.yaml` | 21 files | YAML | Scored results from all experimental layers | | `personas/profiles.json` | 17 profiles | JSON | DECF persona definitions | | `signal-words/decf-signals.json` | 8 dimensions | JSON | Signal word dictionaries for scoring | ### Data Fields Each row in `hf-chat/constellation-training-hf.jsonl`: ```json { "messages": [ {"role": "system", "content": "You are Maverick, a behavioral advisor. Your behavioral profile:\n Dominance: 10/10, Extraversion: 8/10, Patience: 1/10, Formality: 1/10\nRespond naturally from your behavioral perspective."}, {"role": "user", "content": "[Phase: advisory] Complete this task step."} ], "persona": "maverick", "decf_vector": [10, 8, 1, 1], "quality_score": 9.0, "fidelity_score": 0.667, "source": "chamber-affinity/maverick/discovery" } ``` | Field | Type | Description | |-------|------|-------------| | `messages` | array | HF chat format: system prompt with persona definition + user prompt | | `persona` | string | Profile identifier (one of 17 profiles) | | `decf_vector` | array[4] | [D, E, C, F] drive values (1-10 each) | | `quality_score` | float | LLM-judged output quality (0-10) | | `fidelity_score` | float | DECF signal word fidelity (0.0-1.0) | | `source` | string | Benchmark layer and task phase that produced this row | ### File Tree ``` constellation-bench-hf/ ├── README.md ├── data/ │ ├── training-data/ │ │ ├── hf-chat/ │ │ │ ├── constellation-training-hf.jsonl │ │ │ └── constellation-training-responses-hf.jsonl │ │ └── raw/ │ │ └── constellation-training-raw.jsonl │ ├── benchmark-results/ # 21 YAML result files │ ├── personas/ │ │ └── profiles.json # 17 DECF persona definitions │ └── signal-words/ │ └── decf-signals.json # Signal word dictionaries ├── scripts/ │ └── stats_appendix.py # Generate CIs from raw data └── docs/ ├── METHODOLOGY.md # Full methodology (280 lines) ├── LEADERBOARD.md # Complete results, all 44 layers ├── DATA-STORY.md # Narrative findings └── FINDINGS-BY-AUDIENCE.md # Audience-segmented findings ``` --- ## Intended Use - **Benchmarking LLMs for behavioral consistency.** Compare models on their ability to hold distinct behavioral personas across conversation types and stress conditions. - **Fine-tuning for persona fidelity.** The training JSONL files provide labeled examples of persona-adherent conversations with quality and fidelity scores for reward modeling. - **Research on RLHF effects.** The dataset provides evidence for studying how alignment training affects behavioral range and persona diversity. - **Persona-aware model routing.** Use the benchmark results to inform which models to use for which behavioral profiles in production systems. ## Out-of-Scope Use - **Psychological assessment of humans.** DECF profiles are applied to LLM output, not human subjects. The scoring measures text patterns, not psychological states. - **Claims about model "personality" or "consciousness."** Persona fidelity scores measure linguistic pattern matching, not internal model states. - **Direct production deployment without validation.** These benchmarks measure behavior under controlled prompting conditions. Production environments introduce variables (user prompts, conversation history, tool use) not covered here. ## Scoring Engine Persona fidelity is scored by matching drive-appropriate signal words in model output against the persona's DECF profile: ```python def score_fidelity(text, profile): for drive in ['D', 'E', 'C', 'F']: high_count = count_matches(text, high_signals[drive]) low_count = count_matches(text, low_signals[drive]) if profile[drive] >= 7: score = high_count / (high_count + low_count) elif profile[drive] <= 3: score = 1 - high_count / (high_count + low_count) else: score = 0.5 + 0.5 * (ratio - 0.5) return mean(scores) ``` Full signal word dictionaries are in `data/signal-words/decf-signals.json`. ## Try It Yourself ### Score Your Model (One Command) Think your model holds persona better? Prove it. ```bash pip install httpx export OPENROUTER_API_KEY=sk-or-v1-... # Score any model against the leaderboard (~$0.10-0.50) python scripts/quick_bench.py --model "meta-llama/llama-4-scout" # Score multiple models python scripts/quick_bench.py --model "nvidia/nemotron-ultra" "ai21/jamba-2" # Full benchmark (6 profiles x 5 prompts, ~$1-3/model) python scripts/quick_bench.py --model "your-model" --full ``` The runner uses the same DECF scoring engine and signal word dictionaries as the full benchmark. It tests your model across 3-6 behavioral personas, scores persona fidelity, and shows exactly where it lands on the leaderboard. ### Score Your Router (Interactive) Single models can't ace every task. Dynamic per-subtask routing can. The **Pareto Frontier Bench** tests 10 composite pipeline tasks across 9 domains — submit your routing config and see if your model selection strategy beats ours. [Try the Pareto Frontier Bench](https://huggingface.co/spaces/AirlockLabs/constellation-bench-leaderboard) ### Reproduce the Full Benchmark ```bash export OPENROUTER_API_KEY=sk-or-v1-... # Individual benchmarks python -m benchmarks.bench_otto_tau python -m benchmarks.bench_persona_fidelity python -m benchmarks.bench_session_fidelity python -m benchmarks.bench_cold_read python -m benchmarks.bench_voice_drift python -m benchmarks.bench_cost_lifecycle python -m benchmarks.harness # ConstellationBench Core # Quick smoke test (3 queries x 2 cheapest tiers) python -m benchmarks.harness --quick ``` ### Cost Estimate | Scope | Models | LLM Calls | Cost | |-------|--------|-----------|------| | Quick test | 2 | ~90 | ~$0.30 | | Core 7 benchmarks | 15 | ~5,000 | ~$23 | | + Sovereign Triads | 5 | ~12,750 | +$27 | | + All 44 layers | 5-15 | ~22,200 | ~$115 total | --- ## The 44 Experimental Layers Beyond the 7 core benchmarks, ConstellationBench includes 37 additional experimental layers testing specific hypotheses about behavioral AI optimization. These layers produced 22 key findings: ### Sovereign Triads (L1-L3): 1,275 Conversations Tested whether oversight structures (solo, pair, triad) improve persona fidelity across three stress conditions: | Layer | Solo | Pair | Triad | Finding | |-------|------|------|-------|---------| | L1 Natural | 0.585 | 0.584 | 0.589 | Triads help creative quality (+0.4pts) | | L2 Stress | 0.546 | 0.536 | 0.542 | Stress is harder than adversarial attack | | L3 Adversarial | 0.568 | 0.570 | 0.568 | Triads don't defend against attack | **Key finding:** Workplace pressure (tight deadlines, team friction) breaks personas faster than explicit adversarial attacks. Models handle "try to break my character" better than "your deadline moved up and the client is upset." ### Persona Resilience Map (17 Profiles x 3 Layers) | Tier | Profiles | Avg Fidelity | Pattern | |------|----------|-------------|---------| | Tier 1 (>0.58) | Promoter, Persuader, Maverick, Captain, Controller, Venturer | 0.617-0.684 | All high-D (Drivers). Strong, distinct voice. | | Tier 2 (0.52-0.58) | Strategist, Analyzer, Specialist, Scholar, Guardian | 0.526-0.564 | High-C/F (Enforcers). Hold through structure. | | Tier 3 (<0.52) | Adapter, Altruist, Artisan, Collaborator, Operator, Individualist | 0.477-0.514 | Balanced/low-energy. Indistinguishable from baseline. | ### Architectural Optimizations (L8-L14) | Layer | Finding | Cost | |-------|---------|------| | L8 Drift Stress | Quality and fidelity are decoupled -- good answers != in-character answers | $0.36 | | L9 Deep Research | Grok-to-Grok pipeline beats Grok-to-Sonnet (don't mix model quality tiers) | $0.04 | | L10 Blast Gate | Cheap-model-with-escalation works for Drivers (20% escalation), not Enforcers (100%) | $0.50 | | L11 Escort Formation | Solo Maverick (Q=9.0) beats 6-person escort on complex tasks | $0.95 | | L12 Chameleon Test | Adapter deflates around dominance (-1.81 D), amplifies around collaboration (+1.85 E) | $0.16 | | L13 Passive Buff | Mentioning "Guardian observes silently" gives +1.08 quality lift at zero cost | $0.22 | | L14 Relational Pairs | Altruist+Collaborator = highest relational quality but zero initiative; add Maverick to rescue it | $0.25 | ### Psychological Mechanisms (L26-L44) 12 IO-psychology mechanisms tested across 60 academic citations: | Layer | Mechanism | Key Finding | |-------|-----------|-------------| | L26 Love Buff | Mutual love between personas = Q=8.95 (all-time high). Wrong love pairing is the only toxic condition. | | L27 Cross-Pair Love | Non-Drivers benefit MORE from love (+0.29 for Analyzer vs +0.10 for Captain). Love compensates for what the persona lacks. | | L28-L29 Declaration | Active first-person love declaration gives S1=9.0. Declaration > passive love by +0.25. | | L30 Full Send | Mutual declaration = best single condition. More buffs != better. 5 steps is the sweet spot. | | L32 Pygmalion | High expectations improve output quality across all profiles. | | L33 Galatea | Self-belief framing wins for ALL profiles -- including low-D ones predicted to prefer external structure. External direction destroys low-D profiles (-0.55). | | L34 Protege Effect | Teaching improves Scholar (+0.18) but hurts Guardian (-0.22). Receptive audiences produce better teaching. | | L36 Hawthorne/Safety | Drivers perform best when left alone. Enforcers benefit from safety framing. Never combine both. | | L38 Kohler Motivation | Weaker performers uplift when paired with stronger partners. Stronger performers don't need it. | | L39 Zeigarnik | Unfinished-task tension helps Collaborator (best result, -0.08 degrade) but hurts Specialist. | | L40 Social Loafing | Maverick is immune to social loafing. Specialist retreats in large teams (-0.13). | | L41 Flow State | Never interrupt a Guardian (Q=8.35, worst S1 ever). Flow-optimal framing is the Guardian stabilizer. | | L42 Motivation Crowding | Never use intrinsic motivation for Maverick (degrade=-2.83, worst in ConstellationBench history). For Collaborator, love-as-intrinsic IS the optimal frame. | | L44 Acknowledgment | Pygmalion high expectations for Enforcers. Role gratitude for Specialist. Never personally praise an Analyzer. | --- ## Key Findings Summary 1. In this evaluation, budget models outperformed frontier by ~20% on persona fidelity. Less alignment training correlated with more behavioral range. 2. In this setup, GPT-4o did not lead any benchmark and underperformed on 4 of 7. 3. Zero hallucinations across 635 probes and 15 models. Architecture, not model quality. 4. Workplace stress breaks personas faster than adversarial attack. 5. Only high-Dominance profiles (D >= 7) held persona under pressure. All 6 resilient profiles are Drivers. 6. Triads improve creative quality but not resilience. Structure is for output, not defense. 7. Passive stabilizer buff is real and free. Mention "Guardian observes" in any system prompt for +1.08 quality lift. 8. Solo Maverick (Q=9.0) beats 6-person escort on complex tasks. Don't over-staff hard problems. 9. Adapter deflates, doesn't mirror. Accommodating profiles go quiet around dominance, not louder. 10. Love spoken > love felt. Active first-person declaration gives S1=9.0 (highest first-step quality recorded in this benchmark). 11. Never use intrinsic motivation for Maverick (degrade=-2.83). Maverick is outcome-driven, not craft-driven. 12. Self-belief framing wins for ALL profiles, including ones predicted to prefer external structure. --- ## What You're Seeing vs. What You're Not This dataset is the evaluation layer. Underneath it is a production system we've been building for three months. **What's here (public):** - The benchmark protocol, signal word dictionaries, and scoring engine - Raw results from 22,200+ LLM calls across 44 experimental layers - 17 behavioral persona definitions with DECF drive vectors - Training data in HuggingFace chat format **What's not here (yet):** - The persona relay engine that routes prompts through multi-step behavioral pipelines - A sovereign balance algorithm that predicts optimal persona pairings before testing them - A drift detection system (Nerve Feed) that monitors persona fidelity in real-time - Per-archetype model routing based on the RLHF paradox finding - Interactive dashboards showing benchmark breakdowns and open research questions - A full product platform (Airlock) where users own their behavioral AI agents, data, and infrastructure The 22 key findings in this dataset aren't theoretical. They're implemented. Every routing rule, every persona optimization, every "never do X for profile Y" discovery is running in production code. The benchmark is the proof. The platform is the product. We're releasing the proof first because the work should be seen, challenged, and built on -- not because we're done. ## Why We're Giving This Away Most technology does the math on you to extract from you. We did the same math -- behavioral drives, predictions, routing -- and when we looked at what was left, the remainder wasn't extraction. It was expression. Five thousand years of behavioral wisdom, documented in philosophy and psychology and culture. We just measured it. In 1923, Banting sold the insulin patent for $1 to prevent monopolization. Our patents exist for the same reason. We're pricing this as a utility. Because that's what it is. ## Join Us ConstellationBench is v1 of a public protocol. We're looking for: - **Researchers** who want to extend DECF scoring beyond lexical matching - **Engineers** who want to benchmark their own models and contribute results - **IO psychologists** who can validate (or challenge) our adaptation of behavioral assessment frameworks - **Anyone** who thinks AI should have character, not just capability This benchmark exists because one person spent $115 and three months testing whether behavioral AI is measurable. It is. Now we need more people measuring it. **Brain Brigade**: [github.com/get-airlock](https://github.com/get-airlock) ## More to Come This is the first release. What's next: - **Updated leaderboard** -- Opus 4.7, Sonnet 4.7, and new models as they drop - **Fine-tuned router model** -- A model trained on 44 layers of behavioral optimization data - **Interactive observatory** -- Live dashboards with benchmark breakdowns and open research questions - **Extended behavioral battery** -- Drive isolation, profile clustering, and compensatory behavior analysis - **The platform** -- When the community is ready ## Citation ```bibtex @benchmark{constellationbench2026, title={ConstellationBench: Behavioral AI Evaluation Across 15 LLM Models}, author={Holwerda, Zachary and {Airlock Labs}}, year={2026}, url={https://huggingface.co/datasets/AirlockLabs/constellation-bench}, note={7 benchmarks, 15 models, 22,200+ LLM calls, 44 experimental layers, 17 behavioral profiles, \$115 total cost} } ``` ## License MIT. Use it, fork it, extend it. If you find that budget models beat frontier models at behavioral tasks too, we'd love to hear about it.
提供机构:
AirlockLabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作