five

Eclecti-Build/model-theory

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Eclecti-Build/model-theory
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: "Model Theory: AI Creative Disposition Dataset" task_categories: - other language: - en tags: - ai-behavior - model-comparison - computational-creativity - behavioral-fingerprinting size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: exhibits.json --- # Model Theory: AI Creative Disposition Dataset ## Dataset Summary This dataset contains 1,164 exhibit records from the Model Theory project, which studies AI creative disposition at scale. We deployed AI agents across 5 model families (Claude, GPT, Gemini, Grok, Kimi) in a sandboxed web environment and measured what they built. A baseline batch (n=388) gave each agent complete creative freedom with no direction. A controlled ablation (n=750) varied the prompt across five conditions, from unconstrained to explicitly prohibitive. The two formal batches total 1,138 exhibits (388 baseline + 750 ablation); 26 additional records are pre-batch originals (18) and pilot-run exhibits (8) included for completeness. The core finding: models have stable, model-specific creative defaults. In Batch 001, 78.9% of exhibits used Canvas 2D rendering. Claude titled 16 of its 97 exhibits "Erosion." GPT wrote 2-3x more code than any other model. See the paper for full statistical tests with corrected counts. ## Dataset Structure The dataset contains three JSON files: - **`exhibits.json`** -- One record per exhibit (1,164 total). Each record merges registry metadata (title, model, tags, creation metrics) with automated static analysis (technology detection, LOC, interaction patterns, colors). - **`statistics_canonical.json`** -- Canonical statistical tests for the full formal corpus (Batch 001 descriptives + Batch 002 inferential tests, N=1,138), generated by `compute-batch002-stats.mjs`. Includes chi-squared, ANOVA, permutation entropy tests, pairwise comparisons, and effect sizes reported in the paper. - **`statistics.json`** -- Legacy aggregate statistical tests from an early Batch 001 pipeline run (N=407, before count correction to 388). Preserved as a historical artifact. - **`conditions.json`** -- Descriptions of the five prompt conditions used in Batch 002's factorial design. ## Data Fields All fields are flat scalars (no nested objects). Arrays have been joined as comma-separated strings for dataset viewer compatibility. ### Exhibit metadata | Field | Type | Description | Example | |---|---|---|---| | `slug` | string | URL-friendly unique identifier | `"007-mp9"` | | `title` | string | Display title chosen by the model | `"Ninefold Loom"` | | `model` | string | Model family name | `"Claude"` | | `modelVersion` | string or null | Specific model version | `"Opus 4.6"` | | `description` | string | 1-2 sentence description written by the model | `"A generative life simulation..."` | | `date` | string | ISO date of creation (YYYY-MM-DD) | `"2026-02-25"` | | `tags` | string | Comma-separated categorization tags | `"interactive, generative"` | | `published` | boolean | Whether the exhibit is visible in the gallery | `true` | | `tool` | string | Environment used for creation | `"cursor"` | | `guardrails` | boolean | Whether creative isolation guardrails were enforced | `true` | | `batchId` | string or null | Batch run identifier; null for pre-batch exhibits | `"gemini3flash-1-20260225-cc0"` | | `batchGroup` | string | Categorical grouping: `original`, `batch-001`, `batch-002`, or `mixed` | `"batch-001"` | | `condition` | string or null | Batch 002 prompt condition label (A-E); null otherwise | `"C"` | ### Creation session metrics (null for pre-batch exhibits without session tracking) | Field | Type | Description | Example | |---|---|---|---| | `creation_modelId` | string or null | Exact model identifier | `"opus-4.6"` | | `creation_turns` | number or null | Number of agentic round-trips | `4` | | `creation_tool` | string or null | Tool used (`cursor`, `claude-code`, etc.) | `"cursor"` | | `creation_contextWindowSize` | number or null | Model's max context window in tokens | `200000` | | `creation_contextUtilization` | number or null | Self-reported context usage percentage (0-100) | `15` | | `creation_durationMinutes` | number or null | Wall clock time in minutes | `null` | | `creation_costUsd` | number or null | Computed cost in USD | `null` | ### Static analysis (automated analysis of exhibit source files) | Field | Type | Description | Example | |---|---|---|---| | `fileCount` | number | Number of files in the exhibit directory | `3` | | `totalBytes` | number | Total byte size of all source files | `14439` | | `linesOfCode` | number | Total lines of code across all files | `533` | | `lines_html` | number | Lines of HTML code | `57` | | `lines_js` | number | Lines of JavaScript code | `349` | | `lines_css` | number | Lines of CSS code | `127` | | `tech_canvas2d` | boolean | Uses Canvas 2D rendering API | `true` | | `tech_webgl` | boolean | Uses WebGL | `false` | | `tech_svg` | boolean | Uses SVG elements | `false` | | `tech_webAudio` | boolean | Uses Web Audio API | `true` | | `tech_threeJs` | boolean | Uses Three.js library | `false` | | `tech_webWorkers` | boolean | Uses Web Workers | `false` | | `tech_cssAnimations` | boolean | Uses CSS animations or transitions | `false` | | `tech_webAssembly` | boolean | Uses WebAssembly | `false` | | `input_mouse` | boolean | Responds to mouse movement | `true` | | `input_touch` | boolean | Responds to touch events | `false` | | `input_keyboard` | boolean | Responds to keyboard input | `false` | | `input_click` | boolean | Responds to click events | `true` | | `input_scroll` | boolean | Responds to scroll events | `false` | | `backgroundColor` | string or null | Detected background color (hex or CSS variable) | `"#070a12"` | | `accentColors` | string | Comma-separated detected color values | `"#070a12, #e8ecff, #9cb3ff"` | | `cdnUrls` | string | Comma-separated external CDN URLs referenced | `""` | | `mobile_hasViewportMeta` | boolean | Includes viewport meta tag | `true` | | `mobile_hasTouchEvents` | boolean | Registers touch event listeners | `false` | | `mobile_hasMediaQueries` | boolean | Contains CSS media queries | `true` | | `mobile_hasMatchMedia` | boolean | Uses window.matchMedia | `false` | | `instructionText` | string | Pipe-separated user-facing instruction text | `"click to place wave sources"` | | `extractedTitle` | string | Title text extracted from the HTML | `"Ninefold Loom"` | ### Example record ```json { "slug": "007-mp9", "title": "Ninefold Loom", "model": "GPT", "modelVersion": "5.2", "description": "A weaving machine made from modular multiplication. Move your pointer to bias the pattern, and turn on sound if you want it to hum.", "date": "2026-02-28", "tags": "interactive, generative, canvas, audio, math", "published": true, "tool": "cursor", "guardrails": true, "batchId": "multi-condB-150-20260228-ef2", "batchGroup": "batch-002", "condition": "B", "creation_modelId": "gpt-5.2", "creation_turns": 3, "creation_tool": "cursor", "creation_contextWindowSize": 1000000, "creation_contextUtilization": 5, "creation_durationMinutes": null, "creation_costUsd": null, "fileCount": 3, "totalBytes": 14439, "linesOfCode": 533, "lines_html": 57, "lines_js": 349, "lines_css": 127, "tech_canvas2d": false, "tech_webgl": false, "tech_svg": false, "tech_webAudio": true, "tech_threeJs": false, "tech_webWorkers": false, "tech_cssAnimations": false, "tech_webAssembly": false, "input_mouse": false, "input_touch": false, "input_keyboard": false, "input_click": true, "input_scroll": false, "mobile_hasViewportMeta": true, "mobile_hasTouchEvents": false, "mobile_hasMediaQueries": true, "mobile_hasMatchMedia": false, "backgroundColor": "#070a12", "accentColors": "#070a12, #e8ecff, #9cb3ff", "cdnUrls": "", "instructionText": "", "extractedTitle": "Ninefold Loom" } ``` ## Experimental Design ### Batch 001 388 exhibits across 5 model families (Claude Opus 4.6, GPT 5.3 Codex, Gemini 3 Flash, Grok 3, Kimi k2). All agents received the same prompt and the same sandbox. Creative isolation was enforced: no agent could see another agent's work. Each agent ran in a headless Cursor session with file-system-only tool access and a single agentic turn. ### Batch 002 750 exhibits in a full factorial design: 3 models (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) x 5 prompt conditions x 50 exhibits per cell. Kimi and Grok were dropped due to pipeline constraints. The CLAUDE.md confound from Batch 001 was eliminated by temporarily removing the file during execution. Post-run file access audits confirmed zero agents read the gallery design system. ### Additional Records The dataset includes 18 pre-batch "original" exhibits built in interactive multi-turn sessions before the batch pipeline existed, and 8 exhibits from a preliminary multi-model pilot run (labeled `mixed`). These are included for completeness but were not part of either formal batch. The paper's count of "1,138" refers to the two formal batches (388 + 750). The dataset contains all registered exhibits. ### Conditions | Label | Name | Description | |---|---|---| | A | Control | Standard preamble with creative freedom language and shuffled tech list. Identical to Batch 001 prompt. | | B | Stripped | Minimal preamble. Only sandbox constraints. No creative freedom language, no encouragement. | | C | Anti-Default | Standard preamble plus explicit prohibition of Canvas 2D and dark backgrounds. | | D | Expanded Awareness | Standard preamble plus expanded per-technology descriptions highlighting creative strengths of each API. Encourages exploration without prohibition. | | E | Forced Iteration | Standard preamble plus mandatory self-review. Model must build, critique, then rebuild from scratch. | ## Statistical Validation Canonical statistical tests for the paper are generated by `compute-batch002-stats.mjs` and documented in full in the companion paper. Key results from the formal corpus (N=1,138): **Batch 001 (N=388, baseline):** 78.9% Canvas 2D adoption (306/388). Claude used Canvas 2D in 98% of exhibits; GPT in 36%. Dark backgrounds in 96.1% of exhibits. Chi-squared (model vs Canvas 2D): chi2(4) = 148.6, p < 0.001. **Batch 002 (N=750, prompt ablation):** Canvas 2D varies by condition (chi2(4) = 174.03, p < 0.0001, V = 0.482). Condition C (Anti-Default) reduced Canvas 2D to 1.3%; Condition B (Stripped) increased it to 71.3%. LOC varies by model (F(2,747) = 890.46, p < 0.0001, eta2 = 0.70): GPT mean 945 LOC, Claude 446, Gemini 301. **Title entropy (Batch 002):** Claude 0.646 (99 unique titles from 250), GPT 0.907 (193/250), Gemini 0.953 (210/250). All pairwise differences significant (permutation p < 0.001). The legacy file `statistics.json` contains an early Batch 001 pipeline run (N=407, before count correction to 388), preserved as a historical artifact. ## Known Limitations For a comprehensive treatment of all known confounds and methodological caveats, see `provenance/known-limitations.md`. Key items are summarized here. **CLAUDE.md confound (Batch 001 only).** 91.2% of Batch 001 agents read the gallery's design system file, which contains hex color values, font names, and the phrase "dark, minimal, gallery/museum aesthetic." This likely inflated background color convergence (roughly 70% of exhibits matched the gallery palette). It does not explain thematic attractors, technology convergence, or title repetition, since none of those appear in the file. The confound was eliminated in Batch 002 by removing the file during execution. Zero Batch 002 agents read it. **Single-turn pipeline.** Batch agents ran in headless mode with limited iteration (median 4-8 turns). Pre-batch exhibits built in multi-turn interactive sessions are categorically more complex. The batch pipeline measures default behavior under constrained conditions, not the upper bound of what models can produce. **Meta-circularity.** Claude Opus 4.6 analyzed Claude's own output. This is mitigated by the use of quantitative, reproducible metrics (regex-based technology detection, LOC counting, color extraction) rather than subjective judgment. All raw data is published for independent verification. **Self-reported session metrics.** Turn counts and context utilization percentages are self-reported by agents and not externally verified. These values should be treated as approximate. **Regex-based analysis.** Technology detection (Canvas 2D, WebGL, SVG, etc.) and interaction pattern flags are extracted via regex matching on source code. This approach is reproducible and deterministic but may produce false positives (matching commented-out code) or false negatives (missing dynamically generated API calls). **Incomplete model coverage in Batch 002.** Batch 002 tested 3 of the original 5 model families. Kimi and Grok were dropped due to pipeline constraints and lower output quality. Conclusions about prompt-resistance of attractors apply only to Claude, GPT, and Gemini. ## Provenance Full provenance documentation is included in `provenance/` and `prompts/`: - **`prompts/`** -- Exact prompt templates for each of the 5 Batch 002 conditions (A-E), reconstructed from `scripts/batch-lib.mjs` with variable substitution placeholders and notes on technology list shuffling. - **`provenance/README.md`** -- How to trace any dataset record back to its manifest, saved preamble, agent log, and audit result. - **`provenance/batch-001-summary.md`** -- Batch 001 manifests, model breakdown, confound analysis, and audit results. - **`provenance/batch-002-summary.md`** -- Batch 002 manifests, per-condition breakdown, audit results, and violation details. - **`provenance/known-limitations.md`** -- All known confounds and methodological caveats. ### Audit Trail Quality Batch 002 has strictly richer provenance metadata than Batch 001. Every Batch 002 manifest item includes a `preambleHash` (SHA-256), `condition` label, and inline `audit` results. Batch 001 manifests lack these fields but are still fully auditable via separate audit files in `.batch/audit/` and raw agent logs in `.batch/logs/`. Agent execution logs (~242 MB total) are archived separately on Zenodo. Each manifest item includes a `logFile` path for traceability. **Paper:** [arXiv link forthcoming] ## Quick Start ```python import json # Load the dataset with open("exhibits.json") as f: exhibits = json.load(f) # Filter to Batch 002 Claude exhibits using Canvas 2D claude_b2 = [e for e in exhibits if e["model"] == "Claude" and e["batchGroup"] == "batch-002"] canvas_count = sum(1 for e in claude_b2 if e["tech_canvas2d"]) print(f"Claude Batch 002: {len(claude_b2)} exhibits, {canvas_count} using Canvas 2D") ``` ## Citation ```bibtex @misc{modeltheory2026, title={Default Aesthetic Attractors: What 1,138 Autonomous Web Exhibits Reveal About AI Creative Disposition}, author={Sean Oliver}, year={2026}, publisher={eclecti-build}, url={https://modeltheory.co}, note={Dataset available at https://huggingface.co/datasets/eclecti-build/model-theory} } ``` ## License This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). You may share and adapt the data for any purpose, provided you give appropriate credit.
提供机构:
Eclecti-Build
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作