bertybaums/marc

Name: bertybaums/marc
Creator: bertybaums
Published: 2026-03-21 16:38:58
License: 暂无描述

Hugging Face2026-03-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/bertybaums/marc

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 size_categories: - 10K<n<100K task_categories: - visual-question-answering - text-classification tags: - arc - metaphor - figurative-language - mechanistic-interpretability - grid-puzzles - abstraction-and-reasoning - multimodal-integration pretty_name: "MARC: Metaphor Abstraction and Reasoning Corpus" configs: - config_name: tasks data_files: "tasks/train.parquet" - config_name: task_subsets data_files: "task_subsets/train.parquet" - config_name: descriptions data_files: "descriptions/train.parquet" - config_name: baseline data_files: "baseline/train.parquet" - config_name: figurative data_files: "figurative/train.parquet" dataset_info: - config_name: tasks features: - name: task_id dtype: int32 - name: arc_name dtype: string - name: source dtype: string - name: num_train dtype: int32 - name: see_description dtype: string - name: do_description dtype: string - name: grid_description dtype: string splits: - name: train num_examples: 577 - config_name: task_subsets features: - name: task_id dtype: int32 - name: model_name dtype: string - name: subset dtype: string splits: - name: train num_examples: 1188 - config_name: descriptions features: - name: fig_id dtype: int32 - name: task_id dtype: int32 - name: generator_model dtype: string - name: variant dtype: string - name: source_domain dtype: string - name: metaphor dtype: string - name: figurative_see dtype: string - name: figurative_do dtype: string - name: figurative_grid dtype: string splits: - name: train num_examples: 1500 - config_name: baseline features: - name: trial_id dtype: int32 - name: task_id dtype: int32 - name: model_name dtype: string - name: condition dtype: string - name: num_examples dtype: int32 - name: correct dtype: int32 - name: cell_accuracy dtype: float32 splits: - name: train num_examples: 3952 - config_name: figurative features: - name: trial_id dtype: int32 - name: fig_id dtype: int32 - name: task_id dtype: int32 - name: model_name dtype: string - name: num_examples dtype: int32 - name: correct dtype: int32 - name: cell_accuracy dtype: float32 - name: variant dtype: string - name: source_domain dtype: string splits: - name: train num_examples: 8225 --- # MARC: Metaphor Abstraction and Reasoning Corpus ## What This Is MARC identifies puzzles where figurative language and visual examples are *genuinely complementary*: the model fails given examples alone, fails given the metaphor alone, but succeeds when both are presented together. We call this the **MARC property**. The corpus provides 78 MARC-verified puzzles with 1,230 domain-diverse figurative descriptions and complete behavioral trial data for three language models. Suppose you are staring at a grid puzzle — coloured cells in rows and columns, some pattern lurking beneath the surface. A handful of training examples show input grids paired with their correct outputs, but the transformation rule eludes you. Now someone offers a hint: "Think of it as a garden, where the green cells are plants spreading to fill empty soil." If that helps — and if neither the examples nor the metaphor would have sufficed on its own — then this puzzle exhibits the MARC property. ## The MARC Property A puzzle satisfies the MARC property for a given model when three conditions hold simultaneously: 1. **Examples alone fail.** The model cannot solve the puzzle from training input-output pairs alone. 2. **Figurative description alone fails.** The model cannot solve the puzzle from the metaphorical clue alone. 3. **Figurative + examples succeeds.** The model solves the puzzle when given both the metaphor and some number of training examples. This isolates cases where language and perception are genuinely complementary — neither channel suffices on its own, but their combination does. ## Data Sources The corpus draws on two sources: - **LARC tasks** (task IDs 0–399): 400 ARC-AGI puzzles (Chollet, 2019) extended with crowdsourced literal descriptions from the LARC dataset (Acquaviva et al., 2022). Figurative descriptions were generated by Claude and verified behaviorally. - **MARC submissions** (task IDs 1000–1176): 177 hand-crafted puzzles with human-authored figurative clues. Grid puzzles themselves (the actual input-output grid pairs) are not included in this dataset. They are available from the original ARC-AGI repository and the LARC dataset. Task IDs and `arc_name` fields provide the link. ## Dataset Configs ### `tasks` (577 rows) Task metadata. For LARC tasks, `see_description`, `do_description`, and `grid_description` contain literal descriptions of the puzzle. For MARC submissions, these fields contain the figurative clue (no literal descriptions exist). | Column | Description | |--------|-------------| | `task_id` | Unique task identifier (0–399: LARC; 1000–1176: submissions) | | `arc_name` | Original ARC filename (8-character hex ID) | | `source` | `'larc'` or `'marc-submission'` | | `num_train` | Number of training examples (1–10) | | `see_description` | What structures are visible in the input | | `do_description` | The transformation rule | | `grid_description` | How output dimensions relate to input | ### `task_subsets` (1,188 rows) Classification of each task by each model into one of four categories based on baseline performance. | Column | Description | |--------|-------------| | `task_id` | Links to `tasks` | | `model_name` | Which model this classification applies to | | `subset` | `'examples_sufficient'`, `'language_sufficient'`, `'both_required'`, or `'unsolvable'` | ### `descriptions` (1,500 rows) Figurative descriptions — the metaphorical clues. This is the core contribution. Each MARC-verified puzzle has an original clue plus up to 24 domain-diverse alternatives. | Column | Description | |--------|-------------| | `fig_id` | Unique description identifier | | `task_id` | Links to `tasks` | | `generator_model` | `'claude-agent'` (generated) or `'human'` (hand-crafted) | | `variant` | `'original'`, `'alt-1'`, `'alt-2'`, ... | | `source_domain` | Metaphor domain: `'biology'`, `'warfare'`, `'cooking'`, etc. (NULL for originals) | | `metaphor` | One-line metaphor concept | | `figurative_see` | Figurative "what you see" (empty for submissions) | | `figurative_do` | Figurative "what to do" (empty for submissions) | | `figurative_grid` | Figurative grid description (empty for submissions) | **Domain coverage:** 12 core domains (warfare, biology, cooking, music, gardening, navigation, dance, theater, architecture, astronomy, chemistry, weather) each cover all 78 MARC-verified tasks. 51 additional domains appear in smaller numbers. ### `baseline` (3,952 rows) Baseline trial results under three conditions (no figurative language involved). | Column | Description | |--------|-------------| | `trial_id` | Unique trial identifier | | `task_id` | Links to `tasks` | | `model_name` | Subject model | | `condition` | `'examples_only'`, `'language_only'`, or `'both'` | | `num_examples` | Training examples shown | | `correct` | 1 = exact match, 0 = incorrect | | `cell_accuracy` | Fraction of cells matching (0.0–1.0) | ### `figurative` (8,225 rows) Figurative trial results. Each row is one (metaphor variant, model, number of examples) combination. | Column | Description | |--------|-------------| | `trial_id` | Unique trial identifier | | `fig_id` | Links to `descriptions` | | `task_id` | Links to `tasks` | | `model_name` | Subject model | | `num_examples` | 0 = figurative only, 1–N = figurative + k examples | | `correct` | 1 = exact match, 0 = incorrect | | `cell_accuracy` | Fraction of cells matching | | `variant` | Denormalized from `descriptions` for convenience | | `source_domain` | Denormalized from `descriptions` for convenience | ## Models Tested | Model | Parameters | Architecture | Notes | |-------|-----------|--------------|-------| | gpt-oss-120b | 120B | MoE, open-weight | Primary verification model | | gpt-oss-20b | 21B (3.6B active) | MoE, open-weight | Mechanistic interpretability target | | qwen3.5-400b | 400B | Dense | Baseline only (no figurative trials) | All trials used temperature 0.0. Reasoning models (gpt-oss-*) use a two-pass protocol: Pass 1 for reasoning, Pass 2 for structured output extraction. ## Key Statistics | | Count | |---|---:| | Total tasks | 577 | | MARC-verified puzzles (120b) | 78 | | Figurative descriptions | 1,500 | | — original clues | 270 | | — domain-diverse alternatives | 1,230 | | Distinct source domains | 63 | | Core domains (full 78-task coverage) | 12 | | Baseline trials | 3,952 | | Figurative trials | 8,225 | | MARC-valid alternatives (120b) | 725/1,230 (59%) | ## Usage ```python from datasets import load_dataset # Load a specific config tasks = load_dataset("bertybaums/marc", "tasks") descriptions = load_dataset("bertybaums/marc", "descriptions") figurative = load_dataset("bertybaums/marc", "figurative") # Get all MARC-valid alternatives for biology domain bio = [d for d in descriptions["train"] if d["source_domain"] == "biology"] # Check MARC property: find tasks where figurative+examples succeeds # but figurative-alone and examples-alone both fail import pandas as pd fig_df = figurative["train"].to_pandas() marc_valid = fig_df.groupby("fig_id").apply( lambda g: (g[g.num_examples == 0].correct == 0).all() and (g[g.num_examples > 0].correct == 1).any() ) ``` ## Intended Uses - **Mechanistic interpretability:** How do LLMs internally integrate figurative language with visual-spatial pattern recognition? The domain-diverse alternatives enable controlled comparisons: same puzzle, different metaphor. - **Figurative language understanding:** Which source domains produce more effective metaphors for abstract reasoning tasks? The 12-domain factorial supports systematic comparison. - **Abstraction and reasoning:** The MARC property identifies a specific failure mode (examples alone insufficient) and a specific remedy (figurative scaffolding). What makes some puzzles amenable to this scaffolding and others not? - **Scaling analysis:** Comparing MARC validity rates across 20B vs. 120B models reveals how figurative reasoning capacity scales with model size. ## Limitations - Figurative descriptions were generated by Claude, not humans (except the 177 MARC submissions). The metaphors may reflect Claude's biases in how it maps grid operations to conceptual domains. - Behavioral trials use temperature 0.0, but reasoning models may still exhibit minor non-determinism across identical prompts. - The dataset does not include the grid puzzles themselves (input-output pairs). Researchers need the original ARC/LARC data to see what the metaphors describe. - MARC verification was performed against gpt-oss-120b. A metaphor that is MARC-valid for 120b may not be for smaller models (and vice versa). - Prompt text is not included in this release to keep the dataset compact. Researchers needing full prompts can reconstruct them from the task data + descriptions, or contact the authors. ## Citation If you use this dataset, please cite: ```bibtex @misc{baumgaertner2026marc, title={MARC: Metaphor Abstraction and Reasoning Corpus}, author={Baumgaertner, Bert}, year={2026}, url={https://huggingface.co/datasets/bertybaums/marc} } ``` ## Acknowledgments Grid puzzles are drawn from ARC-AGI (Chollet, 2019) and LARC (Acquaviva et al., 2022). Figurative descriptions were generated using Claude (Anthropic). Behavioral experiments were conducted on the MindRouter infrastructure at the University of Idaho. ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). The underlying ARC puzzles are licensed under Apache 2.0; LARC descriptions are licensed under CC-BY-4.0.

提供机构：

bertybaums

5,000+

优质数据集

54 个

任务类型

进入经典数据集