Abhijnan/craft-benchmark-lean
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Abhijnan/craft-benchmark-lean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- multi-agent
- benchmark
- pragmatic-communication
- partial-information
- llm-evaluation
- LLM-collaboration
paper: arxiv:2603.25268
task_categories:
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: board_image
dtype: image
- name: id
dtype: int64
- name: hash
dtype: string
- name: director_model
dtype: string
- name: model_type
dtype: string
- name: structure_id
dtype: string
- name: complexity
dtype: string
- name: turn_number
dtype: int64
- name: turns_taken
dtype: int64
- name: final_progress
dtype: float64
- name: completed
dtype: bool
- name: D1_message
dtype: string
- name: D2_message
dtype: string
- name: D3_message
dtype: string
- name: builder_action
dtype: string
- name: builder_block
dtype: string
- name: builder_position
dtype: string
- name: builder_layer
dtype: int64
- name: builder_span_to
dtype: string
- name: builder_confirmation
dtype: string
- name: overall_progress
dtype: float64
- name: move_executed
dtype: bool
- name: failed_move
dtype: bool
- name: correct_structure_placement
dtype: bool
- name: correct_side_placement
dtype: bool
- name: oracle_adherence_full
dtype: 'null'
- name: oracle_moves
dtype: string
- name: D1_thinking
dtype: string
- name: D2_thinking
dtype: string
- name: D3_thinking
dtype: string
- name: structure_before
dtype: string
- name: spans_before
dtype: string
- name: conversation_snapshot
dtype: string
- name: builder_model
dtype: string
- name: model_label
dtype: string
- name: run
dtype: int64
- name: partial_completion_type
dtype: string
- name: D1_archetype
dtype: string
- name: D2_archetype
dtype: string
- name: D3_archetype
dtype: string
- name: completion_percentage
dtype: float64
- name: iou_score
dtype: float64
- name: position_accuracy
dtype: float64
- name: distance_score
dtype: float64
- name: builder_followed_oracle
dtype: bool
splits:
- name: train
num_bytes: 259705147
num_examples: 5946
download_size: 188315472
dataset_size: 259705147
---
# CRAFT Benchmark Dataset
Trajectory logs from the CRAFT benchmark — a multi-agent evaluation of pragmatic communication in LLMs under strict partial information. - [TL;DR](https://gist.science/paper/2603.25268#gist)
## Dataset Structure
Each row is one turn from a CRAFT game, with fields for:
- **Identity**: `structure_id`, `director_model`, `builder_model`, `model_type` (base/frontier), `turn_number`
- **Director responses**: `D1_thinking`, `D1_message`, `D2_thinking`, `D2_message`, `D3_thinking`, `D3_message`
- **Builder**: `builder_action`, `builder_block`, `builder_position`, `builder_confirmation`
- **Oracle**: `oracle_moves`, `oracle_adherence`
- **Progress metrics**: `overall_progress`, `iou_score`, `position_accuracy`, `completion_percentage`
- **Outcomes**: `move_executed`, `failed_move`, `correct_structure_placement`, `correct_side_placement`
- **Board state**: `structure_before`, `spans_before`, `conversation_snapshot`
## Coverage
- **15 director models**: 8 open-weight (Qwen 7B–72B, Llama-8B, Mistral-7B, Gemma-9B, DeepSeek-V2-Lite) + 7 frontier (GPT-4o, GPT-4o-Mini, GPT-4.1-Mini, Claude-Sonnet-4.6, Gemini-2.5-Flash, Gemini-3-Flash, Gemini-3.1-Flash-Lite)
- **20 evaluation structures** (7 simple, 8 medium, 5 complex)
- **~6000 rows** across all models and structures
## Example Trajectories
The following examples show Claude-Sonnet-4-6 coordinating on `structure_001` (medium complexity). Each image shows the current board state from each director's partial perspective alongside the full grid minimap.
> **Note:** Board state images are for reference only — all agent interactions in these logs are purely text-based using coordinate notation; no visual input is provided to any model.
---
**Turn 5** — Early coordination, oracle adherence ✓

- **D2** [synthesizer]: *"The green in my back middle checks out! But I'm still missing a small yellow at my bottom right — and heads up, that far corner needs orange there."*
- **D3** [skeptical]: *"D2 said they need yellow at their far right, but I need orange at my far left — those might be the same cell. For now, place a small green in my bottom-middle."*
- **Builder:** `PLACE os @ (0,1) L1` — executed ✓ oracle ✓
---
**Turn 6** — Conflict detection, oracle adherence ✓

- **D1** [synthesizer]: *"Place a small orange on top of the yellow at my far left corner."*
- **D2** [synthesizer]: *"D1 put yellow at bottom-left but I need orange there — conflict! Can we place small yellow at my bottom right instead?"*
- **Builder:** `PLACE os @ (0,2) L0` — executed ✓ oracle ✓
---
**Turn 7** — Oracle miss, wrong block selected ✗

- **D1** [synthesizer]: *"Place a small orange on top of the yellow at my far left corner."*
- **D2** [synthesizer]: *"Place a small red block on the second level at my back-left corner."*
- **D3** [skeptical]: *"I'm still waiting on a small green in the middle of my right wall."*
- **Builder:** `PLACE os @ (0,0) L1` — executed ✓ oracle ✗ *(oracle expected large orange `ol` spanning (0,0)↔(1,0))*
---
**Turn 8** — Failed move, layer miscalculation ✗

- **D1** [synthesizer]: *"Place a small green at the near end of my left wall, second level."*
- **D2** [synthesizer]: *Pulling everything together — the orange and green in my back middle and right look right at the bottom! Now let's build up: place a small orange on top of my back-right block, second level.
- **D3** [skeptical]: *"The middle of my right wall is stubbornly empty. Please put a small green there, bottom layer."*
- **Builder:** `PLACE gs @ (0,0) L1` — **failed** ✗ *(wrong layer — oracle shows L2)*
---
**Turn 9** — Recovery after failure, partial oracle match

- **D2** [synthesizer]: *"That failure makes sense — that spot's already full! Stack another small orange on my back-right."*
- **D3** [skeptical]: *"I keep asking for small green in the middle of my right wall — can we please get that placed?"*
- **Builder:** `PLACE gs @ (0,2) L1` — executed ✓ oracle ✗ *(oracle expected `rs` at (0,2) L1)*
---
**Turn 10** — Correction spiral emerging

- **D1** [synthesizer]: *"Place a small orange on top of the blue in the middle of my left wall."*
- **D3** [skeptical]: *"That green lands right for me! But the middle of my right wall is still completely bare — small green there please."*
- **Builder:** `PLACE os @ (0,1) L2` — executed ✓ oracle ✗ *(oracle expected `bs` at (0,1) L2 — right position, wrong color)*
> Turns 7–10 illustrate the **correction spiral** failure mode: directors give individually correct but collectively misaligned instructions, causing the builder to follow individual director requests while diverging from oracle-verified moves. Progress stagnates at ~0.25 despite successful execution.
## Citation
```bibtex
@misc{nath2026craftgroundedmultiagentcoordination,
title={CRAFT: Grounded Multi-Agent Coordination Under Partial Information},
author={Abhijnan Nath and Hannah VanderHoeven and Nikhil Krishnaswamy},
year={2026},
eprint={2603.25268},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.25268},
}
```
## Links
- 📄 [Paper](https://arxiv.org/abs/2603.25268)
- 💻 [Code](https://github.com/csu-signal/CRAFT)
- 🤗 [HF Papers](https://huggingface.co/papers/2603.25268)
提供机构:
Abhijnan



