five

Abhijnan/craft-benchmark-lean

收藏
Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Abhijnan/craft-benchmark-lean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - multi-agent - benchmark - pragmatic-communication - partial-information - llm-evaluation - LLM-collaboration paper: arxiv:2603.25268 task_categories: - text-generation configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: board_image dtype: image - name: id dtype: int64 - name: hash dtype: string - name: director_model dtype: string - name: model_type dtype: string - name: structure_id dtype: string - name: complexity dtype: string - name: turn_number dtype: int64 - name: turns_taken dtype: int64 - name: final_progress dtype: float64 - name: completed dtype: bool - name: D1_message dtype: string - name: D2_message dtype: string - name: D3_message dtype: string - name: builder_action dtype: string - name: builder_block dtype: string - name: builder_position dtype: string - name: builder_layer dtype: int64 - name: builder_span_to dtype: string - name: builder_confirmation dtype: string - name: overall_progress dtype: float64 - name: move_executed dtype: bool - name: failed_move dtype: bool - name: correct_structure_placement dtype: bool - name: correct_side_placement dtype: bool - name: oracle_adherence_full dtype: 'null' - name: oracle_moves dtype: string - name: D1_thinking dtype: string - name: D2_thinking dtype: string - name: D3_thinking dtype: string - name: structure_before dtype: string - name: spans_before dtype: string - name: conversation_snapshot dtype: string - name: builder_model dtype: string - name: model_label dtype: string - name: run dtype: int64 - name: partial_completion_type dtype: string - name: D1_archetype dtype: string - name: D2_archetype dtype: string - name: D3_archetype dtype: string - name: completion_percentage dtype: float64 - name: iou_score dtype: float64 - name: position_accuracy dtype: float64 - name: distance_score dtype: float64 - name: builder_followed_oracle dtype: bool splits: - name: train num_bytes: 259705147 num_examples: 5946 download_size: 188315472 dataset_size: 259705147 --- # CRAFT Benchmark Dataset Trajectory logs from the CRAFT benchmark — a multi-agent evaluation of pragmatic communication in LLMs under strict partial information. - [TL;DR](https://gist.science/paper/2603.25268#gist) ## Dataset Structure Each row is one turn from a CRAFT game, with fields for: - **Identity**: `structure_id`, `director_model`, `builder_model`, `model_type` (base/frontier), `turn_number` - **Director responses**: `D1_thinking`, `D1_message`, `D2_thinking`, `D2_message`, `D3_thinking`, `D3_message` - **Builder**: `builder_action`, `builder_block`, `builder_position`, `builder_confirmation` - **Oracle**: `oracle_moves`, `oracle_adherence` - **Progress metrics**: `overall_progress`, `iou_score`, `position_accuracy`, `completion_percentage` - **Outcomes**: `move_executed`, `failed_move`, `correct_structure_placement`, `correct_side_placement` - **Board state**: `structure_before`, `spans_before`, `conversation_snapshot` ## Coverage - **15 director models**: 8 open-weight (Qwen 7B–72B, Llama-8B, Mistral-7B, Gemma-9B, DeepSeek-V2-Lite) + 7 frontier (GPT-4o, GPT-4o-Mini, GPT-4.1-Mini, Claude-Sonnet-4.6, Gemini-2.5-Flash, Gemini-3-Flash, Gemini-3.1-Flash-Lite) - **20 evaluation structures** (7 simple, 8 medium, 5 complex) - **~6000 rows** across all models and structures ## Example Trajectories The following examples show Claude-Sonnet-4-6 coordinating on `structure_001` (medium complexity). Each image shows the current board state from each director's partial perspective alongside the full grid minimap. > **Note:** Board state images are for reference only — all agent interactions in these logs are purely text-based using coordinate notation; no visual input is provided to any model. --- **Turn 5** — Early coordination, oracle adherence ✓ ![Turn 5](assets/sample_turn_0.png) - **D2** [synthesizer]: *"The green in my back middle checks out! But I'm still missing a small yellow at my bottom right — and heads up, that far corner needs orange there."* - **D3** [skeptical]: *"D2 said they need yellow at their far right, but I need orange at my far left — those might be the same cell. For now, place a small green in my bottom-middle."* - **Builder:** `PLACE os @ (0,1) L1` — executed ✓ oracle ✓ --- **Turn 6** — Conflict detection, oracle adherence ✓ ![Turn 6](assets/sample_turn_1.png) - **D1** [synthesizer]: *"Place a small orange on top of the yellow at my far left corner."* - **D2** [synthesizer]: *"D1 put yellow at bottom-left but I need orange there — conflict! Can we place small yellow at my bottom right instead?"* - **Builder:** `PLACE os @ (0,2) L0` — executed ✓ oracle ✓ --- **Turn 7** — Oracle miss, wrong block selected ✗ ![Turn 7](assets/sample_turn_2.png) - **D1** [synthesizer]: *"Place a small orange on top of the yellow at my far left corner."* - **D2** [synthesizer]: *"Place a small red block on the second level at my back-left corner."* - **D3** [skeptical]: *"I'm still waiting on a small green in the middle of my right wall."* - **Builder:** `PLACE os @ (0,0) L1` — executed ✓ oracle ✗ *(oracle expected large orange `ol` spanning (0,0)↔(1,0))* --- **Turn 8** — Failed move, layer miscalculation ✗ ![Turn 8](assets/sample_turn_3.png) - **D1** [synthesizer]: *"Place a small green at the near end of my left wall, second level."* - **D2** [synthesizer]: *Pulling everything together — the orange and green in my back middle and right look right at the bottom! Now let's build up: place a small orange on top of my back-right block, second level. - **D3** [skeptical]: *"The middle of my right wall is stubbornly empty. Please put a small green there, bottom layer."* - **Builder:** `PLACE gs @ (0,0) L1` — **failed** ✗ *(wrong layer — oracle shows L2)* --- **Turn 9** — Recovery after failure, partial oracle match ![Turn 9](assets/sample_turn_4.png) - **D2** [synthesizer]: *"That failure makes sense — that spot's already full! Stack another small orange on my back-right."* - **D3** [skeptical]: *"I keep asking for small green in the middle of my right wall — can we please get that placed?"* - **Builder:** `PLACE gs @ (0,2) L1` — executed ✓ oracle ✗ *(oracle expected `rs` at (0,2) L1)* --- **Turn 10** — Correction spiral emerging ![Turn 10](assets/sample_turn_5.png) - **D1** [synthesizer]: *"Place a small orange on top of the blue in the middle of my left wall."* - **D3** [skeptical]: *"That green lands right for me! But the middle of my right wall is still completely bare — small green there please."* - **Builder:** `PLACE os @ (0,1) L2` — executed ✓ oracle ✗ *(oracle expected `bs` at (0,1) L2 — right position, wrong color)* > Turns 7–10 illustrate the **correction spiral** failure mode: directors give individually correct but collectively misaligned instructions, causing the builder to follow individual director requests while diverging from oracle-verified moves. Progress stagnates at ~0.25 despite successful execution. ## Citation ```bibtex @misc{nath2026craftgroundedmultiagentcoordination, title={CRAFT: Grounded Multi-Agent Coordination Under Partial Information}, author={Abhijnan Nath and Hannah VanderHoeven and Nikhil Krishnaswamy}, year={2026}, eprint={2603.25268}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.25268}, } ``` ## Links - 📄 [Paper](https://arxiv.org/abs/2603.25268) - 💻 [Code](https://github.com/csu-signal/CRAFT) - 🤗 [HF Papers](https://huggingface.co/papers/2603.25268)
提供机构:
Abhijnan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作