five

Ryenhails/ikea-bench

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Ryenhails/ikea-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - visual-question-answering - image-text-to-text language: - en tags: - assembly - cross-depiction - vlm-benchmark - ikea - procedural-understanding pretty_name: IKEA-Bench size_categories: - 1K<n<10K --- # IKEA-Bench **Benchmarking Vision-Language Models for Cross-Depiction Assembly Instruction Alignment** [[Project Page]](https://ryenhails.github.io/IKEA-Bench/) | [[Paper]](https://arxiv.org/abs/2604.00913) | [[GitHub]](https://github.com/Ryenhails/IKEA-Bench) ## Dataset Description IKEA-Bench evaluates how well VLMs can align assembly instruction diagrams (like IKEA manuals) with real-world assembly videos. The benchmark contains **1,623 questions** across **6 task types** covering cross-modal alignment and procedural reasoning. This dataset is **self-contained**: all images (133 manual diagrams + 2,570 video frames) are included. No additional downloads needed. ## Dataset Structure ``` ikea-bench/ ├── README.md ├── qa_benchmark.json # 1,623 benchmark questions ├── step_descriptions.json # 132 text descriptions of assembly steps ├── manual_img/ # 133 assembly instruction diagrams │ ├── Bench/{product}/step_{i}/step_{j}.png │ ├── Chair/{product}/... │ ├── Desk/{product}/... │ ├── Misc/{product}/... │ ├── Shelf/{product}/... │ └── Table/{product}/... └── qa_frames/ # 2,570 video frames ├── Bench/{product}/step{i}/{video_id}/frame_*.jpg ├── Chair/{product}/... └── ... ``` ### Question Schema All image paths in `qa_benchmark.json` are **relative to the dataset root**. ```json { "id": "1a_tjusig_step3_MNGqJ4gXqbA_0", "type": "1a", "dimension": "cross_modal", "task": "step_recognition", "product": "tjusig", "category": "Bench", "question": "Which manual step is being performed in these video frames?", "video_frames": ["qa_frames/Bench/tjusig/step3/MNGqJ4gXqbA/frame_00_t52.3s.jpg", ...], "options": [ {"label": "A", "image": "manual_img/Bench/tjusig/step_2/step_5.png", "step_id": 2}, {"label": "B", "image": "manual_img/Bench/tjusig/step_3/step_7.png", "step_id": 3}, ... ], "answer": "B", "answer_step_id": 3, "visual_tokens_est": 4480, "metadata": {...} } ``` ### Task Types | Code | Task | Type | Questions | |------|------|------|-----------| | 1a | Step Recognition | 4-way MC | 320 | | 1b | Action Verification | Binary | 350 | | 2a | Progress Tracking | 4-way MC | 334 | | 2b | Next-Step Prediction | 4-way MC | 204 | | 1c | Video Discrimination | Binary | 350 | | 2c | Sequence Ordering | 4-way MC | 65 | ### Alignment Strategies - **Visual (baseline)**: Video frames + diagram images - **Visual+Text**: Video frames + diagram images + text descriptions - **Text Only**: Video frames + text descriptions (no diagram images) ## Quick Start ### Download the entire dataset ```python from huggingface_hub import snapshot_download # Downloads everything (~300MB) to a local directory snapshot_download( repo_id="Ryenhails/ikea-bench", repo_type="dataset", local_dir="data" ) ``` ### Load and iterate ```python import json from pathlib import Path from PIL import Image data_dir = Path("data") with open(data_dir / "qa_benchmark.json") as f: questions = json.load(f) # Example: load a question with images q = questions[0] video_frames = [Image.open(data_dir / p) for p in q["video_frames"]] option_images = [Image.open(data_dir / o["image"]) for o in q["options"] if "image" in o] print(f"Question: {q['question']}") print(f"Answer: {q['answer']}") print(f"Video frames: {len(video_frames)}, Option images: {len(option_images)}") ``` ### Use with evaluation code ```bash git clone https://github.com/Ryenhails/IKEA-Bench.git cd IKEA-Bench pip install -r requirements.txt # Download data python -c "from huggingface_hub import snapshot_download; snapshot_download('Ryenhails/ikea-bench', repo_type='dataset', local_dir='data')" # Run evaluation python -m ikea_bench.eval \ --model qwen3-vl-8b \ --setting baseline \ --input data/qa_benchmark.json \ --output results/qwen3-vl-8b_baseline.json ``` ## Source Data This benchmark is built upon the [IKEA Manuals at Work](https://github.com/yunongLiu1/IKEA-Manuals-at-Work) dataset (Liu et al., NeurIPS 2024), which provides: - 36 furniture products from 6 categories (29 used in this benchmark) - 98 assembly videos with temporal step annotations - Wordless assembly instruction diagrams Manual diagrams are sourced from the original dataset (CC-BY-4.0). Video frames are extracted from assembly videos hosted on the [Stanford Digital Repository](https://purl.stanford.edu/sg200ps4374). Text descriptions (132 entries) are generated by Claude Opus 4.6 and cross-validated against ground-truth annotations (96.2% consistency). ## Citation ```bibtex @article{liu2026ikeabench, title={Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment}, author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu}, journal={arXiv preprint arXiv:2604.00913}, year={2026} } ``` Please also cite the source dataset: ```bibtex @inproceedings{liu2024ikeamanualsatwork, title={IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos}, author={Liu, Yunong and Eyzaguirre, Cristobal and Li, Manling and Khanna, Shubh and Niebles, Juan Carlos and Ravi, Vineeth and Mishra, Saumitra and Liu, Weiyu and Wu, Jiajun}, booktitle={NeurIPS 2024 Datasets and Benchmarks}, year={2024} } ``` ## License CC-BY-4.0. Original IKEA manual images remain the copyright of Inter IKEA Systems B.V. The source dataset [IKEA Manuals at Work](https://github.com/yunongLiu1/IKEA-Manuals-at-Work) is also released under CC-BY-4.0.
提供机构:
Ryenhails
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作