cavendishlabs/rebus

Name: cavendishlabs/rebus
Creator: cavendishlabs
Published: 2024-01-12 01:30:58
License: 暂无描述

Hugging Face2024-01-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/cavendishlabs/rebus

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Filename dtype: string - name: Solution dtype: string - name: Also accept dtype: string - name: Theme dtype: string - name: Difficulty dtype: string - name: Exact spelling? dtype: string - name: Specific reference dtype: string - name: Reading? dtype: string - name: Attribution dtype: string - name: Author dtype: string - name: image dtype: image splits: - name: train num_bytes: 51545282.0 num_examples: 333 download_size: 47656838 dataset_size: 51545282.0 configs: - config_name: default data_files: - split: train path: data/train-* --- # REBUS REBUS: A Robust Evaluation Benchmark of Understanding Symbols [**Paper**](https://arxiv.org/abs/2401.05604) | [**🤗 Dataset**](https://huggingface.co/datasets/cavendishlabs/rebus) | [**GitHub**](https://github.com/cvndsh/rebus) | [**Website**](https://cavendishlabs.org/rebus/) ## Introduction Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini Pro (Gemini Team et al., 2023), they displayed its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi-step reasoning, and understanding the human creator's intent. We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse categories, including hand-drawn and digital images created by nine contributors. Samples are presented in the table below. Notably, GPT-4V, the most powerful model we evaluated, answered only 24% of puzzles correctly, highlighting the poor capabilities of MLLMs in new and unexpected domains to which human reasoning generalizes with comparative ease. Open-source models perform even worse, with a median accuracy below 1%. We notice that models often give faithless explanations, fail to change their minds after an initial approach doesn't work, and remain highly uncalibrated on their own abilities. ![image](https://github.com/cvndsh/rebus/assets/10122030/131bde1a-9a09-44cc-abc3-efe874b95b23) ## Evaluation results | Model | Overall | Easy | Medium | Hard | | ----------------- | ------------- | ------------- | ------------- | ------------ | | GPT-4V | **24.0** | **33.0** | **13.2** | **7.1** | | Gemini Pro | 13.2 | 19.4 | 5.3 | 3.6 | | LLaVa-1.5-13B | 1.8 | 2.6 | 0.9 | 0.0 | | LLaVa-1.5-7B | 1.5 | 2.6 | 0.0 | 0.0 | | BLIP2-FLAN-T5-XXL | 0.9 | 0.5 | 1.8 | 0.0 | | CogVLM | 0.9 | 1.6 | 0.0 | 0.0 | | QWEN | 0.9 | 1.6 | 0.0 | 0.0 | | InstructBLIP | 0.6 | 0.5 | 0.9 | 0.0 |

提供机构：

cavendishlabs

原始信息汇总

数据集概述

数据集信息

特征列表：
- Filename：文件名，数据类型为字符串。
- Solution：解决方案，数据类型为字符串。
- Also accept：也接受，数据类型为字符串。
- Theme：主题，数据类型为字符串。
- Difficulty：难度，数据类型为字符串。
- Exact spelling?：精确拼写？数据类型为字符串。
- Specific reference：特定引用，数据类型为字符串。
- Reading?：阅读？数据类型为字符串。
- Attribution：归属，数据类型为字符串。
- Author：作者，数据类型为字符串。
- image：图像，数据类型为图像。
数据分割：
- train：训练集，包含333个样本，总字节数为51545282.0。
数据集大小：
- 下载大小：47656838字节
- 数据集大小：51545282.0字节
配置：
- default：默认配置，数据文件路径为data/train-*。

评估结果

模型	总体	简单	中等	困难
GPT-4V	24.0	33.0	13.2	7.1
Gemini Pro	13.2	19.4	5.3	3.6
LLaVa-1.5-13B	1.8	2.6	0.9	0.0
LLaVa-1.5-7B	1.5	2.6	0.0	0.0
BLIP2-FLAN-T5-XXL	0.9	0.5	1.8	0.0
CogVLM	0.9	1.6	0.0	0.0
QWEN	0.9	1.6	0.0	0.0
InstructBLIP	0.6	0.5	0.9	0.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集