winoground

Name: winoground
Creator: maas
Published: 2025-12-05 12:14:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/facebook/winoground

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Winoground ## Dataset Description Winoground is a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly—but crucially, both captions contain a completely identical set of words/morphemes, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. In our accompanying paper, we probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. In the paper, we perform an extensive analysis to obtain insights into how future work might try to mitigate these models’ shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. We are thankful to Getty Images for providing the image data. ## Data The captions and tags are located in `data/examples.jsonl` and the images are located in `data/images.zip`. You can load the data as follows: ```python from datasets import load_dataset examples = load_dataset('facebook/winoground', use_auth_token=<YOUR USER ACCESS TOKEN>) ``` You can get `<YOUR USER ACCESS TOKEN>` by following these steps: 1) log into your Hugging Face account 2) click on your profile picture 3) click "Settings" 4) click "Access Tokens" 5) generate an access token ## Model Predictions and Statistics The image-caption model scores from our paper are saved in `statistics/model_scores`. To compute many of the tables and graphs from our paper, run the following commands: ```bash git clone https://huggingface.co/datasets/facebook/winoground cd winoground pip install -r statistics/requirements.txt python statistics/compute_statistics.py ``` ## FLAVA Colab notebook code for Winoground evaluation https://colab.research.google.com/drive/1c3l4r4cEA5oXfq9uXhrJibddwRkcBxzP?usp=sharing ## CLIP Colab notebook code for Winoground evaluation https://colab.research.google.com/drive/15wwOSte2CjTazdnCWYUm2VPlFbk2NGc0?usp=sharing ## Paper FAQ ### Why is the group score for a random model equal to 16.67%? <details> <summary>Click for a proof!</summary> Intuitively, we might think that we can multiply the probabilities from the image and text score to get 1/16 = 6.25%. But, these scores are not conditionally independent. We can find the correct probability with combinatorics: For ease of notation, let: - a = s(c_0, i_0) - b = s(c_1, i_0) - c = s(c_1, i_1) - d = s(c_0, i_1) The group score is defined as 1 if a > b, a > d, c > b, c > d and 0 otherwise. As one would say to GPT-3, let's think step by step: 1. There are 4! = 24 different orderings of a, c, b, d. 2. There are only 4 orderings for which a > b, a > d, c > b, c > d: - a, c, b, d - a, c, d, b - c, a, b, d - c, a, d, b 3. No ordering is any more likely than another because a, b, c, d are sampled from the same random distribution. 4. We can conclude that the probability of a group score of 1 is 4/24 = 0.166... </details> ## Citation Information [https://arxiv.org/abs/2204.03162](https://arxiv.org/abs/2204.03162) Tristan Thrush and Candace Ross contributed equally. ```bibtex @inproceedings{thrush_and_ross2022winoground, author = {Tristan Thrush and Ryan Jiang and Max Bartolo and Amanpreet Singh and Adina Williams and Douwe Kiela and Candace Ross}, title = {Winoground: Probing vision and language models for visio-linguistic compositionality}, booktitle = {CVPR}, year = 2022, } ```

# Winoground 数据集卡片 ## 数据集描述 Winoground是一项用于评估视觉语言模型开展视觉语言组合推理能力的新型任务与数据集。给定两张图像与两段描述文本，任务目标是完成正确匹配——但关键在于，两段文本拥有完全一致的词/语素集合，仅词序不同。本数据集由专家标注人员精心手工构建，并标注了一系列细粒度标签，以辅助模型性能分析。在我们的配套论文中，我们对一系列当前最优的视觉语言模型进行了测试，却意外发现所有模型的表现均未显著优于随机猜测。显然，这些模型在视觉语言组合推理方面的能力并未达到预期。论文中我们开展了全面的分析，以期为后续工作如何改善这些模型的缺陷提供思路。我们期望Winoground能够成为推动该领域前沿发展、助力进一步研究的实用评估数据集。我们感谢Getty Images提供图像数据。 ## 数据描述文本与标签存储于`data/examples.jsonl`，图像数据存储于`data/images.zip`。你可通过如下方式加载数据集： python from datasets import load_dataset examples = load_dataset('facebook/winoground', use_auth_token=<YOUR USER ACCESS TOKEN>) 你可通过以下步骤获取`<YOUR USER ACCESS TOKEN>`： 1) 登录你的Hugging Face账户 2) 点击个人头像 3) 点击"设置" 4) 点击"访问令牌" 5) 生成访问令牌 ## 模型预测与统计数据我们论文中用到的图像-文本模型评分存储于`statistics/model_scores`。如需复现论文中的多数图表，请运行以下命令： bash git clone https://huggingface.co/datasets/facebook/winoground cd winoground pip install -r statistics/requirements.txt python statistics/compute_statistics.py ## 用于Winoground评估的FLAVA Colab笔记本代码 https://colab.research.google.com/drive/1c3l4r4cEA5oXfq9uXhrJibddwRkcBxzP?usp=sharing ## 用于Winoground评估的CLIP Colab笔记本代码 https://colab.research.google.com/drive/15wwOSte2CjTazdnCWYUm2VPlFbk2NGc0?usp=sharing ## 论文常见问题 ### 为何随机模型的分组得分等于16.67%？ <details> <summary>点击查看证明！</summary> 直观而言，我们或许会认为可以将图像与文本的评分概率相乘，得到1/16=6.25%。但这些评分并非条件独立的。我们可以通过组合数学方法推导正确概率：为简化符号表示，令： - a = s(c_0, i_0) - b = s(c_1, i_0) - c = s(c_1, i_1) - d = s(c_0, i_1) 分组得分的定义为：当a > b、a > d、c > b、c > d时得1，否则得0。正如对GPT-3的提示那样，我们一步步推导： 1. a、c、b、d共有4! = 24种不同的排列顺序。 2. 满足a > b、a > d、c > b、c > d的排列仅有4种： - a, c, b, d - a, c, d, b - c, a, b, d - c, a, d, b 3. 由于a、b、c、d均从同一随机分布中采样，因此每种排列的概率均等。 4. 由此可得分组得分为1的概率为4/24 = 0.166... </details> ## 引用信息 [https://arxiv.org/abs/2204.03162](https://arxiv.org/abs/2204.03162) Tristan Thrush与Candace Ross对本文贡献均等。 bibtex @inproceedings{thrush_and_ross2022winoground, author = {Tristan Thrush and Ryan Jiang and Max Bartolo and Amanpreet Singh and Adina Williams and Douwe Kiela and Candace Ross}, title = {Winoground: Probing vision and language models for visio-linguistic compositionality}, booktitle = {CVPR}, year = 2022, }

提供机构：

maas

创建时间：

2025-05-20

搜集汇总

数据集介绍