WinoGrande
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/WinoGrande
下载链接
链接失效反馈官方服务:
资源简介:
“Winograd Schema Challenge (WSC)(Levesque、Davis 和 Morgenstern 2011 年)是常识推理的基准,是一组 273 个专家制作的代词解析问题,最初设计用于依赖选择的统计模型无法解决偏好或单词关联。然而,神经语言模型的最新进展已经在 WSC 的变体上达到了大约 90% 的准确率。这就提出了一个重要的问题,这些模型是否真正获得了强大的常识能力,或者它们是否依赖于数据集中的虚假偏见导致对机器常识的真实能力的高估。为了研究这个问题,我们引入了 WinoGrande,这是一个 44k 问题的大规模数据集,受原始 WSC 设计的启发,但经过调整以提高数据集的规模和硬度. 数据集构建的关键步骤包括 (1) 精心设计的众包程序,然后是 (2) 系统偏差减少n 使用一种新颖的 AfLite 算法,该算法将人类可检测的词关联推广到机器可检测的嵌入关联。 WinoGrande 上最先进的方法达到 59.4 – 79.1%,比人类 94.0% 的表现低 15 – 35%(绝对),具体取决于允许的训练数据量(2% – 100%分别)。此外,我们在五个相关基准上建立了最新的最新结果——WSC (90.1%)、DPR (93.1%)、COPA (90.6%)、KnowRef (85.6%) 和 Winogender (97.1%)。这些结果具有双重含义:一方面,它们证明了 WinoGrande 在用作迁移学习资源时的有效性。另一方面,他们提出了一个担忧,即我们可能高估了所有这些基准测试中机器常识的真实能力。我们强调在现有和未来的基准测试中减少算法偏差以减轻这种高估的重要性。”
Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011) is a benchmark for commonsense reasoning, consisting of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on surface-level word associations or selectional preferences. However, recent advances in neural language models have achieved approximately 90% accuracy on variants of the WSC. This raises a critical question: do these models truly acquire robust commonsense reasoning capabilities, or do they instead rely on spurious biases present in the dataset that lead to an overestimation of the true capacity of machine commonsense? To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design but adjusted to improve both its scale and difficulty. The key steps of dataset construction include (1) a carefully designed crowdsourcing pipeline, followed by (2) systematic bias mitigation using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding-based associations. State-of-the-art methods on WinoGrande achieve 59.4–79.1% accuracy, which is 15–35 percentage points (absolute) lower than human performance of 94.0%, depending on the amount of allowed training data (2%–100%, respectively). Furthermore, we establish new state-of-the-art results on five relevant benchmarks: WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results carry dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a transfer learning resource. On the other hand, they raise a concern that we may be overestimating the true capacity of machine commonsense across all these benchmarks. We emphasize the importance of reducing algorithmic bias in both existing and future benchmarks to mitigate such overestimations.
提供机构:
OpenDataLab
创建时间:
2022-04-28
搜集汇总
数据集介绍

背景与挑战
背景概述
WinoGrande是一个大规模常识推理数据集,包含44k个问题,基于Winograd Schema Challenge(WSC)设计,旨在通过众包和AfLite算法减少数据偏差以提高评估难度。该数据集用于测试自然语言处理模型的真实常识能力,当前最先进模型准确率显著低于人类表现,突显了在基准测试中减少算法偏差的重要性。
以上内容由遇见数据集搜集并总结生成



