GoldenSwag

Name: GoldenSwag
Creator: maas
Published: 2025-12-05 16:39:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/PleIAs/GoldenSwag

下载链接

链接失效反馈

官方服务：

资源简介：

# GoldenSwag This is a filtered subset of the HellaSwag validation set. In the following table, we present the complete set of stages used for the filtering of the HellaSwag validation set, which consists of 10042 questions. | Filter | # to remove | # removed | # left | |---------------------------------------------|------------|----------|--------| | Toxic content | 6 | 6 | 10036 | | Nonsense or ungrammatical prompt | 4065 | 4064 | 5972 | | Nonsense or ungrammatical correct answer | 711 | 191 | 5781 | | Ungrammatical incorrect answers | 3953 | 1975 | 3806 | | Wrong answer | 370 | 89 | 3717 | | All options are nonsense | 409 | 23 | 3694 | | Multiple correct options | 2121 | 583 | 3111 | | Relative length difference > 0.3 | 802 | 96 | 3015 | | Length difference (0.15,0.3] and longest is correct | 1270 | 414 | 2601 | | Zero-prompt core ≥ 0.3 | 3963 | 1076 | 1525 | For each filter, we report the number of questions in HellaSwag that fit the filtering criterion, the number of questions that we actually remove at this stage (that were not removed in previous stages), and the number of questions that are left in HellaSwag after each filtering stage. After the filtering, almost all of the questions are sourced from WikiHow part of the data – 1498 (98.2\%). To cite the work: ``` @misc{chizhov2025hellaswagvaliditycommonsensereasoning, title={What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks}, author={Pavel Chizhov and Mattia Nee and Pierre-Carl Langlais and Ivan P. Yamshchikov}, year={2025}, eprint={2504.07825}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.07825}, } ``` To cite the original HellaSwag dataset: ``` @inproceedings{zellers-etal-2019-hellaswag, title = "{H}ella{S}wag: Can a Machine Really Finish Your Sentence?", author = "Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin", editor = "Korhonen, Anna and Traum, David and M{\`a}rquez, Llu{\'i}s", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1472/", doi = "10.18653/v1/P19-1472", pages = "4791--4800", abstract = "Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as {\textquotedblleft}A woman sits at a piano,{\textquotedblright} a machine must select the most likely followup: {\textquotedblleft}She sets her fingers on the keys.{\textquotedblright} With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans ({\ensuremath{>}}95{\%} accuracy), state-of-the-art models struggle ({\ensuremath{<}}48{\%}). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical {\textquoteleft}Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges." } ```

# GoldenSwag 本数据集为HellaSwag验证集的过滤子集。下表展示了针对HellaSwag验证集（共包含10042条问题）的完整过滤流程各阶段信息。 | 过滤条件 | 应移除数量 | 实际移除数量 | 剩余数量 | |---------------------------------------------|------------|----------|--------| | 有害内容 | 6 | 6 | 10036 | | 无意义或不合语法的题干（prompt） | 4065 | 4064 | 5972 | | 无意义或不合语法的正确答案 | 711 | 191 | 5781 | | 不合语法的错误选项 | 3953 | 1975 | 3806 | | 错误答案 | 370 | 89 | 3717 | | 所有选项均无意义 | 409 | 23 | 3694 | | 存在多个正确选项 | 2121 | 583 | 3111 | | 相对长度差＞0.3 | 802 | 96 | 3015 | | 长度差处于(0.15, 0.3]区间且最长选项为正确答案 | 1270 | 414 | 2601 | | 零提示核心度（Zero-prompt core）≥0.3 | 3963 | 1076 | 1525 | 针对每一项过滤条件，我们均报告了HellaSwag数据集中符合该过滤标准的问题数量、本阶段实际移除的问题数量（即未被此前过滤阶段移除的问题），以及每轮过滤后HellaSwag数据集中剩余的问题数量。过滤完成后，剩余问题几乎全部源自数据集的WikiHow部分——共计1498条，占比98.2%。如需引用本工作： @misc{chizhov2025hellaswagvaliditycommonsensereasoning, title={What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks}, author={Pavel Chizhov and Mattia Nee and Pierre-Carl Langlais and Ivan P. Yamshchikov}, year={2025}, eprint={2504.07825}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.07825}, } 如需引用原始HellaSwag数据集： @inproceedings{zellers-etal-2019-hellaswag, title = "{H}ella{S}wag: Can a Machine Really Finish Your Sentence?", author = "Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin", editor = "Korhonen, Anna and Traum, David and M{`a}rquez, Llu{`i}s", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1472/", doi = "10.18653/v1/P19-1472", pages = "4791--4800", abstract = "Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as { extquotedblleft}A woman sits at a piano,{ extquotedblright} a machine must select the most likely followup: { extquotedblleft}She sets her fingers on the keys.{ extquotedblright} With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans ({ensuremath{>}}95{\%} accuracy), state-of-the-art models struggle ({ensuremath{<}}48{\%}). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical { extquoteleft}Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges." }

提供机构：

maas

创建时间：

2025-06-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集