FRoG

Name: FRoG
Creator: maas
Published: 2025-12-05 16:22:52
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/GAIR/FRoG

下载链接

链接失效反馈

官方服务：

资源简介：

### Introduction **FRoG** is a fuzzy reasoning benchmark of generalized quantifiers to evaluate the fuzzy reasoning abilities of a model. The questions in FRoG are collected from real-world math word problem benchmarks [GSM8K](https://huggingface.co/datasets/openai/gsm8k) and [MathQA](https://huggingface.co/datasets/allenai/math_qa) and the generalized quantifier that is used to introduce fuzziness come from [QuRe](https://huggingface.co/datasets/billli/QuRe). ### Sample Data ``` { "id": 1, "question": "john and ingrid pay [MASK] and 40 % tax annually , respectively . if john makes $ 60000 and ingrid makes $ 72000 , what is their combined tax rate ?\n\nIf the answer to the question is 35.6 %, then please select the quantifier that is closest to the meaning of [MASK] from the following choices. A. moderate amount B. few C. small amount D. some", "target_percentage_mention": "30 %", "quantifier": "moderate amount", "quantifier_strength_gap": 0.069, "answer": "A", "origin_question": "john and ingrid pay 30 % and 40 % tax annually , respectively . if john makes $ 60000 and ingrid makes $ 72000 , what is their combined tax rate ?", "origin_reasoning": "\"( 1 ) when 30 and 40 has equal weight or weight = 1 / 2 , the answer would be 35 . ( 2 ) when 40 has larger weight than 30 , the answer would be in between 35 and 40 . unfortunately , we have 2 answer choices d and e that fit that condition so we need to narrow down our range . ( 3 ) get 72000 / 132000 = 6 / 11 . 6 / 11 is a little above 6 / 12 = 1 / 2 . thus , our answer is just a little above 35 . answer : d\"", "raw_question": "john and ingrid pay [MASK] and 40 % tax annually , respectively . if john makes $ 60000 and ingrid makes $ 72000 , what is their combined tax rate ?\n\nIf the answer to the question is 35.6 %, then please select the quantifier that is closest to the meaning of [MASK] from the following choices.", "source": "MathQA_test" } ``` * id: question id * question: the question corresponds to a FRoG task. * target_percentage_mention: the target percentage mention that is masked in *question*. * quantifier: the generalized quantifier that the *target_percentage_mention* maps to. * quantifier_strength_gap: the average strength of *quantifier* - *target_percentage_mention*. * answer: the answer to the *question*. * origin_question: the original math word problem. * origin_reasoning: the reasoning chain to solve the *origin_question*. * raw_question: the *question* excluding choices. * source: the source benchmark ### Load the Dataset ```python from datasets import load_dataset frog = load_dataset("GAIR/FRoG", TASK, SPLIT) ``` while *TASK* belongs to {mask_quant, mislead, X, mask_percent} and *SPLIT* belongs to {easy, hard}. More scripts on [Github](https://github.com/Nativeatom/FRoG) ### Reference ``` @inproceedings{li-etal-2024-frog, title = "{FR}o{G}: Evaluating Fuzzy Reasoning of Generalized Quantifiers in {LLM}s", author = "Li, Yiyuan and Sun, Shichao and Liu, Pengfei", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.411", pages = "7239--7256", abstract = "Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.", } ```

### 简介 **FRoG** 是一款面向广义量词的模糊推理基准，用于评估模型的模糊推理能力。FRoG的问题采集自真实世界数学应用题基准数据集 [GSM8K](https://huggingface.co/datasets/openai/gsm8k) 与 [MathQA](https://huggingface.co/datasets/allenai/math_qa)，而用于引入模糊性的广义量词则源自 [QuRe](https://huggingface.co/datasets/billli/QuRe)。 ### 样本数据 { "id": 1, "question": "约翰和英格丽分别按[MASK]和40%的税率逐年缴纳税款。若约翰年收入为60000美元，英格丽年收入为72000美元，请问二人的综合税率是多少？若该问题的答案为35.6%，请从以下选项中选出与[MASK]语义最接近的量词。A. 中等额度 B. 少量 C. 小额 D. 若干", "target_percentage_mention": "30%", "quantifier": "moderate amount", "quantifier_strength_gap": 0.069, "answer": "A", "origin_question": "约翰和英格丽分别按30%和40%的税率逐年缴纳税款。若约翰年收入为60000美元，英格丽年收入为72000美元，请问二人的综合税率是多少？", "origin_reasoning": ""(1) 当30和40的权重相等，即权重=1/2时，答案应为35。(2) 当40的权重大于30时，答案应介于35和40之间。遗憾的是，有两个选项D和E符合该条件，因此我们需要缩小范围。(3) 计算得72000/132000=6/11，6/11略高于6/12=1/2。因此，最终答案略高于35。答案：D"", "raw_question": "约翰和英格丽分别按[MASK]和40%的税率逐年缴纳税款。若约翰年收入为60000美元，英格丽年收入为72000美元，请问二人的综合税率是多少？若该问题的答案为35.6%，请从以下选项中选出与[MASK]语义最接近的量词。", "source": "MathQA_test" } * id: 问题编号 * question: 对应FRoG任务的问题文本 * target_percentage_mention: 原问题中被掩码的目标百分比提及值 * quantifier: 目标百分比提及值所映射的广义量词 * quantifier_strength_gap: 广义量词与目标百分比提及值的平均强度差值 * answer: 该问题的正确答案 * origin_question: 原始数学应用题 * origin_reasoning: 求解原始数学应用题的推理链 * raw_question: 不含选项的完整问题文本 * source: 该样本的来源基准数据集 ### 数据集加载 python from datasets import load_dataset frog = load_dataset("GAIR/FRoG", TASK, SPLIT) 其中`TASK`的可选取值为 `{mask_quant, mislead, X, mask_percent}`，`SPLIT`的可选取值为 `{easy, hard}`。更多脚本请访问 [Github](https://github.com/Nativeatom/FRoG) ### 参考文献 @inproceedings{li-etal-2024-frog, title = "FRoG：评估大语言模型（Large Language Models，LLMs）中的广义量词模糊推理能力", author = "Li, Yiyuan and Sun, Shichao and Liu, Pengfei", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "2024年自然语言处理经验方法会议论文集", month = "11月", year = "2024", address = "美国佛罗里达州迈阿密", publisher = "计算语言学协会", url = "https://aclanthology.org/2024/emnlp-main.411", pages = "7239--7256", abstract = "模糊推理至关重要，因日常场景中频繁使用非精确信息。然而，当前大语言模型（Large Language Models，LLMs）处理此类推理的能力仍未得到充分探索。本文提出一款全新的模糊推理基准FRoG，其包含融入广义量词的真实世界数学应用题。实验结果表明，模糊推理仍是大语言模型面临的重大挑战。此外，我们发现专为增强推理能力设计的现有方法，并不能持续提升模糊逻辑任务的性能。我们的研究还显示，大语言模型在FRoG基准上的性能存在逆缩放效应。值得注意的是，我们还证实，即便具备出色的数学推理能力，也未必能在本基准测试中取得优异表现。", }

提供机构：

maas

创建时间：

2025-02-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集