five

Synthetic_Unanswerable_Math

收藏
魔搭社区2025-08-01 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/lime-nlp/Synthetic_Unanswerable_Math
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Synthetic Unanswerable Math (SUM) ## Dataset Summary **Synthetic Unanswerable Math (SUM)** is a dataset of high-quality, implicitly unanswerable math problems constructed to probe and improve the refusal behavior of large language models (LLMs). The goal is to teach models to identify when a problem cannot be answered due to incomplete, ambiguous, or contradictory information, and respond with epistemic humility (e.g., `\boxed{I don't know}`). Each entry in the dataset includes: - `answerable_question`: The original, solvable math problem from the [DeepScaleR](https://github.com/PraMamba/DeepScaleR) dataset. - `unanswerable_question`: Our Synthetic Unanswerable Math (SUM) dataset, a synthetically modified version of the original problem, designed to be unsolvable based on one of the criteria below. ## Unanswerability Criteria Following the taxonomy introduced in the paper, unanswerable questions were generated according to five types of information degradation: 1. **Key Information Deletion** Crucial numerical or logical details are removed from the question, making it impossible to compute the answer. 2. **Ambiguous Key Information** Problem statements are modified to include vague or underspecified details (e.g., ranges or indeterminate sets), preventing precise reasoning. 3. **Unrealistic Conditions** Implausible or logically inconsistent premises are introduced (e.g., negative counts for physical items, impossible time values), invalidating the problem. 4. **Unrelated Objects** Questions are altered to reference entities that are not introduced or defined in the original context. 5. **Question Deletion** The problem statement retains background context but omits the actual question, making it unanswerable. ## Data Generation Process We automatically generated unanswerable variants of DeepScaleR questions by prompting `o3-mini`. The model was prompted with detailed instructions and examples to make plausible edits that rendered the problems unanswerable. The generated questions were reviewed by expert annotators to ensure correctness and naturalness. See our paper for more details. ## Intended Use SUM serves multiple goals: - **Diagnose**: Evaluate LLMs' susceptibility to hallucination in the context of RFT. - **Train**: Improve LLM trustworthiness by mixing SUM data during RFT. - **Teach**: Help models develop a generalizable ability to leverage **inference-time compute** to reason about their own **uncertainty** and **knowledge boundaries**, abstaining when appropriate. Incorporating just 10% SUM data during RFT has been shown to significantly increase refusal rates on unanswerable queries while preserving performance on answerable math tasks — and even generalizes to domains like factual QA. ## 📬 Contact For questions or feedback, feel free to reach out to [**Taiwei Shi**](https://maksimstw.github.io/) at [taiweish@usc.edu](mailto:taiweish@usc.edu) or [**Linxin Song**](https://linxins.net/) at [linxinso@usc.edu](mailto:linxinso@usc.edu). ## 📚 Citations If you find our dataset useful, please cite [The Hallucination Tax of Reinforcement Finetuning](https://arxiv.org/abs/2505.13988): ```bibtex @misc{song2025hallucinationtaxreinforcementfinetuning, title={The Hallucination Tax of Reinforcement Finetuning}, author={Linxin Song and Taiwei Shi and Jieyu Zhao}, year={2025}, eprint={2505.13988}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.13988}, } ```

# 合成不可回答数学(Synthetic Unanswerable Math, SUM)数据集卡片 ## 数据集概述 **合成不可回答数学(Synthetic Unanswerable Math, SUM)** 是一个高质量的隐式不可回答数学问题数据集,旨在探究并优化大语言模型(Large Language Model, LLM)的拒答行为。其核心目标是让模型学会识别因信息不完整、表述模糊或存在矛盾而无法解答的问题,并以认知谦逊的方式回应(例如 `oxed{I don't know}`)。 数据集的每个条目包含: - `answerable_question`:源自[DeepScaleR](https://github.com/PraMamba/DeepScaleR)数据集的原始可解数学问题。 - `unanswerable_question`:本合成不可回答数学(SUM)数据集的内容,即对原始问题进行合成修改后得到的不可解版本,其不可解性遵循以下五类标准之一。 ## 不可回答判定标准 根据论文中提出的分类体系,不可回答问题通过五类信息退化方式生成: 1. **关键信息删除**:从问题中移除关键数值或逻辑细节,导致无法计算答案。 2. **关键信息模糊化**:修改问题表述,加入模糊或未明确界定的细节(例如范围或不确定集合),阻碍精确推理。 3. **非现实条件**:引入不合逻辑或自相矛盾的前提(例如物理物品数量为负数、不可能的时间取值),使问题失去合理性。 4. **无关对象引入**:修改问题,使其引用原始语境中未提及或未定义的实体。 5. **问题本体缺失**:保留问题的背景信息,但省略实际提问内容,导致无法作答。 ## 数据生成流程 我们通过向`o3-mini`发送提示词,自动生成DeepScaleR数据集问题的不可回答变体。提示词包含详细指令与示例,指导模型生成合理的修改以让问题变得不可解答。生成的问题经专家标注人员审核,确保其正确性与自然性。更多细节请参阅我们的论文。 ## 预期用途 SUM 具备多重应用目标: - **诊断**:评估大语言模型在强化微调(Reinforcement Finetuning, RFT)场景下产生幻觉的倾向。 - **训练**:在强化微调过程中混入SUM数据集,提升大语言模型的可信性。 - **教学**:帮助模型习得可泛化的能力,使其能利用**推理时计算资源**对自身的**不确定性**与**知识边界**进行推理,并在合适时机选择拒答。 研究表明,在强化微调阶段仅混入10%的SUM数据,即可显著提升模型对不可回答查询的拒答率,同时保留其在可解答数学任务上的性能——甚至可推广至事实问答等其他领域。 ## 📬 联系方式 如有疑问或反馈,可联系[**Taiwei Shi**](https://maksimstw.github.io/),邮箱为[taiweish@usc.edu](mailto:taiweish@usc.edu),或联系[**Linxin Song**](https://linxins.net/),邮箱为[linxinso@usc.edu](mailto:linxinso@usc.edu)。 ## 📚 引用说明 若您使用本数据集,请引用论文《The Hallucination Tax of Reinforcement Finetuning》: bibtex @misc{song2025hallucinationtaxreinforcementfinetuning, title={The Hallucination Tax of Reinforcement Finetuning}, author={Linxin Song and Taiwei Shi and Jieyu Zhao}, year={2025}, eprint={2505.13988}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.13988}, }
提供机构:
maas
创建时间:
2025-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作