altered-riddles

Name: altered-riddles
Creator: maas
Published: 2026-01-09 04:18:11
License: 暂无描述

魔搭社区2026-01-09 更新2025-05-10 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/altered-riddles

下载链接

链接失效反馈

官方服务：

资源简介：

<a href="https://github.com/bespokelabsai/curator/"> <img src="https://huggingface.co/datasets/marcodsn/academic-chains-dev/resolve/main/made_with_curator.png" alt="Made with Curator" width=200px> </a> # Dataset Card for Altered Riddles Dataset ## Dataset Description - **GitHub:** [https://github.com/marcodsn/altered-riddles](https://github.com/marcodsn/altered-riddles) - **Dataset:** [https://huggingface.co/datasets/marcodsn/altered-riddles](https://huggingface.co/datasets/marcodsn/altered-riddles) (This page) While working on the [academic-chains](https://huggingface.co/datasets/marcodsn/academic-chains) dataset, I tested a well-known alteration of a common riddle, "just for fun": > *The surgeon, who is the boy's father, says, 'I cannot operate on this boy—he's my son!'. Who is the surgeon to the boy?* (*Below is the original riddle for reference*) > *A man and his son are in a terrible accident and are rushed to the hospital in critical condition. The doctor looks at the boy and exclaims, "I can't operate on this boy; he's my son!" How could this be?* You likely immediately thought, *"The father!"*, but surprisingly, many powerful LLMs (including `gemini-2.5-pro`, `claude-sonnet-3.7`, and `qwen3-30b-a3b` in my tests) fail this simple variation. The classic riddle expects *"The mother"* as the answer, revealing societal biases. However, when the text *explicitly states* the father is the surgeon, why do models get it wrong? My investigation, including analysis of token importance gradients, suggests **overfitting to the original pattern**. Models—especially large ones—have seen and memorized standard riddles so often that they appear to ignore crucial, altered details in the prompt. *(Image below: Importance gradients for Llama-3-8B, which answers incorrectly, show low importance on "father")* ![Importance Gradients - Affected Model](gradient_importance_bad.png) *(Image below: Importance gradients for Qwen-3-4B, which answers correctly, show focus on "father")* ![Importance Gradients - Unaffected Model](gradient_importance_good.png) > [!IMPORTANT] > Gradient-based token importance leverages backpropagation to measure how sensitive the model's prediction (specifically, the logit of the predicted token) is to changes in the embedding of each input token. The "affected" model seemingly **ignored** the word "father" because the ingrained pattern of the original riddle overrides the actual input. **This Dataset:** The `altered-riddles` dataset is a curated collection designed to combat this specific type of reasoning failure. It contains familiar riddles where **key details have been deliberately changed**. Each entry includes: 1. The original riddle and its answer 2. The altered version with additional or modified details 3. The correct answer to the altered riddle 4. Detailed reasoning explaining the solution for both versions This dataset is an experiment born from and submitted to the [Reasoning Datasets Competition](https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition). The central question: **Can fine-tuning on a relatively small dataset of these "trick" variations force models to pay closer attention to input details, overriding ingrained patterns and improving robustness?** If so, we hypothesize this could also lead to downstream improvements in areas like RAG (better adherence to provided context) and reasoning. ## Notable Examples *(Image below: Failure by Gemini 2.5 Pro; the answer should be "sleep"!)* ![Gemini 2.5 Pro Failure Example 1](failed_riddle_1.png) *(Image below: Failure by Gemini 2.5 Pro; the answer should be "A sperm cell or tadpole"!)* ![Gemini 2.5 Pro Failure Example 2](failed_riddle_2.png) *(Image below: Failure by Gemini 2.5 Pro; the answer should be "A plant"!)* ![Gemini 2.5 Pro Failure Example 3](failed_riddle_3.png) ## Dataset Structure Each example in this dataset includes the following features: * `original_riddle`: The text of the original, unaltered riddle (e.g., *"I have keys but open no locks. I have space but no room. You can enter but not go inside. What am I?"*). * `original_answer`: The correct answer to the original riddle (e.g., *"A keyboard."*). * `original_reasoning`: The explanation for the original riddle's answer. * `altered_riddle`: The modified version of the riddle with additional constraints or details (e.g., *"I have keys but open no locks. I have space but no room. You can explore but not physically enter. What am I?"*). * `altered_answer`: The correct answer to the altered riddle (e.g., *"A map."*). * `altered_reasoning`: The explanation for the altered riddle's answer, often explicitly noting the deviation from the expected pattern. * `model`: The LLM used to *generate* this dataset entry. ## Dataset Creation ### Source Data The base scenarios use common riddles found in LLM knowledge. The core creative step is identifying familiar riddles and adding key details that change the expected answer. ### Data Generation Pipeline 1. **Riddle Selection & Alteration:** Randomly select 2 base riddles from our extracted list and incorporate them into our [template prompt](https://github.com/marcodsn/altered-riddles/blob/main/prompts/altered_riddles_prompt.txt) (this is done to add uncertainity to the prompt). 2. **Answer & Reasoning Generation\*:** Use LLMs (e.g., `gemini-2.5-flash`, `gemini-2.5-pro`, `command-a-03-2025`, and more) prompted with few-shot examples. - **Prompting Strategy:** The prompt explicitly instructs the model to: - Create a slightly modified version of a single riddle so that it leads to a different correct answer. - Provide both the original and altered riddle solutions. - Pay extremely close attention to the additional details in the altered version. - Explain the reasoning for both the original and altered answers. 3. **Additional Verification:** Check the abilities of different models to solve the altered riddle to flag easy alterations. (TBD) 4. **Final Formatting:** Structure the data into the final JSONL format. *\*Generated using [Bespoke Curator](https://github.com/bespokelabsai/curator/).* ### Splits This repository currently contains a **`train` split (N=102 examples)**. ## Example Uses Given that our altered riddles are difficult for current SOTA LLMs, the main uses of this dataset are to: - **Test models** and study their behavior. - With more samples, **investigate why** LLMs fail on this task and how to address it. This dataset could also be used for fine-tuning LLMs to: - **Improve Attention to Detail:** Train models to scrutinize input text more carefully, especially in familiar contexts. - **Mitigate Pattern-Based Bias:** Reduce reliance on memorized patterns when the input explicitly contradicts them. - **Enhance Robustness:** Make models less brittle and more adaptable to prompt variations. - **Strengthen Chain-of-Thought:** Reinforce detailed reasoning processes. We hypothesize that models trained on this dataset (potentially mixed with other high-quality instruction/reasoning data) might show improved performance on: - RAG tasks (better grounding in provided documents) - Instruction following with subtle nuances - Reasoning tasks requiring careful attention to details We hope this work will be useful to the reasoning (and LLM) ecosystem! ## Planned Evaluation Evaluation of models fine-tuned on this dataset will focus on: 1. **Fine-tuning:** Using efficient methods (e.g., LoRA via [unsloth](https://unsloth.ai/)) on models known to exhibit the original issue. 2. **Direct Evaluation:** Testing performance on a held-out set of altered riddles. 3. **Pattern Bias Probes:** Testing the fine-tuned model on both altered and original riddle versions. 4. **Generalization Tests:** Evaluating performance on standard reasoning benchmarks to assess broader impacts. 5. **Qualitative Analysis:** Examining reasoning quality for clarity, logical consistency, and recognition of altered details. Initial testing with Claude Sonnet 3.7 and Gemini 2.5 Pro showed consistent failures on these altered riddles, even with reasoning enabled, demonstrating the dataset's effectiveness as a challenge set. ## Limitations and Biases - **Limited Scope:** The dataset currently focuses on a specific failure mode (pattern override) using a limited set of base riddles. - **Generation Artifacts:** LLM-generated reasoning may contain errors or lack human-like awareness of alterations. - **Experimental Nature:** This is an exploratory dataset targeting a specific hypothesis; its effectiveness requires empirical validation. ## Scaling Plan If initial experiments show promise, future plans include: 1. **More Diverse Models:** Incorporate generations from additional LLMs to increase riddle and alteration diversity. 2. **More Complex Alterations:** Experiment with different types of modifications beyond simple additions. 3. **Increased Volume:** Scale up generation and refine the QC process. 4. **Testing Framework:** Develop a standardized evaluation procedure for model performance. 5. **Cross-lingual Exploration:** Investigate pattern-override issues in riddles from other languages (soon™️). The dataset can be used as a challenging test set for reasoning or as a targeted training supplement to improve attention to detail. ## Acknowledgements This experiment was inspired by the `academic-chains` dataset and the [Reasoning Datasets Competition](https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition). Thanks to [HuggingFace](https://huggingface.co/), [Bespoke Labs](https://www.bespokelabs.ai/), and [Together AI](https://together.ai/) for organizing the competition! ## Licensing Information This dataset is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt). ## Citation Information ```bibtex @misc{marcodsn_2025_alteredriddles, title = {Altered Riddles Dataset}, author = {Marco De Santis}, month = {May}, year = {2025}, url = {https://huggingface.co/datasets/marcodsn/altered-riddles} } ``` ## Development Updates > [!Note] > **[03/05/2025]** Initial dataset created!

<a href="https://github.com/bespokelabsai/curator/"> <img src="https://huggingface.co/datasets/marcodsn/academic-chains-dev/resolve/main/made_with_curator.png" alt="使用Curator制作" width=200px> </a> # 数据集卡片：修改版谜语数据集（Altered Riddles Dataset） ## 数据集描述 - **GitHub:** [https://github.com/marcodsn/altered-riddles](https://github.com/marcodsn/altered-riddles) - **数据集:** [https://huggingface.co/datasets/marcodsn/altered-riddles](https://huggingface.co/datasets/marcodsn/altered-riddles)（本页面）在处理[academic-chains](https://huggingface.co/datasets/marcodsn/academic-chains)数据集时，我出于"仅供娱乐"的目的，测试了一则经典常见谜语的修改版本： > *"这位外科医生是男孩的父亲，他说道：‘我无法为这个男孩做手术——他是我的儿子！’请问这位外科医生和男孩是什么关系？"* （以下为供参考的原版谜语） > *"一名男子与他的儿子遭遇严重车祸，被紧急送往医院，情况危急。医生看到男孩后惊呼：‘我不能为这个男孩做手术，他是我的儿子！’这究竟是怎么回事？"* 你可能立刻想到"是父亲！"，但令人意外的是，多款顶尖大语言模型（Large Language Model，LLM）（在我的测试中包括`gemini-2.5-pro`、`claude-sonnet-3.7`与`qwen3-30b-a3b`）在这一简单修改版谜语上遭遇了失败。原版谜语的标准答案为"母亲"，这一结果暴露了社会偏见。但在这一修改版中，文本明确说明外科医生是男孩的父亲，为何模型仍会答错？我的研究（包括对Token重要性梯度的分析）表明，模型出现了对原有模式的过拟合现象。大模型尤其如此：它们频繁接触并记忆了标准谜语，以至于会忽略提示中关键的修改细节。（下图：回答错误的Llama-3-8B模型的重要性梯度图，显示其对"父亲"一词的关注度极低） ![Importance Gradients - Affected Model](gradient_importance_bad.png) （下图：回答正确的Qwen-3-4B模型的重要性梯度图，显示其对"父亲"一词的关注度较高） ![Importance Gradients - Unaffected Model](gradient_importance_good.png) > [!重要提示] > 基于梯度的Token重要性分析利用反向传播来衡量模型预测（具体为预测Token的对数似然）对每个输入Token的嵌入变化的敏感程度。"受影响"的模型看似**忽略**了"父亲"一词，因为其脑中根深蒂固的原版谜语模式覆盖了实际输入的信息。 **本数据集：** `altered-riddles`数据集是一套经过精心整理的数据集，旨在解决这类特定的推理失效问题。该数据集收录了经过刻意修改关键细节的经典谜语，每个条目包含以下内容： 1. 原版谜语及其标准答案 2. 带有新增或修改细节的修改版谜语 3. 修改版谜语的标准答案 4. 针对两个版本谜语的详细推理解释本数据集源自[推理数据集竞赛](https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition)并投稿参赛，是一项探索性实验。其核心问题为：在这类"陷阱"修改版谜语的小型数据集上进行微调，能否促使模型更加关注输入细节，突破固有模式的束缚并提升鲁棒性？若可行，我们推测这一改进还可在检索增强生成（Retrieval-Augmented Generation，RAG，可更好地遵循给定上下文）与推理等下游任务中带来性能提升。 ## 典型失效案例（下图：Gemini 2.5 Pro的失效案例，正确答案应为"睡眠"！） ![Gemini 2.5 Pro Failure Example 1](failed_riddle_1.png) （下图：Gemini 2.5 Pro的失效案例，正确答案应为"精子细胞或蝌蚪"！） ![Gemini 2.5 Pro Failure Example 2](failed_riddle_2.png) （下图：Gemini 2.5 Pro的失效案例，正确答案应为"植物"！） ![Gemini 2.5 Pro Failure Example 3](failed_riddle_3.png) ## 数据集结构本数据集的每条样本包含以下字段： * `original_riddle`：原版未修改的谜语文本（例如：*"我有钥匙却打不开任何锁，我有空间却没有房间，你可以进入却无法走进去。请问这是什么？"*）。 * `original_answer`：原版谜语的标准答案（例如：*"键盘。"*）。 * `original_reasoning`：原版谜语答案的推理解释。 * `altered_riddle`：带有新增约束或细节的修改版谜语（例如：*"我有钥匙却打不开任何锁，我有空间却没有房间，你可以探索却无法实际进入。请问这是什么？"*）。 * `altered_answer`：修改版谜语的标准答案（例如：*"地图。"*）。 * `altered_reasoning`：修改版谜语答案的推理解释，通常会明确指出与原有模式的差异。 * `model`：用于生成本条数据集样本的大语言模型。 ## 数据集构建 ### 源数据基础场景取自大语言模型知识库中常见的谜语。核心创作步骤为挑选经典谜语，并添加能够改变预期答案的关键细节。 ### 数据生成流程 1. **谜语挑选与修改：** 从我们提取的列表中随机挑选2条基础谜语，并将其融入我们的[模板提示词](https://github.com/marcodsn/altered-riddles/blob/main/prompts/altered_riddles_prompt.txt)（此举旨在为提示词增加不确定性）。 2. **答案与推理解释生成**：** 使用大语言模型（例如`gemini-2.5-flash`、`gemini-2.5-pro`、`command-a-03-2025`等），并辅以少样本（Few-shot）示例进行提示。 - **提示词策略：** 提示词会明确要求模型完成以下操作： - 为单条谜语创建小幅修改版本，使其拥有不同的标准答案。 - 同时提供原版与修改版谜语的答案。 - 高度关注修改版谜语中的新增细节。 - 分别解释原版与修改版答案的推理过程。 3. **额外验证：** 测试不同模型解决修改版谜语的能力，以标记出过于简单的修改版本。（待完成） 4. **最终格式化：** 将数据整理为最终的JSONL格式。 **本数据集由[Bespoke Curator](https://github.com/bespokelabsai/curator/)生成。* ### 数据集划分本仓库目前仅包含**训练划分（train split），共102条样本**。 ## 数据集用途鉴于修改版谜语对当前主流顶尖大语言模型具有挑战性，本数据集的主要用途包括： - **测试模型**并研究其推理行为。 - 若拥有更多样本，可**深入研究**大语言模型在该任务上失效的原因及解决方案。本数据集还可用于对大语言模型进行微调，以实现以下目标： - **提升细节关注度**：训练模型更仔细地审视输入文本，尤其是在熟悉的场景中。 - **缓解基于模式的偏见**：当输入内容与固有记忆模式相悖时，降低模型对记忆模式的依赖。 - **增强鲁棒性**：让模型更不易出现脆性失效，更能适应提示词的变化。 - **强化思维链（Chain-of-Thought）**：优化详细的推理过程。我们推测，在本数据集（可与其他高质量指令/推理数据混合）上训练的模型，可能在以下任务中获得性能提升： - 检索增强生成（RAG）任务（可更好地锚定给定文档内容） - 带有细微差异的指令遵循任务 - 需要细致关注细节的推理任务我们希望本研究能对推理（与大语言模型）生态有所帮助！ ## 计划中的评估方案针对在本数据集上微调后的模型的评估将聚焦于以下方面： 1. **微调环节：** 针对已知存在该失效问题的模型，使用高效微调方法（例如通过[unsloth](https://unsloth.ai/)实现的低秩自适应（Low-Rank Adaptation，LoRA））。 2. **直接评估：** 在预留的修改版谜语测试集上测试模型性能。 3. **模式偏见探测：** 在修改版与原版谜语的两组测试集上分别测试微调后的模型。 4. **泛化性测试：** 在标准推理基准测试集上评估模型性能，以考察其泛化影响。 5. **定性分析：** 评估推理解释的清晰度、逻辑一致性以及对修改细节的识别能力。我们使用Claude Sonnet 3.7与Gemini 2.5 Pro进行的初步测试显示，即使开启推理功能，这些模型仍在修改版谜语上持续失效，这证明了本数据集作为挑战集的有效性。 ## 局限性与潜在偏见 - **范围有限：** 本数据集目前仅针对特定的失效模式（固有模式覆盖），且仅使用了有限的基础谜语样本。 - **生成 artifacts：** 由大语言模型生成的推理解释可能存在错误，或无法像人类一样识别修改细节。 - **实验属性：** 本数据集是针对特定假设的探索性数据集，其有效性仍需通过实证验证。 ## 扩展计划若初步实验结果向好，未来的扩展计划包括： 1. **丰富模型来源：** 纳入更多大语言模型生成的样本，以提升谜语与修改方式的多样性。 2. **设计更复杂的修改：** 尝试除简单添加细节之外的其他修改类型。 3. **扩大样本规模：** 增加生成样本的数量，并优化质量控制流程。 4. **构建测试框架：** 开发标准化的模型性能评估流程。 5. **跨语言探索：** 研究其他语言谜语中的固有模式覆盖问题（即将推出）。本数据集可作为推理任务的挑战性测试集，或作为提升模型细节关注度的针对性训练补充数据。 ## 致谢本实验的灵感源自`academic-chains`数据集与[推理数据集竞赛](https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition)。感谢[HuggingFace](https://huggingface.co/)、[Bespoke Labs](https://www.bespokelabs.ai/)与[Together AI](https://together.ai/)对本次竞赛的组织！ ## 许可协议本数据集采用[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt)协议进行许可。 ## 引用信息 bibtex @misc{marcodsn_2025_alteredriddles, title = {Altered Riddles Dataset}, author = {Marco De Santis}, month = {May}, year = {2025}, url = {https://huggingface.co/datasets/marcodsn/altered-riddles} } ## 开发更新 > [!注意事项] > **[2025年5月3日]** 初始数据集已创建！

提供机构：

maas

创建时间：

2025-05-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集