unlearning-cleanslate/generations-15-qwen3-8b-rmu-baseline-target-100-checkpoint-1078
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/unlearning-cleanslate/generations-15-qwen3-8b-rmu-baseline-target-100-checkpoint-1078
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个配置,用于评估语言模型在多种推理任务上的性能。主要配置包括ARC挑战任务(arc_challenge)和多个BBH(Big-Bench Hard)任务的思维链少样本版本,涵盖布尔表达式、因果判断、日期理解、消歧问答、Dyck语言、形式谬误、几何形状、超常语序、逻辑推理(涉及三、五、七个对象)、电影推荐、多步算术、导航、对象计数、表格中的企鹅、关于彩色物体的推理和毁坏名称等任务。每个配置包含文档ID、文档(如问题、输入、目标答案)、目标、生成参数、模型响应、过滤响应、过滤方法、评估指标、哈希值和分数等特征。数据集用于测试模型在复杂推理和少样本学习下的表现。
This dataset includes multiple configurations for evaluating language model performance on various reasoning tasks. Key configurations encompass the ARC challenge task (arc_challenge) and several Big-Bench Hard (BBH) tasks in chain-of-thought few-shot settings, covering boolean expressions, causal judgement, date understanding, disambiguation QA, Dyck languages, formal fallacies, geometric shapes, hyperbaton, logical deduction (with three, five, and seven objects), movie recommendation, multistep arithmetic, navigation, object counting, penguins in a table, reasoning about colored objects, and ruin names. Each configuration features document ID, document (e.g., question, input, target answer), target, generation arguments, model responses, filtered responses, filter method, evaluation metrics, hash values, and scores. The dataset is designed to test model capabilities in complex reasoning and few-shot learning scenarios.
提供机构:
unlearning-cleanslate



