unlearning-cleanslate/generations-qwen3-8b-rmu-baseline
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/unlearning-cleanslate/generations-qwen3-8b-rmu-baseline
下载链接
链接失效反馈官方服务:
资源简介:
该数据集集合包含多个配置,主要用于评估语言模型在推理任务上的性能。主要配置包括:1. ARC挑战数据集(arc_challenge),包含1172个训练示例,涉及多项选择题,具有问题、选项和答案键。2. Big-Bench Hard(BBH)数据集的思维链(CoT)少样本版本,涵盖多个任务,如布尔表达式(250个示例)、因果判断(187个示例)、日期理解(250个示例)、消歧问答(250个示例)、戴克语言(250个示例)、形式谬误(250个示例)、几何形状(250个示例)、超序位(250个示例)、逻辑演绎(三/五/七对象,各250个示例)、电影推荐(250个示例)、多步算术(250个示例)、导航(250个示例)、物体计数(250个示例)、企鹅表格(146个示例)、彩色物体推理(250个示例)、名字破坏(未完整显示示例数)。每个任务的特征包括输入文本、目标输出、生成参数(如采样设置)、模型响应、过滤响应、评估指标和分数。数据集设计用于少样本学习,支持思维链推理,适用于语言模型评估和微调。
This dataset collection includes multiple configurations, primarily for evaluating language model performance on reasoning tasks. Key configurations include: 1. ARC Challenge dataset (arc_challenge) with 1172 training examples, involving multiple-choice questions with questions, choices, and answer keys. 2. Chain-of-Thought (CoT) few-shot versions of Big-Bench Hard (BBH) datasets, covering tasks such as boolean expressions (250 examples), causal judgement (187 examples), date understanding (250 examples), disambiguation QA (250 examples), Dyck languages (250 examples), formal fallacies (250 examples), geometric shapes (250 examples), hyperbaton (250 examples), logical deduction (three/five/seven objects, each with 250 examples), movie recommendation (250 examples), multistep arithmetic (250 examples), navigate (250 examples), object counting (250 examples), penguins in a table (146 examples), reasoning about colored objects (250 examples), ruin names (example count not fully shown). Each task features input text, target output, generation parameters (e.g., sampling settings), model responses, filtered responses, evaluation metrics, and scores. The dataset is designed for few-shot learning, supports chain-of-thought reasoning, and is suitable for language model evaluation and fine-tuning.
提供机构:
unlearning-cleanslate



