unlearning-cleanslate/generations-11-qwen3-8b-simnpo-gentle-bm25-6t-target-100-checkpoint-173

Name: unlearning-cleanslate/generations-11-qwen3-8b-simnpo-gentle-bm25-6t-target-100-checkpoint-173
Creator: unlearning-cleanslate
Published: 2026-04-29 12:51:40
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/unlearning-cleanslate/generations-11-qwen3-8b-simnpo-gentle-bm25-6t-target-100-checkpoint-173

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个多任务评估数据集，包含ARC挑战（AI2推理挑战）和BBH（Big-Bench Hard）的思维链少样本任务。数据集涵盖多种推理和认知任务，如布尔表达式、因果判断、日期理解、消歧问答、Dyck语言、形式谬误、几何形状、超序、逻辑演绎、电影推荐、多步算术、导航、对象计数、企鹅表格、彩色对象推理、名称毁坏等。每个任务配置包括训练分割，提供问题输入、目标答案、生成参数、模型响应、过滤响应、哈希值和评分等字段，用于评估语言模型在复杂少样本场景下的性能。数据集规模从数百到数千个示例不等，总下载大小约数MB。

This is a multi-task evaluation dataset comprising ARC Challenge (AI2 Reasoning Challenge) and BBH (Big-Bench Hard) chain-of-thought few-shot tasks. The dataset covers a variety of reasoning and cognitive tasks, such as boolean expressions, causal judgement, date understanding, disambiguation QA, Dyck languages, formal fallacies, geometric shapes, hyperbaton, logical deduction, movie recommendation, multistep arithmetic, navigation, object counting, penguins in a table, reasoning about colored objects, and ruin names. Each task configuration includes a train split with fields like question input, target answer, generation arguments, model responses, filtered responses, hash values, and scores, designed for evaluating language model performance in complex few-shot scenarios. Dataset sizes range from hundreds to thousands of examples, with total download sizes around several MB.

提供机构：

unlearning-cleanslate