unlearning-cleanslate/generations-olmo-3-7b-simnpo-gentle-bm25-6t
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/unlearning-cleanslate/generations-olmo-3-7b-simnpo-gentle-bm25-6t
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多任务评估数据集,主要用于测试和评估大型语言模型在复杂推理和问题解决方面的能力。它包含多个配置,覆盖广泛的任务类型:1) ARC挑战:涉及科学问答,包含问题、选择项和答案键;2) BBH思维链少样本任务:基于Big-Bench Hard基准,涵盖布尔表达式、因果判断、日期理解、歧义消解、Dyck语言、形式谬误、几何形状、超常语序、逻辑推理(3、5、7个对象)、电影推荐、多步算术、导航、对象计数、企鹅表格、彩色对象推理和名称毁坏等任务,每个任务都设计用于评估模型在少样本提示下的思维链推理能力。数据集的特征包括文档ID、文档内容(输入和目标)、参数(如生成参数)、模型响应、过滤响应、过滤器、指标、哈希值和分数,适用于模型性能分析和基准测试。
This dataset is a multi-task evaluation dataset primarily designed to test and assess the capabilities of large language models in complex reasoning and problem-solving. It includes multiple configurations covering a wide range of task types: 1) ARC Challenge: involves scientific question-answering with questions, choices, and answer keys; 2) BBH Chain-of-Thought Few-Shot Tasks: based on the Big-Bench Hard benchmark, encompassing boolean expressions, causal judgement, date understanding, disambiguation QA, Dyck languages, formal fallacies, geometric shapes, hyperbaton, logical deduction (with 3, 5, and 7 objects), movie recommendation, multistep arithmetic, navigation, object counting, penguins in a table, reasoning about colored objects, and ruin names. Each task is designed to evaluate model performance in chain-of-thought reasoning with few-shot prompting. The dataset features include document ID, document content (input and target), arguments (e.g., generation parameters), model responses, filtered responses, filters, metrics, hash values, and scores, making it suitable for model performance analysis and benchmarking.
提供机构:
unlearning-cleanslate



