AtlaAI/counsel

Name: AtlaAI/counsel
Creator: AtlaAI
Published: 2026-01-28 18:02:46
License: 暂无描述

Hugging Face2026-01-28 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/AtlaAI/counsel

下载链接

链接失效反馈

官方服务：

资源简介：

Counsel是第一个提供LLM-as-a-Judge（LLMJ）对代理任务执行的批评的人类元评估的数据集。它解决了LLMJ批评在诊断和改进代理系统中的关键问题：其准确性和可靠性很少被系统评估。Counsel通过提供高质量、经过人类验证的注释，使研究人员和实践者能够对LLMJ批评进行基准测试、改进和信任。数据集包含225个独特的代理执行轨迹，覆盖了两个广泛使用的真实世界代理基准：TauBench（零售）和DACode（代码生成和调试）。轨迹由两个具有不同推理风格的代理模型生成：GPT-OSS-20B（中等推理）和Qwen3-235B-A22B-Instruct-2507（无推理）。数据集还包括由三个法官模型（Qwen-3, GPT-OSS-2B:low, GPT-OSS-20B:high）进行的流程级（跨度）判断，以及相应的人类注释的元判断。每个法官对轨迹的每个跨度进行批评，但只有那些被标记为有错误的才会由人类进行元判断。元判断评估了每个法官模型批评的位置和推理质量，分为“Spot On”（位置和推理都正确）、“Poor Reasoning but Right Location”（位置正确但推理错误）和“Should Not Have Flagged”（位置和推理都错误）三类。此外，数据集还提供了完整的225个代理轨迹。

Counsel is the first dataset to provide human meta-evaluations of LLM-as-a-Judge (LLMJ) critiques on agentic task execution. It addresses a critical gap: while LLMJ critiques are essential for diagnosing and improving agentic systems, their accuracy and reliability have rarely been systematically evaluated—until now. Counsel empowers researchers and practitioners to benchmark, refine, and trust LLMJ critiques by offering high-quality, human-validated annotations over real-world agentic traces. The dataset includes 225 unique agent execution traces, spanning two widely used real-world agentic benchmarks: TauBench (Retail) and DACode (code generation and debugging). The execution traces were generated by two agent models with distinct reasoning styles: GPT-OSS-20B (medium reasoning) and Qwen3-235B-A22B-Instruct-2507 (no reasoning). The dataset contains process-level (span) judgements by three judge models - Qwen-3, GPT-OSS-2B:low and GPT-OSS-20B:high - with the corresponding human-annotated meta-judgement. Each judge critiques every span of the trace, but only those flagged to have errors are meta-judge by humans. These meta-judgements assess both the location and reasoning quality of each judge model critique, categorized as "Spot On" (both location and reasoning are correct), "Poor Reasoning but Right Location" (correct location but incorrect reasoning), and "Should Not Have Flagged" (both location and reasoning are wrong). Along with the meta-evaluation, the dataset also provides the full 225 agent trajectories.

提供机构：

AtlaAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集