DCAgent2/medagentbench_Qwen2_5_Coder_32B_Instruct_20260430_044354-traces

Name: DCAgent2/medagentbench_Qwen2_5_Coder_32B_Instruct_20260430_044354-traces
Creator: DCAgent2
Published: 2026-05-01 07:18:05
License: 暂无描述

Hugging Face2026-05-01 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/DCAgent2/medagentbench_Qwen2_5_Coder_32B_Instruct_20260430_044354-traces

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: content dtype: string - name: role dtype: string - name: agent dtype: string - name: model dtype: string - name: model_provider dtype: string - name: date dtype: string - name: task dtype: string - name: episode dtype: string - name: run_id dtype: string - name: trial_name dtype: string - name: tool_definitions list: - name: function struct: - name: description dtype: string - name: name dtype: string - name: parameters struct: - name: additionalProperties dtype: bool - name: properties struct: - name: code struct: - name: description dtype: string - name: type dtype: string - name: command struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: file_text struct: - name: description dtype: string - name: type dtype: string - name: insert_line struct: - name: description dtype: string - name: type dtype: string - name: is_input struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: message struct: - name: description dtype: string - name: type dtype: string - name: new_str struct: - name: description dtype: string - name: type dtype: string - name: old_str struct: - name: description dtype: string - name: type dtype: string - name: path struct: - name: description dtype: string - name: type dtype: string - name: security_risk struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: task_list struct: - name: description dtype: string - name: items struct: - name: additionalProperties dtype: bool - name: properties struct: - name: id struct: - name: description dtype: string - name: type dtype: string - name: notes struct: - name: description dtype: string - name: type dtype: string - name: status struct: - name: description dtype: string - name: enum list: string - name: type dtype: string - name: title struct: - name: description dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: thought struct: - name: description dtype: string - name: type dtype: string - name: timeout struct: - name: description dtype: string - name: type dtype: string - name: view_range struct: - name: description dtype: string - name: items struct: - name: type dtype: string - name: type dtype: string - name: required list: string - name: type dtype: string - name: type dtype: string - name: result dtype: string - name: verifier_output dtype: string splits: - name: train num_bytes: 97280837 num_examples: 867 download_size: 96386845 dataset_size: 97280837 configs: - config_name: default data_files: - split: train path: data/train-* ---

提供机构：

DCAgent2

搜集汇总

数据集介绍

构建方式

该数据集源自MedAgentBench框架，通过追踪Qwen2.5-Coder-32B-Instruct模型在医学代理任务中的交互轨迹构建而成。每条数据包含完整的对话历史、工具调用定义及结果验证信息，其中工具定义覆盖代码操作、文件编辑、安全检测等医疗场景中的关键行为。数据格式结构化存储，涵盖agent标识、模型来源、任务类型及运行批次等元数据，便于复现与分析。

特点

数据集的核心特点在于其多层次的结构化设计：既保留了代理与用户之间的自然语言对话流，又详细记录了每个步骤的工具调用参数与返回值。特别地，tool_definitions字段定义了丰富的工具函数，包括文件读写、代码执行、任务管理等功能，且支持参数校验与枚举约束。此外，verifier_output字段提供了自动化验证结果，确保了数据质量的可靠性与可审计性。

使用方法

适用于训练和评估医学领域的大语言模型代理系统。用户可直接加载conversations字段作为多轮对话训练数据，或利用tool_definitions构建工具调用学习样本。通过filtering即可筛选特定任务类型（task）或模型版本的数据，结合result与verifier_output字段用于强化学习中的奖励建模。数据以parquet格式存储，兼容HuggingFace Datasets库的标准加载流程。

背景与挑战

背景概述

MedAgentBench是一个专为评估医学领域大语言模型智能体能力而设计的基准数据集，由Qwen团队于2024年创建，核心研究问题聚焦于如何系统性地衡量语言模型在复杂医疗任务中调用工具、执行多步推理与代码生成的表现。该数据集包含867条精心构建的对话轨迹，每条轨迹都记录了模型与环境的完整交互过程，包括任务定义、工具定义、智能体决策序列及最终结果，涵盖了文件操作、命令执行、安全风险评估等医疗场景中常见的工具调用模式。MedAgentBench的提出填补了医学智能体标准化评估的空白，为后续研究提供了可复现的测试平台，对推动大语言模型在临床辅助决策、医疗信息处理等领域的实际落地具有重要参考价值。

当前挑战

该数据集所解决的领域核心挑战在于如何评估大语言模型在医学场景下的智能体能力，包括工具使用的准确性、多步任务执行的连贯性以及对复杂医疗指令的理解与转化。传统的问答式评估无法覆盖智能体自主规划与执行的过程，因此需要构建包含完整交互轨迹的基准。在构建过程中面临的挑战包括：设计能够真实反映医生工作流的任务场景，确保工具定义的结构化与多样性以覆盖临床常见操作，以及通过严格的验证器自动评估模型生成结果的正确性与安全性，从而在医疗这一高风险领域实现可靠且可扩展的自动化评测机制。

常用场景

经典使用场景

在人工智能与医疗深度融合的时代背景下，构建能够模拟真实临床诊疗场景的智能体系统成为研究热点。medagentBench_Qwen2_5_Coder_32B_Instruct_20260430_044354-traces数据集专为医疗智能体（MedAgent）的交互行为建模而设计，其经典使用场景聚焦于多轮对话中的工具调用与任务规划。研究人员可借助该数据集训练或评估大语言模型在医疗情境下执行复杂指令的能力，例如诊断建议生成、病历解析、药物检索等。数据集中精细定义的tool_definitions字段涵盖了代码执行、文件操作、任务列表管理等多种工具接口，使智能体能够模拟医生与信息系统之间的真实协作流程，从而为医疗人工智能的决策可解释性与环境交互能力提供坚实的实验基础。

衍生相关工作

自该数据集发布以来，其精细的Multi-turn Agent轨迹结构催生了一系列创新性研究。在方法论层面，衍生工作探索了基于强化学习的医疗工具调用优化策略，利用数据中的task列表与verifier_output字段训练模型自主分解复杂医嘱。在模型评估方面，研究工作构建了医疗智能体排行榜，将Qwen2.5等基座模型在该数据集上的表现作为核心指标。另有学者借鉴其对话格式，开发了面向罕见病的多智能体协同诊断框架。数据集的episode和run_id设计还启发了医疗AI的可解释性研究，催生出一批用于分析模型在诊疗各阶段置信度变化的前沿工具，进一步夯实了医疗智能体的理论基础与应用生态。

数据集最近研究