veAgentBench

Name: veAgentBench
Creator: maas
Published: 2026-01-08 21:40:43
License: 暂无描述

魔搭社区2026-01-08 更新2025-11-29 收录

下载链接：

https://modelscope.cn/datasets/bytedance-research/veAgentBench

下载链接

链接失效反馈

官方服务：

资源简介：

# VeAgentBench Dataset The VeAgentBench dataset is designed based on specific application scenarios of agents, aiming to test and evaluate the quality of agents generated by full-process agent development frameworks (such as veADK). It focuses on assessing agents' capabilities in tool calling, knowledge base retrieval, memory management, and overall performance. ## Updates - 2025.11.25 First public release of the dataset, containing a total of 484 questions (145 publicly available this time) ## Advantages - **Scenario-oriented design**: Simulates real-world agent behavior, enabling better evaluation of agent quality in practical applications. - **Multi-dimensional assessment**: Comprehensively evaluates agent capabilities from tool calling, knowledge base retrieval, memory management, and other aspects. - **Example agents provided**: Based on Volcengine veADK, allowing developers to directly invoke and extend. ## Project Structure ``` ├── dataset/ # Dataset files directory │ ├── educational_tutoring.csv # Educational tutoring domain dataset │ ├── financial_analysis.csv # Financial analysis domain dataset │ ├── legal_aid.csv # Legal aid domain dataset │ └── personal_assistant.csv # Personal assistant domain dataset ├── agents/ # Example agent implementations │ ├── educational_tutoring.py # Educational tutoring agent │ ├── financial_analysis.py # Financial analysis agent │ ├── legal_aid.py # Legal aid agent │ ├── personal_assistant.py # Personal assistant agent │ └── utils/ # Utility functions directory │ ├── data_loader.py # Dataset loading tool │ └── ... # Other utility functions └── knowledge/ # Knowledge base files directory ``` ## Dataset Introduction ### Dataset Structure The dataset is designed based on agent application scenarios and presented in CSV format, containing a total of 484 questions, with 145 questions publicly available. It is divided into four sub-datasets according to application scenarios: #### Legal Aid Sub-dataset (Total 250 questions, 70 public) - **Design Goal**: Designed around "hierarchical knowledge retrieval capabilities", covering scenarios where RAG (Retrieval-Augmented Generation) knowledge base is fully covered and insufficiently covered. - **Data Source**: Public legal provisions and case databases. For knowledge base files, please refer to the `knowledge` directory. - **Data Example**: ``` number: 1 input: What is the definition of legal aid? expect_output: Legal aid is a system established by the state to provide free legal advice, agency, criminal defense, and other legal services to economically disadvantaged citizens and other parties that meet statutory conditions. It is part of the public legal service system. expect_tools: load_knowledgebase ``` #### Financial Analysis Sub-dataset (Total 57 questions, 20 public) - **Design Goal**: Focuses on "multi-tool collaboration needs" in financial scenarios, verifying the agent's ability to select, call financial data tools and output analysis conclusions. It also examines the agent's deep research capability (the agent needs to accurately find company information and announcement time based on clues). - **Data Source**: Refers to public financial data provided by the AKshare project (such as stock indices, financial statement indicators, etc.). - **Data Example**: ``` number: 1 input: In April 2023, the founder of a leading domestic internet security company divorced and split nearly 9 billion yuan in equity. Query the daily line data for 3 trading days after the announcement. tool_input: 1. vesearch: In April 2023, the founder of a leading domestic internet security company divorced and split nearly 9 billion yuan in equity. Find the company name, stock code, and event date. 2. stock_zh_a_hist: symbol="601360", period="daily", start_date="20230404", end_date="20230407" 3. stock_individual_info_em: symbol="601360" (extract "industry") 4. stock_board_industry_hist_em: symbol="software development", start_date="20230404", end_date="20230407" tool_expect_output: ... expect_output: ... ``` #### Educational Tutoring Sub-dataset (Total 74 questions, 25 public) - **Design Goal**: Tests the agent's "RAG information extraction accuracy" by increasing the amount of knowledge base data, while examining the agent's ability to obtain key information through memory. - **Data Source**: Public educational textbooks and tutoring materials. For knowledge base files, please refer to the `knowledge` directory. - **Data Example**: ``` number: 1 input: I'm in Grade 7 and want to practice basic problems on rational numbers systematically. Can you give me 5 questions? expect_output: 1. Practice Questions: (1) [Multiple Choice] The storage temperature indicated on the instruction manual of a certain medicine is (20±4)℃. Which of the following is the most suitable temperature range for storing the medicine? A. -4℃~4℃ B. 16℃~24℃ C. 20℃~24℃ D. 16℃~20℃ (2) [Multiple Choice] If m and n are opposite numbers, which of the following groups is not opposite numbers? A. -m and -n B. 5m and 5n C. m+1 and n-1 D. m+1 and n+1 ... ``` #### Personal Assistant Sub-dataset (Total 103 questions, 30 public) - **Design Goal**: Further examines the agent's tool calling in actual task scenarios from simple to complex (different levels), as well as event summary and user profiling capabilities combined with memory. - **Data Source**: Designed manually based on daily tasks and evaluation goals. MCP tools are from the public tools of [Volcengine MCP Marketplace](https://www.volcengine.com/mcp-marketplace). - **Data Example**: ``` number: 1 input: Please record the Feishu ecosystem cooperation discussion with Manager Wang from JD Technology on September 2, 2025, in the "Work Docking" sheet of "EXCEL_PATH". expect_tools: 1. excel_tool expect_tools_detail: 1. excel_tool.excel_write_to_sheet expect_memory_use: level: Level 1 ``` ``` number: 2 input: I need to meet Manager Song from SenseTime in Pudong New Area, Shanghai on September 3, 2025. Please check the weather on that day. expect_tools: 1. weather_tool expect_tools_detail: 1. weather_tool.getChatResponse expect_memory_use: level: Level 1 ``` ## Usage ### Download the Dataset ```bash git clone https://huggingface.co/datasets/bytedance-research/veAgentBench ``` ### Configure Knowledge Base Before using the legal aid and educational tutoring agents, you need to configure the RAG knowledge base according to the knowledge base files in the knowledge directory. ### MCP Tools For the personal assistant agent, the MCP tools used require you to obtain the relevant API KEYs from the addresses mentioned in the tool script comments and configure them into environment variables. ## Example Agents All example agents are implemented based on veADK (Volcengine Agent Development Kit). veADK is a full-process development framework for agent development launched by Volcengine, with complete observability and fast planning capabilities, which can help users simplify the development process and improve efficiency. ### Install veADK ```bash pip install veadk-python # Install extensions pip install veadk-python[extensions] ``` For more information, please visit: [veADK Official GitHub Repository](https://github.com/volcengine/veadk-python) ### Run Example Agents ```bash python agents/financial_analysis.py ``` After execution, it will generate task Trace files and eval_set files, which can be used with the VeAgentBench evaluation framework to complete the evaluation. ## Contribute This dataset aims to evaluate the effectiveness of agent applications combined with actual scenarios and mainstream development frameworks. Developers are welcome to contribute more scenarios. ## Disclaimer This dataset is for academic research purposes only. Commercial use is strictly prohibited, including but not limited to commercial analysis, product development, paid services, investment decision support, and business cooperation negotiations. All legal liabilities, economic losses, and other related risks caused by any illegal use shall be borne by the user. Under no circumstances shall we be liable for any direct, special, indirect, incidental, consequential, punitive, or other losses, costs, expenses, or damages arising from the use of this dataset, regardless of any legal theory or other grounds. The above disclaimer and limitation of liability shall be interpreted to the maximum extent permitted by law to be as close as possible to absolute exemption from liability and immunity. ## License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). For details, please visit: https://creativecommons.org/licenses/by-nc/4.0/

# VeAgentBench 数据集 VeAgentBench 数据集基于智能体（Agent）的特定应用场景设计，旨在测试与评估全流程智能体开发框架（如 veADK）所生成智能体的质量，重点考核智能体的工具调用、知识库检索、记忆管理及综合性能。 ## 更新记录 - 2025.11.25 首次公开发布本数据集，总计包含484道题目（本次公开145道） ## 数据集优势 - **面向场景化设计**：模拟真实智能体的行为逻辑，可更精准地评估智能体在实际应用中的表现质量。 - **多维度能力考核**：从工具调用、知识库检索、记忆管理等多个维度全面评估智能体的综合能力。 - **提供示例智能体**：基于火山引擎（Volcengine）veADK开发，支持开发者直接调用并进行功能扩展。 ## 项目结构 ├── dataset/ # 数据集文件目录 │ ├── educational_tutoring.csv # 教育辅导领域数据集 │ ├── financial_analysis.csv # 金融分析领域数据集 │ ├── legal_aid.csv # 法律援助领域数据集 │ └── personal_assistant.csv # 个人助理领域数据集 ├── agents/ # 示例智能体实现目录 │ ├── educational_tutoring.py # 教育辅导智能体 │ ├── financial_analysis.py # 金融分析智能体 │ ├── legal_aid.py # 法律援助智能体 │ ├── personal_assistant.py # 个人助理智能体 │ └── utils/ # 工具函数目录 │ ├── data_loader.py # 数据集加载工具 │ └── ... # 其余工具函数 └── knowledge/ # 知识库文件目录 ## 数据集说明 ### 数据集整体结构本数据集基于智能体应用场景设计，采用CSV格式存储，总计包含484道题目，其中145道面向公众开放。按照应用场景划分为四个子数据集： #### 法律援助子数据集（总计250道题，公开70道） - **设计目标**：围绕“分层知识库检索能力”展开设计，覆盖检索增强生成（Retrieval-Augmented Generation, RAG）知识库完全覆盖与覆盖不足两类场景。 - **数据来源**：公开法律条文与案例数据库。知识库文件详见`knowledge`目录。 - **数据示例**：编号：1 输入：什么是法律援助？预期输出：法律援助是国家为经济困难公民和符合法定条件的其他当事人无偿提供法律咨询、代理、刑事辩护等法律服务的制度，是公共法律服务体系的组成部分。预期调用工具：load_knowledgebase #### 金融分析子数据集（总计57道题，公开20道） - **设计目标**：聚焦金融场景中的“多工具协同需求”，验证智能体选取、调用金融数据工具并输出分析结论的能力，同时考核智能体的深度调研能力（需基于线索精准检索企业信息与公告时间）。 - **数据来源**：参考AKshare项目提供的公开金融数据（如股票指数、财务报表指标等）。 - **数据示例**：编号：1 输入：2023年4月，国内某头部互联网安全公司创始人离婚并拆分近90亿元股权。请查询该公告发布后3个交易日的日线数据。工具输入： 1. vesearch：2023年4月，国内某头部互联网安全公司创始人离婚并拆分近90亿元股权。请查询该公司名称、股票代码与事件发生日期。 2. stock_zh_a_hist：symbol="601360", period="daily", start_date="20230404", end_date="20230407" 3. stock_individual_info_em：symbol="601360"（提取“所属行业”字段） 4. stock_board_industry_hist_em：symbol="软件开发", start_date="20230404", end_date="20230407" 工具预期输出：... 预期输出：... #### 教育辅导子数据集（总计74道题，公开25道） - **设计目标**：通过增大知识库数据量，测试智能体的“RAG信息提取准确率”，同时考核智能体通过记忆获取关键信息的能力。 - **数据来源**：公开教育教材与辅导资料。知识库文件详见`knowledge`目录。 - **数据示例**：编号：1 输入：我正在读七年级，想要系统练习有理数相关的基础习题，能否为我提供5道题目？预期输出：1. 练习题： (1) [选择题] 某药品说明书标注的储存温度为(20±4)℃，以下最适宜的储存温度范围是？A. -4℃~4℃ B. 16℃~24℃ C. 20℃~24℃ D. 16℃~20℃ (2) [选择题] 若m与n互为相反数，以下哪组不互为相反数？A. -m和-n B. 5m和5n C. m+1和n-1 D. m+1和n+1 ... #### 个人助理子数据集（总计103道题，公开30道） - **设计目标**：从简单到复杂（分不同难度等级），进一步考核智能体在实际任务场景中的工具调用能力，以及结合记忆的事件总结与用户画像构建能力。 - **数据来源**：基于日常任务与评估目标人工设计。模型上下文协议（Model Context Protocol, MCP）工具取自[火山引擎MCP市集](https://www.volcengine.com/mcp-marketplace)的公开工具。 - **数据示例**：编号：1 输入：请将2025年9月2日与京东科技王经理进行的飞书生态合作洽谈记录至"EXCEL_PATH"文件的"工作对接"工作表中。预期调用工具：1. excel_tool 预期工具详情：1. excel_tool.excel_write_to_sheet 记忆使用：无难度等级：Level 1 编号：2 输入：我需要在2025年9月3日于上海浦东新区与商汤科技宋经理会面，请查询当日天气情况。预期调用工具：1. weather_tool 预期工具详情：1. weather_tool.getChatResponse 记忆使用：无难度等级：Level 1 ## 使用方法 ### 下载数据集 bash git clone https://huggingface.co/datasets/bytedance-research/veAgentBench ### 配置知识库使用法律援助与教育辅导智能体前，需根据`knowledge`目录中的知识库文件配置RAG知识库。 ### MCP工具配置个人助理智能体所使用的MCP工具，需从工具脚本注释中提及的地址获取对应API密钥，并配置至环境变量中。 ## 示例智能体所有示例智能体均基于veADK（火山引擎智能体开发套件，Volcengine Agent Development Kit）实现。veADK是火山引擎推出的全流程智能体开发框架，具备完整的可观测性与快速规划能力，可帮助用户简化开发流程、提升开发效率。 ### 安装veADK bash pip install veadk-python # 安装扩展组件 pip install veadk-python[extensions] 更多信息请访问：[veADK 官方GitHub仓库](https://github.com/volcengine/veadk-python) ### 运行示例智能体 bash python agents/financial_analysis.py 执行后将生成任务追踪文件与评估集文件，可配合VeAgentBench评估框架完成智能体性能评估。 ## 贡献指南本数据集旨在评估结合实际场景与主流开发框架的智能体应用效果，欢迎开发者贡献更多应用场景。 ## 免责声明本数据集仅用于学术研究用途，严格禁止商业使用，包括但不限于商业分析、产品开发、付费服务、投资决策支持及商务合作洽谈等。任何非法使用所引发的一切法律责任、经济损失及其他相关风险均由使用者自行承担。无论基于何种法律理论或其他依据，对于因使用本数据集而产生的任何直接、特殊、间接、附带、后果性、惩罚性或其他损失、成本、费用或损害，我们均不承担任何责任。本免责声明与责任限制条款将在法律允许的最大范围内进行解释，尽可能达到完全免责与豁免的效果。 ## 许可协议本作品采用知识共享署名-非商业性使用4.0国际许可协议（CC BY-NC 4.0）进行许可。详细信息请访问：https://creativecommons.org/licenses/by-nc/4.0/

提供机构：

maas

创建时间：

2025-11-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集