hud-evals/SheetBench-50
收藏Hugging Face2025-12-03 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/hud-evals/SheetBench-50
下载链接
链接失效反馈官方服务:
资源简介:
SheetBench-50是一个包含50个电子表格任务的基准数据集,用于评估AI代理在处理数据查找、计算、过滤、转换和多步骤分析等方面的能力。数据集中的每个任务都包括一个预先填充了测试数据的电子表格、一个描述任务的 自然语言提示、一个评估工具来检查成功标准和基于正确性的0-1之间的奖励分数。
SheetBench-50 is a benchmark dataset containing 50 spreadsheet tasks designed to evaluate the capabilities of AI agents in data lookup, calculations, filtering, transformations, and multi-step analysis. Each task in the dataset includes a pre-populated spreadsheet with test data, a natural language prompt describing the task, an evaluation tool to check success criteria, and a reward score between 0-1 based on correctness.
提供机构:
hud-evals



