hyunjun1121/MacroBench
收藏Hugging Face2025-10-10 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/hyunjun1121/MacroBench
下载链接
链接失效反馈官方服务:
资源简介:
MacroBench是一个代码优先的基准测试,用于评估大型语言模型是否能够通过阅读HTML/DOM并发出Selenium代码,从自然语言目标中合成可重用的浏览器自动化程序(宏)。该数据集包含681个独特的自动化任务,跨越六个模拟真实世界平台的合成网站,并提供了四个最先进的大型语言模型在2636个模型-任务组合上的完整实验结果。
MacroBench is a code-first benchmark that evaluates whether Large Language Models can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium code. The dataset consists of 681 distinct automation tasks across six synthetic websites emulating real-world platforms (TikTok, Reddit, Instagram, Facebook, Discord, Threads) and provides complete experimental results from evaluating four state-of-the-art LLMs across 2,636 model-task combinations.
提供机构:
hyunjun1121



