carcinize/modylbench
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/carcinize/modylbench
下载链接
链接失效反馈官方服务:
资源简介:
ModylBench是一个用于评估作为会议参与者的AI代理的基准数据集。它专注于多模态、多轮次的专业会议场景,评估AI代理在语音、文档编辑、意图信号等方面的联合表现。数据集包含六个专业领域的完整会议模拟,每个场景包括人类角色设定、48-52轮脚本对话、预期工作产出及对抗性边缘案例。评估维度涵盖交互质量和工作产出质量,采用11个评分维度进行综合评分。数据集分为标准测试集(6个场景)和硬测试集(29个边缘案例),支持程序化验证和多法官评分机制。
ModylBench is a benchmark dataset for evaluating AI agents as meeting participants. It focuses on multimodal, multi-turn professional meeting scenarios, assessing agents joint performance across speech, artifact edits, intent signals, and more. The dataset includes complete meeting simulations across six professional verticals, each containing a human persona, 48-52 scripted turns, expected work outputs, and adversarial edge cases. Evaluation covers both interaction quality (journey) and work product quality (destination) through 11 scoring dimensions. The dataset is structured into standard test splits (6 scenarios) and hard test splits (29 edge cases), supporting programmatic verification and multi-judge scoring mechanisms.
提供机构:
carcinize



