five

WildBench

收藏
arXiv2024-06-07 更新2024-06-21 收录
下载链接:
https://hf.co/spaces/allenai/WildBench
下载链接
链接失效反馈
官方服务:
资源简介:
WildBench是由艾伦人工智能研究所开发的自动化评估框架,包含1024个从真实用户与聊天机器人对话中精选的任务。数据集涵盖多种任务类型,如编程、数学和数据分析等,旨在通过这些复杂任务评估大型语言模型的性能。创建过程中,使用GPT-4-Turbo等模型进行难度标注,并经过人工审核确保质量。WildBench的应用领域广泛,特别适用于测试模型在真实世界复杂任务中的表现,以解决自动化和成本效益评估的挑战。

WildBench is an automated evaluation framework developed by the Allen Institute for Artificial Intelligence. It comprises 1,024 tasks curated from real-world conversations between end-users and chatbots. The dataset spans diverse task categories including programming, mathematical reasoning, data analysis and more, with the core goal of evaluating the performance of large language models (LLMs) through these complex real-world tasks. During the dataset construction process, models such as GPT-4-Turbo were utilized for difficulty annotation, and manual reviews were conducted to ensure data quality. WildBench has broad application scenarios, and is particularly ideal for testing model performance on complex real-world tasks, thus addressing the challenges of automated and cost-effective LLM evaluation.
提供机构:
艾伦人工智能研究所
创建时间:
2024-06-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作