arena-hard-auto-v0.1

Name: arena-hard-auto-v0.1
Creator: maas
Published: 2026-05-16 21:47:56
License: 暂无描述

魔搭社区2026-05-16 更新2025-03-29 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

## Arena-Hard-Auto **Arena-Hard-Auto-v0.1** ([See Paper](https://arxiv.org/abs/2406.11939)) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest *correlation* and *separability* to Chatbot Arena among popular open-ended LLM benchmarks ([See Paper](https://arxiv.org/abs/2406.11939)). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto. Please checkout our GitHub repo on how to evaluate models using Arena-Hard-Auto and more information about the benchmark. If you find this dataset useful, feel free to cite us! ``` @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}, author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion}, journal={arXiv preprint arXiv:2406.11939}, year={2024} } ``` #### 下载方法 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

## Arena-Hard-Auto ### Arena-Hard-Auto-v0.1 Arena-Hard-Auto-v0.1（[详见论文](https://arxiv.org/abs/2406.11939)）是一款面向指令微调大语言模型（Large Language Model，LLM）的自动化评估工具。该数据集包含500条源自Chatbot Arena的高挑战性用户查询。我们以GPT-4-Turbo作为评判模型，将待测模型的回复与基准模型（默认采用GPT-4-0314）的回复进行对比评估。值得注意的是，在当前主流的开放式大语言模型基准测试集当中，Arena-Hard-Auto与Chatbot Arena的相关性和可区分性均处于最高水平（[详见论文](https://arxiv.org/abs/2406.11939)）。若您希望了解自己的模型在Chatbot Arena上的表现，我们推荐使用Arena-Hard-Auto进行评估。请访问我们的GitHub仓库，以了解如何使用Arena-Hard-Auto开展模型评估，以及该基准测试集的更多相关信息。若该数据集对您的研究有所帮助，欢迎引用我们的工作！ @article{li2024crowdsourced, title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline}, author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion}, journal={arXiv preprint arXiv:2406.11939}, year={2024} } #### 下载方式 :modelscope-code[]{type="sdk"} :modelscope-code[]{type="git"}

提供机构：

maas

创建时间：

2025-03-27

搜集汇总

数据集介绍

背景与挑战

背景概述

Arena-Hard-Auto-v0.1是一个用于评估指令调优大语言模型的自动评测工具数据集，包含500个来自Chatbot Arena的高难度用户查询。它采用GPT-4-Turbo作为评判者，与基线模型对比响应，在开放领域LLM基准中具有最高的相关性和可分离性，能有效预测模型在Chatbot Arena上的表现。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集