WildBench
收藏魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/WildBench
下载链接
链接失效反馈官方服务:
资源简介:
<div style="display: flex; justify-content: flex-start;"><img src="https://allenai.github.io/WildBench/wildbench_logo.png" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 800px;"> </div>
# 🦁 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
## Loading
```python
from datasets import load_dataset
wb_data = load_dataset("allenai/WildBench", "v2", split="test")
```
## Quick Links:
- [HF Leaderboard](https://huggingface.co/spaces/allenai/WildBench)
- [HF Dataset](https://huggingface.co/datasets/allenai/WildBench)
- [Github](https://github.com/allenai/WildBench)
## Dataset Description
- **License:** [CC BY](https://creativecommons.org/licenses/by/4.0/)
- **Language(s) (NLP):** English
- **Point of Contact:** [Yuchen Lin](mailto:yuchenl@allenai.org)
WildBench is a subset of [WildChat](https://huggingface.co/datasets/allenai/WildChat). The use of WildChat data to cause harm is strictly prohibited.
## Data Fields
The dataset on Hugging Face is organized with several features, each of which is designed to capture specific information pertinent to the data being represented. Here is a descriptive breakdown of each feature:
- `id`: A unique identifier for each entry, represented as an integer (`int64`). Not often used.
- `session_id`: A string that uniquely identifies an example, which is usually used as id.
- `conversation_input`: A list structure that encompasses multiple attributes related to the input of the conversation:
- `content`: The actual text content of the conversation input, stored as a string.
- `language`: A string indicating the language used in the conversation input.
- `redacted`: A boolean flag (`bool`) to denote whether any part of the content has been redacted for privacy or other reasons.
- `role`: A string indicating the role of the party in the conversation (e.g., 'user', 'assistant').
- `toxic`: A boolean indicating whether the content contains any toxic elements.
- `references`: A list of dict items.
- `gpt-4`: The value is the gpt-4 generation as the assistant to the next turn.
- `checklist`: A sequence of strings that could represent a set of questions to evaluate the outputs.
- `length`: An integer (`int64`) representing the length of the conversation or content. Note that this is the number of messages.
- `primary_tag`: A string that labels the entry with a primary category.
- `secondary_tags`: A sequence of strings providing additional categorizations.
- `intent`: A string indicating the underlying intent of the conversation or the interaction instance.
- `appropriate`: A string that assesses or describes whether the conversation or content is considered appropriate, potentially in terms of content, context, or some other criteria.
### Introduction of the WildBench Leaderboard
<details open><summary style="font-size: 1.5em; font-weight: bold;"> What is WildBench? Why should I use it?</summary>
<div style="font-size: 1.2em; margin-top: 30px;">
🦁 <b>WildBench</b> is a benchmark for evaluating large language models (LLMs) on challenging tasks that are more representative of real-world applications. The examples are collected from real users by the <a href="https://wildchat.allen.ai/"><b>AI2 WildChat</b></a> project.</li>
<br>
<b>🆕 Motivation</b>: We aim to provide a more <strong>realistic</strong> and <strong>challenging</strong> benchmark for evaluating LLMs, as opposed to existing benchmarks that do not capture the <em>diversity</em> and <em>complexity</em> of <em>real-world</em> tasks.
<h2 style="color: purple">🌠 Key Features:</h2>
<ul>
<li><b style="color: purple">🌟 Fine-grained:</b>
We provide a fine-grained annotation for each example, including task types and <b>checklists</b> for evaluating the quality of responses. In addition, we use <b>length-penalized</b> Elo ratings to ensure that the quality of responses is not biased towards longer outputs.</li>
<li><b style="color: purple">🌟 Transparent & Fair: </b> We test all LLMs on the SAME set of examples, ensuring a fair evaluation. You can explore the data and see the difference between two models to analyze the concrete gap between any pair of LLMs. </li>
<li><b style="color: purple">🌟 Easy & Fast:</b> WildBench (v1.0) contains 1024 examples, and it is extremely easy to add your own LLMs to our leaderboard! 1️⃣ Let us know your model ID and suggested inference configs; 2️⃣ We'll run inference and evaluation for you; 3️⃣ Voilà! We'll notify you when your results are ready on the leaderboard.</li>
<li><b style="color: purple">🌟 Dynamic:</b> WildBench will not be a static dataset. We will continue adding new examples and updating evaluation methods. Our goal is to include new challenging examples from real users over time and provide fast yet reliable evaluations.</li>
<li><b style="color: purple">🌟 Human Verification (ongoing):</b> Although we currently use GPT-4 as the automatic evaluator, we are also collecting human preferences here (see the 🔍 🆚 Tab). We plan to update the leaderboard by incorporating human evaluations in the near future.</li>
<li><b style="color: purple">🌟 Community-driven:</b> In addition to collecting human preferences for improving our evaluation, we also welcome community users to contribute new examples they find challenging to top LLMs like GPT-4/Claude3. Any feedback and suggestions are welcome, and we'll do our best to upgrade our data and evaluation methods accordingly. </li>
</ul>
</div>
</details>
## Licensing Information
WildBench is made available under the [CC BY](https://creativecommons.org/licenses/by/4.0/) license. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
## Citation
```bibtex
@article{yuchen2024wildbench,
title={WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild},
author={Yuchen Lin, Bill and Deng, Yuntian and Chandu, Khyathi and Brahman, Faeze and Ravichander, Abhilasha and Pyatkin, Valentina and Dziri, Nouha and Le Bras, Ronan and Choi, Yejin},
journal={arXiv e-prints},
pages={arXiv--2406},
year={2024}
}
```
https://arxiv.org/abs/2406.04770
<div style="display: flex; justify-content: flex-start;"><img src="https://allenai.github.io/WildBench/wildbench_logo.png" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 800px;"> </div>
# 🦁 WildBench:基于真实野外用户挑战性任务的大语言模型(Large Language Model,LLM)评测基准
## 加载方式
python
from datasets import load_dataset
wb_data = load_dataset("allenai/WildBench", "v2", split="test")
## 快速链接:
- [Hugging Face(HF)排行榜](https://huggingface.co/spaces/allenai/WildBench)
- [Hugging Face(HF)数据集](https://huggingface.co/datasets/allenai/WildBench)
- [Github](https://github.com/allenai/WildBench)
## 数据集说明
- **许可协议**:[知识共享署名许可(CC BY)](https://creativecommons.org/licenses/by/4.0/)
- **自然语言处理所用语言**:英语
- **联系人**:[林宇辰](mailto:yuchenl@allenai.org)
WildBench是[WildChat](https://huggingface.co/datasets/allenai/WildChat)的子集。严禁使用WildChat数据从事危害性行为。
## 数据字段
Hugging Face平台上的该数据集包含多个特征字段,每个字段均用于捕获与数据相关的特定信息。以下对各特征字段进行详细说明:
- `id`: 每个条目的唯一标识符,以整数(`int64`)形式表示,较少使用。
- `session_id`: 用于唯一标识示例的字符串,通常可作为id使用。
- `conversation_input`: 包含与对话输入相关的多个属性的列表结构:
- `content`: 对话输入的实际文本内容,以字符串形式存储。
- `language`: 表示对话输入所用语言的字符串。
- `redacted`: 布尔标志(`bool`),用于指示是否出于隐私或其他原因对内容的部分信息进行了编辑隐去。
- `role`: 表示对话参与方角色的字符串(例如:'user'(用户)、'assistant'(助手))。
- `toxic`: 布尔值,用于指示内容是否包含不当攻击性内容。
- `references`: 由字典项组成的列表:
- `gpt-4`: 其值为GPT-4生成的作为下一回合助手回复的内容。
- `checklist`: 可代表一组用于评估回复质量的问题的字符串序列。
- `length`: 表示对话或内容长度的整数(`int64`),此处的长度指消息条数。
- `primary_tag`: 用于为条目标注主要类别的字符串。
- `secondary_tags`: 提供额外分类类别的字符串序列。
- `intent`: 表示对话或交互实例的潜在意图的字符串。
- `appropriate`: 用于评估或描述对话或内容是否合规的字符串,评估维度可能包括内容、上下文或其他标准。
### WildBench排行榜介绍
<details open><summary style="font-size: 1.5em; font-weight: bold;">什么是WildBench?为何选择该基准?</summary>
<div style="font-size: 1.2em; margin-top: 30px;">
🦁 <b>WildBench</b> 是一款用于评测大语言模型(Large Language Model,LLM)的基准数据集,其所用任务更贴合真实应用场景中的挑战性任务。所有示例均由<a href="https://wildchat.allen.ai/"><b>艾伦人工智能研究所(AI2)WildChat</b></a>项目从真实用户处收集。</li>
<br>
<b>🆕 设计动机</b>:相较于现有基准无法体现真实世界任务的多样性与复杂性,我们旨在打造一款更具<strong>真实性</strong>与<strong>挑战性</strong>的LLM评测基准。
<h2 style="color: purple">🌠 核心特性:</h2>
<ul>
<li><b style="color: purple">🌟 细粒度标注</b>:我们为每个示例提供细粒度标注,包括任务类型以及<b>checklist(评估清单)</b>用于评估回复质量。此外,我们采用<b>长度惩罚(length-penalized)</b> Elo评分,以确保回复质量不会因输出长度过长而产生偏差。</li>
<li><b style="color: purple">🌟 透明公平</b>:我们将所有大语言模型在完全相同的示例集上进行测试,确保评测公平。您可以浏览数据集,对比任意两款模型的表现差异,分析它们之间的具体差距。</li>
<li><b style="color: purple">🌟 便捷高效</b>:WildBench(v1.0)包含1024个示例,且您可以极轻松地将自有LLM接入我们的排行榜!步骤如下:1️⃣ 告知我们您的模型ID与推荐的推理配置;2️⃣ 我们将为您完成推理与评测;3️⃣ 大功告成!当您的模型结果在排行榜上更新时,我们会通知您。</li>
<li><b style="color: purple">🌟 动态更新</b>:WildBench并非静态数据集。我们将持续新增示例并更新评测方法。我们的目标是随时间推移不断纳入来自真实用户的挑战性新示例,并提供快速且可靠的评测服务。</li>
<li><b style="color: purple">🌟 人工验证(进行中)</b>:尽管目前我们采用GPT-4作为自动评测器,但我们也正在收集人工偏好评分(详见🔍 🆚 标签页)。我们计划在不久的将来纳入人工评测结果以更新排行榜。</li>
<li><b style="color: purple">🌟 社区共建</b>:除了收集人工偏好评分以优化我们的评测体系外,我们也欢迎社区用户贡献他们认为对顶级LLM(如GPT-4、Claude3)具有挑战性的新示例。我们欢迎任何反馈与建议,并将尽力据此优化我们的数据集与评测方法。</li>
</ul>
</div>
</details>
## 许可信息
WildBench采用[知识共享署名许可(CC BY)](https://creativecommons.org/licenses/by/4.0/)许可协议发布。其旨在遵循艾伦人工智能研究所(AI2)的[负责任使用指南](https://allenai.org/responsible-use),供研究与教育用途使用。
## 引用格式
bibtex
@article{yuchen2024wildbench,
title={WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild},
author={Yuchen Lin, Bill and Deng, Yuntian and Chandu, Khyathi and Brahman, Faeze and Ravichander, Abhilasha and Pyatkin, Valentina and Dziri, Nouha and Le Bras, Ronan and Choi, Yejin},
journal={arXiv e-prints},
pages={arXiv--2406},
year={2024}
}
https://arxiv.org/abs/2406.04770
提供机构:
maas
创建时间:
2025-05-27
搜集汇总
数据集介绍

背景与挑战
背景概述
WildBench是一个用于评估大语言模型在真实用户挑战性任务上表现的数据集,包含1024个来自WildChat的示例,强调任务多样性和复杂性。其特点包括细粒度注释、公平的评估方法(如长度惩罚Elo评分)以及动态更新机制,旨在提供更贴近实际应用的基准测试。
以上内容由遇见数据集搜集并总结生成



