资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- conversational
language:
- zh
size_categories:
- n<1K
---
# 💬 MT-Bench-ZH
👻 [GitHub](https://github.com/GeneZC/MiniMA/tree/main/mt_bench_zh)
## 🎯 Motivation
MiniChat-1/1.5/2-3B are all instruction-following language models that could handle Chinese instructions, however, there is currently no instruciton-following benchamrk specialized for Chinese. Due to this, our previous evaluation has been limited to English-only benchmarks (i.e., AlpacaEval and MT-Bench).
To this demand, MT-Bench-ZH is made to mitigate this. MT-Bench-ZH is basically translated from MT-Bench-ZH by GPT-4 and further checked by human. Hopefully, MT-Bench-ZH could help the communnity to develop better instruction-following language models that are able to tackle Chinese instructions.
## 🚀 Quick Start
> [!NOTE]
> The code is either copied or modified from [FastChat](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), yet we currently only support `single` mode judgment.
> Please refer to FastChat for more details.
### Install FastChat
```bash
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,webui]"
```
### Generate Responses
```bash
python gen_model_answer.py --model-path GeneZC/MiniChat-2-3B --model-id minichat --bench-name mt_bench_zh --max-new-token 1536
```
### Evaluate Responses
```bash
export OPENAI_API_KEY=XXXXXX # Set the OpenAI API key.
python gen_judgment.py --model-list minichat --bench-name mt_bench_zh --judge-file data/judge_prompts_zh.jsonl --parallel 4
```
### Display Results
```bash
python show_result.py --bench-name mt_bench_zh
```
## 🏆 Leaderboard
|Method|MT-Bench-ZH|
|--|--|
|🥇 GPT-4|8.96|
|🥈 Zephyr-7B-Beta|6.27<sup>#</sup>|
|🥉 Qwen-Chat-7B|6.24|
|MiniChat-2-3B|6.04|
|Qwen-Chat-1.8B|5.65|
|LLaMA-2-Chat-7B|5.43<sup>#</sup>|
|Vicuna-7B|5.22<sup>#</sup>|
|StableLM-Zephyr-3B|4.31<sup>#</sup>|
|Rocket-3B|4.07<sup>#</sup>|
|Phi-2-DPO|1.59<sup>#</sup><sup>$</sup>|
<sup>#</sup> specialized mainly for English.
<sup>$</sup> finetuned without multi-turn instruction data.
## 🙌 Contributions
You can raise questions related to the benchmark by opening an issue. Or you can add results of other models to the leaderboard by opening a pull request. For the leaderboard, related files should be attached for sanity check (i.e., a separate model response file should be uploaded, and the GPT-4 judgement file should be updated).
许可证:Apache-2.0
任务类别:
- 文本生成
- 对话式
语言:
- 中文
规模类别:
- 数据量小于1000(n<1K)
# 💬 MT-Bench-ZH
👻 [GitHub仓库](https://github.com/GeneZC/MiniMA/tree/main/mt_bench_zh)
## 🎯 项目动机
MiniChat-1/1.5/2-3B均为支持中文指令的指令跟随大语言模型,但目前尚无专门针对中文场景的指令跟随评测基准。此前我们的模型评估只能局限于英文基准(如AlpacaEval与MT-Bench)。为填补这一需求空白,我们开发了MT-Bench-ZH。该基准本质上由GPT-4翻译自MT-Bench,并经人工进一步校核。期望MT-Bench-ZH能够助力社区开发出更优秀的、可适配中文指令的指令跟随大语言模型。
## 🚀 快速上手
> [!提示]
> 本项目代码部分复刻或修改自 [FastChat](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge),目前仅支持`单轮`(single)模式评测。
> 更多细节请参考FastChat官方仓库。
### 安装FastChat
bash
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,webui]"
### 生成模型响应
bash
python gen_model_answer.py --model-path GeneZC/MiniChat-2-3B --model-id minichat --bench-name mt_bench_zh --max-new-token 1536
### 评测模型响应
bash
export OPENAI_API_KEY=XXXXXX # 设置OpenAI API密钥
python gen_judgment.py --model-list minichat --bench-name mt_bench_zh --judge-file data/judge_prompts_zh.jsonl --parallel 4
### 展示评测结果
bash
python show_result.py --bench-name mt_bench_zh
## 🏆 排行榜
|评测方法|MT-Bench-ZH得分|
|--|--|
|🥇 GPT-4|8.96|
|🥈 Zephyr-7B-Beta|6.27<sup>#</sup>|
|🥉 Qwen-Chat-7B|6.24|
|MiniChat-2-3B|6.04|
|Qwen-Chat-1.8B|5.65|
|LLaMA-2-Chat-7B|5.43<sup>#</sup>|
|Vicuna-7B|5.22<sup>#</sup>|
|StableLM-Zephyr-3B|4.31<sup>#</sup>|
|Rocket-3B|4.07<sup>#</sup>|
|Phi-2-DPO|1.59<sup>#</sup><sup>$</sup>|
<sup>#</sup> 该模型主要针对英文场景优化。
<sup>$</sup> 该模型未使用多轮指令数据进行微调。
## 🙌 贡献指南
您可通过提交Issue反馈与该基准相关的问题,或通过发起Pull Request向排行榜新增其他模型的评测结果。如需新增排行榜条目,请附带相关文件以进行合规性校验:即需单独上传模型响应文件,并更新GPT-4生成的评测结果文件。