HalloMTBench
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/AIDC-AI/HalloMTBench
下载链接
链接失效反馈官方服务:
资源简介:
# HalloMTBench: A Benchmark for Translation Hallucination in LLMs
**[ Leaderboard ](https://github.com/AIDC-AI/HalloMTBench#leaderboard) | [ Paper ](https://github.com/AIDC-AI) | [ GitHub ](https://github.com/AIDC-AI/HalloMTBench)**
---
## Dataset Summary
**HalloMTBench** is a new and challenging benchmark designed to evaluate the performance of Large Language Models (LLMs) against translation hallucinations.
The result is a high-quality, expert-verified dataset of **5,435 challenging samples** that capture naturally occurring hallucinations, providing a cost-effective and robust tool for evaluating model safety and reliability in translation tasks.
## Supported Tasks and Leaderboards
The primary use of this dataset is for **evaluating the robustness of LLMs against translation hallucinations**. Models can be prompted to translate the `source_text` and their output can be compared against the `target_text` and `halluc_type` to measure their susceptibility to hallucination.
An official leaderboard and evaluation tool, **HalloMTDetector**, are available in the [repository](https://github.com/AIDC-AI/HalloMTBench).
## Languages
The dataset covers **11 high-resource language pairs**, with English (`en`) as the source language.
* **Source Language:** English (`en`)
* **Target Languages:** Spanish (`es`), French (`fr`), Italian (`it`), Portuguese (`pt`), German (`de`), Russian (`ru`), Arabic (`ar`), Vietnamese (`vi`), Chinese (`zh`), Japanese (`ja`), Korean (`ko`).
## Dataset Structure
### Data Distribution
The 5,435 samples in the dataset are distributed across the four hallucination types as follows. Avg. Target Length refers to the average character length of the target_text.
| Hallucination Type | Count | Avg. Target Length |
|-----------------------------|-------|--------------------|
| Incorrect Target Language | 2,836 | 184.9 |
| Extraneous Addition | 1,907 | 143.8 |
| Untranslated Content | 635 | 4.9 |
| Repetition | 57 | 119.5 |
| **Total** | **5,435** | **148.7** |
### Data Instances
Each instance in the dataset is a JSON object representing a single, expert-verified example of a translation hallucination.
```json
{
"source_text":"Third Congress",
"target_text":"第三回国会",
"lang_pair":"en-ja",
"model":"qwen-max",
"halluc_type":"Incorrect Language"
}
```
## License / 许可证
The dataset is licensed under the [apache-2.0](https://www.apache.org/licenses/LICENSE-2.0).
# HalloMTBench:面向大语言模型翻译幻觉的评测基准
**[ 排行榜 ](https://github.com/AIDC-AI/HalloMTBench#leaderboard) | [ 论文 ](https://github.com/AIDC-AI) | [ GitHub仓库 ](https://github.com/AIDC-AI/HalloMTBench)**
---
## 数据集概览
**HalloMTBench**是一款全新且极具挑战性的评测基准,旨在评估大语言模型(Large Language Model, LLM)抵御翻译幻觉的性能表现。
本数据集为经专家核验的高质量数据集,包含5435个极具挑战性的样本,覆盖真实出现的翻译幻觉场景,为评测翻译任务中模型的安全性与可靠性提供了兼具成本效益与鲁棒性的工具。
## 支持任务与排行榜
本数据集的核心用途为**评测大语言模型抵御翻译幻觉的鲁棒性**。可通过提示模型对`source_text`(源文本)进行翻译,将模型输出与`target_text`(目标文本)及`halluc_type`(幻觉类型)进行比对,以衡量模型出现幻觉的概率。
官方排行榜与评测工具**HalloMTDetector**已在[本仓库](https://github.com/AIDC-AI/HalloMTBench)中开源。
## 覆盖语言
本数据集涵盖**11个高资源语言对**,以英语(`en`)作为源语言。
* **源语言:** 英语(`en`)
* **目标语言:** 西班牙语(`es`)、法语(`fr`)、意大利语(`it`)、葡萄牙语(`pt`)、德语(`de`)、俄语(`ru`)、阿拉伯语(`ar`)、越南语(`vi`)、中文(`zh`)、日语(`ja`)、韩语(`ko`)。
## 数据集结构
### 数据分布
本数据集的5435个样本按4种幻觉类型分布如下,Avg. Target Length指目标文本的平均字符长度。
| 幻觉类型 | 样本量 | 平均目标文本长度 |
|-------------------------|--------|------------------|
| 目标语言错误 | 2836 | 184.9 |
| 额外冗余添加 | 1907 | 143.8 |
| 未翻译内容 | 635 | 4.9 |
| 重复内容 | 57 | 119.5 |
| **总计** | **5435** | **148.7** |
### 数据样例
数据集中的每个实例均为经专家核验的单个翻译幻觉场景的JSON对象。
json
{
"source_text":"Third Congress",
"target_text":"第三回国会",
"lang_pair":"en-ja",
"model":"qwen-max",
"halluc_type":"Incorrect Language"
}
## 授权协议
本数据集采用[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)开源协议。
提供机构:
maas
创建时间:
2025-10-29



