mu-shroom
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/mu-shroom
下载链接
链接失效反馈官方服务:
资源简介:
# The **Mu-SHROOM** dataset for Multilingual Hallucination and Overgeneration detection.
Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes and Related Observable Overgeneration Mistakes
## Dataset Description
Mu-SHROOM is a multilingual dataset for detecting hallucination spans in LLM outputs across 14 languages. It was created for [SemEval-2025 Task 3](https://helsinki-nlp.github.io/shroom/2025).
**disclaimer**: Mu-SHROOM is not properly a fact-checking dataset, but we mark is as such until `hallucination detection` (or something more adequate) is added to the offical listo of task_ids.
### Features
- **14 languages**: Arabic, Basque, Catalan, Chinese, Czech, English, Farsi, Finnish, French, German, Hindi, Italian, Spanish, Swedish
- **Splits**: `train_unlabeled`, `validation`, and `test` sets
- **Rich annotations**: Character-level hallucination spans with hard and soft labels, and annotator IDs
- **Model outputs**: Includes output tokens and logits from various LLMs
- **Full Transparency**: For full replicability, in [the official git repo](https://github.com/Helsinki-NLP/mu-shroom) we make available all the scripts
used to generate the outputs. We also grant access to scripts to replicate the annotation & evaluation pratforms, evaluation
scripts, the raw data and the shared-task participant kit.
## Dataset Structure
Each language is available as a separate subset, with "all"-.subset contianing a concatenation of all data.
The dataset contains:
### Data Fields
- `id`: Unique example identifier
- `lang`: Language code (ISO 639-1)
- `model_input`: The input prompt given to the LLM
- `model_output_text`: The generated output text
- `model_id`: Identifier of the LLM that generated the output
- `wikipedia_url`: Reference Wikipedia URL used for annotation
- `soft_labels`: Probabilistic character spans of hallucinations `[{"start": int, "end": int, "prob": float}]`
- `hard_labels`: Binary character spans of hallucinations `[[start, end]]` (marked as 1 when the majority of annotators marked is as a hallucination)
- `model_output_logits`: Logits from the LLM generation
- `model_output_tokens`: Tokenized output
- `annotations`: Raw annotations from multiple annotators `[{"annotator_id": str, "labels": [[start, end]]}]`
- `annotator_id`: unique identifier for each annotator (useful for studying annotation trends, like disagreement)
### Data Splits
Each language has:
- `train_unlabeled`: Unlabeled training data (available for some languages)
- `validation`: Labeled validation set
- `test`: Labeled test set
## How to Use
### Loading the Dataset
```python
from datasets import load_dataset
# Load a specific language (e.g., English)
dataset = load_dataset("Helsinki-NLP/mu-shroom", "en")
# Access splits
train = dataset["train_unlabeled"]
val = dataset["validation"]
test = dataset["test"]
```
### Load all languages combined
```python
full_dataset = load_dataset("Helsinki-NLP/mu-shroom", "all")
```
### Example Usage
```python
# Get an example from validation set
example = dataset["validation"][0]
print(f"Language: {example['lang']}")
print(f"Input: {example['model_input']}")
print(f"Model Output: {example['model_output_text']}")
print(f"Hallucination spans: {example['hard_labels']}")
# Visualize hallucination spans
text = example["model_output_text"]
for span in example["hard_labels"]:
start, end = span
print(f"Hallucinated text: '{text[start:end]}'")
```
Expected outcome
```text
Language: en
Input: What did Petra van Staveren win a gold medal for?
Model Output: Petra van Stoveren won a silver medal in the 2008 Summer Olympics in Beijing, China.
Hallucination spans: [[25, 31], [45, 49], [69, 83]]
Hallucinated text: 'silver'
Hallucinated text: '2008'
Hallucinated text: 'Beijing, China'
```
# Shared Task Information: Quick Overview
Mu-SHROOM is part of SemEval-2025 Task 3. Participants were asked to detect hallucination spans in LLM outputs.
They were evaluatied using [this evaluation script](https://github.com/Helsinki-NLP/mu-shroom/blob/main/participant_kit/scorer.py) over the:
- intersection-over-Union (IoU) of hallucinated characters
- correlation between predicted and empirical probabilities
# Citation
If you use this dataset, please cite the SemEval-2025 task proceedings *(citation information to be updated after the workshop)*:
```bib
@inproceedings{vazquez-etal-2025-mu-shroom,
author={Ra\'ul V\'azquez and Timothee Mickus and Elaine Zosa and Teemu Vahtola and J\"org Tiedemann and Aman Sinha and Vincent Segonne and Fernando S\'anchez-Vega and Alessandro Raganato and Jindřich Libovický and Jussi Karlgren and Shaoxiong Ji and Jindřich Helcl and Liane Guillou and Ona de Gibert and Jaione Bengoetxea and Joseph Attieh and Marianna Apidianaki},
title={Sem{E}val-2025 {T}ask 3: {Mu-SHROOM}, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes},
year={2025},
url={https://helsinki-nlp.github.io/shroom/2025},
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
publisher = "Association for Computational Linguistics",
month = jul,
year = "2025",
address = "Vienna, Austria",
}
```
## Contact
For questions about the dataset, please contact the organizers:
- Raúl Vázquez (University of Helsinki)
- Timothee Mickus (University of Helsinki)
## 👥🙌🌐 Join the SHROOM Community
Whether you're interested in joining the next round, learning from past editions, or just staying informed about hallucination detection in NLG, we'd love to have you in the community.
- Check out the [**\*SHRO0M** shared task series](https://helsinki-nlp.github.io/shroom/)
- Join the conversation on [Slack](https://join.slack.com/t/shroom-shared-task/shared_invite/zt-2mmn4i8h2-HvRBdK5f4550YHydj5lpnA)
- Check out the past editions Google gorups
- [Mu-SHROOM 2025](https://groups.google.com/g/semeval-2025-task-3-mu-shroom)
- [Mu-SHROOM 2024](https://groups.google.com/g/semeval-2024-task-6-shroom)
# 用于多语言幻觉与过生成检测的**Mu-SHROOM**数据集
Mu-SHROOM: 多语言幻觉与可观测相关过生成错误共享任务(Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes)
## 数据集概述
Mu-SHROOM是一个多语言数据集,用于检测14种语言下大语言模型(Large Language Model,LLM)输出中的幻觉片段。该数据集为[SemEval-2025任务3](https://helsinki-nlp.github.io/shroom/2025)构建。
**免责声明**:Mu-SHROOM并非严格意义上的事实核查数据集,但在“幻觉检测”(或更合适的术语)被加入官方任务ID列表之前,我们暂将其归类为此类。
### 数据集特性
- **14种语言**:阿拉伯语、巴斯克语、加泰罗尼亚语、汉语、捷克语、英语、波斯语、芬兰语、法语、德语、印地语、意大利语、西班牙语、瑞典语
- **数据集划分**:包含`train_unlabeled`(未标注训练集)、`validation`(验证集)与`test`(测试集)
- **丰富标注**:带有硬标签与软标签的字符级幻觉片段标注,以及标注者ID
- **模型输出数据**:涵盖多种大语言模型生成的输出Token与对数似然值(logits)
- **完全可复现性**:为确保研究可复现,我们在[官方Git仓库](https://github.com/Helsinki-NLP/mu-shroom)中提供了所有用于生成模型输出的脚本,同时开放用于复现标注与评估平台、评估脚本、原始数据以及共享任务参与者工具包的相关代码。
## 数据集结构
每种语言均设有独立子集,另有`all`子集包含所有语言数据的拼接结果。数据集包含以下内容:
### 数据字段
- `id`:唯一样本标识符
- `lang`:语言代码(遵循ISO 639-1标准)
- `model_input`:提供给大语言模型的输入提示词
- `model_output_text`:模型生成的输出文本
- `model_id`:生成该输出的大语言模型标识符
- `wikipedia_url`:用于标注的参考维基百科URL
- `soft_labels`:幻觉片段的概率型字符标注,格式为`[{"start": 整数, "end": 整数, "prob": 浮点数}]`
- `hard_labels`:幻觉片段的二值型字符标注,格式为`[[start, end]]`(当多数标注者将某片段标记为幻觉时,该位置标记为1)
- `model_output_logits`:模型生成过程输出的对数似然值
- `model_output_tokens`:模型输出的Token化结果
- `annotations`:多位标注者的原始标注数据,格式为`[{"annotator_id": 字符串, "labels": [[start, end]]}]`
- `annotator_id`:每位标注者的唯一标识符(可用于研究标注趋势,如标注分歧)
### 数据划分
每种语言对应以下划分:
- `train_unlabeled`:未标注训练数据(部分语言提供)
- `validation`:带标注的验证集
- `test`:带标注的测试集
## 使用方法
### 加载数据集
python
from datasets import load_dataset
# 加载指定语言的子集(例如英语)
dataset = load_dataset("Helsinki-NLP/mu-shroom", "en")
# 访问不同划分
train = dataset["train_unlabeled"]
val = dataset["validation"]
test = dataset["test"]
### 加载所有语言的合并数据集
python
full_dataset = load_dataset("Helsinki-NLP/mu-shroom", "all")
### 示例用法
python
# 从验证集中获取一个样本
example = dataset["validation"][0]
print(f"语言: {example['lang']}")
print(f"输入提示: {example['model_input']}")
print(f"模型输出: {example['model_output_text']}")
print(f"幻觉片段: {example['hard_labels']}")
# 可视化幻觉片段
text = example["model_output_text"]
for span in example['hard_labels']:
start, end = span
print(f"幻觉文本: '{text[start:end]}'")
预期输出:
text
语言: en
输入提示: What did Petra van Staveren win a gold medal for?
模型输出: Petra van Stoveren won a silver medal in the 2008 Summer Olympics in Beijing, China.
幻觉片段: [[25, 31], [45, 49], [69, 83]]
幻觉文本: 'silver'
幻觉文本: '2008'
幻觉文本: 'Beijing, China'
# 共享任务快速概览
Mu-SHROOM是SemEval-2025任务3的组成部分。参与者需检测大语言模型输出中的幻觉片段,评估将基于以下指标:
- 幻觉字符的交并比(Intersection-over-Union,IoU)
- 预测概率与经验概率之间的相关性
所用评估脚本可参见[此脚本](https://github.com/Helsinki-NLP/mu-shroom/blob/main/participant_kit/scorer.py)。
# 引用说明
若您使用本数据集,请引用SemEval-2025任务相关会议论文(会议论文的引用信息将在研讨会后更新):
bib
@inproceedings{vazquez-etal-2025-mu-shroom,
author={Raúl Vázquez and Timothee Mickus and Elaine Zosa and Teemu Vahtola and Jörg Tiedemann and Aman Sinha and Vincent Segonne and Fernando Sánchez-Vega and Alessandro Raganato and Jindřich Libovický and Jussi Karlgren and Shaoxiong Ji and Jindřich Helcl and Liane Guillou and Ona de Gibert and Jaione Bengoetxea and Joseph Attieh and Marianna Apidianaki},
title={Sem{E}val-2025 {T}ask 3: {Mu-SHROOM}, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes},
year={2025},
url={https://helsinki-nlp.github.io/shroom/2025},
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
publisher = "Association for Computational Linguistics",
month = jul,
year = "2025",
address = "Vienna, Austria",
}
# 联系方式
若您对本数据集有任何疑问,请联系组委会:
- Raúl Vázquez(赫尔辛基大学)
- Timothee Mickus(赫尔辛基大学)
## 👥🙌🌐 加入SHROOM社区
无论您是否希望参与下一届共享任务、学习过往赛事内容,或是仅希望了解自然语言生成中的幻觉检测相关动态,我们都欢迎您加入社区。
- 查看[**SHROOM共享任务系列**](https://helsinki-nlp.github.io/shroom/)
- 加入[Slack社区](https://join.slack.com/t/shroom-shared-task/shared_invite/zt-2mmn4i8h2-HvRBdK5f4550YHydj5lpnA)进行交流
- 查看过往赛事的谷歌群组:
- [Mu-SHROOM 2025](https://groups.google.com/g/semeval-2025-task-3-mu-shroom)
- [Mu-SHROOM 2024](https://groups.google.com/g/semeval-2024-task-6-shroom)
提供机构:
maas
创建时间:
2025-08-15



