Helsinki-NLP/shroom-cap
收藏Hugging Face2026-02-11 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/Helsinki-NLP/shroom-cap
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mpl-2.0
tags:
- multilingual
- hallucination-detection
- scientific-text
- cross-lingual
- classification
- factuality
- fluency
- LLM-evaluation
---
# SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications
## Dataset Summary
SHROOM-CAP is a multilingual dataset for hallucination detection in scientific text generated by large language models (LLMs). The dataset covers nine languages: five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource Indic languages (Bengali, Gujarati, Malayalam, and Telugu). Each instance consists of LLM-generated text, token sequences, logits, and metadata about the source scientific publication. The dataset provides binary labels for:
- **Factual mistakes:** whether the text contains hallucinated or factually incorrect content.
- **Fluency mistakes:** whether the text contains linguistic errors affecting readability.
The task frames hallucination detection as a binary classification problem, with LLMs required to predict factual and fluency mistakes.
## Dataset Structure
The dataset is organized into the following splits:
| Split | Examples | Description |
|-------|---------|------------|
| `train` | 1,755 | Training set batch 1 (en, hi, es, fr, it) |
| `validation` | 1,200 | Validation set (en, hi, es, fr, it) |
| `test` | 4,384 | Test set (all 9 languages, including IndicLanguages bn, te, ml, gu), labels not included to help fight against leakage. Contact the authors for more info. |
Each example contains:
- `index`: unique identifier
- `title`, `abstract`, `doi`, `url`, `datafile`: source publication metadata
- `authors`: list of author names (`first` and `last`)
- `question`: question about the publication
- `model_id`: the LLM used for generation
- `model_config`: model configuration parameters
- `prompt`: prompt used for generation
- `output_text`: LLM-generated answer
- `output_tokens`: tokenized model output
- `output_logits`: token-level logits
- `has_fluency_mistakes`: binary label (`y`/`n`) or `null` for test
- `has_factual_mistakes`: binary label (`y`/`n`) or `null` for test
## Source
- Sinha, Aman et al. (2025). [SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications](https://aclanthology.org/2025.chomps-main.7/). *Proceedings of CHOMPS 2025*.
## Citation
```bibtex
@inproceedings{sinha-etal-2025-shroom,
title = "{SHROOM}-{CAP}: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications",
author = "Sinha, Aman and
Gamba, Federica and
V{\'a}zquez, Ra{\'u}l and
Mickus, Timothee and
Chattopadhyay, Ahana and
Zanella, Laura and
Arakkal Remesh, Binesh and
Kankanampati, Yash and
Chandramania, Aryan and
Agarwal, Rohit",
editor = {Sinha, Aman and
V{\'a}zquez, Ra{\'u}l and
Mickus, Timothee and
Agarwal, Rohit and
Buhnila, Ioana and
Schmidtov{\'a}, Patr{\'i}cia and
Gamba, Federica and
Prasad, Dilip K. and
Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)",
month = dec,
year = "2025",
address = "Mumbai, India",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.chomps-main.7/",
pages = "70--80",
ISBN = "979-8-89176-308-1",
}
提供机构:
Helsinki-NLP



