Helsinki-NLP/shroom-cap

Name: Helsinki-NLP/shroom-cap
Creator: Helsinki-NLP
Published: 2026-02-11 12:13:47
License: 暂无描述

Hugging Face2026-02-11 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Helsinki-NLP/shroom-cap

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mpl-2.0 tags: - multilingual - hallucination-detection - scientific-text - cross-lingual - classification - factuality - fluency - LLM-evaluation --- # SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications ## Dataset Summary SHROOM-CAP is a multilingual dataset for hallucination detection in scientific text generated by large language models (LLMs). The dataset covers nine languages: five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource Indic languages (Bengali, Gujarati, Malayalam, and Telugu). Each instance consists of LLM-generated text, token sequences, logits, and metadata about the source scientific publication. The dataset provides binary labels for: - **Factual mistakes:** whether the text contains hallucinated or factually incorrect content. - **Fluency mistakes:** whether the text contains linguistic errors affecting readability. The task frames hallucination detection as a binary classification problem, with LLMs required to predict factual and fluency mistakes. ## Dataset Structure The dataset is organized into the following splits: | Split | Examples | Description | |-------|---------|------------| | `train` | 1,755 | Training set batch 1 (en, hi, es, fr, it) | | `validation` | 1,200 | Validation set (en, hi, es, fr, it) | | `test` | 4,384 | Test set (all 9 languages, including IndicLanguages bn, te, ml, gu), labels not included to help fight against leakage. Contact the authors for more info. | Each example contains: - `index`: unique identifier - `title`, `abstract`, `doi`, `url`, `datafile`: source publication metadata - `authors`: list of author names (`first` and `last`) - `question`: question about the publication - `model_id`: the LLM used for generation - `model_config`: model configuration parameters - `prompt`: prompt used for generation - `output_text`: LLM-generated answer - `output_tokens`: tokenized model output - `output_logits`: token-level logits - `has_fluency_mistakes`: binary label (`y`/`n`) or `null` for test - `has_factual_mistakes`: binary label (`y`/`n`) or `null` for test ## Source - Sinha, Aman et al. (2025). [SHROOM-CAP: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications](https://aclanthology.org/2025.chomps-main.7/). *Proceedings of CHOMPS 2025*. ## Citation ```bibtex @inproceedings{sinha-etal-2025-shroom, title = "{SHROOM}-{CAP}: Shared Task on Hallucinations and Related Observable Overgeneration Mistakes in Crosslingual Analyses of Publications", author = "Sinha, Aman and Gamba, Federica and V{\'a}zquez, Ra{\'u}l and Mickus, Timothee and Chattopadhyay, Ahana and Zanella, Laura and Arakkal Remesh, Binesh and Kankanampati, Yash and Chandramania, Aryan and Agarwal, Rohit", editor = {Sinha, Aman and V{\'a}zquez, Ra{\'u}l and Mickus, Timothee and Agarwal, Rohit and Buhnila, Ioana and Schmidtov{\'a}, Patr{\'i}cia and Gamba, Federica and Prasad, Dilip K. and Tiedemann, J{\"o}rg}, booktitle = "Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)", month = dec, year = "2025", address = "Mumbai, India", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.chomps-main.7/", pages = "70--80", ISBN = "979-8-89176-308-1", }

提供机构：

Helsinki-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集