drivel-hub
收藏魔搭社区2026-01-08 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/extraordinarylab/drivel-hub
下载链接
链接失效反馈官方服务:
资源简介:
# Drivelology Multilingual Dataset
Paper: [Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth](https://huggingface.co/papers/2509.03867)
Code / Project Page: [https://github.com/ExtraOrdinaryLab/drivelology](https://github.com/ExtraOrdinaryLab/drivelology)
The DrivelHub Dataset is a curated collection of linguistic samples, characterized as "nonsense with depth" (utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive), designed to support research in humor detection and other forms of playful or deceptive language constructs.
Each entry contains a short "Drivelology" style text sample, categorised under one of five nuanced rhetorical types: inversion, misdirection, paradox, switchbait, wordplay.
The dataset supports six languages: Simplified Chinese (zh), Traditional Chinese (zh-hant), Korean (ko), Japanese (ja), Spanish (es), English (en), and French (fr).
### Tasks
The Drivelology benchmark evaluates models on four main tasks, as described in the accompanying paper and code repository:
1. **Multiple-Choice Question Answering (MCQA):** This task asks models to pick the correct narrative for a Drivelology sample from several options. It includes Easy and Hard versions.
2. **Detection:** This is a binary classification task where LLMs identify whether a text is Drivelology or not.
3. **Narrative Writing:** This task assesses the model's ability to generate a coherent and meaningful implicit narrative that underlies a given Drivelology sample.
4. **Multi-label Tagging:** Models are asked to assign one or more rhetorical categories (Misdirection, Paradox, Switchbait, Inversion, Wordplay) to each Drivelology sample.
### Sample Usage
To run the evaluation tasks or interact with the dataset as described in the paper, please refer to the [official GitHub repository](https://github.com/ExtraOrdinaryLab/drivelology).
> **Update:** Drivelology is now officially supported by `evalscope`! This is now the recommended way to run evaluations. Please refer to [here](https://github.com/modelscope/evalscope/pull/927). The original execution scripts are kept for legacy purposes.
# Citing
Accepted for an oral presentation at EMNLP 2025. Find our paper at [arxiv](https://www.arxiv.org/abs/2509.03867).
```bibtex
@misc{wang2025drivelologychallengingllmsinterpreting,
title={Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth},
author={Yang Wang and Chenghao Xiao and Chia-Yi Hsiao and Zi Yan Chang and Chi-Li Chen and Tyler Loakman and Chenghua Lin},
year={2025},
eprint={2509.03867},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.03867},
}
@inproceedings{wang-etal-2025-drivel,
title = "Drivel-ology: Challenging {LLM}s with Interpreting Nonsense with Depth",
author = "Wang, Yang and
Xiao, Chenghao and
Hsiao, Chia-Yi and
Chang, Zi Yan and
Chen, Chi-Li and
Loakman, Tyler and
Lin, Chenghua",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1177/",
doi = "10.18653/v1/2025.emnlp-main.1177",
pages = "23085--23107",
ISBN = "979-8-89176-332-6",
abstract = "We introduce Drivelology, a unique linguistic phenomenon characterised as ``nonsense with depth'' - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence."
}
```
# 废话学(Drivelology)多语言数据集
## 论文
[《废话学:用深度解读挑战大语言模型(LLM)》](https://huggingface.co/papers/2509.03867)
## 代码/项目主页
[https://github.com/ExtraOrdinaryLab/drivelology](https://github.com/ExtraOrdinaryLab/drivelology)
废话学数据集(DrivelHub Dataset)是经精心甄选的语言样本集合,其特征为「带有深度的废话」——即句法连贯但语用层面存在悖论、饱含情感或具有修辞颠覆性的话语,旨在为幽默检测以及其他趣味或欺骗性语言结构的相关研究提供支持。
每条数据均包含一段简短的「废话学」风格文本样本,并被归类为以下五种精细修辞类型之一:倒置(inversion)、误导(misdirection)、悖论(paradox)、诱饵切换(switchbait)、文字游戏(wordplay)。
该数据集涵盖六种语言:简体中文(zh)、繁体中文(zh-hant)、韩语(ko)、日语(ja)、西班牙语(es)、英语(en)与法语(fr)。
### 评测任务
废话学评测基准围绕四项核心任务对模型进行评估,相关细节可参阅配套论文与代码仓库:
1. **多项选择问答(Multiple-Choice Question Answering,MCQA)**:本任务要求模型从多个候选选项中选出与给定废话学样本匹配的正确叙事逻辑,分为简易版与困难版两个子任务。
2. **检测任务**:本任务为二分类任务,要求大语言模型(LLM)判断一段文本是否属于废话学范畴。
3. **叙事生成**:本任务用于评估模型生成给定废话学样本所隐含的连贯且富有深意的隐性叙事的能力。
4. **多标签标注**:本任务要求模型为每条废话学样本标注一个或多个修辞类别(误导(misdirection)、悖论(paradox)、诱饵切换(switchbait)、倒置(inversion)、文字游戏(wordplay))。
### 样本使用方法
若需按照论文所述运行评测任务或与数据集进行交互,请参阅[官方GitHub仓库](https://github.com/ExtraOrdinaryLab/drivelology)。
> **更新**:`evalscope`现已正式支持废话学(Drivelology)评测,这也是当前推荐的评测运行方式,详情请参阅[此处](https://github.com/modelscope/evalscope/pull/927)。原始执行脚本仍保留以兼容旧版使用场景。
# 引用说明
本工作已被EMNLP 2025会议接收为口头报告论文,论文详情可参阅[arXiv平台](https://www.arxiv.org/abs/2509.03867)。
bibtex
@misc{wang2025drivelologychallengingllmsinterpreting,
title={"Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth"},
author={"Yang Wang" and "Chenghao Xiao" and "Chia-Yi Hsiao" and "Zi Yan Chang" and "Chi-Li Chen" and "Tyler Loakman" and "Chenghua Lin"},
year={2025},
eprint={2509.03867},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={"https://arxiv.org/abs/2509.03867"},
}
@inproceedings{wang-etal-2025-drivel,
title = ""Drivel-ology: Challenging {LLM}s with Interpreting Nonsense with Depth"",
author = ""Wang, Yang" and
"Xiao, Chenghao" and
"Hsiao, Chia-Yi" and
"Chang, Zi Yan" and
"Chen, Chi-Li" and
"Loakman, Tyler" and
"Lin, Chenghua"",
editor = ""Christodoulopoulos, Christos" and
"Chakraborty, Tanmoy" and
"Rose, Carolyn" and
"Peng, Violet"",
booktitle = ""Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing"",
month = nov,
year = "2025",
address = ""Suzhou, China"",
publisher = ""Association for Computational Linguistics"",
url = ""https://aclanthology.org/2025.emnlp-main.1177/"",
doi = ""10.18653/v1/2025.emnlp-main.1177"",
pages = ""23085--23107"",
ISBN = ""979-8-89176-332-6"",
abstract = ""We introduce Drivelology, a unique linguistic phenomenon characterised as ``nonsense with depth'' - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.""
}
提供机构:
maas
创建时间:
2025-10-27



