ColBERT dataset - 200k short texts for humor detection

Name: ColBERT dataset - 200k short texts for humor detection
Creator: Annamoradnejad, Issa
Published: 2021-03-09 00:00:00
License: 暂无描述

IEEE2021-03-09 更新2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/colbert-dataset-200k-short-texts-humor-detection

下载链接

链接失效反馈

官方服务：

资源简介：

Automatic humor detection has interesting use cases in modern technologies, such as chatbots and virtual assistants. Existing humor detection datasets usually combined formal non-humorous texts and informal jokes with incompatible statistics (text length, words count, etc.). This makes it more likely to detect humor with simple analytical models and without understanding the underlying latent lingual features and structures.We introduce a new combined dataset for the task of humor detection, entitled “ColBERT dataset”, which contains 200k labeled short texts, equally distributed between humor and non-humor. We reduced or completely removed issues of the existing datasets from the new dataset. The dataset is much larger than the previous datasets and it includes texts with similar textual features. Correlation between character count and the target is insignificant (+0.09), and there is no notable connection between the target value and sentiment features (correlation coefficient of -0.09 and +0.02 for polarity and subjectivity, respectively).

自动幽默检测在现代技术中具备诸多颇具应用价值的场景，例如聊天机器人与虚拟助手。现有幽默检测数据集通常将正式非幽默文本与非正式笑话进行合并，但两类文本的统计特征（如文本长度、单词数量等）并不统一。这会导致简单分析模型更易识别出幽默文本，却无需理解其背后潜藏的语言特征与结构。我们针对幽默检测任务推出了一款全新的合并数据集，命名为「ColBERT数据集」，该数据集包含20万条带标注的短文本，幽默与非幽默样本的占比完全均等。我们在该数据集中修复并彻底消除了现有数据集存在的各类问题。该数据集的规模远超此前的同类数据集，且其中的文本具备统一的文本特征。字符数与分类标签之间的相关性极弱（相关系数为+0.09），而分类标签与情感特征之间也无显著关联（极性与主观性的相关系数分别为-0.09与+0.02）。

提供机构：

Annamoradnejad, Issa

创建时间：

2021-03-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集