CreativeLang/ColBERT_Humor_Detection
收藏Hugging Face2023-07-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/CreativeLang/ColBERT_Humor_Detection
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-2.0
---
# ColBERT_Humor
## Dataset Description
- **Paper:** [Colbert: Using bert sentence embedding for humor detection](https://arxiv.org/abs/2004.12765)
## Dataset Summary
ColBERT Humor contains 200,000 labeled short texts, equally distributed between humorous and non-humorous content. The dataset was created to overcome the limitations of prior humor detection datasets, which were characterized by inconsistencies in text length, word count, and formality, making them easy to predict with simple models without truly understanding the nuances of humor. The two sources for this dataset are the News Category dataset, featuring 200k news headlines from the Huffington Post (2012-2018), and a collection of 231,657 Reddit jokes. The texts have been rigorously preprocessed to ensure syntactic similarity, requiring models to delve into the linguistic intricacies to distinguish humor, effectively providing a more complex and substantial platform for humor detection research.
For the details of this dataset, we refer you to the original [paper](https://arxiv.org/abs/2004.12765).
Metadata in Creative Language Toolkit ([CLTK](https://github.com/liyucheng09/cltk))
- CL Type: Humor
- Task Type: detection
- Size: 200k
- Created time: 2020
### Citation Information
If you find this dataset helpful, please cite:
```
@article{annamoradnejad2020colbert,
title={Colbert: Using bert sentence embedding for humor detection},
author={Annamoradnejad, Issa and Zoghi, Gohar},
journal={arXiv preprint arXiv:2004.12765},
year={2020}
}
```
### Contributions
If you have any queries, please open an issue or direct your queries to [mail](mailto:yucheng.li@surrey.ac.uk).
提供机构:
CreativeLang
原始信息汇总
ColBERT_Humor 数据集概述
数据集描述
- 数据集名称: ColBERT_Humor
- 数据集来源: 由News Category数据集(200k新闻标题)和Reddit笑话集(231,657条)组成。
- 数据集规模: 包含200,000条标记的短文本,均分为幽默和非幽默内容。
- 数据集目的: 旨在解决以往幽默检测数据集在文本长度、字数和正式性方面的局限性,提供一个更复杂和实质性的幽默检测研究平台。
- 数据预处理: 经过严格预处理,确保语法相似性,要求模型深入理解语言细微差别以区分幽默。
- 创建时间: 2020年
引用信息
- 论文: Colbert: Using bert sentence embedding for humor detection
- 作者: Annamoradnejad, Issa and Zoghi, Gohar
- 出版年份: 2020
数据集元数据
- CL类型: Humor
- 任务类型: detection
- 数据集大小: 200k
贡献与查询
- 联系方式: yucheng.li@surrey.ac.uk



