Telugu Humour dataset

Name: Telugu Humour dataset
Creator: figshare
Published: 2025-05-01 04:44:28
License: 暂无描述

DataCite Commons2025-05-01 更新2024-08-18 收录

下载链接：

https://figshare.com/articles/dataset/Telugu_Humour_dataset/20219042/1

下载链接

链接失效反馈

官方服务：

资源简介：

Increased use of online social media sites has given rise to tremendous amounts of user generated data. Social media sites have become a platform where users express and voice their opinions in a real-time environment. Social media sites such as Twitter limit the number of characters used to express a thought in a tweet, leading to increased use of creative, humorous and confusing language in order to convey the message. Due to this, automatic humor detection has become a difficult task, especially for resource-less languages such as the Dravidian languages. Humor detection has been a well studied area for resource rich languages due to the availability of rich and accurate data. In this paper, we have attempted to solve this issue by working on resource-less languages, such as, Telugu, a Dravidian language, by collecting and annotating Telugu tweets and performing automatic humor detection on the collected data. We experimented on the corpus using various transformer models such as Multilingual BERT, Multilingual DistillBERT and XLM-RoBERTa to establish a baseline classification system. We concluded that XLM-RoBERTa was the best-performing model and it achieved an F1-score of 0.82 with 81.5% accuracy. Link to our paper. Cite our work if you use our data. <pre><code>@inproceedings{bellamkonda-etal-2022-dataset, title = "A Dataset for Detecting Humor in {T}elugu Social Media Text", author = "Bellamkonda, Sriphani and Lohakare, Maithili and Patel, Shaswat", booktitle = "Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages", month = may, year = "2022" }</code></pre>

在线社交媒体的使用愈发普及，催生了海量用户生成数据。社交媒体已成为用户在实时场景中表达与抒发观点的平台。诸如推特（Twitter）这类社交媒体平台会对单条推文的字符数进行限制，这促使用户为传递信息而更多使用富有创意、诙谐且表意模糊的语言。正因如此，自动幽默检测成为一项颇具挑战的任务，对于达罗毗荼语系这类低资源语言而言尤为突出。得益于丰富且精准的数据集，高资源语言的幽默检测研究已较为成熟。针对这一问题，本文聚焦达罗毗荼语系中的泰卢固语（Telugu）这类低资源语言，通过收集并标注泰卢固语推文，对采集得到的语料开展自动幽默检测研究，以尝试攻克该难题。我们采用多种Transformer模型（包括多语言BERT（Multilingual BERT）、多语言蒸馏BERT（Multilingual DistillBERT）以及XLM-RoBERTa）对该语料库开展实验，以构建基准分类系统。实验结果显示，XLM-RoBERTa为表现最优的模型，其F1值达0.82，准确率为81.5%。 论文链接。 若使用本数据集，请引用我们的研究成果。 <pre><code>@inproceedings{bellamkonda-etal-2022-dataset, title = "泰卢固语社交媒体文本幽默检测数据集", author = "Bellamkonda, Sriphani and Lohakare, Maithili and Patel, Shaswat", booktitle = "第二届达罗毗荼语言语音与语言技术研讨会论文集", month = 5, year = "2022" }</code></pre>

提供机构：

figshare

创建时间：

2022-07-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集