MultiSoc-4D : Multi Task Learning (MTL) Dataset
收藏Mendeley Data2026-05-21 收录
下载链接:
https://data.mendeley.com/datasets/twv2kyggt8
下载链接
链接失效反馈官方服务:
资源简介:
The MultiSoc-4D is a Bengali multi task learning dataset for Bengali text classification which was built using posts from different social media websites such as Facebook, Twitter (X), YouTube, TikTok, Likee, and Instagram. The aim here is to maintain the natural attributes of social media datasets, including imbalanced classes, diverse topics, and varied languages. The dataset is annotated across four dimensions: category, sentiment, hateful, and sarcasm. Each sample is labeled by one of the eight exclusive categories: International, National, Sports, Education, Entertainment, Economy, Technology, and Others. The Others category is used as a catch-all category in case any of the samples do not fit into defined topics. The sentiment is labeled with one of the following labels: Positive, Negative, and Neutral. The label Neutral is assigned to the objectively neutral content as well as sentiment-neutral samples. Hate speech is labeled in a binary classification task: Yes and No. The former is applied to samples containing direct or indirect hate speech directed to individuals or groups. Similarly to Hate Speech, the Sarcasm feature is also labeled with Yes/No: Yes and No. It characterizes implicit expressions, which diverge in literal meaning. The annotation process uses four large language models, namely ChatGPT, Gemini, Claude, and Grok. These language models are chosen because of their high proficiency in instruction-following and text classification tasks. The annotators are provided with identical instructions for annotation purposes. In line with the closed-set labeling procedure, each dimension is assigned one label within a predefined set.
Overall this is the synthetic data to address the problem of instruction based closed-set labeling using LLMs for annotation.
创建时间:
2026-05-18



