Code-Mixed Indic Languages with Emoticons for Sarcasm Detection
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/bdm2y2p3rc
下载链接
链接失效反馈官方服务:
资源简介:
This dataset consists of code-mixed multilingual text data designed for sentiment analysis research. It captures naturally occurring code-mixed patterns combining English with ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and Urdu.
The dataset aims to support studies in multilingual NLP, sentiment classification, and language processing for real-world social media and conversational data.
Dataset Description
The dataset contains the following attributes:
• Text: The original code-mixed text sample.
• Sentiment: The corresponding sentiment label (positive, negative, or neutral).
• Translated_text: English translation of the original text.
• Cleaned_text: Text after preprocessing, including lowercasing, punctuation and stopword removal, and normalization.
• Tokens: Tokenized representation of the cleaned text.
Preprocessing involved cleaning (removal of punctuation, URLs, and emojis), normalization of repeated characters, language-specific stopword removal, translation to English, and token formation for downstream NLP tasks.
创建时间:
2025-10-10



