five

PakDramaEcho: A DistilBERT-Labeled Urdu Sentiment Dataset

收藏
DataCite Commons2026-04-20 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/mh3mws5wns/2
下载链接
链接失效反馈
官方服务:
资源简介:
📊 PakDramaEcho Dataset Description & Research Interpretation 🧠 Research Hypothesis This study assumes that sentiment expressed in YouTube comments on Pakistani drama content can be effectively modeled using transformer-based NLP (DistilBERT). It further hypothesizes that these sentiments reflect meaningful emotional patterns linked to storylines, characters, and dramatic events in Urdu dramas. Specifically: * Informal Urdu/Roman Urdu still contains strong sentiment signals despite noise * DistilBERT can perform reliable sentiment labeling in low-resource settings * Viewer comments reflect real audience emotional response to dramas 📦 Data Overview PakDramaEcho is a sentiment analysis dataset created from YouTube comments on Pakistani drama videos. Source: YouTube drama comment sections Language: Urdu, Roman Urdu, mixed English-Urdu Domain: Pakistani TV dramas Task: Sentiment classification (Positive / Neutral / Negative) Labeling Method: DistilBERT-based automatic annotation Format: CSV file 🧹 Data Collection & Processing Collection * Scraped publicly available YouTube comments from drama-related videos * Only public comments were included Preprocessing * Removed URLs, emojis, and special characters * Cleaned repeated/noisy text * Normalized Urdu text * Filtered empty/irrelevant comments * Optional deduplication applied Labeling * Sentiment labels generated using DistilBERT classifier * Classes: Positive, Neutral, Negative 📈 Key Insights 1. Strong Emotional Engagement Most comments are emotionally expressive, especially positive ones, showing strong audience connection with drama content. 2. Noisy & Informal Language Includes Roman Urdu, spelling variations, and mixed-language text, reflecting real-world social media challenges. 3. Sentiment Imbalance Positive sentiment dominates, likely due to fan engagement and selective commenting behavior. 4. Context-Dependent Emotion Sentiment depends on characters, emotional scenes, and plot twists rather than isolated words. 🔍 Interpretation This dataset should be treated as a real-world noisy corpus rather than a clean benchmark. It represents: * Audience perception, not objective truth * Emotional response to entertainment media * A resource for low-resource Urdu NLP research ⚠️ Since labels are generated using DistilBERT, there may be classification noise and model bias. 🚀 Potential Use Cases * Urdu sentiment classification benchmarking * Transformer model training (BERT, RoBERTa, DistilBERT) * Low-resource NLP and domain adaptation * Social media opinion mining * Audience behavior analysis ⚠️ Limitations * Auto-generated labels (not human verified) * Domain-specific (Pakistani dramas only) * Noisy informal text * Possible sentiment imbalance
提供机构:
Mendeley Data
创建时间:
2026-04-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作