PakDramaEcho: A DistilBERT-Labeled Urdu Sentiment Dataset

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/mh3mws5wns

下载链接

链接失效反馈

官方服务：

资源简介：

📊 PakDramaEcho Dataset Description & Research Interpretation 🧠 Research Hypothesis This study assumes that sentiment expressed in YouTube comments on Pakistani drama content can be effectively modeled using transformer-based NLP (DistilBERT). It further hypothesizes that these sentiments reflect meaningful emotional patterns linked to storylines, characters, and dramatic events in Urdu dramas. Specifically: * Informal Urdu/Roman Urdu still contains strong sentiment signals despite noise * DistilBERT can perform reliable sentiment labeling in low-resource settings * Viewer comments reflect real audience emotional response to dramas 📦 Data Overview PakDramaEcho is a sentiment analysis dataset created from YouTube comments on Pakistani drama videos. Source: YouTube drama comment sections Language: Urdu, Roman Urdu, mixed English-Urdu Domain: Pakistani TV dramas Task: Sentiment classification (Positive / Neutral / Negative) Labeling Method: DistilBERT-based automatic annotation Format: CSV file 🧹 Data Collection & Processing Collection * Scraped publicly available YouTube comments from drama-related videos * Only public comments were included Preprocessing * Removed URLs, emojis, and special characters * Cleaned repeated/noisy text * Normalized Urdu text * Filtered empty/irrelevant comments * Optional deduplication applied Labeling * Sentiment labels generated using DistilBERT classifier * Classes: Positive, Neutral, Negative 📈 Key Insights 1. Strong Emotional Engagement Most comments are emotionally expressive, especially positive ones, showing strong audience connection with drama content. 2. Noisy & Informal Language Includes Roman Urdu, spelling variations, and mixed-language text, reflecting real-world social media challenges. 3. Sentiment Imbalance Positive sentiment dominates, likely due to fan engagement and selective commenting behavior. 4. Context-Dependent Emotion Sentiment depends on characters, emotional scenes, and plot twists rather than isolated words. 🔍 Interpretation This dataset should be treated as a real-world noisy corpus rather than a clean benchmark. It represents: * Audience perception, not objective truth * Emotional response to entertainment media * A resource for low-resource Urdu NLP research ⚠️ Since labels are generated using DistilBERT, there may be classification noise and model bias. 🚀 Potential Use Cases * Urdu sentiment classification benchmarking * Transformer model training (BERT, RoBERTa, DistilBERT) * Low-resource NLP and domain adaptation * Social media opinion mining * Audience behavior analysis ⚠️ Limitations * Auto-generated labels (not human verified) * Domain-specific (Pakistani dramas only) * Noisy informal text * Possible sentiment imbalance

创建时间：

2026-04-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集