PakDramaEcho: A DistilBERT-Labeled Urdu Sentiment Dataset
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/mh3mws5wns
下载链接
链接失效反馈官方服务:
资源简介:
📊 PakDramaEcho Dataset Description & Research Interpretation
🧠 Research Hypothesis
This study assumes that sentiment expressed in YouTube comments on Pakistani drama content can be effectively modeled using transformer-based NLP (DistilBERT). It further hypothesizes that these sentiments reflect meaningful emotional patterns linked to storylines, characters, and dramatic events in Urdu dramas.
Specifically:
* Informal Urdu/Roman Urdu still contains strong sentiment signals despite noise
* DistilBERT can perform reliable sentiment labeling in low-resource settings
* Viewer comments reflect real audience emotional response to dramas
📦 Data Overview
PakDramaEcho is a sentiment analysis dataset created from YouTube comments on Pakistani drama videos.
Source: YouTube drama comment sections
Language: Urdu, Roman Urdu, mixed English-Urdu
Domain: Pakistani TV dramas
Task: Sentiment classification (Positive / Neutral / Negative)
Labeling Method: DistilBERT-based automatic annotation
Format: CSV file
🧹 Data Collection & Processing
Collection
* Scraped publicly available YouTube comments from drama-related videos
* Only public comments were included
Preprocessing
* Removed URLs, emojis, and special characters
* Cleaned repeated/noisy text
* Normalized Urdu text
* Filtered empty/irrelevant comments
* Optional deduplication applied
Labeling
* Sentiment labels generated using DistilBERT classifier
* Classes: Positive, Neutral, Negative
📈 Key Insights
1. Strong Emotional Engagement
Most comments are emotionally expressive, especially positive ones, showing strong audience connection with drama content.
2. Noisy & Informal Language
Includes Roman Urdu, spelling variations, and mixed-language text, reflecting real-world social media challenges.
3. Sentiment Imbalance
Positive sentiment dominates, likely due to fan engagement and selective commenting behavior.
4. Context-Dependent Emotion
Sentiment depends on characters, emotional scenes, and plot twists rather than isolated words.
🔍 Interpretation
This dataset should be treated as a real-world noisy corpus rather than a clean benchmark.
It represents:
* Audience perception, not objective truth
* Emotional response to entertainment media
* A resource for low-resource Urdu NLP research
⚠️ Since labels are generated using DistilBERT, there may be classification noise and model bias.
🚀 Potential Use Cases
* Urdu sentiment classification benchmarking
* Transformer model training (BERT, RoBERTa, DistilBERT)
* Low-resource NLP and domain adaptation
* Social media opinion mining
* Audience behavior analysis
⚠️ Limitations
* Auto-generated labels (not human verified)
* Domain-specific (Pakistani dramas only)
* Noisy informal text
* Possible sentiment imbalance
创建时间:
2026-04-17



