Name: RUEmoCorp
Creator: Harvard Dataverse
Published: 2026-05-11 14:23:22
License: 暂无描述

下载链接：

https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/BPWHOZ

下载链接

链接失效反馈

官方服务：

资源简介：

<h2>RUEmoCorp: A Large-Scale Roman Urdu Emotion Corpus</h2> <p> RUEmoCorp is a large-scale emotion classification corpus for Roman Urdu, the informal phonetically transliterated writing style widely used across Pakistani social media, messaging applications, and online communities. Roman Urdu remains severely underrepresented in natural language processing research despite being one of the dominant written forms of Urdu in digital communication. Unlike standard Urdu written in Nastaliq script, Roman Urdu has no standardized orthography, exhibits substantial spelling variation, and frequently contains code-mixed English expressions, making emotion recognition particularly challenging. </p> <p> To address this gap, RUEmoCorp provides both a formally annotated benchmark dataset and a large-scale raw corpus designed to support research in low-resource multilingual NLP, affective computing, cross-lingual transfer learning, and Roman Urdu language understanding. </p> <h3>Dataset Components</h3> <b>Training Corpus:</b> A curated label-balanced subset of approximately 28,000 annotated samples used to train the companion transformer-based emotion classification model released alongside this dataset. - The 162k dataset, and ground truth will be released separately. <h3>Emotion Taxonomy</h3> <p> RUEmoCorp adopts Paul Ekman’s six basic emotion categories with an additional neutral category for emotionally ambiguous or non-affective utterances: </p> <ul> <li><b>joy</b> – happiness, excitement, delight</li> <li><b>anger</b> – frustration, hostility, rage</li> <li><b>sadness</b> – grief, disappointment, sorrow</li> <li><b>fear</b> – anxiety, uncertainty, dread</li> <li><b>disgust</b> – contempt, revulsion, strong dislike</li> <li><b>surprise</b> – astonishment, unexpected reactions</li> <li><b>none</b> – emotionally neutral or ambiguous utterances</li> </ul> <h3>Data Sources</h3> <p> The corpus was collected from naturally occurring Roman Urdu communication contexts, including: </p> <ul> <li>Public Pakistani social media posts, comments, and discussion threads</li> <li>Anonymized WhatsApp group conversations contributed by consenting participants</li> </ul> <p> All personally identifiable information including names, phone numbers, and URLs was removed or anonymized prior to inclusion in the dataset. </p> <h3>Annotation Methodology</h3> <p> The benchmark subset was independently labeled by four annotators from: </p> <ul> <li>Bahauddin Zakariya University (BZU), Multan</li> <li>COMSATS University Islamabad (CUI)</li> <li>Emerson University Multan (EUM)</li> </ul> <p> Annotators were native Urdu speakers and active users of Roman Urdu in digital communication. Annotation followed a structured protocol including: </p> <ul> <li>Detailed annotation guidelines with Roman Urdu examples</li> <li>Ground-truth-guided calibration sessions</li> <li>Independent single-label emotion annotation</li> <li>Confidence scoring and secondary-label recording</li> <li>Majority-vote conflict resolution</li> </ul> <p> Inter-annotator agreement was formally evaluated using Fleiss’ Kappa and pairwise Cohen’s Kappa: </p> <ul> <li><b>Fleiss’ Kappa:</b> κ = 0.6588 (Substantial Agreement)</li> <li><b>Mean Pairwise Cohen’s Kappa:</b> κ = 0.6597</li> <li><b>Total Annotated Samples:</b> 700</li> <li><b>Full Agreement (4/4):</b> 49.7%</li> <li><b>Majority Agreement (3/4):</b> 34.4%</li> <li><b>Ambiguous Samples (2–2 split):</b> 15.9%</li> </ul> <p> The observed agreement levels are considered strong for subjective affective annotation tasks and are comparable to established multilingual emotion datasets. </p> <h3>Companion Model</h3> <p> A transformer-based companion model, <b>khubaib01/roman-urdu-emotion-xlmr-v2</b>, is released alongside RUEmoCorp. The model extends XLM-RoBERTa with a custom two-layer MLP classification head for seven-class emotion classification. </p> <p> The model achieves: </p> <ul> <li><b>Macro F1:</b> 0.9896</li> <li><b>Weighted F1:</b> 0.9896</li> <li><b>Accuracy:</b> 0.9896</li> </ul> <p> Experimental comparisons against mBERT, TF-IDF + SVM, Logistic Regression, and FastText baselines demonstrate consistent improvements from the proposed architecture. </p> <h3>Intended Use</h3> <p> RUEmoCorp is intended to support research in: </p> <ul> <li>Emotion classification for Roman Urdu</li> <li>Low-resource multilingual NLP</li> <li>Cross-lingual transfer learning</li> <li>Affective computing</li> <li>South Asian social media analysis</li> <li>Code-mixed language understanding</li> </ul> <h3>Ethical Considerations</h3> <p> The dataset was curated with anonymization procedures to remove personally identifiable information. Researchers should not use this dataset for surveillance, profiling, or monitoring of individuals based on inferred emotional states. </p> <p> Users should also consider the following limitations: </p> <ul> <li>Predominantly Pakistani sociolinguistic context</li> <li>Natural class imbalance in the raw corpus</li> <li>Subjective ambiguity inherent in emotion annotation</li> <li>Temporal evolution of online language usage patterns</li> </ul> <h3>License</h3> <p> RUEmoCorp is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Users are free to share and adapt the material for any purpose provided appropriate attribution is given. </p> <h3>Citation</h3> <p> If you use RUEmoCorp in your research, please cite: </p> <p> Ahmad, M. K., &amp; Faisal, K. (2025). <i>RUEmoCorp: Roman Urdu Emotion Corpus</i> [Data set]. Harvard Dataverse. </p> <h3>Contributors</h3> <ul> <li><b>Muhammad Khubaib Ahmad</b> – Core Researcher, Lead Engineer, Project Administration, Model Development</li> <li><b>Khadija Faisal</b> – Data Manager, Annotation Coordination, Annotator</li> <li><b>Muzammil Shadab</b> – Annotator</li> <li><b>Sara</b> – Annotator</li> <li><b>Faiez Ahmad</b> – Annotator</li> </ul>

应用场景：