RUEmoCorp
收藏DataCite Commons2026-05-11 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/BPWHOZ
下载链接
链接失效反馈官方服务:
资源简介:
<h2>RUEmoCorp: A Large-Scale Roman Urdu Emotion Corpus</h2>
<p>
RUEmoCorp is a large-scale emotion classification corpus for Roman Urdu, the informal
phonetically transliterated writing style widely used across Pakistani social media,
messaging applications, and online communities. Roman Urdu remains severely
underrepresented in natural language processing research despite being one of the
dominant written forms of Urdu in digital communication. Unlike standard Urdu written
in Nastaliq script, Roman Urdu has no standardized orthography, exhibits substantial
spelling variation, and frequently contains code-mixed English expressions, making
emotion recognition particularly challenging.
</p>
<p>
To address this gap, RUEmoCorp provides both a formally annotated benchmark dataset
and a large-scale raw corpus designed to support research in low-resource multilingual
NLP, affective computing, cross-lingual transfer learning, and Roman Urdu language
understanding.
</p>
<h3>Dataset Components</h3>
<b>Training Corpus:</b> A curated label-balanced subset of approximately 28,000
annotated samples used to train the companion transformer-based emotion classification
model released alongside this dataset.
- The 162k dataset, and ground truth will be released separately.
<h3>Emotion Taxonomy</h3>
<p>
RUEmoCorp adopts Paul Ekman’s six basic emotion categories with an additional
neutral category for emotionally ambiguous or non-affective utterances:
</p>
<ul>
<li><b>joy</b> – happiness, excitement, delight</li>
<li><b>anger</b> – frustration, hostility, rage</li>
<li><b>sadness</b> – grief, disappointment, sorrow</li>
<li><b>fear</b> – anxiety, uncertainty, dread</li>
<li><b>disgust</b> – contempt, revulsion, strong dislike</li>
<li><b>surprise</b> – astonishment, unexpected reactions</li>
<li><b>none</b> – emotionally neutral or ambiguous utterances</li>
</ul>
<h3>Data Sources</h3>
<p>
The corpus was collected from naturally occurring Roman Urdu communication contexts,
including:
</p>
<ul>
<li>Public Pakistani social media posts, comments, and discussion threads</li>
<li>Anonymized WhatsApp group conversations contributed by consenting participants</li>
</ul>
<p>
All personally identifiable information including names, phone numbers, and URLs
was removed or anonymized prior to inclusion in the dataset.
</p>
<h3>Annotation Methodology</h3>
<p>
The benchmark subset was independently labeled by four annotators from:
</p>
<ul>
<li>Bahauddin Zakariya University (BZU), Multan</li>
<li>COMSATS University Islamabad (CUI)</li>
<li>Emerson University Multan (EUM)</li>
</ul>
<p>
Annotators were native Urdu speakers and active users of Roman Urdu in digital
communication. Annotation followed a structured protocol including:
</p>
<ul>
<li>Detailed annotation guidelines with Roman Urdu examples</li>
<li>Ground-truth-guided calibration sessions</li>
<li>Independent single-label emotion annotation</li>
<li>Confidence scoring and secondary-label recording</li>
<li>Majority-vote conflict resolution</li>
</ul>
<p>
Inter-annotator agreement was formally evaluated using Fleiss’ Kappa and pairwise
Cohen’s Kappa:
</p>
<ul>
<li><b>Fleiss’ Kappa:</b> κ = 0.6588 (Substantial Agreement)</li>
<li><b>Mean Pairwise Cohen’s Kappa:</b> κ = 0.6597</li>
<li><b>Total Annotated Samples:</b> 700</li>
<li><b>Full Agreement (4/4):</b> 49.7%</li>
<li><b>Majority Agreement (3/4):</b> 34.4%</li>
<li><b>Ambiguous Samples (2–2 split):</b> 15.9%</li>
</ul>
<p>
The observed agreement levels are considered strong for subjective affective
annotation tasks and are comparable to established multilingual emotion datasets.
</p>
<h3>Companion Model</h3>
<p>
A transformer-based companion model,
<b>khubaib01/roman-urdu-emotion-xlmr-v2</b>,
is released alongside RUEmoCorp. The model extends XLM-RoBERTa with a custom
two-layer MLP classification head for seven-class emotion classification.
</p>
<p>
The model achieves:
</p>
<ul>
<li><b>Macro F1:</b> 0.9896</li>
<li><b>Weighted F1:</b> 0.9896</li>
<li><b>Accuracy:</b> 0.9896</li>
</ul>
<p>
Experimental comparisons against mBERT, TF-IDF + SVM, Logistic Regression,
and FastText baselines demonstrate consistent improvements from the proposed
architecture.
</p>
<h3>Intended Use</h3>
<p>
RUEmoCorp is intended to support research in:
</p>
<ul>
<li>Emotion classification for Roman Urdu</li>
<li>Low-resource multilingual NLP</li>
<li>Cross-lingual transfer learning</li>
<li>Affective computing</li>
<li>South Asian social media analysis</li>
<li>Code-mixed language understanding</li>
</ul>
<h3>Ethical Considerations</h3>
<p>
The dataset was curated with anonymization procedures to remove personally
identifiable information. Researchers should not use this dataset for surveillance,
profiling, or monitoring of individuals based on inferred emotional states.
</p>
<p>
Users should also consider the following limitations:
</p>
<ul>
<li>Predominantly Pakistani sociolinguistic context</li>
<li>Natural class imbalance in the raw corpus</li>
<li>Subjective ambiguity inherent in emotion annotation</li>
<li>Temporal evolution of online language usage patterns</li>
</ul>
<h3>License</h3>
<p>
RUEmoCorp is released under the
Creative Commons Attribution 4.0 International License (CC BY 4.0).
Users are free to share and adapt the material for any purpose provided
appropriate attribution is given.
</p>
<h3>Citation</h3>
<p>
If you use RUEmoCorp in your research, please cite:
</p>
<p>
Ahmad, M. K., &amp; Faisal, K. (2025). <i>RUEmoCorp: Roman Urdu Emotion Corpus</i>
[Data set]. Harvard Dataverse.
</p>
<h3>Contributors</h3>
<ul>
<li><b>Muhammad Khubaib Ahmad</b> – Core Researcher, Lead Engineer, Project Administration, Model Development</li>
<li><b>Khadija Faisal</b> – Data Manager, Annotation Coordination, Annotator</li>
<li><b>Muzammil Shadab</b> – Annotator</li>
<li><b>Sara</b> – Annotator</li>
<li><b>Faiez Ahmad</b> – Annotator</li>
</ul>
提供机构:
Harvard Dataverse
创建时间:
2026-05-07



