five

RUEmoCorp

收藏
DataCite Commons2026-05-11 更新2026-05-18 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/BPWHOZ
下载链接
链接失效反馈
官方服务:
资源简介:
<h2>RUEmoCorp: A Large-Scale Roman Urdu Emotion Corpus</h2> <p> RUEmoCorp is a large-scale emotion classification corpus for Roman Urdu, the informal phonetically transliterated writing style widely used across Pakistani social media, messaging applications, and online communities. Roman Urdu remains severely underrepresented in natural language processing research despite being one of the dominant written forms of Urdu in digital communication. Unlike standard Urdu written in Nastaliq script, Roman Urdu has no standardized orthography, exhibits substantial spelling variation, and frequently contains code-mixed English expressions, making emotion recognition particularly challenging. </p> <p> To address this gap, RUEmoCorp provides both a formally annotated benchmark dataset and a large-scale raw corpus designed to support research in low-resource multilingual NLP, affective computing, cross-lingual transfer learning, and Roman Urdu language understanding. </p> <h3>Dataset Components</h3> <b>Training Corpus:</b> A curated label-balanced subset of approximately 28,000 annotated samples used to train the companion transformer-based emotion classification model released alongside this dataset. - The 162k dataset, and ground truth will be released separately. <h3>Emotion Taxonomy</h3> <p> RUEmoCorp adopts Paul Ekman’s six basic emotion categories with an additional neutral category for emotionally ambiguous or non-affective utterances: </p> <ul> <li><b>joy</b> – happiness, excitement, delight</li> <li><b>anger</b> – frustration, hostility, rage</li> <li><b>sadness</b> – grief, disappointment, sorrow</li> <li><b>fear</b> – anxiety, uncertainty, dread</li> <li><b>disgust</b> – contempt, revulsion, strong dislike</li> <li><b>surprise</b> – astonishment, unexpected reactions</li> <li><b>none</b> – emotionally neutral or ambiguous utterances</li> </ul> <h3>Data Sources</h3> <p> The corpus was collected from naturally occurring Roman Urdu communication contexts, including: </p> <ul> <li>Public Pakistani social media posts, comments, and discussion threads</li> <li>Anonymized WhatsApp group conversations contributed by consenting participants</li> </ul> <p> All personally identifiable information including names, phone numbers, and URLs was removed or anonymized prior to inclusion in the dataset. </p> <h3>Annotation Methodology</h3> <p> The benchmark subset was independently labeled by four annotators from: </p> <ul> <li>Bahauddin Zakariya University (BZU), Multan</li> <li>COMSATS University Islamabad (CUI)</li> <li>Emerson University Multan (EUM)</li> </ul> <p> Annotators were native Urdu speakers and active users of Roman Urdu in digital communication. Annotation followed a structured protocol including: </p> <ul> <li>Detailed annotation guidelines with Roman Urdu examples</li> <li>Ground-truth-guided calibration sessions</li> <li>Independent single-label emotion annotation</li> <li>Confidence scoring and secondary-label recording</li> <li>Majority-vote conflict resolution</li> </ul> <p> Inter-annotator agreement was formally evaluated using Fleiss’ Kappa and pairwise Cohen’s Kappa: </p> <ul> <li><b>Fleiss’ Kappa:</b> κ = 0.6588 (Substantial Agreement)</li> <li><b>Mean Pairwise Cohen’s Kappa:</b> κ = 0.6597</li> <li><b>Total Annotated Samples:</b> 700</li> <li><b>Full Agreement (4/4):</b> 49.7%</li> <li><b>Majority Agreement (3/4):</b> 34.4%</li> <li><b>Ambiguous Samples (2–2 split):</b> 15.9%</li> </ul> <p> The observed agreement levels are considered strong for subjective affective annotation tasks and are comparable to established multilingual emotion datasets. </p> <h3>Companion Model</h3> <p> A transformer-based companion model, <b>khubaib01/roman-urdu-emotion-xlmr-v2</b>, is released alongside RUEmoCorp. The model extends XLM-RoBERTa with a custom two-layer MLP classification head for seven-class emotion classification. </p> <p> The model achieves: </p> <ul> <li><b>Macro F1:</b> 0.9896</li> <li><b>Weighted F1:</b> 0.9896</li> <li><b>Accuracy:</b> 0.9896</li> </ul> <p> Experimental comparisons against mBERT, TF-IDF + SVM, Logistic Regression, and FastText baselines demonstrate consistent improvements from the proposed architecture. </p> <h3>Intended Use</h3> <p> RUEmoCorp is intended to support research in: </p> <ul> <li>Emotion classification for Roman Urdu</li> <li>Low-resource multilingual NLP</li> <li>Cross-lingual transfer learning</li> <li>Affective computing</li> <li>South Asian social media analysis</li> <li>Code-mixed language understanding</li> </ul> <h3>Ethical Considerations</h3> <p> The dataset was curated with anonymization procedures to remove personally identifiable information. Researchers should not use this dataset for surveillance, profiling, or monitoring of individuals based on inferred emotional states. </p> <p> Users should also consider the following limitations: </p> <ul> <li>Predominantly Pakistani sociolinguistic context</li> <li>Natural class imbalance in the raw corpus</li> <li>Subjective ambiguity inherent in emotion annotation</li> <li>Temporal evolution of online language usage patterns</li> </ul> <h3>License</h3> <p> RUEmoCorp is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Users are free to share and adapt the material for any purpose provided appropriate attribution is given. </p> <h3>Citation</h3> <p> If you use RUEmoCorp in your research, please cite: </p> <p> Ahmad, M. K., & Faisal, K. (2025). <i>RUEmoCorp: Roman Urdu Emotion Corpus</i> [Data set]. Harvard Dataverse. </p> <h3>Contributors</h3> <ul> <li><b>Muhammad Khubaib Ahmad</b> – Core Researcher, Lead Engineer, Project Administration, Model Development</li> <li><b>Khadija Faisal</b> – Data Manager, Annotation Coordination, Annotator</li> <li><b>Muzammil Shadab</b> – Annotator</li> <li><b>Sara</b> – Annotator</li> <li><b>Faiez Ahmad</b> – Annotator</li> </ul>
提供机构:
Harvard Dataverse
创建时间:
2026-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作