Khubaib01/RomanUrdu-NLP-Sentiment-Corpus
收藏Hugging Face2026-03-02 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Khubaib01/RomanUrdu-NLP-Sentiment-Corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- ur
tags:
- code
size_categories:
- 100K<n<1M
---
# RomanUrdu-NLP-Sentiment-Corpus
**Largest Open-Source Roman Urdu Sentiment Dataset with Slang Robustness**
---
## Overview
This repository presents the largest publicly available Roman Urdu sentiment analysis dataset, containing **134,052 labeled text samples** collected from chats and social media platforms. The dataset is designed to be:
- Robust to slang and informal Roman Urdu
- High-quality through LLM-assisted labeling and human validation
- Balanced across sentiment classes
- Suitable for research and real-world NLP applications
This dataset supports research in:
- Sentiment Analysis
- Low-resource language NLP
- Code-mixed and slang-aware text modeling
- Social media opinion mining
---
## Dataset Design Goals
The dataset was created with the following objectives:
1. Robustness to slang, abbreviations, and spelling variations
2. Large-scale corpus for deep learning models
3. High annotation quality through hybrid labeling
4. Open-source accessibility under Apache 2.0
5. Future extensibility with emotion labels
---
## Dataset Structure
Each row contains two attributes:
| Column | Description |
|--------|-------------|
| `message` | Roman Urdu text |
| `label` | Sentiment class (`Positive`, `Neutral`, `Negative`) |
---
## Dataset Statistics
### General Statistics
- Total samples: **134,052**
- Unique messages: **109,409**
- Most frequent message: `"Good"` (24 occurrences)
- Labels: **3** (Positive, Neutral, Negative)
---
### Class Distribution
| Label | Percentage |
|-------|------------|
| Positive | 28% |
| Neutral | 32% |
| Negative | 40% |
This distribution reflects real-world social media sentiment skew.
---
## Message Length Statistics
### Word Length (per message)
```python
count 134052
mean 13.55 words
std 19.46
min 0
25% 5
50% 9
75% 16
max 3212
```
### Character Length (per message)
```python
count 134052
mean 66.62 chars
std 102.15
min 1
25% 22
50% 41
75% 81
max 19074
```
### Average Word Length by Label
| Label | Avg Words |
| -------- | --------- |
| Negative | 18.05 |
| Positive | 13.68 |
| Neutral | 7.87 |
Negative samples tend to be longer and more expressive, while neutral messages are shorter and concise.
## Annotation Methodology
The dataset was created in two major phases:
### Phase 1: Initial Dataset (99K Samples)
- Labeled using LLM-assisted annotation
- Verified by human annotators and validators
- Released previously in the form of embeddings
- Used to train the baseline model:
`Khubaib01/roman-urdu-sentiment-xlm-r`
> - Read the paper here: [Paper](https://doi.org/10.5281/zenodo.18080524)
### Phase 2: Extended Dataset (134K Samples)
- Additional samples labeled using the trained model
- All newly labeled samples validated by human reviewers
Focused on including:
- Slang
- Informal expressions
- Local dialect usage
- Social media language patterns
This hybrid annotation pipeline ensures:
- Scalability
- Consistency
- High label reliability
## Benchmark Model
A sentiment classification model trained on the initial 99k dataset:
**Model Name:**
`Khubaib01/roman-urdu-sentiment-xlm-r`
**Performance:**
- Achieved 84% accuracy
- Ranked highest among available Roman Urdu sentiment models on HuggingFace at time of evaluation
- Benchmarked against multiple multilingual and Roman Urdu models
This model was also used to assist labeling for the extended dataset.
## Slang & Robustness Focus
Unlike many clean benchmark datasets, this dataset includes:
- Local slang
- Abbreviations (e.g., "bkl", "yr", "bhai", "scene off")
- Misspellings
- Mixed English + Roman Urdu
- Informal sentence structures
This makes the dataset suitable for:
- Real-world deployment
- Chatbots
- Social media analysis
- Low-resource language research
## Future Work
Planned extensions include:
- Emotion labels (anger, joy, sadness, fear, etc.)
- Multi-label emotion classification
- Offensive and toxicity detection
- Language normalization benchmarks
## Core Author
**Muhammad Khubaib Ahmad**
Core Engineer & Researcher
Creator of:
- Roman Urdu Sentiment Dataset (134k)
- 99k Roman Urdu embeddings dataset
`Khubaib01/roman-urdu-sentiment-xlm-r` model
## Contributors (Human Validation & Annotation)
The following contributors reviewed labels and worked as data validators and annotators:
- **Ayesha Khalid**
- **Faiez Ahmad**
- **Khadija Faysal**
Their role ensured quality control and reduced noise and labeling errors.
## License
This dataset is released under the **Apache License 2.0**.
You are free to:
- Use
- Modify
- Distribute
- Train models
- Use commercially
With proper attribution.
## Citation
If you use this dataset in your research, please cite:
```bibtex
@misc{muhammad_khubaib_ahmad_2026,
author = { Muhammad Khubaib Ahmad },
title = { RomanUrdu-NLP-Sentiment-Corpus (Revision 98d0169) },
year = 2026,
url = { https://huggingface.co/datasets/Khubaib01/RomanUrdu-NLP-Sentiment-Corpus },
doi = { 10.57967/hf/7931 },
publisher = { Hugging Face }
}
```
## Ethical Considerations
- All data has been anonymized.
- No personal identifiers are included.
- Data collected from public sources and chat-style corpora.
- Dataset intended for research and educational purposes only.
## Author Contact
**Email:** muhammadkhubaibahmad854@gmail.com
**LinkedIn:** [Muhammad Khubaib Ahmad](https://www.linkedin.com/in/muhammad-khubaib-ahmad-)
提供机构:
Khubaib01



