Khubaib01/RomanUrdu-NLP-Sentiment-Corpus

Name: Khubaib01/RomanUrdu-NLP-Sentiment-Corpus
Creator: Khubaib01
Published: 2026-03-02 13:07:59
License: 暂无描述

Hugging Face2026-03-02 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Khubaib01/RomanUrdu-NLP-Sentiment-Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - ur tags: - code size_categories: - 100K<n<1M --- # RomanUrdu-NLP-Sentiment-Corpus **Largest Open-Source Roman Urdu Sentiment Dataset with Slang Robustness** --- ## Overview This repository presents the largest publicly available Roman Urdu sentiment analysis dataset, containing **134,052 labeled text samples** collected from chats and social media platforms. The dataset is designed to be: - Robust to slang and informal Roman Urdu - High-quality through LLM-assisted labeling and human validation - Balanced across sentiment classes - Suitable for research and real-world NLP applications This dataset supports research in: - Sentiment Analysis - Low-resource language NLP - Code-mixed and slang-aware text modeling - Social media opinion mining --- ## Dataset Design Goals The dataset was created with the following objectives: 1. Robustness to slang, abbreviations, and spelling variations 2. Large-scale corpus for deep learning models 3. High annotation quality through hybrid labeling 4. Open-source accessibility under Apache 2.0 5. Future extensibility with emotion labels --- ## Dataset Structure Each row contains two attributes: | Column | Description | |--------|-------------| | `message` | Roman Urdu text | | `label` | Sentiment class (`Positive`, `Neutral`, `Negative`) | --- ## Dataset Statistics ### General Statistics - Total samples: **134,052** - Unique messages: **109,409** - Most frequent message: `"Good"` (24 occurrences) - Labels: **3** (Positive, Neutral, Negative) --- ### Class Distribution | Label | Percentage | |-------|------------| | Positive | 28% | | Neutral | 32% | | Negative | 40% | This distribution reflects real-world social media sentiment skew. --- ## Message Length Statistics ### Word Length (per message) ```python count 134052 mean 13.55 words std 19.46 min 0 25% 5 50% 9 75% 16 max 3212 ``` ### Character Length (per message) ```python count 134052 mean 66.62 chars std 102.15 min 1 25% 22 50% 41 75% 81 max 19074 ``` ### Average Word Length by Label | Label | Avg Words | | -------- | --------- | | Negative | 18.05 | | Positive | 13.68 | | Neutral | 7.87 | Negative samples tend to be longer and more expressive, while neutral messages are shorter and concise. ## Annotation Methodology The dataset was created in two major phases: ### Phase 1: Initial Dataset (99K Samples) - Labeled using LLM-assisted annotation - Verified by human annotators and validators - Released previously in the form of embeddings - Used to train the baseline model: `Khubaib01/roman-urdu-sentiment-xlm-r` > - Read the paper here: [Paper](https://doi.org/10.5281/zenodo.18080524) ### Phase 2: Extended Dataset (134K Samples) - Additional samples labeled using the trained model - All newly labeled samples validated by human reviewers Focused on including: - Slang - Informal expressions - Local dialect usage - Social media language patterns This hybrid annotation pipeline ensures: - Scalability - Consistency - High label reliability ## Benchmark Model A sentiment classification model trained on the initial 99k dataset: **Model Name:** `Khubaib01/roman-urdu-sentiment-xlm-r` **Performance:** - Achieved 84% accuracy - Ranked highest among available Roman Urdu sentiment models on HuggingFace at time of evaluation - Benchmarked against multiple multilingual and Roman Urdu models This model was also used to assist labeling for the extended dataset. ## Slang & Robustness Focus Unlike many clean benchmark datasets, this dataset includes: - Local slang - Abbreviations (e.g., "bkl", "yr", "bhai", "scene off") - Misspellings - Mixed English + Roman Urdu - Informal sentence structures This makes the dataset suitable for: - Real-world deployment - Chatbots - Social media analysis - Low-resource language research ## Future Work Planned extensions include: - Emotion labels (anger, joy, sadness, fear, etc.) - Multi-label emotion classification - Offensive and toxicity detection - Language normalization benchmarks ## Core Author **Muhammad Khubaib Ahmad** Core Engineer & Researcher Creator of: - Roman Urdu Sentiment Dataset (134k) - 99k Roman Urdu embeddings dataset `Khubaib01/roman-urdu-sentiment-xlm-r` model ## Contributors (Human Validation & Annotation) The following contributors reviewed labels and worked as data validators and annotators: - **Ayesha Khalid** - **Faiez Ahmad** - **Khadija Faysal** Their role ensured quality control and reduced noise and labeling errors. ## License This dataset is released under the **Apache License 2.0**. You are free to: - Use - Modify - Distribute - Train models - Use commercially With proper attribution. ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{muhammad_khubaib_ahmad_2026, author = { Muhammad Khubaib Ahmad }, title = { RomanUrdu-NLP-Sentiment-Corpus (Revision 98d0169) }, year = 2026, url = { https://huggingface.co/datasets/Khubaib01/RomanUrdu-NLP-Sentiment-Corpus }, doi = { 10.57967/hf/7931 }, publisher = { Hugging Face } } ``` ## Ethical Considerations - All data has been anonymized. - No personal identifiers are included. - Data collected from public sources and chat-style corpora. - Dataset intended for research and educational purposes only. ## Author Contact **Email:** muhammadkhubaibahmad854@gmail.com **LinkedIn:** [Muhammad Khubaib Ahmad](https://www.linkedin.com/in/muhammad-khubaib-ahmad-)

提供机构：

Khubaib01

5,000+

优质数据集

54 个

任务类型

进入经典数据集