ReySajju742/shaistagi_clean
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ReySajju742/shaistagi_clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- ur
pretty_name: Shaistagi (شائستگی) Clean Urdu Mega-Dataset
size_categories:
- 10M<n<100M
task_categories:
- text-generation
- translation
- text-classification
- token-classification
- question-answering
tags:
- urdu
- nlp
- cleaned
- instruction-tuning
- sentiment
- reasoning
- address-parsing
- poetry
- pretraining
- nmt
- ner
configs:
# ============================================================================
# SPECIALIZED DATASETS
# ============================================================================
- config_name: addresses
data_files: addresses/*.parquet
default: false
- config_name: english_urdu_translation
data_files: english_urdu_translation/*.parquet
default: false
- config_name: error_correction
data_files: error_correction/*.parquet
default: false
# ============================================================================
# PRETRAINING DATA
# ============================================================================
- config_name: fineweb_pretrain
data_files: fineweb_pretrain/*.parquet
default: false
- config_name: gemma_pretrain
data_files: gemma_pretrain/*.parquet
default: false
- config_name: generic_train_ur
data_files: generic_train_ur/*.parquet
default: false
- config_name: tiny_stories
data_files: tiny_stories/*.parquet
default: false
# ============================================================================
# WEB CRAWL DATA (Large Scale Pretraining)
# ============================================================================
- config_name: c4
data_files: c4/*.parquet
default: false
- config_name: cc100
data_files: cc100/*.parquet
default: false
- config_name: hplt
data_files: hplt/*.parquet
default: false
- config_name: cleaned_data
data_files: cleaned_data/*.parquet
default: true
# ============================================================================
# TRANSLATION / NMT DATA
# ============================================================================
- config_name: nmt
data_files: nmt/*.parquet
default: false
- config_name: nmt_parquet
data_files: nmt_parquet/*.parquet
default: false
- config_name: parliament_translation
data_files: parliament_translation/*.parquet
default: false
# ============================================================================
# SENTIMENT & CLASSIFICATION
# ============================================================================
- config_name: imdb_reviews_ur
data_files: imdb_reviews_ur/*.parquet
default: false
- config_name: sentiment
data_files: sentiment/*.parquet
default: false
- config_name: sentiment_v1_ur
data_files: sentiment_v1_ur/*.parquet
default: false
- config_name: urdu_sentiment_local
data_files: urdu_sentiment_local/*.parquet
default: false
- config_name: urdu_sarcasm
data_files: urdu_sarcasm/*.parquet
default: false
# ============================================================================
# POETRY DATA
# ============================================================================
- config_name: iqbal_poetry
data_files:
- split: train
path: iqbal_poetry/train-*.parquet
default: false
- config_name: organized_poetry_csv
data_files:
- split: train
path: organized_poetry_csv/train-*.parquet
default: false
- config_name: poetry_by_poet
data_files:
- split: train
path: poetry_by_poet/train-*.parquet
default: false
- config_name: poetry_csv_main
data_files:
- split: train
path: poetry_csv_main/train-*.parquet
default: false
- config_name: urdu_poetry_general
data_files:
- split: train
path: urdu_poetry_general/train-*.parquet
default: false
# ============================================================================
# REASONING & INSTRUCTION DATA
# ============================================================================
- config_name: urdu_reasoning
data_files: urdu_reasoning/*.parquet
default: false
- config_name: reasoning
data_files: reasoning/*.parquet
default: false
- config_name: reasoning_parquet
data_files: reasoning_parquet/*.parquet
default: false
- config_name: urdu_instruct_alpaca
data_files: urdu_instruct_alpaca/*.parquet
default: false
# ============================================================================
# ROMAN URDU & TRANSLITERATION
# ============================================================================
- config_name: roman_urdu
data_files: roman_urdu/*.parquet
default: false
- config_name: roman_urdu_toxicity
data_files: roman_urdu_toxicity/*.parquet
default: false
# ============================================================================
# SPECIALIZED / STRUCTURED DATA
# ============================================================================
- config_name: urdu_tts_transcription
data_files: urdu_tts_transcription/*.parquet
default: false
- config_name: wikiann_ur
data_files: wikiann_ur/*.parquet
default: false
- config_name: xnli_ipa
data_files: xnli_ipa/*.parquet
default: false
- config_name: urdu_dictionary
data_files: urdu_dictionary/*.parquet
default: false
- config_name: news_1m
data_files: news_1m/*.parquet
default: false
# ============================================================================
# LOCAL & EXTERNAL SOURCES
# ============================================================================
- config_name: local
data_files: local/*.parquet
default: false
- config_name: mendeley
data_files: mendeley/*.parquet
default: false
# ============================================================================
# DATASET INFO (Detailed Metadata)
# ============================================================================
dataset_info:
- config_name: addresses
features:
- name: urdu
dtype: string
- name: roman_urdu
dtype: string
splits:
- name: train
num_examples: 982837
- config_name: english_urdu_translation
features:
- name: english
dtype: string
- name: urdu
dtype: string
splits:
- name: train
num_examples: 7057673
- config_name: error_correction
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 600000
- config_name: fineweb_pretrain
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 100000
- config_name: gemma_pretrain
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 245153
- config_name: generic_train_ur
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 3731
- config_name: tiny_stories
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 357900
- config_name: imdb_reviews_ur
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 10000
- config_name: sentiment
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 83309
- config_name: sentiment_v1_ur
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 987
- config_name: urdu_sentiment_local
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 20834
- config_name: urdu_sarcasm
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 19949
- config_name: iqbal_poetry
features:
- name: text
dtype: string
- name: source
dtype: string
- name: original_index
dtype: int64
splits:
- name: train
num_bytes: 979894
num_examples: 10002
download_size: 416600
dataset_size: 979894
- config_name: organized_poetry_csv
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_bytes: 2575653
num_examples: 17609
download_size: 1139029
dataset_size: 2575653
- config_name: poetry_by_poet
features:
- name: poet
dtype: string
- name: poetry_ur
dtype: string
- name: poetry_en
dtype: string
splits:
- name: train
num_bytes: 2064728
num_examples: 1314
download_size: 1098688
dataset_size: 2064728
- config_name: poetry_csv_main
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
num_bytes: 2575653
num_examples: 17609
download_size: 1139029
dataset_size: 2575653
- config_name: urdu_poetry_general
features:
- name: title
dtype: string
- name: content
dtype: string
- name: source
dtype: string
- name: original_index
dtype: int64
splits:
- name: train
num_bytes: 1405946
num_examples: 1323
download_size: 664929
dataset_size: 1405946
- config_name: urdu_reasoning
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 800
- config_name: parliament_translation
features:
- name: urdu
dtype: string
- name: roman_urdu
dtype: string
splits:
- name: train
num_examples: 6374673
- config_name: urdu_tts_transcription
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 4306
- config_name: wikiann_ur
features:
- name: tokens
sequence: string
- name: ner_tags
sequence: int64
splits:
- name: train
num_examples: 21972
- config_name: xnli_ipa
features:
- name: premise
dtype: string
- name: hypothesis
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_examples: 400202
- config_name: cleaned_data
features:
- name: text
dtype: string
splits:
- name: train
- config_name: c4
features:
- name: text
dtype: string
splits:
- name: train
- config_name: cc100
features:
- name: text
dtype: string
splits:
- name: train
- config_name: hplt
features:
- name: text
dtype: string
splits:
- name: train
- config_name: nmt
features:
- name: source
dtype: string
- name: target
dtype: string
splits:
- name: train
- config_name: nmt_parquet
features:
- name: source
dtype: string
- name: target
dtype: string
splits:
- name: train
- config_name: reasoning
features:
- name: text
dtype: string
splits:
- name: train
- config_name: reasoning_parquet
features:
- name: text
dtype: string
splits:
- name: train
- config_name: roman_urdu
features:
- name: text
dtype: string
splits:
- name: train
- config_name: roman_urdu_toxicity
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: train
- config_name: urdu_instruct_alpaca
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
- config_name: urdu_dictionary
features:
- name: word
dtype: string
- name: meaning
dtype: string
splits:
- name: train
- config_name: news_1m
features:
- name: text
dtype: string
splits:
- name: train
- config_name: local
features:
- name: text
dtype: string
splits:
- name: train
- config_name: mendeley
features:
- name: text
dtype: string
splits:
- name: train
---
# Shaistagi (شائستگی) Clean Urdu Mega-Dataset
<div align="center">
[](https://opensource.org/licenses/Apache-2.0)
[](https://en.wikipedia.org/wiki/Urdu)
[](https://huggingface.co/datasets/ReySajju742/shaistagi_clean)
**The largest and most comprehensive cleaned Urdu NLP dataset collection**
</div>
---
## 📊 Dataset Overview
**Shaistagi Clean** is one of the most comprehensive, multi-task Urdu NLP collections available. It aggregates high-quality, cleaned data for pre-training, instruction-tuning, and specialized downstream tasks.
### Key Statistics
| Metric | Value |
|--------|-------|
| **Total Rows** | ~16.2 Million |
| **Total Tokens** | ~1.22 Billion |
| **Total Characters** | ~1.73 Billion |
| **Total Configurations** | 37 |
| **Total Parquet Files** | 287 |
---
## 📦 Dataset Composition & Percentages
The dataset is categorized into functional blocks:
| Category | Estimated % | Configurations |
|:---------|:------------|:---------------|
| **LLM Pre-training** | 65% | `fineweb_pretrain`, `gemma_pretrain`, `generic_train_ur`, `tiny_stories`, `c4`, `cc100`, `hplt`, `cleaned_data`, `news_1m` |
| **Translation (NMT)** | 15% | `english_urdu_translation`, `parliament_translation`, `nmt`, `nmt_parquet` |
| **Classification** | 10% | `imdb_reviews_ur`, `sentiment`, `sentiment_v1_ur`, `urdu_sentiment_local`, `urdu_sarcasm`, `roman_urdu_toxicity` |
| **Specialized/Structured** | 7% | `addresses` (982k+ rows), `urdu_tts_transcription`, `wikiann_ur`, `xnli_ipa`, `urdu_dictionary` |
| **Reasoning & Instruction** | 3% | `urdu_reasoning`, `reasoning`, `reasoning_parquet`, `urdu_instruct_alpaca`, `error_correction` |
| **Poetry** | ~1% | `iqbal_poetry`, `organized_poetry_csv`, `poetry_by_poet`, `poetry_csv_main`, `urdu_poetry_general` |
---
## 📋 Detailed Configuration Statistics
| Config | Rows | Tokens | Avg Tokens/Row | Description |
|--------|------|--------|----------------|-------------|
| `addresses` | 982,837 | 33.2M | 33.78 | Urdu/Roman Urdu address mappings |
| `english_urdu_translation` | 7,057,673 | 115.6M | 16.39 | Parallel EN-UR translations |
| `error_correction` | 600,000 | 108.3M | 180.43 | Text error correction pairs |
| `fineweb_pretrain` | 100,000 | 266.4M | 2664.03 | Long-form pretraining text |
| `gemma_pretrain` | 245,153 | 200.3M | 817.15 | Gemma-formatted instruction data |
| `generic_train_ur` | 3,731 | 253K | 67.85 | Urdu headlines with labels |
| `imdb_reviews_ur` | 10,000 | 12M | 1201.53 | IMDB reviews in Urdu |
| `iqbal_poetry` | 10,002 | 316K | 31.65 | Allama Iqbal poetry |
| `organized_poetry_csv` | 17,609 | 1.3M | 73.52 | Organized poetry with labels |
| `parliament_translation` | 6,374,673 | 363.7M | 57.06 | Urdu/Roman transliteration |
| `poetry_by_poet` | 1,314 | 968K | 737.22 | Poetry organized by poet |
| `poetry_csv_main` | 17,609 | 1.3M | 73.52 | Poetry collection |
| `sentiment` | 83,309 | - | - | Roman Urdu sentiment |
| `sentiment_v1_ur` | 987 | 78K | 79.61 | Urdu tweets sentiment |
| `tiny_stories` | 357,900 | 73.6M | 205.63 | Children's stories in Urdu |
| `urdu_poetry_general` | 1,323 | 679K | 513.23 | General Urdu poetry |
| `urdu_reasoning` | 800 | 110K | 137.92 | Math/reasoning problems |
| `urdu_sarcasm` | 19,949 | 1.7M | 85.76 | Sarcasm detection |
| `urdu_sentiment_local` | 20,834 | 4.4M | 213.09 | Sentiment/toxicity |
| `urdu_tts_transcription` | 4,306 | 314K | 73.06 | TTS transcription |
| `wikiann_ur` | 21,972 | 773K | 35.21 | Named Entity Recognition |
| `xnli_ipa` | 400,202 | 30.9M | 77.29 | Natural Language Inference |
---
## 🔍 What This Dataset Includes
### 1. 📚 Large-Scale Pre-training Data
Diverse Urdu web text from multiple sources (C4, CC100, HPLT, FineWeb) and synthetic data (Tiny Stories) to help models learn Urdu syntax and semantics.
### 2. 🏠 Structured Urdu Addresses
Nearly **1 million rows** of Urdu-Roman Urdu address mappings, essential for logistics and geolocation models.
### 3. 💭 Sentiment & Nuance
Benchmark datasets including IMDB Urdu, Urdu Sarcasm, and multiple sentiment datasets for detecting emotional tone and figurative language.
### 4. 🌐 Cross-Lingual NLI (`xnli_ipa`)
Premises and hypotheses in Urdu for Natural Language Inference tasks (entailment, contradiction, neutral).
### 5. 📜 Poetry Collections
Multiple poetry datasets including Allama Iqbal's works, organized by poet, and general Urdu poetry.
### 6. 🔤 Named Entity Recognition (`wikiann_ur`)
Token-level NER annotations for identifying persons, locations, and organizations.
### 7. 🧠 Reasoning & Instruction Data
Math problems, reasoning tasks, and Alpaca-format instruction data in Urdu.
---
## 🚀 Quick Start
```python
from datasets import load_dataset
# Load the default configuration (cleaned_data - largest)
ds = load_dataset("ReySajju742/shaistagi_clean")
print(ds['train'][0])
# Load specific configurations
addresses = load_dataset("ReySajju742/shaistagi_clean", "addresses")
poetry = load_dataset("ReySajju742/shaistagi_clean", "iqbal_poetry")
sentiment = load_dataset("ReySajju742/shaistagi_clean", "sentiment")
translation = load_dataset("ReySajju742/shaistagi_clean", "english_urdu_translation")
# Load web crawl data for pretraining
c4_data = load_dataset("ReySajju742/shaistagi_clean", "c4")
hplt_data = load_dataset("ReySajju742/shaistagi_clean", "hplt")
```
---
## 📁 Available Configurations
<details>
<summary><b>Click to expand all 37 configurations</b></summary>
### Pre-training Data
- `fineweb_pretrain` - FineWeb Urdu subset
- `gemma_pretrain` - Gemma-formatted data
- `generic_train_ur` - Generic training data
- `tiny_stories` - Urdu children's stories
- `c4` - C4 Urdu subset (19 files)
- `cc100` - CC100 Urdu subset (12 files)
- `hplt` - HPLT web crawl (34 files)
- `cleaned_data` - Main cleaned data (139 files)
- `news_1m` - 1M news articles
### Translation
- `english_urdu_translation` - EN-UR parallel corpus
- `parliament_translation` - Parliamentary translations
- `nmt` - Neural MT data
- `nmt_parquet` - NMT in parquet format
### Classification & Sentiment
- `imdb_reviews_ur` - IMDB reviews
- `sentiment` - General sentiment
- `sentiment_v1_ur` - Urdu tweets
- `urdu_sentiment_local` - Local sentiment
- `urdu_sarcasm` - Sarcasm detection
- `roman_urdu_toxicity` - Toxicity detection
### Poetry
- `iqbal_poetry` - Allama Iqbal
- `organized_poetry_csv` - Organized poetry
- `poetry_by_poet` - By poet name
- `poetry_csv_main` - Main poetry CSV
- `urdu_poetry_general` - General poetry
### Structured & Specialized
- `addresses` - Address mappings
- `urdu_tts_transcription` - TTS data
- `wikiann_ur` - NER annotations
- `xnli_ipa` - NLI data
- `urdu_dictionary` - Dictionary entries
- `urdu_instruct_alpaca` - Alpaca instructions
### Reasoning
- `urdu_reasoning` - Reasoning tasks
- `reasoning` - General reasoning
- `reasoning_parquet` - Reasoning parquet
- `error_correction` - Error correction
### Other
- `roman_urdu` - Roman Urdu text
- `local` - Local sources
- `mendeley` - Mendeley data
</details>
---
## 📄 License
This dataset is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0).
---
## 🙏 Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{shaistagi_clean_2026,
author = {ReySajju742},
title = {Shaistagi Clean: Comprehensive Urdu NLP Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ReySajju742/shaistagi_clean}
}
```
---
## 📧 Contact
For questions or feedback, please open an issue on the [dataset repository](https://huggingface.co/datasets/ReySajju742/shaistagi_clean).
提供机构:
ReySajju742



