five

ReySajju742/shaistagi_clean

收藏
Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ReySajju742/shaistagi_clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - ur pretty_name: Shaistagi (شائستگی) Clean Urdu Mega-Dataset size_categories: - 10M<n<100M task_categories: - text-generation - translation - text-classification - token-classification - question-answering tags: - urdu - nlp - cleaned - instruction-tuning - sentiment - reasoning - address-parsing - poetry - pretraining - nmt - ner configs: # ============================================================================ # SPECIALIZED DATASETS # ============================================================================ - config_name: addresses data_files: addresses/*.parquet default: false - config_name: english_urdu_translation data_files: english_urdu_translation/*.parquet default: false - config_name: error_correction data_files: error_correction/*.parquet default: false # ============================================================================ # PRETRAINING DATA # ============================================================================ - config_name: fineweb_pretrain data_files: fineweb_pretrain/*.parquet default: false - config_name: gemma_pretrain data_files: gemma_pretrain/*.parquet default: false - config_name: generic_train_ur data_files: generic_train_ur/*.parquet default: false - config_name: tiny_stories data_files: tiny_stories/*.parquet default: false # ============================================================================ # WEB CRAWL DATA (Large Scale Pretraining) # ============================================================================ - config_name: c4 data_files: c4/*.parquet default: false - config_name: cc100 data_files: cc100/*.parquet default: false - config_name: hplt data_files: hplt/*.parquet default: false - config_name: cleaned_data data_files: cleaned_data/*.parquet default: true # ============================================================================ # TRANSLATION / NMT DATA # ============================================================================ - config_name: nmt data_files: nmt/*.parquet default: false - config_name: nmt_parquet data_files: nmt_parquet/*.parquet default: false - config_name: parliament_translation data_files: parliament_translation/*.parquet default: false # ============================================================================ # SENTIMENT & CLASSIFICATION # ============================================================================ - config_name: imdb_reviews_ur data_files: imdb_reviews_ur/*.parquet default: false - config_name: sentiment data_files: sentiment/*.parquet default: false - config_name: sentiment_v1_ur data_files: sentiment_v1_ur/*.parquet default: false - config_name: urdu_sentiment_local data_files: urdu_sentiment_local/*.parquet default: false - config_name: urdu_sarcasm data_files: urdu_sarcasm/*.parquet default: false # ============================================================================ # POETRY DATA # ============================================================================ - config_name: iqbal_poetry data_files: - split: train path: iqbal_poetry/train-*.parquet default: false - config_name: organized_poetry_csv data_files: - split: train path: organized_poetry_csv/train-*.parquet default: false - config_name: poetry_by_poet data_files: - split: train path: poetry_by_poet/train-*.parquet default: false - config_name: poetry_csv_main data_files: - split: train path: poetry_csv_main/train-*.parquet default: false - config_name: urdu_poetry_general data_files: - split: train path: urdu_poetry_general/train-*.parquet default: false # ============================================================================ # REASONING & INSTRUCTION DATA # ============================================================================ - config_name: urdu_reasoning data_files: urdu_reasoning/*.parquet default: false - config_name: reasoning data_files: reasoning/*.parquet default: false - config_name: reasoning_parquet data_files: reasoning_parquet/*.parquet default: false - config_name: urdu_instruct_alpaca data_files: urdu_instruct_alpaca/*.parquet default: false # ============================================================================ # ROMAN URDU & TRANSLITERATION # ============================================================================ - config_name: roman_urdu data_files: roman_urdu/*.parquet default: false - config_name: roman_urdu_toxicity data_files: roman_urdu_toxicity/*.parquet default: false # ============================================================================ # SPECIALIZED / STRUCTURED DATA # ============================================================================ - config_name: urdu_tts_transcription data_files: urdu_tts_transcription/*.parquet default: false - config_name: wikiann_ur data_files: wikiann_ur/*.parquet default: false - config_name: xnli_ipa data_files: xnli_ipa/*.parquet default: false - config_name: urdu_dictionary data_files: urdu_dictionary/*.parquet default: false - config_name: news_1m data_files: news_1m/*.parquet default: false # ============================================================================ # LOCAL & EXTERNAL SOURCES # ============================================================================ - config_name: local data_files: local/*.parquet default: false - config_name: mendeley data_files: mendeley/*.parquet default: false # ============================================================================ # DATASET INFO (Detailed Metadata) # ============================================================================ dataset_info: - config_name: addresses features: - name: urdu dtype: string - name: roman_urdu dtype: string splits: - name: train num_examples: 982837 - config_name: english_urdu_translation features: - name: english dtype: string - name: urdu dtype: string splits: - name: train num_examples: 7057673 - config_name: error_correction features: - name: text dtype: string splits: - name: train num_examples: 600000 - config_name: fineweb_pretrain features: - name: text dtype: string splits: - name: train num_examples: 100000 - config_name: gemma_pretrain features: - name: text dtype: string splits: - name: train num_examples: 245153 - config_name: generic_train_ur features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 3731 - config_name: tiny_stories features: - name: text dtype: string splits: - name: train num_examples: 357900 - config_name: imdb_reviews_ur features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 10000 - config_name: sentiment features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 83309 - config_name: sentiment_v1_ur features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 987 - config_name: urdu_sentiment_local features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 20834 - config_name: urdu_sarcasm features: - name: text dtype: string - name: label dtype: string splits: - name: train num_examples: 19949 - config_name: iqbal_poetry features: - name: text dtype: string - name: source dtype: string - name: original_index dtype: int64 splits: - name: train num_bytes: 979894 num_examples: 10002 download_size: 416600 dataset_size: 979894 - config_name: organized_poetry_csv features: - name: text dtype: string - name: label dtype: string splits: - name: train num_bytes: 2575653 num_examples: 17609 download_size: 1139029 dataset_size: 2575653 - config_name: poetry_by_poet features: - name: poet dtype: string - name: poetry_ur dtype: string - name: poetry_en dtype: string splits: - name: train num_bytes: 2064728 num_examples: 1314 download_size: 1098688 dataset_size: 2064728 - config_name: poetry_csv_main features: - name: text dtype: string - name: label dtype: string splits: - name: train num_bytes: 2575653 num_examples: 17609 download_size: 1139029 dataset_size: 2575653 - config_name: urdu_poetry_general features: - name: title dtype: string - name: content dtype: string - name: source dtype: string - name: original_index dtype: int64 splits: - name: train num_bytes: 1405946 num_examples: 1323 download_size: 664929 dataset_size: 1405946 - config_name: urdu_reasoning features: - name: text dtype: string splits: - name: train num_examples: 800 - config_name: parliament_translation features: - name: urdu dtype: string - name: roman_urdu dtype: string splits: - name: train num_examples: 6374673 - config_name: urdu_tts_transcription features: - name: text dtype: string splits: - name: train num_examples: 4306 - config_name: wikiann_ur features: - name: tokens sequence: string - name: ner_tags sequence: int64 splits: - name: train num_examples: 21972 - config_name: xnli_ipa features: - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: int64 splits: - name: train num_examples: 400202 - config_name: cleaned_data features: - name: text dtype: string splits: - name: train - config_name: c4 features: - name: text dtype: string splits: - name: train - config_name: cc100 features: - name: text dtype: string splits: - name: train - config_name: hplt features: - name: text dtype: string splits: - name: train - config_name: nmt features: - name: source dtype: string - name: target dtype: string splits: - name: train - config_name: nmt_parquet features: - name: source dtype: string - name: target dtype: string splits: - name: train - config_name: reasoning features: - name: text dtype: string splits: - name: train - config_name: reasoning_parquet features: - name: text dtype: string splits: - name: train - config_name: roman_urdu features: - name: text dtype: string splits: - name: train - config_name: roman_urdu_toxicity features: - name: text dtype: string - name: label dtype: string splits: - name: train - config_name: urdu_instruct_alpaca features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train - config_name: urdu_dictionary features: - name: word dtype: string - name: meaning dtype: string splits: - name: train - config_name: news_1m features: - name: text dtype: string splits: - name: train - config_name: local features: - name: text dtype: string splits: - name: train - config_name: mendeley features: - name: text dtype: string splits: - name: train --- # Shaistagi (شائستگی) Clean Urdu Mega-Dataset <div align="center"> [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Language](https://img.shields.io/badge/Language-Urdu-green.svg)](https://en.wikipedia.org/wiki/Urdu) [![HuggingFace](https://img.shields.io/badge/🤗-Datasets-yellow.svg)](https://huggingface.co/datasets/ReySajju742/shaistagi_clean) **The largest and most comprehensive cleaned Urdu NLP dataset collection** </div> --- ## 📊 Dataset Overview **Shaistagi Clean** is one of the most comprehensive, multi-task Urdu NLP collections available. It aggregates high-quality, cleaned data for pre-training, instruction-tuning, and specialized downstream tasks. ### Key Statistics | Metric | Value | |--------|-------| | **Total Rows** | ~16.2 Million | | **Total Tokens** | ~1.22 Billion | | **Total Characters** | ~1.73 Billion | | **Total Configurations** | 37 | | **Total Parquet Files** | 287 | --- ## 📦 Dataset Composition & Percentages The dataset is categorized into functional blocks: | Category | Estimated % | Configurations | |:---------|:------------|:---------------| | **LLM Pre-training** | 65% | `fineweb_pretrain`, `gemma_pretrain`, `generic_train_ur`, `tiny_stories`, `c4`, `cc100`, `hplt`, `cleaned_data`, `news_1m` | | **Translation (NMT)** | 15% | `english_urdu_translation`, `parliament_translation`, `nmt`, `nmt_parquet` | | **Classification** | 10% | `imdb_reviews_ur`, `sentiment`, `sentiment_v1_ur`, `urdu_sentiment_local`, `urdu_sarcasm`, `roman_urdu_toxicity` | | **Specialized/Structured** | 7% | `addresses` (982k+ rows), `urdu_tts_transcription`, `wikiann_ur`, `xnli_ipa`, `urdu_dictionary` | | **Reasoning & Instruction** | 3% | `urdu_reasoning`, `reasoning`, `reasoning_parquet`, `urdu_instruct_alpaca`, `error_correction` | | **Poetry** | ~1% | `iqbal_poetry`, `organized_poetry_csv`, `poetry_by_poet`, `poetry_csv_main`, `urdu_poetry_general` | --- ## 📋 Detailed Configuration Statistics | Config | Rows | Tokens | Avg Tokens/Row | Description | |--------|------|--------|----------------|-------------| | `addresses` | 982,837 | 33.2M | 33.78 | Urdu/Roman Urdu address mappings | | `english_urdu_translation` | 7,057,673 | 115.6M | 16.39 | Parallel EN-UR translations | | `error_correction` | 600,000 | 108.3M | 180.43 | Text error correction pairs | | `fineweb_pretrain` | 100,000 | 266.4M | 2664.03 | Long-form pretraining text | | `gemma_pretrain` | 245,153 | 200.3M | 817.15 | Gemma-formatted instruction data | | `generic_train_ur` | 3,731 | 253K | 67.85 | Urdu headlines with labels | | `imdb_reviews_ur` | 10,000 | 12M | 1201.53 | IMDB reviews in Urdu | | `iqbal_poetry` | 10,002 | 316K | 31.65 | Allama Iqbal poetry | | `organized_poetry_csv` | 17,609 | 1.3M | 73.52 | Organized poetry with labels | | `parliament_translation` | 6,374,673 | 363.7M | 57.06 | Urdu/Roman transliteration | | `poetry_by_poet` | 1,314 | 968K | 737.22 | Poetry organized by poet | | `poetry_csv_main` | 17,609 | 1.3M | 73.52 | Poetry collection | | `sentiment` | 83,309 | - | - | Roman Urdu sentiment | | `sentiment_v1_ur` | 987 | 78K | 79.61 | Urdu tweets sentiment | | `tiny_stories` | 357,900 | 73.6M | 205.63 | Children's stories in Urdu | | `urdu_poetry_general` | 1,323 | 679K | 513.23 | General Urdu poetry | | `urdu_reasoning` | 800 | 110K | 137.92 | Math/reasoning problems | | `urdu_sarcasm` | 19,949 | 1.7M | 85.76 | Sarcasm detection | | `urdu_sentiment_local` | 20,834 | 4.4M | 213.09 | Sentiment/toxicity | | `urdu_tts_transcription` | 4,306 | 314K | 73.06 | TTS transcription | | `wikiann_ur` | 21,972 | 773K | 35.21 | Named Entity Recognition | | `xnli_ipa` | 400,202 | 30.9M | 77.29 | Natural Language Inference | --- ## 🔍 What This Dataset Includes ### 1. 📚 Large-Scale Pre-training Data Diverse Urdu web text from multiple sources (C4, CC100, HPLT, FineWeb) and synthetic data (Tiny Stories) to help models learn Urdu syntax and semantics. ### 2. 🏠 Structured Urdu Addresses Nearly **1 million rows** of Urdu-Roman Urdu address mappings, essential for logistics and geolocation models. ### 3. 💭 Sentiment & Nuance Benchmark datasets including IMDB Urdu, Urdu Sarcasm, and multiple sentiment datasets for detecting emotional tone and figurative language. ### 4. 🌐 Cross-Lingual NLI (`xnli_ipa`) Premises and hypotheses in Urdu for Natural Language Inference tasks (entailment, contradiction, neutral). ### 5. 📜 Poetry Collections Multiple poetry datasets including Allama Iqbal's works, organized by poet, and general Urdu poetry. ### 6. 🔤 Named Entity Recognition (`wikiann_ur`) Token-level NER annotations for identifying persons, locations, and organizations. ### 7. 🧠 Reasoning & Instruction Data Math problems, reasoning tasks, and Alpaca-format instruction data in Urdu. --- ## 🚀 Quick Start ```python from datasets import load_dataset # Load the default configuration (cleaned_data - largest) ds = load_dataset("ReySajju742/shaistagi_clean") print(ds['train'][0]) # Load specific configurations addresses = load_dataset("ReySajju742/shaistagi_clean", "addresses") poetry = load_dataset("ReySajju742/shaistagi_clean", "iqbal_poetry") sentiment = load_dataset("ReySajju742/shaistagi_clean", "sentiment") translation = load_dataset("ReySajju742/shaistagi_clean", "english_urdu_translation") # Load web crawl data for pretraining c4_data = load_dataset("ReySajju742/shaistagi_clean", "c4") hplt_data = load_dataset("ReySajju742/shaistagi_clean", "hplt") ``` --- ## 📁 Available Configurations <details> <summary><b>Click to expand all 37 configurations</b></summary> ### Pre-training Data - `fineweb_pretrain` - FineWeb Urdu subset - `gemma_pretrain` - Gemma-formatted data - `generic_train_ur` - Generic training data - `tiny_stories` - Urdu children's stories - `c4` - C4 Urdu subset (19 files) - `cc100` - CC100 Urdu subset (12 files) - `hplt` - HPLT web crawl (34 files) - `cleaned_data` - Main cleaned data (139 files) - `news_1m` - 1M news articles ### Translation - `english_urdu_translation` - EN-UR parallel corpus - `parliament_translation` - Parliamentary translations - `nmt` - Neural MT data - `nmt_parquet` - NMT in parquet format ### Classification & Sentiment - `imdb_reviews_ur` - IMDB reviews - `sentiment` - General sentiment - `sentiment_v1_ur` - Urdu tweets - `urdu_sentiment_local` - Local sentiment - `urdu_sarcasm` - Sarcasm detection - `roman_urdu_toxicity` - Toxicity detection ### Poetry - `iqbal_poetry` - Allama Iqbal - `organized_poetry_csv` - Organized poetry - `poetry_by_poet` - By poet name - `poetry_csv_main` - Main poetry CSV - `urdu_poetry_general` - General poetry ### Structured & Specialized - `addresses` - Address mappings - `urdu_tts_transcription` - TTS data - `wikiann_ur` - NER annotations - `xnli_ipa` - NLI data - `urdu_dictionary` - Dictionary entries - `urdu_instruct_alpaca` - Alpaca instructions ### Reasoning - `urdu_reasoning` - Reasoning tasks - `reasoning` - General reasoning - `reasoning_parquet` - Reasoning parquet - `error_correction` - Error correction ### Other - `roman_urdu` - Roman Urdu text - `local` - Local sources - `mendeley` - Mendeley data </details> --- ## 📄 License This dataset is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0). --- ## 🙏 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{shaistagi_clean_2026, author = {ReySajju742}, title = {Shaistagi Clean: Comprehensive Urdu NLP Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ReySajju742/shaistagi_clean} } ``` --- ## 📧 Contact For questions or feedback, please open an issue on the [dataset repository](https://huggingface.co/datasets/ReySajju742/shaistagi_clean).
提供机构:
ReySajju742
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作