aakashMeghwar01/sindhi-corpus-505m

Name: aakashMeghwar01/sindhi-corpus-505m
Creator: aakashMeghwar01
Published: 2026-03-17 16:46:15
License: 暂无描述

Hugging Face2026-03-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/aakashMeghwar01/sindhi-corpus-505m

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - sd license: apache-2.0 size_categories: - 100K<n<1M task_categories: - text-generation - fill-mask tags: - sindhi - low-resource - arabic-script - pretraining - computational-linguistics - nlp - language-model pretty_name: "Sindhi Corpus 505M — Pretraining Dataset" dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 742000 --- # Sindhi Corpus 505M **The largest open-source, deduplicated Sindhi language pretraining corpus.** ~505 million tokens across 742K documents, covering news, literature, legal, religious, encyclopedic, and web-crawled Sindhi text. Built for training Sindhi language models, tokenizers, and NLP tools. ## Dataset Summary | Stat | Value | |------|-------| | **Documents** | ~742,000 | | **Tokens (estimated)** | ~505 million | | **Language** | Sindhi (sd) — Arabic script | | **Format** | Parquet (single `text` column) | | **License** | Apache 2.0 | | **Curator** | [Aakash Meghwar](https://huggingface.co/aakashMeghwar01) | ## How to Use ```python from datasets import load_dataset # Full dataset ds = load_dataset("aakashMeghwar01/sindhi-corpus-505m", split="train") print(ds[0]["text"][:200]) # Streaming (recommended for Colab/Kaggle) ds = load_dataset("aakashMeghwar01/sindhi-corpus-505m", split="train", streaming=True) for example in ds: print(example["text"][:200]) break ``` ## Source Datasets This corpus was compiled from 11 publicly available Sindhi datasets spanning diverse genres: | # | Source | Type | URL | |---|--------|------|-----| | 1 | **AMBILE Sindhi Mega Corpus** | Mixed (news, web, literature) | [Kaggle](https://www.kaggle.com/datasets/ambile/sindhi-mega-corpus-118-million-tokens) | | 2 | **CC100-Sindhi** | Web crawl (Common Crawl) | [Metatext](https://metatext.io/datasets/cc100-sindhi) | | 3 | **Daily Kawish Articles** | Newspaper articles | [Kaggle](https://www.kaggle.com/datasets/owaisraza009/sindhi-articles-dataset-from-daily-kawish) | | 4 | **Sindhi News — Awami Awaz** | News articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-news-corpus-awami-awaz) | | 5 | **Sindhi News — Sindh Express** | News articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-news-corpus-sindh-express) | | 6 | **Encyclopedia Sindhiana** | Encyclopedic articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/encyclopedia-sindhiana-text-corpus) | | 7 | **Sindhi Legal Dataset** | Legal documents | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-legal-dataset) | | 8 | **Sindhi Religious Data** | Religious texts | [Kaggle](https://www.kaggle.com/datasets/nairsaanvi/sindhi-religious-data) | | 9 | **Sindhi Language Corpus** | Mixed | [Kaggle](https://www.kaggle.com/datasets/majidshah123/sindhi-language-corpus) | | 10 | **Sindhi Stopwords** | Linguistic resource | [Kaggle](https://www.kaggle.com/datasets/owaisraza009/sindhi-stopwords) | | 11 | **Zenodo Sindhi Dataset** | Mixed | [Zenodo](https://zenodo.org/records/16976593) | ## Data Collection & Processing Pipeline The corpus was assembled through a multi-stage pipeline: ### Stage 1: Collection Raw text was extracted from all 11 sources. CSV files were parsed with automatic column detection (looking for `text`, `content`, `article` columns). TXT files were split on paragraph boundaries. All text was read as UTF-8. ### Stage 2: Normalization - **Unicode NFC normalization** to canonicalize Arabic-script characters - **Encoding repair** via `ftfy` (fixes mojibake and broken encodings) - **Character variant unification**: Arabic Yeh → Farsi Yeh, Arabic Kaf → Keheh, Arabic Heh → Heh Goal (standard Sindhi forms) - **URL and email removal** - **HTML tag stripping** - **Whitespace normalization** ### Stage 3: Quality Filtering Documents were filtered based on: - **Minimum length**: 50 characters (removes fragments) - **Maximum length**: 100,000 characters (removes data dumps) - **Sindhi script ratio**: At least 30% of non-space characters must be Arabic-script (removes code-mixed noise, English-only documents) - **Repetition check**: Documents with <10% unique characters rejected (catches OCR errors, repetitive noise) ### Stage 4: Deduplication Two-stage deduplication to remove near-identical documents: 1. **Exact dedup**: MD5 hash of full document text 2. **Near-duplicate detection**: MinHash LSH with 128 hash functions, 5-word shingles, Jaccard threshold 0.85 ### Stage 5: Final Assembly - Documents shuffled randomly - Exported as Parquet with a single `text` column - Pushed to HuggingFace Hub ### Processing Tools - `pandas` for CSV/data handling - `ftfy` for encoding repair - `datasketch` (MinHash LSH) for near-duplicate detection - `unicodedata` for Unicode normalization - Custom `SindhiTextProcessor` class with Sindhi-specific regex patterns ## Dataset Composition The corpus covers diverse domains to ensure broad linguistic coverage: | Domain | Sources | Coverage | |--------|---------|----------| | **News** | Daily Kawish, Awami Awaz, Sindh Express | Current affairs, politics, sports, editorials | | **Encyclopedia** | Encyclopedia Sindhiana | History, culture, geography, biography | | **Web Crawl** | CC100-Sindhi | Diverse web text, blogs, forums | | **Legal** | Sindhi Legal Dataset | Legal documents, court proceedings | | **Religious** | Sindhi Religious Data | Religious texts and commentary | | **Mixed** | AMBILE Mega Corpus, Sindhi Language Corpus, Zenodo | Literature, academic, general | ## Intended Use This corpus is designed for: - **Pretraining** Sindhi language models (GPT, BERT, etc.) - **Tokenizer training** (BPE, Unigram, SentencePiece) - **NLP tool development** (stemmer evaluation, stopword extraction, morphological analysis) - **Linguistic research** on Sindhi text ### Models Trained on This Corpus | Model | Type | Params | Link | |-------|------|--------|------| | SindhiLM | GPT-2 from scratch | 37.8M | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM) | | SindhiLM-Qwen-0.5B | Qwen2.5-0.5B fine-tune | 0.5B | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Qwen-0.5B) | | SindhiLM-Qwen-0.5B-v2 | Qwen2.5-0.5B SFT | 0.5B | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Qwen-0.5B-v2) | | SindhiLM-Tokenizer-v2 | Morpheme-aware BPE | — | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Tokenizer-v2) | ### Associated Tools | Tool | Description | Link | |------|-------------|------| | SindhiNLTK | Morphology-aware NLP toolkit for Sindhi | [PyPI](https://pypi.org/project/sindhinltk/) · [GitHub](https://github.com/AakashKumarMissrani/SindhiNLTK) | ## Tokenized Version A pre-tokenized version of this corpus is also available: - **[sindhi-tokenized-505m](https://huggingface.co/datasets/aakashMeghwar01/sindhi-tokenized-505m)** — Same corpus, tokenized with SindhiLM-Tokenizer-v1 ## Limitations & Biases - **News-heavy**: Newspaper sources (Kawish, Awami Awaz, Sindh Express) form a significant portion, which may bias toward formal journalistic Sindhi. - **Script variants**: Despite normalization, some character variant inconsistencies may remain in web-crawled text. - **Temporal bias**: News articles are concentrated around their publication dates; no uniform temporal sampling was applied. - **Deduplication residual**: Near-duplicate detection at threshold 0.85 may leave some paraphrased duplicates. ## Citation ```bibtex @dataset{meghwar2026sindhi505m, author = {Aakash Meghwar}, title = {Sindhi Corpus 505M: A Deduplicated Pretraining Dataset for Sindhi}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/aakashMeghwar01/sindhi-corpus-505m} } ``` ## Contact **Aakash Meghwar** — Computational Linguist - HuggingFace: [aakashMeghwar01](https://huggingface.co/aakashMeghwar01) - GitHub: [AakashKumarMissrani](https://github.com/AakashKumarMissrani)

提供机构：

aakashMeghwar01

5,000+

优质数据集

54 个

任务类型

进入经典数据集