aakashMeghwar01/sindhi-corpus-505m
收藏Hugging Face2026-03-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aakashMeghwar01/sindhi-corpus-505m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sd
license: apache-2.0
size_categories:
- 100K<n<1M
task_categories:
- text-generation
- fill-mask
tags:
- sindhi
- low-resource
- arabic-script
- pretraining
- computational-linguistics
- nlp
- language-model
pretty_name: "Sindhi Corpus 505M — Pretraining Dataset"
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 742000
---
# Sindhi Corpus 505M
**The largest open-source, deduplicated Sindhi language pretraining corpus.**
~505 million tokens across 742K documents, covering news, literature, legal, religious, encyclopedic, and web-crawled Sindhi text. Built for training Sindhi language models, tokenizers, and NLP tools.
## Dataset Summary
| Stat | Value |
|------|-------|
| **Documents** | ~742,000 |
| **Tokens (estimated)** | ~505 million |
| **Language** | Sindhi (sd) — Arabic script |
| **Format** | Parquet (single `text` column) |
| **License** | Apache 2.0 |
| **Curator** | [Aakash Meghwar](https://huggingface.co/aakashMeghwar01) |
## How to Use
```python
from datasets import load_dataset
# Full dataset
ds = load_dataset("aakashMeghwar01/sindhi-corpus-505m", split="train")
print(ds[0]["text"][:200])
# Streaming (recommended for Colab/Kaggle)
ds = load_dataset("aakashMeghwar01/sindhi-corpus-505m", split="train", streaming=True)
for example in ds:
print(example["text"][:200])
break
```
## Source Datasets
This corpus was compiled from 11 publicly available Sindhi datasets spanning diverse genres:
| # | Source | Type | URL |
|---|--------|------|-----|
| 1 | **AMBILE Sindhi Mega Corpus** | Mixed (news, web, literature) | [Kaggle](https://www.kaggle.com/datasets/ambile/sindhi-mega-corpus-118-million-tokens) |
| 2 | **CC100-Sindhi** | Web crawl (Common Crawl) | [Metatext](https://metatext.io/datasets/cc100-sindhi) |
| 3 | **Daily Kawish Articles** | Newspaper articles | [Kaggle](https://www.kaggle.com/datasets/owaisraza009/sindhi-articles-dataset-from-daily-kawish) |
| 4 | **Sindhi News — Awami Awaz** | News articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-news-corpus-awami-awaz) |
| 5 | **Sindhi News — Sindh Express** | News articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-news-corpus-sindh-express) |
| 6 | **Encyclopedia Sindhiana** | Encyclopedic articles | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/encyclopedia-sindhiana-text-corpus) |
| 7 | **Sindhi Legal Dataset** | Legal documents | [Kaggle](https://www.kaggle.com/datasets/danishmahdi/sindhi-legal-dataset) |
| 8 | **Sindhi Religious Data** | Religious texts | [Kaggle](https://www.kaggle.com/datasets/nairsaanvi/sindhi-religious-data) |
| 9 | **Sindhi Language Corpus** | Mixed | [Kaggle](https://www.kaggle.com/datasets/majidshah123/sindhi-language-corpus) |
| 10 | **Sindhi Stopwords** | Linguistic resource | [Kaggle](https://www.kaggle.com/datasets/owaisraza009/sindhi-stopwords) |
| 11 | **Zenodo Sindhi Dataset** | Mixed | [Zenodo](https://zenodo.org/records/16976593) |
## Data Collection & Processing Pipeline
The corpus was assembled through a multi-stage pipeline:
### Stage 1: Collection
Raw text was extracted from all 11 sources. CSV files were parsed with automatic column detection (looking for `text`, `content`, `article` columns). TXT files were split on paragraph boundaries. All text was read as UTF-8.
### Stage 2: Normalization
- **Unicode NFC normalization** to canonicalize Arabic-script characters
- **Encoding repair** via `ftfy` (fixes mojibake and broken encodings)
- **Character variant unification**: Arabic Yeh → Farsi Yeh, Arabic Kaf → Keheh, Arabic Heh → Heh Goal (standard Sindhi forms)
- **URL and email removal**
- **HTML tag stripping**
- **Whitespace normalization**
### Stage 3: Quality Filtering
Documents were filtered based on:
- **Minimum length**: 50 characters (removes fragments)
- **Maximum length**: 100,000 characters (removes data dumps)
- **Sindhi script ratio**: At least 30% of non-space characters must be Arabic-script (removes code-mixed noise, English-only documents)
- **Repetition check**: Documents with <10% unique characters rejected (catches OCR errors, repetitive noise)
### Stage 4: Deduplication
Two-stage deduplication to remove near-identical documents:
1. **Exact dedup**: MD5 hash of full document text
2. **Near-duplicate detection**: MinHash LSH with 128 hash functions, 5-word shingles, Jaccard threshold 0.85
### Stage 5: Final Assembly
- Documents shuffled randomly
- Exported as Parquet with a single `text` column
- Pushed to HuggingFace Hub
### Processing Tools
- `pandas` for CSV/data handling
- `ftfy` for encoding repair
- `datasketch` (MinHash LSH) for near-duplicate detection
- `unicodedata` for Unicode normalization
- Custom `SindhiTextProcessor` class with Sindhi-specific regex patterns
## Dataset Composition
The corpus covers diverse domains to ensure broad linguistic coverage:
| Domain | Sources | Coverage |
|--------|---------|----------|
| **News** | Daily Kawish, Awami Awaz, Sindh Express | Current affairs, politics, sports, editorials |
| **Encyclopedia** | Encyclopedia Sindhiana | History, culture, geography, biography |
| **Web Crawl** | CC100-Sindhi | Diverse web text, blogs, forums |
| **Legal** | Sindhi Legal Dataset | Legal documents, court proceedings |
| **Religious** | Sindhi Religious Data | Religious texts and commentary |
| **Mixed** | AMBILE Mega Corpus, Sindhi Language Corpus, Zenodo | Literature, academic, general |
## Intended Use
This corpus is designed for:
- **Pretraining** Sindhi language models (GPT, BERT, etc.)
- **Tokenizer training** (BPE, Unigram, SentencePiece)
- **NLP tool development** (stemmer evaluation, stopword extraction, morphological analysis)
- **Linguistic research** on Sindhi text
### Models Trained on This Corpus
| Model | Type | Params | Link |
|-------|------|--------|------|
| SindhiLM | GPT-2 from scratch | 37.8M | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM) |
| SindhiLM-Qwen-0.5B | Qwen2.5-0.5B fine-tune | 0.5B | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Qwen-0.5B) |
| SindhiLM-Qwen-0.5B-v2 | Qwen2.5-0.5B SFT | 0.5B | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Qwen-0.5B-v2) |
| SindhiLM-Tokenizer-v2 | Morpheme-aware BPE | — | [HF](https://huggingface.co/aakashMeghwar01/SindhiLM-Tokenizer-v2) |
### Associated Tools
| Tool | Description | Link |
|------|-------------|------|
| SindhiNLTK | Morphology-aware NLP toolkit for Sindhi | [PyPI](https://pypi.org/project/sindhinltk/) · [GitHub](https://github.com/AakashKumarMissrani/SindhiNLTK) |
## Tokenized Version
A pre-tokenized version of this corpus is also available:
- **[sindhi-tokenized-505m](https://huggingface.co/datasets/aakashMeghwar01/sindhi-tokenized-505m)** — Same corpus, tokenized with SindhiLM-Tokenizer-v1
## Limitations & Biases
- **News-heavy**: Newspaper sources (Kawish, Awami Awaz, Sindh Express) form a significant portion, which may bias toward formal journalistic Sindhi.
- **Script variants**: Despite normalization, some character variant inconsistencies may remain in web-crawled text.
- **Temporal bias**: News articles are concentrated around their publication dates; no uniform temporal sampling was applied.
- **Deduplication residual**: Near-duplicate detection at threshold 0.85 may leave some paraphrased duplicates.
## Citation
```bibtex
@dataset{meghwar2026sindhi505m,
author = {Aakash Meghwar},
title = {Sindhi Corpus 505M: A Deduplicated Pretraining Dataset for Sindhi},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/aakashMeghwar01/sindhi-corpus-505m}
}
```
## Contact
**Aakash Meghwar** — Computational Linguist
- HuggingFace: [aakashMeghwar01](https://huggingface.co/aakashMeghwar01)
- GitHub: [AakashKumarMissrani](https://github.com/AakashKumarMissrani)
提供机构:
aakashMeghwar01



