ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
收藏Hugging Face2025-11-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
下载链接
链接失效反馈官方服务:
资源简介:
---
license: lgpl-3.0
task_categories:
- token-classification
language:
- sd
tags:
- data
- sindhi
- nlp_data
size_categories:
- 100M<n<1B
---
# 🧠 Sindhi Language Mega Corpus – 118M Tokens (JSON + Tokenizer Model)
### 🏛️ Sources:
- Sindh Salamat
- Sindhi Wikipedia
- Altaf Shaikh Literary Works
- Sindhi Language Authority Publications
- Hamsari Akhbar
- Pahenji Akhbar
- Sindhi General Data
### 📤 Compiled & Uploaded by: **Abdul Majid Bhurgiri**, *Institute of Language Engineering*
### 🌐 Official Uploading Credit: [ambile.pk](https://ambile.pk/)
### 📅 Release Year: 2025
### 🌍 Language: Sindhi (سنڌي)
### 📦 Dataset Size: ~118 Million Tokens
### 🧠 Format: Structured JSON Files + Tokenizer Model
### 📁 Download Link: [📥 Google Drive – Sindhi Mega Corpus (118M Tokens)](https://drive.google.com/file/d/1v6g-GJr09BKvPcGgbRip3cOERmbN_4X1/view?usp=sharing)
---
## 📖 Overview
The **Sindhi Language Mega Corpus (118M Tokens)** is the largest open-source Sindhi dataset ever released for **Natural Language Processing (NLP)** and **AI model development**.
It compiles diverse and high-quality Sindhi text from **seven major sources**, ranging from classical literature and journalism to education, history, and religion — fully cleaned, normalized, and structured into JSON format.
Each file represents a unique domain of Sindhi knowledge, with metadata (source, category, token count) included for flexibility and reproducibility.
A **custom-trained Sindhi tokenizer model** is also included, enabling consistent tokenization of Sindhi text for modern transformer-based LLMs.
---
## 📂 Dataset Structure
After extracting the ZIP file, you will find this structure:
### Description of Folders and Files
| Folder/File | Description |
|--------------|-------------|
| **`data/`** | Contains all Sindhi text data as structured JSON files. Each file corresponds to a specific source or category. |
| **`tokenizer/`** | Contains the trained Sindhi tokenizer model, configuration, and vocabulary files. |
| **`README.md`** | This documentation explaining dataset details, structure, preprocessing, and usage. |
---
## 📚 Dataset Categories & Descriptions
| Source | Description |
|---------|--------------|
| **Sindh Salamat** | Literary and cultural text, including poetry, Sufism, education, and philosophy from the Sindh Salamat website. |
| **Sindhi Wikipedia** | Encyclopedic and factual content covering science, technology, history, and world knowledge. |
| **Altaf Shaikh Literature** | Works of Altaf Shaikh focusing on Sindhi literature, travelogues, philosophy, and political essays. |
| **Sindhi Language Authority** | Educational and linguistic resources promoting Sindhi language learning and preservation. |
| **Hamsari Akhbar** | Columns, articles, and opinion pieces from the Hamsari newspaper. |
| **Pahenji Akhbar** | Sindhi newspaper data including daily news and editorials. |
| **General Sindhi Data** | Miscellaneous writings, essays, short stories, and social commentary from various sources. |
---
## 🧩 JSON File Format
Each JSON file contains an array of structured Sindhi text entries:
```json
[
{
"id": "salamat_00123",
"source": "Sindh Salamat",
"category": "Literature",
"title": "ادب ۽ فڪر جو تجزيو",
"text": "سنڌي ادب ۾ فڪر ۽ تخليق جو سفر هڪ ڊگهو ۽ گهرو عمل رهيو آهي...",
"tokens": 243
},
{
"id": "salamat_00124",
"source": "Sindh Salamat",
"category": "Poetry",
"title": "شاعريءَ ۾ احساس جو اظهار",
"text": "شاعريءَ ۾ احساس جو اظهار روحاني ۽ جمالياتي ٻنهي سطحن تي ٿيندو آهي...",
"tokens": 132
}
]
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("tokenizer/")
text = "سنڌي ٻوليءَ جو ادبي سرمايو انتهائي وسيع آهي."
tokens = tokenizer.tokenize(text)
print(tokens)
提供机构:
ambile-official



