ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens

Name: ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
Creator: ambile-official
Published: 2025-11-10 09:49:19
License: 暂无描述

Hugging Face2025-11-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: lgpl-3.0 task_categories: - token-classification language: - sd tags: - data - sindhi - nlp_data size_categories: - 100M<n<1B --- # 🧠 Sindhi Language Mega Corpus – 118M Tokens (JSON + Tokenizer Model) ### 🏛️ Sources: - Sindh Salamat - Sindhi Wikipedia - Altaf Shaikh Literary Works - Sindhi Language Authority Publications - Hamsari Akhbar - Pahenji Akhbar - Sindhi General Data ### 📤 Compiled & Uploaded by: **Abdul Majid Bhurgiri**, *Institute of Language Engineering* ### 🌐 Official Uploading Credit: [ambile.pk](https://ambile.pk/) ### 📅 Release Year: 2025 ### 🌍 Language: Sindhi (سنڌي) ### 📦 Dataset Size: ~118 Million Tokens ### 🧠 Format: Structured JSON Files + Tokenizer Model ### 📁 Download Link: [📥 Google Drive – Sindhi Mega Corpus (118M Tokens)](https://drive.google.com/file/d/1v6g-GJr09BKvPcGgbRip3cOERmbN_4X1/view?usp=sharing) --- ## 📖 Overview The **Sindhi Language Mega Corpus (118M Tokens)** is the largest open-source Sindhi dataset ever released for **Natural Language Processing (NLP)** and **AI model development**. It compiles diverse and high-quality Sindhi text from **seven major sources**, ranging from classical literature and journalism to education, history, and religion — fully cleaned, normalized, and structured into JSON format. Each file represents a unique domain of Sindhi knowledge, with metadata (source, category, token count) included for flexibility and reproducibility. A **custom-trained Sindhi tokenizer model** is also included, enabling consistent tokenization of Sindhi text for modern transformer-based LLMs. --- ## 📂 Dataset Structure After extracting the ZIP file, you will find this structure: ### Description of Folders and Files | Folder/File | Description | |--------------|-------------| | **`data/`** | Contains all Sindhi text data as structured JSON files. Each file corresponds to a specific source or category. | | **`tokenizer/`** | Contains the trained Sindhi tokenizer model, configuration, and vocabulary files. | | **`README.md`** | This documentation explaining dataset details, structure, preprocessing, and usage. | --- ## 📚 Dataset Categories & Descriptions | Source | Description | |---------|--------------| | **Sindh Salamat** | Literary and cultural text, including poetry, Sufism, education, and philosophy from the Sindh Salamat website. | | **Sindhi Wikipedia** | Encyclopedic and factual content covering science, technology, history, and world knowledge. | | **Altaf Shaikh Literature** | Works of Altaf Shaikh focusing on Sindhi literature, travelogues, philosophy, and political essays. | | **Sindhi Language Authority** | Educational and linguistic resources promoting Sindhi language learning and preservation. | | **Hamsari Akhbar** | Columns, articles, and opinion pieces from the Hamsari newspaper. | | **Pahenji Akhbar** | Sindhi newspaper data including daily news and editorials. | | **General Sindhi Data** | Miscellaneous writings, essays, short stories, and social commentary from various sources. | --- ## 🧩 JSON File Format Each JSON file contains an array of structured Sindhi text entries: ```json [ { "id": "salamat_00123", "source": "Sindh Salamat", "category": "Literature", "title": "ادب ۽ فڪر جو تجزيو", "text": "سنڌي ادب ۾ فڪر ۽ تخليق جو سفر هڪ ڊگهو ۽ گهرو عمل رهيو آهي...", "tokens": 243 }, { "id": "salamat_00124", "source": "Sindh Salamat", "category": "Poetry", "title": "شاعريءَ ۾ احساس جو اظهار", "text": "شاعريءَ ۾ احساس جو اظهار روحاني ۽ جمالياتي ٻنهي سطحن تي ٿيندو آهي...", "tokens": 132 } ] from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("tokenizer/") text = "سنڌي ٻوليءَ جو ادبي سرمايو انتهائي وسيع آهي." tokens = tokenizer.tokenize(text) print(tokens)

提供机构：

ambile-official

5,000+

优质数据集

54 个

任务类型

进入经典数据集