five

ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens

收藏
Hugging Face2025-11-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ambile-official/Sindhi_Mega_Corpus_118_Million_Tokens
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: lgpl-3.0 task_categories: - token-classification language: - sd tags: - data - sindhi - nlp_data size_categories: - 100M<n<1B --- # 🧠 Sindhi Language Mega Corpus – 118M Tokens (JSON + Tokenizer Model) ### 🏛️ Sources: - Sindh Salamat - Sindhi Wikipedia - Altaf Shaikh Literary Works - Sindhi Language Authority Publications - Hamsari Akhbar - Pahenji Akhbar - Sindhi General Data ### 📤 Compiled & Uploaded by: **Abdul Majid Bhurgiri**, *Institute of Language Engineering* ### 🌐 Official Uploading Credit: [ambile.pk](https://ambile.pk/) ### 📅 Release Year: 2025 ### 🌍 Language: Sindhi (سنڌي) ### 📦 Dataset Size: ~118 Million Tokens ### 🧠 Format: Structured JSON Files + Tokenizer Model ### 📁 Download Link: [📥 Google Drive – Sindhi Mega Corpus (118M Tokens)](https://drive.google.com/file/d/1v6g-GJr09BKvPcGgbRip3cOERmbN_4X1/view?usp=sharing) --- ## 📖 Overview The **Sindhi Language Mega Corpus (118M Tokens)** is the largest open-source Sindhi dataset ever released for **Natural Language Processing (NLP)** and **AI model development**. It compiles diverse and high-quality Sindhi text from **seven major sources**, ranging from classical literature and journalism to education, history, and religion — fully cleaned, normalized, and structured into JSON format. Each file represents a unique domain of Sindhi knowledge, with metadata (source, category, token count) included for flexibility and reproducibility. A **custom-trained Sindhi tokenizer model** is also included, enabling consistent tokenization of Sindhi text for modern transformer-based LLMs. --- ## 📂 Dataset Structure After extracting the ZIP file, you will find this structure: ### Description of Folders and Files | Folder/File | Description | |--------------|-------------| | **`data/`** | Contains all Sindhi text data as structured JSON files. Each file corresponds to a specific source or category. | | **`tokenizer/`** | Contains the trained Sindhi tokenizer model, configuration, and vocabulary files. | | **`README.md`** | This documentation explaining dataset details, structure, preprocessing, and usage. | --- ## 📚 Dataset Categories & Descriptions | Source | Description | |---------|--------------| | **Sindh Salamat** | Literary and cultural text, including poetry, Sufism, education, and philosophy from the Sindh Salamat website. | | **Sindhi Wikipedia** | Encyclopedic and factual content covering science, technology, history, and world knowledge. | | **Altaf Shaikh Literature** | Works of Altaf Shaikh focusing on Sindhi literature, travelogues, philosophy, and political essays. | | **Sindhi Language Authority** | Educational and linguistic resources promoting Sindhi language learning and preservation. | | **Hamsari Akhbar** | Columns, articles, and opinion pieces from the Hamsari newspaper. | | **Pahenji Akhbar** | Sindhi newspaper data including daily news and editorials. | | **General Sindhi Data** | Miscellaneous writings, essays, short stories, and social commentary from various sources. | --- ## 🧩 JSON File Format Each JSON file contains an array of structured Sindhi text entries: ```json [ { "id": "salamat_00123", "source": "Sindh Salamat", "category": "Literature", "title": "ادب ۽ فڪر جو تجزيو", "text": "سنڌي ادب ۾ فڪر ۽ تخليق جو سفر هڪ ڊگهو ۽ گهرو عمل رهيو آهي...", "tokens": 243 }, { "id": "salamat_00124", "source": "Sindh Salamat", "category": "Poetry", "title": "شاعريءَ ۾ احساس جو اظهار", "text": "شاعريءَ ۾ احساس جو اظهار روحاني ۽ جمالياتي ٻنهي سطحن تي ٿيندو آهي...", "tokens": 132 } ] from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("tokenizer/") text = "سنڌي ٻوليءَ جو ادبي سرمايو انتهائي وسيع آهي." tokens = tokenizer.tokenize(text) print(tokens)
提供机构:
ambile-official
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作