MWirelabs/assamese-monolingual-corpus

Name: MWirelabs/assamese-monolingual-corpus
Creator: MWirelabs
Published: 2025-11-13 20:43:26
License: 暂无描述

Hugging Face2025-11-13 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/MWirelabs/assamese-monolingual-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - as license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1M+ task_categories: - text-generation - text-classification - fill-mask task_ids: - language-modeling pretty_name: Assamese Monolingual Corpus tags: - assamese - northeast-india - bengali-script - low-resource - monolingual --- # Assamese Monolingual Corpus (2025) ![Type](https://img.shields.io/badge/Type-Monolingual%20Corpus-orange) ![Language](https://img.shields.io/badge/Language-Assamese-blue) ![License](https://img.shields.io/badge/License-CC%20BY--SA%204.0-green) ![Maintained By](https://img.shields.io/badge/Maintained%20By-MWire%20Labs-purple) A high-quality, sentence-level Assamese monolingual dataset containing **1.61 million** cleaned, segmented, and deduplicated sentences in Bengali script. This corpus supports Assamese NLP development, language modeling, and public deployment for Northeast India. --- ## Dataset Summary - **Language**: Assamese (Bengali script) - **Size**: 1,613,879 sentences - **Format**: Plain text CSV (`text` column) - **Total tokens**: 77,427,585 (using IndicBERTv2 tokenizer) - **License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) — attribution required - **Sources**: - IITB-IndicMonoDoc - Samanantar (Assamese side) - Assamese poetry and civic texts --- ## 📊 Token Statistics - **Total tokens**: 77,427,585 - **Average sentence length**: 36.70 tokens - **Median sentence length**: 25.00 tokens - **Minimum sentence length**: 10 tokens - **Maximum sentence length**: 26,987 tokens *Tokenization performed using [ai4bharat/IndicBERTv2-MLM-only](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only).* ![sentence_length_distribution](https://cdn-uploads.huggingface.co/production/uploads/68b657f79486ab01fda55107/yTZWrtzVGFe5JOEuLdsPA.png) --- ## Quickstart: Load with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset("MWirelabs/assamese-monolingual-corpus", split="train") print(ds[0]["text"]) ``` --- ## Cleaning Pipeline This corpus was processed using a modular, script-aware pipeline: 1. **Unicode normalization** 2. **Sentence segmentation** using Assamese punctuation (`।`, `!`, `?`, `|`) 3. **Length filtering** (≥10 tokens) 4. **Boilerplate removal** (e.g. news intros, repetitive closings) 5. **Script filtering** (removal of non-Bengali-script lines) 6. **Whitespace normalization** 7. **Deduplication** --- ## Filtering Constraints - **Minimum length**: Sentences with fewer than 10 tokens were removed. - **Maximum length**: No hard limit was applied, but extremely long lines (>200 tokens) were rare and retained for diversity. - **Proportion removed**: Length filtering removed approximately 886,146 lines from the raw segmented corpus (~29.5%). This ensures the corpus favors complete, meaningful sentences while preserving linguistic diversity. --- ## Quality Checks & Validation - **Deduplication**: Applied exact match filtering after whitespace normalization. Removed ~496,008 duplicate lines. - **Script validation**: Filtered lines with <50% Bengali-script characters to remove non-Assamese content. - **Manual sampling**: Random samples were manually inspected to confirm removal of boilerplate, non-script lines, and punctuation clutter. - **Final inspection**: Token statistics and sentence samples were reviewed to ensure linguistic and structural integrity. --- ## Intended Use This dataset is designed for: - Assamese language modeling and generation - Research on low-resource, script-aware language processing --- ## Citation If you use this dataset, please cite it as: ```bibtex @misc{mwirelabs_assamese_2025, title = {Assamese Monolingual Corpus (2025)}, author = {MWire Labs}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus}}, note = {Cleaned, segmented, and deduplicated Assamese sentences} } ``` --- ## About MWire Labs MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility. Learn more at [www.mwirelabs.com](https://www.mwirelabs.com) --- ## Contributions & Feedback We welcome feedback, contributions, and civic collaborations. Reach out via [Hugging Face](https://huggingface.co/MWirelabs).

提供机构：

MWirelabs

5,000+

优质数据集

54 个

任务类型

进入经典数据集