MWirelabs/assamese-monolingual-corpus
收藏Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MWirelabs/assamese-monolingual-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- as
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 1M+
task_categories:
- text-generation
- text-classification
- fill-mask
task_ids:
- language-modeling
pretty_name: Assamese Monolingual Corpus
tags:
- assamese
- northeast-india
- bengali-script
- low-resource
- monolingual
---
# Assamese Monolingual Corpus (2025)




A high-quality, sentence-level Assamese monolingual dataset containing **1.61 million** cleaned, segmented, and deduplicated sentences in Bengali script. This corpus supports Assamese NLP development, language modeling, and public deployment for Northeast India.
---
## Dataset Summary
- **Language**: Assamese (Bengali script)
- **Size**: 1,613,879 sentences
- **Format**: Plain text CSV (`text` column)
- **Total tokens**: 77,427,585 (using IndicBERTv2 tokenizer)
- **License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) — attribution required
- **Sources**:
- IITB-IndicMonoDoc
- Samanantar (Assamese side)
- Assamese poetry and civic texts
---
## 📊 Token Statistics
- **Total tokens**: 77,427,585
- **Average sentence length**: 36.70 tokens
- **Median sentence length**: 25.00 tokens
- **Minimum sentence length**: 10 tokens
- **Maximum sentence length**: 26,987 tokens
*Tokenization performed using [ai4bharat/IndicBERTv2-MLM-only](https://huggingface.co/ai4bharat/IndicBERTv2-MLM-only).*

---
## Quickstart: Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("MWirelabs/assamese-monolingual-corpus", split="train")
print(ds[0]["text"])
```
---
## Cleaning Pipeline
This corpus was processed using a modular, script-aware pipeline:
1. **Unicode normalization**
2. **Sentence segmentation** using Assamese punctuation (`।`, `!`, `?`, `|`)
3. **Length filtering** (≥10 tokens)
4. **Boilerplate removal** (e.g. news intros, repetitive closings)
5. **Script filtering** (removal of non-Bengali-script lines)
6. **Whitespace normalization**
7. **Deduplication**
---
## Filtering Constraints
- **Minimum length**: Sentences with fewer than 10 tokens were removed.
- **Maximum length**: No hard limit was applied, but extremely long lines (>200 tokens) were rare and retained for diversity.
- **Proportion removed**: Length filtering removed approximately 886,146 lines from the raw segmented corpus (~29.5%).
This ensures the corpus favors complete, meaningful sentences while preserving linguistic diversity.
---
## Quality Checks & Validation
- **Deduplication**: Applied exact match filtering after whitespace normalization. Removed ~496,008 duplicate lines.
- **Script validation**: Filtered lines with <50% Bengali-script characters to remove non-Assamese content.
- **Manual sampling**: Random samples were manually inspected to confirm removal of boilerplate, non-script lines, and punctuation clutter.
- **Final inspection**: Token statistics and sentence samples were reviewed to ensure linguistic and structural integrity.
---
## Intended Use
This dataset is designed for:
- Assamese language modeling and generation
- Research on low-resource, script-aware language processing
---
## Citation
If you use this dataset, please cite it as:
```bibtex
@misc{mwirelabs_assamese_2025,
title = {Assamese Monolingual Corpus (2025)},
author = {MWire Labs},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus}},
note = {Cleaned, segmented, and deduplicated Assamese sentences}
}
```
---
## About MWire Labs
MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility.
Learn more at [www.mwirelabs.com](https://www.mwirelabs.com)
---
## Contributions & Feedback
We welcome feedback, contributions, and civic collaborations.
Reach out via [Hugging Face](https://huggingface.co/MWirelabs).
提供机构:
MWirelabs



