MWirelabs/meitei-monolingual-corpus
收藏Hugging Face2025-11-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/MWirelabs/meitei-monolingual-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Meitei Monolingual Corpus (Bengali Script)
tags:
- meitei
- bengali-script
- monolingual
- low-resource
- nlp
- civic-tech
- northeast-india
- language-modeling
- deduplicated
language:
- mni
license: cc-by-sa-4.0
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 143000000 # Approximate total size
num_examples: 1766810
download_size: 143000000
dataset_size: 143000000
description: |
A clean, deduplicated, sentence-level Meitei corpus in Bengali script, curated for civic-first NLP research and public deployment.
This dataset contains 1.76M Meitei sentences written in Bengali script. It was created by combining and cleaning data from the IndicNLP Monolingual Corpus and internal civic/language sources curated by MWire Labs.
The corpus is suitable for language modeling, retrieval, summarization, and other NLP tasks focused on Meitei. It does not include Meitei Mayek script.
source_datasets:
- IndicNLP Monolingual Corpus (Meitei subset)
- Internal civic and linguistic data curated by MWire Labs
dataset_creation:
cleaning_steps:
- Removed boilerplate and repeated headers/footers
- Filtered for Bengali-script Meitei content
- Split long lines into sentences using Bengali punctuation (। ? !)
- Deduplicated long and short sentences separately
- Merged into a clean, sentence-level corpus
usage:
- Language modeling
- Civic NLP applications
- Retrieval and summarization
- Benchmarking Meitei tools
citation:
title: Meitei Monolingual Corpus (Bengali Script)
author: MWire Labs
year: 2025
url: https://huggingface.co/datasets/MWirelabs/meitei-monolingual-corpus
license: CC BY-SA 4.0
---
# Meitei Monolingual Corpus (Bengali Script)
A clean, deduplicated, sentence-level Meitei corpus in Bengali script, curated for civic-first NLP research and public deployment.
---
## 📦 Dataset Overview
- **Language**: Meitei (written in Bengali script)
- **Size**: 1,766,810 sentences
- **Format**: Parquet (auto-converted from Hugging Face `Dataset`)
- **License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) — attribution required
---
## 🧭 Sources
This corpus is primarily derived from:
- **IndicNLP Monolingual Corpus** (Meitei subset in Bengali script)
- **Internal civic and linguistic data** curated by MWire Labs
---
## 🧹 Cleaning Pipeline
The dataset was processed using a modular, reproducible pipeline:
1. **Boilerplate removal** — stripped repeated headers, footers, and template artifacts
2. **Script validation** — retained only Bengali-script Meitei content
3. **Sentence splitting** — long lines split using Bengali punctuation (`।`, `?`, `!`)
4. **Exact deduplication** — applied separately to long and short sentences
5. **Final merge** — resulting in a clean, diverse, sentence-level corpus
---
## ✅ Intended Use
This dataset is suitable for:
- Language modeling and fine-tuning
- Civic NLP applications (retrieval, summarization, classification)
- Benchmarking Meitei language tools
- Public deployment and linguistic research
---
## 🔖 Citation
If you use this dataset, please cite:
@dataset{mwirelabs_meitei_2025,
title = {Meitei Monolingual Corpus (Bengali Script)},
author = {MWire Labs},
year = {2025},
url = {https://huggingface.co/datasets/MWirelabs/meitei-monolingual-corpus},
license = {CC BY-SA 4.0}
}
---
## 🗂️ Notes
- This corpus does **not** include Meitei Mayek script.
- Future versions may include Mayek-script data and parallel corpora.
---
提供机构:
MWirelabs



