Minuri/sinhala-corpus-c-diverse-1m
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-c-diverse-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Diversity-Optimized Sinhala Corpus
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- diversity-optimized
- domain-classified
---
# Diversity-Optimized Sinhala Corpus
A diversity-optimized subset of 1M Sinhala sentences sampled from the `Minuri/diverse_sinhala_dataset` corpus, used for continual pretraining of LLaMA 3.2 1B (Model C) as part of a diversity-driven Sinhala language model adaptation study.
> **Corpus variants in this series:**
> - `Minuri/sinhala-corpus-a-news-1m` - News-only subset (domain-homogeneous baseline)
> - `Minuri/sinhala-corpus-b-random-1m` - Random subset (random baseline)
> - `Minuri/sinhala-corpus-c-diverse-1m` - Diversity-optimized subset (this repo) ✅ Best perplexity
## Dataset Description
Corpus C was selected from the full 12.38M sentence parent corpus using a combination of semantic and lexical diversity metrics to maximise coverage across domains, vocabulary and linguistic style. This makes it the most diverse of the three pretraining corpora, and the model trained on it (Model C) achieved the best perplexity of **10.50** on the Sinhala test set.
### Diversity Selection Criteria
- **Semantic diversity** - embedding spread using `paraphrase-multilingual-mpnet-base-v2`
- **Lexical diversity** - Type-Token Ratio (TTR), Moving Average TTR (MATTR), hapax ratio
### Source Datasets (via parent corpus)
| Source | Description |
|---|---|
| `madlad` | MADLAD-400 multilingual dataset (Sinhala subset) |
| CulturaX | Multilingual web corpus (Sinhala subset) |
| NSina | Sinhala news corpus |
| Wikipedia | Sinhala Wikipedia dump |
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `orig_index` | int | Original index in the parent corpus |
| `sentence` | string | Sinhala sentence text |
| `source` | string | Source dataset identifier (e.g. `madlad`) |
| `predicted_domain` | string | Domain label predicted by XLM-RoBERTa classifier |
| `confidence` | float | Classifier confidence score |
### Splits
| Split | Rows |
|---|---|
| train | 1,000,000 |
### Format
Available in both JSONL and CSV formats.
## Intended Uses
- Continual pretraining of LLMs on Sinhala
- Diversity-controlled ablation studies
- Sinhala NLP benchmarking
## Associated Model
This corpus was used to train: `Minuri/sinhala-llama-1b-corpus-diverse`
## Sources & Licenses
This dataset contains sentences derived from the following source datasets. Users must comply with the license terms of each:
| Source | License | Notes |
|---|---|---|
| [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY | Attribution required |
| [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses | Requires contact info agreement on HuggingFace before access |
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | CC BY-SA 3.0 + GFDL | ShareAlike - derived works must carry same license |
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 | ShareAlike - derived works must carry same license |
This dataset is released under **CC BY-SA 4.0** in compliance with the ShareAlike terms of Wikipedia and NSINA.
提供机构:
Minuri



