Minuri/sinhala-corpus-b-random-1m
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-b-random-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Randomly Curated Sinhala Corpus
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- domain-classified
---
# Randomly Curated Sinhala Corpus
A randomly sampled subset of 1M Sinhala sentences from the `Minuri/diverse_sinhala_dataset` corpus, used for continual pretraining of LLaMA 3.2 1B (Model B) as part of a diversity-driven Sinhala language model adaptation study.
> **Corpus variants in this series:**
> - `Minuri/sinhala-corpus-a-news-1m` - News-only subset (domain-homogeneous baseline)
> - `Minuri/sinhala-corpus-b-random-1m` - Random subset (random baseline) - this repo
> - `Minuri/sinhala-corpus-c-diverse-1m` - Diversity-optimized subset ✅ Best perplexity
## Dataset Description
Corpus B serves as the **random sampling baseline**, comprising 1M sentences drawn randomly from the full parent corpus without any domain or diversity constraints. This allows comparison against the domain-controlled (A) and diversity-optimized (C) corpora. The model trained on this corpus (Model B) achieved a perplexity of **10.86** on the Sinhala test set.
### Source Datasets (via parent corpus)
| Source | Description |
|---|---|
| `culturax` | CulturaX multilingual web corpus (Sinhala subset) |
| `madlad` | MADLAD-400 multilingual dataset (Sinhala subset) |
| `nsina` | NSina Sinhala news corpus |
| `wikipedia` | Sinhala Wikipedia dump |
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `orig_index` | int | Original index in the parent corpus |
| `sentence` | string | Sinhala sentence text |
| `source` | string | Source dataset identifier |
| `predicted_domain` | string | Domain label predicted by XLM-RoBERTa classifier |
| `confidence` | float | Classifier confidence score |
### Splits
| Split | Rows |
|---|---|
| train | 1,000,000 |
### Format
Available in both JSONL and CSV formats.
## Intended Uses
- Continual pretraining of LLMs on Sinhala (random baseline)
- Ablation studies on corpus diversity
- Sinhala NLP benchmarking
## Associated Model
This corpus was used to train: `Minuri/sinhala-llama-1b-corpus-random`
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/diverse_sinhala_dataset` | Full 12.38M sentence parent corpus |
| `Minuri/sinhala-corpus-a-news-1m` | Corpus A - news-only (1M) |
| `Minuri/sinhala-corpus-c-diverse-1m` | Corpus C - diversity-optimized (1M) |
| `Minuri/sinhala-test-set-50k` | Test set (50K sentences) |
| `Minuri/sinhala-validation-set-10k` | Validation set (10K sentences) |
| `Minuri/sinhala-llama-3.2-1b-tokenizer` | Extended Sinhala tokenizer |
## Sources & Licenses
This dataset contains sentences derived from the following source datasets. Users must comply with the license terms of each:
| Source | License | Notes |
|---|---|---|
| [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY | Attribution required |
| [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses | Requires contact info agreement on HuggingFace before access |
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | CC BY-SA 3.0 + GFDL | ShareAlike - derived works must carry same license |
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 | ShareAlike - derived works must carry same license |
This dataset is released under **CC BY-SA 4.0** in compliance with the ShareAlike terms of Wikipedia and NSINA.
提供机构:
Minuri



