Minuri/sinhala-corpus-a-news-1m
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-a-news-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: News-Only Sinhala Corpus
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- news
- domain-classified
---
# News-Only Sinhala Corpus
A news-domain subset of 1M Sinhala sentences sampled from the `Minuri/diverse_sinhala_dataset` corpus, used for continual pretraining of LLaMA 3.2 1B (Model A) as part of a diversity-driven Sinhala language model adaptation study at the Informatics Institute of Technology (IIT), Colombo, affiliated with Robert Gordon University (RGU).
> **Corpus variants in this series:**
> - `Minuri/sinhala-corpus-a-news-1m` - News-only subset (domain-homogeneous baseline) - this repo
> - `Minuri/sinhala-corpus-b-random-1m` - Random subset (random baseline)
> - `Minuri/sinhala-corpus-c-diverse-1m` - Diversity-optimized subset ✅ Best perplexity
## Dataset Description
Corpus A serves as the **domain-homogeneous baseline**, comprising sentences drawn exclusively from the news domain of the parent corpus. This enables controlled comparison against the random (B) and diversity-optimized (C) corpora in downstream perplexity and evaluation experiments. The model trained on this corpus (Model A) achieved a perplexity of **14.68** on the Sinhala test set.
### Source Datasets (via parent corpus)
| Source | Description |
|---|---|
| `culturax` | CulturaX multilingual web corpus (Sinhala subset) |
| `nsina` | NSina Sinhala news corpus |
| `madlad` | MADLAD-400 multilingual dataset (Sinhala subset) |
| `wikipedia` | Sinhala Wikipedia dump |
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `orig_index` | int | Original index in the parent corpus |
| `sentence` | string | Sinhala sentence text |
| `source` | string | Source dataset identifier |
| `predicted_domain` | string | Domain label predicted by XLM-RoBERTa classifier |
| `confidence` | float | Classifier confidence score |
### Splits
| Split | Rows |
|---|---|
| train | 1,000,000 |
### Format
Available in both JSONL and CSV formats.
## Intended Uses
- Continual pretraining of LLMs on Sinhala (domain-homogeneous baseline)
- Ablation studies on corpus diversity
- Sinhala NLP benchmarking
## Associated Model
This corpus was used to train: `Minuri/sinhala-llama-1b-corpus-news`
## Sources & Licenses
This dataset contains sentences derived from the following source datasets. Users must comply with the license terms of each:
| Source | License | Notes |
|---|---|---|
| [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY | Attribution required |
| [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses | Requires contact info agreement on HuggingFace before access |
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | CC BY-SA 3.0 + GFDL | ShareAlike - derived works must carry same license |
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 | ShareAlike - derived works must carry same license |
This dataset is released under **CC BY-SA 4.0** in compliance with the ShareAlike terms of Wikipedia and NSINA.
提供机构:
Minuri



