Minuri/diverse_sinhala_dataset
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/diverse_sinhala_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
viewer: false
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Diverse Sinhala Dataset
size_categories:
- 10M<n<100M
tags:
- sinhala
- low-resource
- pretraining
- domain-classified
- deduplicated
---
# Diverse Sinhala Dataset
A large-scale, cleaned, deduplicated, and domain-classified Sinhala text corpus compiled for continual pretraining of large language models. Constructed as part of a diversity-driven Sinhala language model adaptation study.
This repository serves as **pipeline storage** for the full corpus construction process, from merged cleaned sentences through to the final high-confidence domain-classified corpus.
## Files
| File | Rows | Columns | Description |
|---|---|---|---|
| `exact_deduplicated_corpus.csv` | 12,653,718 | `sentence`, `source` | Merged corpus after exact deduplication across all 4 sources |
| `near_deduplicated_corpus.csv` | 12,381,586 | `sentence`, `source` | After MinHash LSH near-duplicate removal |
| `full_dataset_classified.csv` | 12,381,586 | `sentence`, `source`, `predicted_domain`, `confidence` | After domain classification |
| `high_confidence_7_7M.csv` | 7,706,748 | `sentence`, `source`, `predicted_domain`, `confidence` | **Final corpus** — high-confidence domain classifications only |
## Pipeline
```
4 cleaned source datasets
↓
Exact deduplication (cross-source)
↓ exact_deduplicated_corpus.csv (12.65M)
MinHash LSH near-deduplication
↓ near_deduplicated_corpus.csv (12.38M)
XLM-RoBERTa domain classification (8 domains, 94% macro-F1)
↓ full_dataset_classified.csv (12.38M)
High-confidence filter
↓ high_confidence_7_7M.csv (7.70M) ← final usable corpus
```
## Source Datasets
| Source | Repo | Cleaned sentences |
|---|---|---|
| MADLAD-400 | `Minuri/madlad_cleaned_version` | 5,033,732 |
| CulturaX | `Minuri/culturax_cleaned_version` | 3,684,137 |
| NSINA | `Minuri/nsina_cleaned_version` | 3,546,626 |
| Wikipedia | `Minuri/wikipedia_cleaned_version` | 389,223 |
## Domains (8 classes)
Classified using a fine-tuned XLM-RoBERTa model (94% macro-F1): News, Education, Government / Legal, Health, Religion / Culture, Sports, Entertainment, General / Other
## Usage
```python
import pandas as pd
from huggingface_hub import hf_hub_download
# Load the final high-confidence corpus
path = hf_hub_download(
repo_id="Minuri/diverse_sinhala_dataset",
filename="high_confidence_7_7M.csv",
repo_type="dataset"
)
df = pd.read_csv(path)
print(df.head())
```
## Downstream Datasets
This corpus was used to sample the following pretraining corpora:
| Repo | Description |
|---|---|
| `Minuri/sinhala-corpus-a-news-1m` | News-only subset (1M sentences) |
| `Minuri/sinhala-corpus-b-random-1m` | Random subset (1M sentences) |
| `Minuri/sinhala-corpus-c-diverse-1m` | Diversity-optimised subset (1M sentences) |
| `Minuri/sinhala-test-set-50k` | Test set (50K sentences) |
| `Minuri/sinhala-validation-set-10k` | Validation set (10K sentences) |
| `Minuri/sinhala-llama-3.2-1b-tokenizer` | SentencePiece tokenizer trained on this corpus |
## Sources & Licenses
| Source | License |
|---|---|
| [allenai/MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY |
| [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses |
| [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | CC BY-SA 3.0 + GFDL |
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 |
This dataset is released under **CC BY-SA 4.0** in compliance with the ShareAlike terms of Wikipedia and NSINA.
提供机构:
Minuri



