Minuri/nsina_cleaned_version
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/nsina_cleaned_version
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Sinhala Cleaned Sentences - NSINA
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- news
- cleaned
- deduplicated
---
# Sinhala Cleaned Sentences - NSINA
Cleaned and deduplicated Sinhala sentences derived from `Minuri/nsina-sentences-raw`, produced through a multi-stage cleaning pipeline. This repo was used as pipeline storage across cleaning stages, with the final output being `stage10_final_corpus_deduped.csv`.
## Final Output
| File | Rows | Description |
|---|---|---|
| `stage10_final_corpus_deduped.csv` | 3,546,626 | Final cleaned and deduplicated sentences |
## Dataset Structure (final output)
| Column | Description |
|---|---|
| `sentence` | Cleaned Sinhala sentence |
## Pipeline Stages
This repo stores intermediate CSV files across the 12-stage cleaning pipeline (stage2 through stage10), including cleaning logs and filter reports.
## Pipeline Position
`Minuri/nsina-sentences-raw` → **this repo** → `Minuri/diverse_sinhala_dataset`
## Sources & Licenses
| Source | License |
|---|---|
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 - ShareAlike applies |
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/nsina-sentences-raw` | Raw sentences (5,415,583) |
| `Minuri/diverse_sinhala_dataset` | Final parent corpus |
提供机构:
Minuri



