Minuri/nsina-sentences-raw
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/nsina-sentences-raw
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Sinhala Raw Sentences - NSINA
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- news
- raw
---
# Sinhala Raw Sentences - NSINA
Raw Sinhala sentences extracted and sentence-split from the `sinhala-nlp/NSINA` news corpus using the SinLing sentence tokenizer. This is an intermediate dataset used in the construction of `Minuri/diverse_sinhala_dataset`.
## Dataset Structure
| Column | Description |
|---|---|
| `text` | Raw Sinhala sentence |
| `char_count` | Character count of the sentence |
| `word_count` | Word count of the sentence |
| Split | Rows |
|---|---|
| train | 5,415,583 |
## Pipeline Position
`sinhala-nlp/NSINA` → **this repo** → `Minuri/nsina_cleaned_version` → `Minuri/diverse_sinhala_dataset`
## Sources & Licenses
| Source | License |
|---|---|
| [sinhala-nlp/NSINA](https://huggingface.co/datasets/sinhala-nlp/NSINA) | CC BY-SA 4.0 - ShareAlike applies |
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/nsina_cleaned_version` | Cleaned version (3,546,626 sentences) |
| `Minuri/diverse_sinhala_dataset` | Final parent corpus |
提供机构:
Minuri



