Minuri/sinhala-corpus-culturax
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-culturax
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Sinhala Raw Sentences - CulturaX
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- raw
---
# Sinhala Raw Sentences - CulturaX
Raw Sinhala sentences extracted and sentence-split from the `uonlp/CulturaX` dataset. This is an intermediate dataset used in the construction of `Minuri/diverse_sinhala_dataset`.
## Dataset Structure
| Column | Description |
|---|---|
| `text` | Raw Sinhala sentence |
| `source` | Source identifier (`culturax`) |
| Split | Rows |
|---|---|
| train | 4,707,451 |
## Pipeline Position
`uonlp/CulturaX` → **this repo** → `Minuri/culturax_cleaned_version` → `Minuri/diverse_sinhala_dataset`
## Sources & Licenses
| Source | License |
|---|---|
| [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) | mC4 + OSCAR licenses - requires contact info agreement on HuggingFace |
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/culturax_cleaned_version` | Cleaned version (3,684,137 sentences) |
| `Minuri/diverse_sinhala_dataset` | Final parent corpus |
提供机构:
Minuri



