Minuri/sinhala-corpus-madlad400
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-corpus-madlad400
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Sinhala Raw Sentences - MADLAD-400
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- raw
---
# Sinhala Raw Sentences - MADLAD-400
Raw Sinhala sentences extracted and sentence-split from the `allenai/MADLAD-400` dataset. This is an intermediate dataset used in the construction of `Minuri/diverse_sinhala_dataset`.
## Dataset Structure
| Column | Description |
|---|---|
| `text` | Raw Sinhala sentence |
| `source` | Source identifier (`madlad`) |
| Split | Rows |
|---|---|
| train | 7,281,026 |
## Pipeline Position
`allenai/MADLAD-400` → **this repo** → `Minuri/madlad_cleaned_version` → `Minuri/diverse_sinhala_dataset`
## Sources & Licenses
| Source | License |
|---|---|
| [allenai/MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY |
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/madlad_cleaned_version` | Cleaned version (5,033,732 sentences) |
| `Minuri/diverse_sinhala_dataset` | Final parent corpus |
提供机构:
Minuri



