Minuri/madlad_cleaned_version
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/madlad_cleaned_version
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-4.0
task_categories:
- text-generation
pretty_name: Sinhala Cleaned Sentences - MADLAD-400
size_categories:
- 1M<n<10M
tags:
- sinhala
- low-resource
- pretraining
- cleaned
- deduplicated
---
# Sinhala Cleaned Sentences - MADLAD-400
Cleaned and deduplicated Sinhala sentences derived from `Minuri/sinhala-corpus-madlad400`, produced through a multi-stage cleaning pipeline. This repo was used as pipeline storage across cleaning stages, with the final output being `stage10_final_corpus_deduped.csv`.
## Final Output
| File | Rows | Description |
|---|---|---|
| `stage10_final_corpus_deduped.csv` | 5,033,732 | Final cleaned and deduplicated sentences |
## Dataset Structure (final output)
| Column | Description |
|---|---|
| `sentence` | Cleaned Sinhala sentence |
## Pipeline Stages
This repo stores intermediate CSV files across the 12-stage cleaning pipeline (stage2 through stage10), including cleaning logs and filter reports.
## Pipeline Position
`Minuri/sinhala-corpus-madlad400` → **this repo** → `Minuri/diverse_sinhala_dataset`
## Sources & Licenses
| Source | License |
|---|---|
| [allenai/MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) | ODC-BY |
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/sinhala-corpus-madlad400` | Raw sentences (7,281,026) |
| `Minuri/diverse_sinhala_dataset` | Final parent corpus |
提供机构:
Minuri



