Minuri/sinhala-sft-dataset
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-sft-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license: cc-by-sa-3.0
task_categories:
- text-generation
- question-answering
pretty_name: Sinhala SFT Dataset
size_categories:
- 100K<n<1M
tags:
- sinhala
- low-resource
- instruction-tuning
- sft
- alpaca
- dolly
---
# Sinhala Supervised Fine-Tuning Dataset
A merged Sinhala instruction-following dataset of 213,703 pairs, used for Supervised Fine-Tuning (SFT) of continually pretrained LLaMA 3.2 1B variants. Constructed as part of a diversity-driven Sinhala language model adaptation study.
## Dataset Description
This dataset merges three existing Sinhala instruction datasets into a unified resource for SFT. It follows the standard Alpaca-style instruction–input–output format and covers a range of tasks including question answering, summarization and general instruction following.
### Source Datasets
| Source (value in `source` column) | Original Dataset |
|---|---|
| `ihalage_alpaca` | `ihalage/sinhala-instruction-finetune-large` |
| `dolly_sinhala` | `Suchinthana/databricks-dolly-15k-sinhala` |
| `alpaca_sinhala` | `sahanruwantha/alpaca-sinhala` |
### Dataset Structure
| Column | Type | Description |
|---|---|---|
| `instruction` | string | The instruction given to the model |
| `input` | string | Optional context or input for the instruction |
| `output` | string | The expected response |
| `source` | string | Source dataset identifier (3 values) |
### Splits
| Split | Rows |
|---|---|
| train | 203,000 |
| **Total** | **213,703** |
### Dataset Statistics
| Metric | Value |
|---|---|
| Total rows | 213,703 |
| Format | Parquet |
| Size | 167 MB |
| Language | Sinhala (`si`) |
| Sources | 3 |
## Intended Uses
- Supervised fine-tuning (SFT) of Sinhala LLMs
- Instruction-following research in Sinhala
- Low-resource multilingual SFT benchmarking
## Training Details
This dataset was used to fine-tune three LLaMA 3.2 1B model variants (Three models - continually pretrained on different Sinhala corpora).
## Related Repositories
| Repo | Description |
|---|---|
| `Minuri/sinhala-corpus-a-news-1m` | Pretraining corpus A (news-only) |
| `Minuri/sinhala-corpus-b-random-1m` | Pretraining corpus B (random) |
| `Minuri/sinhala-corpus-c-diverse-1m` | Pretraining corpus C (diversity-optimized) |
提供机构:
Minuri



