Minuri/sinhala-sft-dataset

Name: Minuri/sinhala-sft-dataset
Creator: Minuri
Published: 2026-04-03 15:19:26
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Minuri/sinhala-sft-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - si license: cc-by-sa-3.0 task_categories: - text-generation - question-answering pretty_name: Sinhala SFT Dataset size_categories: - 100K<n<1M tags: - sinhala - low-resource - instruction-tuning - sft - alpaca - dolly --- # Sinhala Supervised Fine-Tuning Dataset A merged Sinhala instruction-following dataset of 213,703 pairs, used for Supervised Fine-Tuning (SFT) of continually pretrained LLaMA 3.2 1B variants. Constructed as part of a diversity-driven Sinhala language model adaptation study. ## Dataset Description This dataset merges three existing Sinhala instruction datasets into a unified resource for SFT. It follows the standard Alpaca-style instruction–input–output format and covers a range of tasks including question answering, summarization and general instruction following. ### Source Datasets | Source (value in `source` column) | Original Dataset | |---|---| | `ihalage_alpaca` | `ihalage/sinhala-instruction-finetune-large` | | `dolly_sinhala` | `Suchinthana/databricks-dolly-15k-sinhala` | | `alpaca_sinhala` | `sahanruwantha/alpaca-sinhala` | ### Dataset Structure | Column | Type | Description | |---|---|---| | `instruction` | string | The instruction given to the model | | `input` | string | Optional context or input for the instruction | | `output` | string | The expected response | | `source` | string | Source dataset identifier (3 values) | ### Splits | Split | Rows | |---|---| | train | 203,000 | | **Total** | **213,703** | ### Dataset Statistics | Metric | Value | |---|---| | Total rows | 213,703 | | Format | Parquet | | Size | 167 MB | | Language | Sinhala (`si`) | | Sources | 3 | ## Intended Uses - Supervised fine-tuning (SFT) of Sinhala LLMs - Instruction-following research in Sinhala - Low-resource multilingual SFT benchmarking ## Training Details This dataset was used to fine-tune three LLaMA 3.2 1B model variants (Three models - continually pretrained on different Sinhala corpora). ## Related Repositories | Repo | Description | |---|---| | `Minuri/sinhala-corpus-a-news-1m` | Pretraining corpus A (news-only) | | `Minuri/sinhala-corpus-b-random-1m` | Pretraining corpus B (random) | | `Minuri/sinhala-corpus-c-diverse-1m` | Pretraining corpus C (diversity-optimized) |

提供机构：

Minuri

5,000+

优质数据集

54 个

任务类型

进入经典数据集