five

Minuri/sinhala-sft-dataset

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Minuri/sinhala-sft-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - si license: cc-by-sa-3.0 task_categories: - text-generation - question-answering pretty_name: Sinhala SFT Dataset size_categories: - 100K<n<1M tags: - sinhala - low-resource - instruction-tuning - sft - alpaca - dolly --- # Sinhala Supervised Fine-Tuning Dataset A merged Sinhala instruction-following dataset of 213,703 pairs, used for Supervised Fine-Tuning (SFT) of continually pretrained LLaMA 3.2 1B variants. Constructed as part of a diversity-driven Sinhala language model adaptation study. ## Dataset Description This dataset merges three existing Sinhala instruction datasets into a unified resource for SFT. It follows the standard Alpaca-style instruction–input–output format and covers a range of tasks including question answering, summarization and general instruction following. ### Source Datasets | Source (value in `source` column) | Original Dataset | |---|---| | `ihalage_alpaca` | `ihalage/sinhala-instruction-finetune-large` | | `dolly_sinhala` | `Suchinthana/databricks-dolly-15k-sinhala` | | `alpaca_sinhala` | `sahanruwantha/alpaca-sinhala` | ### Dataset Structure | Column | Type | Description | |---|---|---| | `instruction` | string | The instruction given to the model | | `input` | string | Optional context or input for the instruction | | `output` | string | The expected response | | `source` | string | Source dataset identifier (3 values) | ### Splits | Split | Rows | |---|---| | train | 203,000 | | **Total** | **213,703** | ### Dataset Statistics | Metric | Value | |---|---| | Total rows | 213,703 | | Format | Parquet | | Size | 167 MB | | Language | Sinhala (`si`) | | Sources | 3 | ## Intended Uses - Supervised fine-tuning (SFT) of Sinhala LLMs - Instruction-following research in Sinhala - Low-resource multilingual SFT benchmarking ## Training Details This dataset was used to fine-tune three LLaMA 3.2 1B model variants (Three models - continually pretrained on different Sinhala corpora). ## Related Repositories | Repo | Description | |---|---| | `Minuri/sinhala-corpus-a-news-1m` | Pretraining corpus A (news-only) | | `Minuri/sinhala-corpus-b-random-1m` | Pretraining corpus B (random) | | `Minuri/sinhala-corpus-c-diverse-1m` | Pretraining corpus C (diversity-optimized) |
提供机构:
Minuri
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作