ngusadeep/Swahili-FineTome-20k
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ngusadeep/Swahili-FineTome-20k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- sw
- en
tags:
- swahili
- kiswahili
- instruction-tuning
- alpaca
- sharegpt
- translation
- finetome
- gemma4
- unsloth
task_categories:
- text-generation
pretty_name: FineTome 20K Swahili (FineTome-20k-sw)
size_categories:
- 10K<n<100K
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_en
dtype: string
- name: output_en
dtype: string
- name: source
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 55371595
num_examples: 17982
download_size: 28754790
dataset_size: 55371595
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# FineTome-20k-sw — Swahili Instruction Dataset
A high-quality Swahili instruction-following dataset translated from [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k) using GPT-4o-mini via the OpenAI Batch API. Built for fine-tuning Swahili LLMs, particularly Gemma4 E2B and E24.
## Dataset Summary
| Property | Value |
|----------|-------|
| **Language** | Swahili (`sw`) + English originals (`en`) |
| **Size** | 17,982 instruction-response pairs |
| **Source** | `mlabonne/FineTome-100k` (best 20K filtered → 17,982 after quality gate) |
| **Translation model** | GPT-4o-mini (OpenAI Batch API) |
| **License** | Apache 2.0 |
| **Task** | Instruction following, Q&A, summarization, creative writing |
## Dataset Creation
### Source Data
Selected the best 20,000 rows from `mlabonne/FineTome-100k` by filtering out:
- Code-heavy content (>30% code characters)
- Outputs under 20 words (too short)
- Outputs over 600 words (too long for translation quality)
79,664 rows passed filtering; 20,000 were sampled with even spacing for topic diversity.
### Translation Pipeline
- **Model**: `gpt-4o-mini` via OpenAI Batch API (50% cost reduction)
- **System prompt**: Kiswahili sanifu — instructs the model to produce natural, fluent Swahili (not word-for-word translation)
- **Technical terms** (AI, model, data, algorithm) preserved in English
- **Response format**: JSON `{"instruction": "...", "output": "..."}`
### Quality Filtering
After translation, each row was validated:
- Must contain ≥2 Swahili function word markers (`ni`, `na`, `kwa`, `katika`, etc.)
- Output length ratio vs English original must be in `[0.5, 2.5]`
- Must not be identical to the English source (untranslated)
**Result**: 17,982 / 20,000 rows passed (89.9% yield).
## Schema
```python
{
"instruction": str, # Swahili instruction
"output": str, # Swahili response
"instruction_en": str, # Original English instruction
"output_en": str, # Original English response
"source": str, # "FineTome-100k"
"lang": str, # "sw"
}
```
## Usage
### Load Dataset
```python
from datasets import load_dataset
ds = load_dataset("ngusadeep/FineTome-20k-sw", split="train")
print(ds[0])
```
### Fine-tune with Unsloth (ShareGPT format)
Use the companion ShareGPT dataset for direct Unsloth SFTTrainer compatibility:
```python
from datasets import load_dataset
ds = load_dataset("ngusadeep/FineTome-20k-sw-sharegpt", split="train")
# Each row:
# {
# "conversations": [
# {"from": "human", "value": "<Swahili instruction>"},
# {"from": "gpt", "value": "<Swahili response>"},
# ],
# "lang": "sw",
# "source": "FineTome-100k"
# }
```
### Example Row
```python
{
"instruction": "Eleza jinsi Boolean operators zinavyofanya kazi katika programu.",
"output": "Boolean operators ni waendeshaji wa kimantiki wanaotumika katika programu...",
"instruction_en": "Explain what boolean operators are and how they work in programming.",
"output_en": "Boolean operators are logical operators used in programming...",
"source": "FineTome-100k",
"lang": "sw"
}
```
## Intended Use
- **Fine-tuning Swahili LLMs**: Gemma4 E2B, Gemma4 E24, Qwen3.5, LLaMA3
- **Swahili NLP research**: instruction following, conversational AI
- **Benchmarking**: evaluating multilingual model Swahili capability
## Related Resources
| Resource | Link |
|----------|------|
| Fine-tuned Gemma4 E2B | [ngusadeep/gemma-4-2B-Swahili-llm](https://huggingface.co/ngusadeep/gemma-4-2B-Swahili-llm) |
| Fine-tuned Gemma4 E24 | [ngusadeep/gemma-4-24B-Swahili-llm](https://huggingface.co/ngusadeep/gemma-4-24B-Swahili-llm) |
| ShareGPT format | [ngusadeep/FineTome-20k-sw-sharegpt](https://huggingface.co/datasets/ngusadeep/FineTome-20k-sw-sharegpt) |
| Source dataset | [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) |
| Training code | [GitHub — Gemma4-Swahili](https://github.com/ngusadeep/Gemma4-Swahili) |
## Citation
```bibtex
@dataset{finetome_20k_sw_2026,
author = {Ngusa, Deep},
title = {FineTome-20k-sw: A Swahili Instruction Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ngusadeep/FineTome-20k-sw}
}
```
## Acknowledgements
- [mlabonne](https://huggingface.co/mlabonne) for the original FineTome-100k dataset
- OpenAI for GPT-4o-mini translation
- [Lengai AI Lab](https://huggingface.co/lengai-lab) — Swahili LLM Research
提供机构:
ngusadeep



