five

ngusadeep/Swahili-FineTome-20k

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ngusadeep/Swahili-FineTome-20k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - sw - en tags: - swahili - kiswahili - instruction-tuning - alpaca - sharegpt - translation - finetome - gemma4 - unsloth task_categories: - text-generation pretty_name: FineTome 20K Swahili (FineTome-20k-sw) size_categories: - 10K<n<100K dataset_info: features: - name: instruction dtype: string - name: output dtype: string - name: instruction_en dtype: string - name: output_en dtype: string - name: source dtype: string - name: lang dtype: string splits: - name: train num_bytes: 55371595 num_examples: 17982 download_size: 28754790 dataset_size: 55371595 configs: - config_name: default data_files: - split: train path: data/train-* --- # FineTome-20k-sw — Swahili Instruction Dataset A high-quality Swahili instruction-following dataset translated from [`mlabonne/FineTome-100k`](https://huggingface.co/datasets/mlabonne/FineTome-100k) using GPT-4o-mini via the OpenAI Batch API. Built for fine-tuning Swahili LLMs, particularly Gemma4 E2B and E24. ## Dataset Summary | Property | Value | |----------|-------| | **Language** | Swahili (`sw`) + English originals (`en`) | | **Size** | 17,982 instruction-response pairs | | **Source** | `mlabonne/FineTome-100k` (best 20K filtered → 17,982 after quality gate) | | **Translation model** | GPT-4o-mini (OpenAI Batch API) | | **License** | Apache 2.0 | | **Task** | Instruction following, Q&A, summarization, creative writing | ## Dataset Creation ### Source Data Selected the best 20,000 rows from `mlabonne/FineTome-100k` by filtering out: - Code-heavy content (>30% code characters) - Outputs under 20 words (too short) - Outputs over 600 words (too long for translation quality) 79,664 rows passed filtering; 20,000 were sampled with even spacing for topic diversity. ### Translation Pipeline - **Model**: `gpt-4o-mini` via OpenAI Batch API (50% cost reduction) - **System prompt**: Kiswahili sanifu — instructs the model to produce natural, fluent Swahili (not word-for-word translation) - **Technical terms** (AI, model, data, algorithm) preserved in English - **Response format**: JSON `{"instruction": "...", "output": "..."}` ### Quality Filtering After translation, each row was validated: - Must contain ≥2 Swahili function word markers (`ni`, `na`, `kwa`, `katika`, etc.) - Output length ratio vs English original must be in `[0.5, 2.5]` - Must not be identical to the English source (untranslated) **Result**: 17,982 / 20,000 rows passed (89.9% yield). ## Schema ```python { "instruction": str, # Swahili instruction "output": str, # Swahili response "instruction_en": str, # Original English instruction "output_en": str, # Original English response "source": str, # "FineTome-100k" "lang": str, # "sw" } ``` ## Usage ### Load Dataset ```python from datasets import load_dataset ds = load_dataset("ngusadeep/FineTome-20k-sw", split="train") print(ds[0]) ``` ### Fine-tune with Unsloth (ShareGPT format) Use the companion ShareGPT dataset for direct Unsloth SFTTrainer compatibility: ```python from datasets import load_dataset ds = load_dataset("ngusadeep/FineTome-20k-sw-sharegpt", split="train") # Each row: # { # "conversations": [ # {"from": "human", "value": "<Swahili instruction>"}, # {"from": "gpt", "value": "<Swahili response>"}, # ], # "lang": "sw", # "source": "FineTome-100k" # } ``` ### Example Row ```python { "instruction": "Eleza jinsi Boolean operators zinavyofanya kazi katika programu.", "output": "Boolean operators ni waendeshaji wa kimantiki wanaotumika katika programu...", "instruction_en": "Explain what boolean operators are and how they work in programming.", "output_en": "Boolean operators are logical operators used in programming...", "source": "FineTome-100k", "lang": "sw" } ``` ## Intended Use - **Fine-tuning Swahili LLMs**: Gemma4 E2B, Gemma4 E24, Qwen3.5, LLaMA3 - **Swahili NLP research**: instruction following, conversational AI - **Benchmarking**: evaluating multilingual model Swahili capability ## Related Resources | Resource | Link | |----------|------| | Fine-tuned Gemma4 E2B | [ngusadeep/gemma-4-2B-Swahili-llm](https://huggingface.co/ngusadeep/gemma-4-2B-Swahili-llm) | | Fine-tuned Gemma4 E24 | [ngusadeep/gemma-4-24B-Swahili-llm](https://huggingface.co/ngusadeep/gemma-4-24B-Swahili-llm) | | ShareGPT format | [ngusadeep/FineTome-20k-sw-sharegpt](https://huggingface.co/datasets/ngusadeep/FineTome-20k-sw-sharegpt) | | Source dataset | [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) | | Training code | [GitHub — Gemma4-Swahili](https://github.com/ngusadeep/Gemma4-Swahili) | ## Citation ```bibtex @dataset{finetome_20k_sw_2026, author = {Ngusa, Deep}, title = {FineTome-20k-sw: A Swahili Instruction Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ngusadeep/FineTome-20k-sw} } ``` ## Acknowledgements - [mlabonne](https://huggingface.co/mlabonne) for the original FineTome-100k dataset - OpenAI for GPT-4o-mini translation - [Lengai AI Lab](https://huggingface.co/lengai-lab) — Swahili LLM Research
提供机构:
ngusadeep
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作