NaolBM/Kiya-SFT
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/Kiya-SFT
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- question-answering
language:
- am
- en
- om
- yo
- sw
- ti
- ha
size_categories:
- 100K<n<1M
license: mit
pretty_name: Kiya-SFT
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: text
list:
- name: content
dtype: string
- name: role
dtype: string
- name: language
dtype: string
splits:
- name: train
num_bytes: 1919438418
num_examples: 747307
download_size: 950374054
dataset_size: 1919438418
tags:
- sft
- post-training
- llm
---
# Kiya-SFT
This dataset is a collection of single-turn and multi-turn conversational data designed for Supervised Fine-Tuning (SFT) of large language models, specifically focusing on supporting multiple African languages alongside English. It is intended to train models to be helpful and friendly assistants capable of understanding and generating responses across a diverse linguistic landscape.
## Dataset Description
Kiya-SFT combines several existing instruction-following and conversational datasets, meticulously processed to a unified `text` column containing conversational turns and a `language` column indicating the primary language of each conversation. A system prompt, "you are kiya, a helpful and friendly assistant", is prepended to each conversation to guide the model's persona during fine-tuning.
### Languages
The dataset includes conversations in the following languages:
- English (`en`)
- Swahili (`sw`)
- Oromo (`om`)
- Yoruba (`yo`)
- Amharic (`am`)
- Tigrinya (`ti`)
- Hausa (`ha`)
### Data Structure
Each entry in the dataset is a dictionary with two fields:
- `text`: A list of dictionaries, where each inner dictionary represents a turn in a conversation. Each turn has a `role` (e.g., "system", "user", "assistant") and `content` (the message).
Example:
```json
[
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing well, thank you for asking! How can I help you today?"}
]
```
- `language`: A string representing the ISO 639-1 language code of the conversation (e.g., "en", "sw", "am").
### Dataset Statistics
- **Total Conversations**: 518,500 (based on the last execution of the notebook)
## Usage
You can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("NaolBM/Kiya-SFT")
# To access a specific split (e.g., 'train')
train_dataset = dataset["train"]
# To inspect an example
print(train_dataset[0])
```
任务类别:问答
语言:阿姆哈拉语(am)、英语(en)、奥罗莫语(om)、约鲁巴语(yo)、斯瓦西里语(sw)、提格雷尼亚语(ti)、豪萨语(ha)
样本规模:10万<n<100万
许可证:MIT
美观名称:Kiya-SFT
配置项:
- 配置名称:default
数据文件:
- 拆分集:训练集(train)
路径:data/train-*
数据集信息:
特征:
- 名称:text
子项:
- 名称:content
数据类型:字符串
- 名称:role
数据类型:字符串
- 名称:language
数据类型:字符串
拆分集:
- 名称:训练集(train)
字节数:1919438418
样本数:747307
下载大小:950374054
数据集总大小:1919438418
标签:sft、后训练、大语言模型(Large Language Model, LLM)
# Kiya-SFT
本数据集为面向大语言模型(Large Language Model, LLM)监督微调(Supervised Fine-Tuning, SFT)打造的单轮与多轮对话数据集集合,旨在训练能够在多元语言环境中理解并生成响应的友好实用助手,重点支持英语与多种非洲语言。
## 数据集描述
Kiya-SFT整合了多款现有指令遵循与对话数据集,经精细化处理后统一为`text`列与`language`列:其中`text`列包含对话轮次,`language`列标注每条对话的主要语言。每条对话前均预设系统提示“你是Kiya,一名乐于助人且友好的助手”,以在微调阶段引导模型的角色定位。
### 支持语言
本数据集包含以下语言的对话:
- 英语(`en`)
- 斯瓦西里语(`sw`)
- 奥罗莫语(`om`)
- 约鲁巴语(`yo`)
- 阿姆哈拉语(`am`)
- 提格雷尼亚语(`ti`)
- 豪萨语(`ha`)
### 数据结构
数据集中的每条条目均为包含两个字段的字典:
- `text`:由多个字典组成的列表,每个内部字典代表一轮对话。每轮对话包含`role`(角色,例如“system”“user”“assistant”)与`content`(对话内容)。示例如下:
json
[
{"role": "user", "content": "你好,近来可好?"},
{"role": "assistant", "content": "我一切安好,感谢你的询问!今天我能为你提供什么帮助?"}
]
- `language`:字符串类型,代表该对话的ISO 639-1语言代码(例如“en”“sw”“am”)。
### 数据集统计数据
- **总对话数**:518500条(基于最近一次笔记本运行结果)
## 使用方法
你可以通过Hugging Face的`datasets`库加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("NaolBM/Kiya-SFT")
# 访问指定拆分集(例如训练集)
train_dataset = dataset["train"]
# 查看单条示例数据
print(train_dataset[0])
提供机构:
NaolBM



