five

NaolBM/Kiya-SFT

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NaolBM/Kiya-SFT
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering language: - am - en - om - yo - sw - ti - ha size_categories: - 100K<n<1M license: mit pretty_name: Kiya-SFT configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: text list: - name: content dtype: string - name: role dtype: string - name: language dtype: string splits: - name: train num_bytes: 1919438418 num_examples: 747307 download_size: 950374054 dataset_size: 1919438418 tags: - sft - post-training - llm --- # Kiya-SFT This dataset is a collection of single-turn and multi-turn conversational data designed for Supervised Fine-Tuning (SFT) of large language models, specifically focusing on supporting multiple African languages alongside English. It is intended to train models to be helpful and friendly assistants capable of understanding and generating responses across a diverse linguistic landscape. ## Dataset Description Kiya-SFT combines several existing instruction-following and conversational datasets, meticulously processed to a unified `text` column containing conversational turns and a `language` column indicating the primary language of each conversation. A system prompt, "you are kiya, a helpful and friendly assistant", is prepended to each conversation to guide the model's persona during fine-tuning. ### Languages The dataset includes conversations in the following languages: - English (`en`) - Swahili (`sw`) - Oromo (`om`) - Yoruba (`yo`) - Amharic (`am`) - Tigrinya (`ti`) - Hausa (`ha`) ### Data Structure Each entry in the dataset is a dictionary with two fields: - `text`: A list of dictionaries, where each inner dictionary represents a turn in a conversation. Each turn has a `role` (e.g., "system", "user", "assistant") and `content` (the message). Example: ```json [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing well, thank you for asking! How can I help you today?"} ] ``` - `language`: A string representing the ISO 639-1 language code of the conversation (e.g., "en", "sw", "am"). ### Dataset Statistics - **Total Conversations**: 518,500 (based on the last execution of the notebook) ## Usage You can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("NaolBM/Kiya-SFT") # To access a specific split (e.g., 'train') train_dataset = dataset["train"] # To inspect an example print(train_dataset[0]) ```

任务类别:问答 语言:阿姆哈拉语(am)、英语(en)、奥罗莫语(om)、约鲁巴语(yo)、斯瓦西里语(sw)、提格雷尼亚语(ti)、豪萨语(ha) 样本规模:10万<n<100万 许可证:MIT 美观名称:Kiya-SFT 配置项: - 配置名称:default 数据文件: - 拆分集:训练集(train) 路径:data/train-* 数据集信息: 特征: - 名称:text 子项: - 名称:content 数据类型:字符串 - 名称:role 数据类型:字符串 - 名称:language 数据类型:字符串 拆分集: - 名称:训练集(train) 字节数:1919438418 样本数:747307 下载大小:950374054 数据集总大小:1919438418 标签:sft、后训练、大语言模型(Large Language Model, LLM) # Kiya-SFT 本数据集为面向大语言模型(Large Language Model, LLM)监督微调(Supervised Fine-Tuning, SFT)打造的单轮与多轮对话数据集集合,旨在训练能够在多元语言环境中理解并生成响应的友好实用助手,重点支持英语与多种非洲语言。 ## 数据集描述 Kiya-SFT整合了多款现有指令遵循与对话数据集,经精细化处理后统一为`text`列与`language`列:其中`text`列包含对话轮次,`language`列标注每条对话的主要语言。每条对话前均预设系统提示“你是Kiya,一名乐于助人且友好的助手”,以在微调阶段引导模型的角色定位。 ### 支持语言 本数据集包含以下语言的对话: - 英语(`en`) - 斯瓦西里语(`sw`) - 奥罗莫语(`om`) - 约鲁巴语(`yo`) - 阿姆哈拉语(`am`) - 提格雷尼亚语(`ti`) - 豪萨语(`ha`) ### 数据结构 数据集中的每条条目均为包含两个字段的字典: - `text`:由多个字典组成的列表,每个内部字典代表一轮对话。每轮对话包含`role`(角色,例如“system”“user”“assistant”)与`content`(对话内容)。示例如下: json [ {"role": "user", "content": "你好,近来可好?"}, {"role": "assistant", "content": "我一切安好,感谢你的询问!今天我能为你提供什么帮助?"} ] - `language`:字符串类型,代表该对话的ISO 639-1语言代码(例如“en”“sw”“am”)。 ### 数据集统计数据 - **总对话数**:518500条(基于最近一次笔记本运行结果) ## 使用方法 你可以通过Hugging Face的`datasets`库加载该数据集: python from datasets import load_dataset dataset = load_dataset("NaolBM/Kiya-SFT") # 访问指定拆分集(例如训练集) train_dataset = dataset["train"] # 查看单条示例数据 print(train_dataset[0])
提供机构:
NaolBM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作