Nadhari/Swahili-Thinking
收藏Hugging Face2025-11-23 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/Nadhari/Swahili-Thinking
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: reasoning_language
dtype: string
- name: developer
dtype: string
- name: user
dtype: string
- name: analysis
dtype: string
- name: final
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: thinking
dtype: string
splits:
- name: train
num_bytes: 1281981
num_examples: 166
download_size: 741779
dataset_size: 1281981
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
language:
- sw
size_categories:
- n<1K
---
# Swahili Thinking Dataset
**The first Swahili dataset for chain-of-thought reasoning.**
This dataset contains 166 examples of conversational AI responses with explicit chain-of-thought reasoning in Swahili. It is derived from the [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) dataset, with English examples professionally translated to Swahili using GPT-5 Pro.
## Dataset Summary
Swahili-Thinking is a reasoning dataset where both the chain-of-thought and final responses have been translated from English to Swahili. The dataset was created by sampling 200 English examples from the **Multilingual-Thinking** dataset and translating them with GPT-5 Pro, resulting in 166 high-quality Swahili reasoning examples.
This dataset enables training language models to perform explicit reasoning in Swahili before generating responses, similar to how humans think through problems step-by-step before answering.
## Loading the Dataset
You can load the dataset using:
```python
from datasets import load_dataset
ds = load_dataset("Nadhari/Swahili-Thinking", split="train")
# Access first example
example = ds[0]
print(example['user']) # User query in Swahili
print(example['analysis']) # Chain-of-thought reasoning in Swahili
print(example['final']) # Final response in Swahili
```
## Dataset Structure
### Data Fields
Each example contains 6 fields following the Harmony response format:
| Field | Type | Description |
|-------|------|-------------|
| `reasoning_language` | string | Always "Swahili" |
| `developer` | string | System prompt in Swahili defining the assistant's role |
| `user` | string | User query in Swahili |
| `analysis` | string | Chain-of-thought reasoning process in Swahili (the "thinking") |
| `final` | string | Final response to user in Swahili |
| `messages` | list | Formatted conversation with 3 messages (system, user, assistant) where assistant message includes `thinking` field |
### Message Format
The `messages` field follows a structure similar to OpenAI's messages format, with an important addition: the `assistant` turn contains a `thinking` field which contains the model's reasoning process in Swahili, and a `content` field which contains the final response to the user.
### Example
```python
{
"reasoning_language": "Swahili",
"developer": "Wewe ni msaidizi mahiri anayeweza kujibu maswali ya huduma kwa wateja",
"user": "Je, unaweza kunipa orodha ya mifululizo iliyokadiriwa juu kwa sasa kwenye Netflix?",
"analysis": "Sawa, mtumiaji anauliza kuhusu mifululizo iliyokadiriwa juu kwa sasa kwenye Netflix...",
"final": "Netflix hatoi hadharani orodha za wakati halisi za mifululizo yake iliyokadiriwa juu...",
"messages": [
{
"role": "system",
"content": "reasoning language: Swahili\n\nWewe ni msaidizi mahiri...",
"thinking": null
},
{
"role": "user",
"content": "Je, unaweza kunipa orodha ya...",
"thinking": null
},
{
"role": "assistant",
"content": "Netflix hatoi hadharani...",
"thinking": "Sawa, mtumiaji anauliza..."
}
]
}
```
## Use Cases
- **Fine-tuning**: Train Swahili language models with chain-of-thought reasoning capabilities
- **Prompt Engineering**: Learn how to structure reasoning prompts in Swahili
- **Research**: Study multilingual reasoning patterns and cross-lingual transfer
- **Low-resource Language AI**: Advance capabilities for African languages
## Translation Details
All content was translated from English to Swahili using **GPT-5 Pro (gpt-5-pro-2025-10-06)** with the following specifications:
- **Reasoning quality preserved**: High reasoning effort maintained throughout translation
- **Natural Swahili**: Idiomatic expressions and cultural context considered
- **Proper nouns preserved**: Company names (Netflix, IMDb), person names, URLs kept in original form
- **Technical accuracy**: Domain-specific terminology handled appropriately
- **Formatting preserved**: Markdown, lists, headers, and structure maintained
## Dataset Statistics
- **Total examples**: 166
- **Source dataset**: HuggingFaceH4/Multilingual-Thinking (English subset)
- **Translation model**: GPT-5 Pro (gpt-5-pro-2025-10-06)
- **Translation date**: November 2025
## Training Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# Load dataset
dataset = load_dataset("Nadhari/Swahili-Thinking", split="train")
# Format for training with thinking
def format_example(example):
return f"""<|system|>
{example['developer']}
<|user|>
{example['user']}
<|thinking|>
{example['analysis']}
<|assistant|>
{example['final']}"""
# Use with your training pipeline
formatted = dataset.map(lambda x: {'text': format_example(x)})
```
## Limitations
- Dataset size is relatively small (166 examples)
- May not cover all Swahili dialects or regional variations (primarily Standard Swahili)
- Technical/specialized domains may have limited representation
- Some nuances from English may be lost in translation
- 34 examples were excluded due to content policy violations or API issues
## Citation
If you use this dataset, please cite:
```bibtex
@misc{swahili-thinking-dataset-2025,
title={Swahili Thinking Dataset},
author={Nadhari AI},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Nadhari/Swahili-Thinking}
}
```
Also cite the original dataset:
```bibtex
@misc{multilingual-thinking-2024,
title={Multilingual Thinking Dataset},
author={HuggingFace H4},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking}
}
```
## License
Apache 2.0
## Acknowledgments
- **Original dataset**: [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking)
- **Translation**: OpenAI's GPT-5 Pro
- **Created by**: [Nadhari AI](https://github.com/nadhari)
- **Support**: This work was supported by the O'Shaughnessy Ventures Fellowships & Grants
## Contact
For questions or issues with this dataset, please open an issue on the [dataset repository](https://huggingface.co/datasets/Nadhari/Swahili-Thinking/discussions).
---
**Note**: This is the first public dataset for chain-of-thought reasoning in Swahili, contributing to the advancement of AI capabilities for African languages.
提供机构:
Nadhari



