five

Nadhari/Swahili-Thinking

收藏
Hugging Face2025-11-23 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/Nadhari/Swahili-Thinking
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: reasoning_language dtype: string - name: developer dtype: string - name: user dtype: string - name: analysis dtype: string - name: final dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: thinking dtype: string splits: - name: train num_bytes: 1281981 num_examples: 166 download_size: 741779 dataset_size: 1281981 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 language: - sw size_categories: - n<1K --- # Swahili Thinking Dataset **The first Swahili dataset for chain-of-thought reasoning.** This dataset contains 166 examples of conversational AI responses with explicit chain-of-thought reasoning in Swahili. It is derived from the [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) dataset, with English examples professionally translated to Swahili using GPT-5 Pro. ## Dataset Summary Swahili-Thinking is a reasoning dataset where both the chain-of-thought and final responses have been translated from English to Swahili. The dataset was created by sampling 200 English examples from the **Multilingual-Thinking** dataset and translating them with GPT-5 Pro, resulting in 166 high-quality Swahili reasoning examples. This dataset enables training language models to perform explicit reasoning in Swahili before generating responses, similar to how humans think through problems step-by-step before answering. ## Loading the Dataset You can load the dataset using: ```python from datasets import load_dataset ds = load_dataset("Nadhari/Swahili-Thinking", split="train") # Access first example example = ds[0] print(example['user']) # User query in Swahili print(example['analysis']) # Chain-of-thought reasoning in Swahili print(example['final']) # Final response in Swahili ``` ## Dataset Structure ### Data Fields Each example contains 6 fields following the Harmony response format: | Field | Type | Description | |-------|------|-------------| | `reasoning_language` | string | Always "Swahili" | | `developer` | string | System prompt in Swahili defining the assistant's role | | `user` | string | User query in Swahili | | `analysis` | string | Chain-of-thought reasoning process in Swahili (the "thinking") | | `final` | string | Final response to user in Swahili | | `messages` | list | Formatted conversation with 3 messages (system, user, assistant) where assistant message includes `thinking` field | ### Message Format The `messages` field follows a structure similar to OpenAI's messages format, with an important addition: the `assistant` turn contains a `thinking` field which contains the model's reasoning process in Swahili, and a `content` field which contains the final response to the user. ### Example ```python { "reasoning_language": "Swahili", "developer": "Wewe ni msaidizi mahiri anayeweza kujibu maswali ya huduma kwa wateja", "user": "Je, unaweza kunipa orodha ya mifululizo iliyokadiriwa juu kwa sasa kwenye Netflix?", "analysis": "Sawa, mtumiaji anauliza kuhusu mifululizo iliyokadiriwa juu kwa sasa kwenye Netflix...", "final": "Netflix hatoi hadharani orodha za wakati halisi za mifululizo yake iliyokadiriwa juu...", "messages": [ { "role": "system", "content": "reasoning language: Swahili\n\nWewe ni msaidizi mahiri...", "thinking": null }, { "role": "user", "content": "Je, unaweza kunipa orodha ya...", "thinking": null }, { "role": "assistant", "content": "Netflix hatoi hadharani...", "thinking": "Sawa, mtumiaji anauliza..." } ] } ``` ## Use Cases - **Fine-tuning**: Train Swahili language models with chain-of-thought reasoning capabilities - **Prompt Engineering**: Learn how to structure reasoning prompts in Swahili - **Research**: Study multilingual reasoning patterns and cross-lingual transfer - **Low-resource Language AI**: Advance capabilities for African languages ## Translation Details All content was translated from English to Swahili using **GPT-5 Pro (gpt-5-pro-2025-10-06)** with the following specifications: - **Reasoning quality preserved**: High reasoning effort maintained throughout translation - **Natural Swahili**: Idiomatic expressions and cultural context considered - **Proper nouns preserved**: Company names (Netflix, IMDb), person names, URLs kept in original form - **Technical accuracy**: Domain-specific terminology handled appropriately - **Formatting preserved**: Markdown, lists, headers, and structure maintained ## Dataset Statistics - **Total examples**: 166 - **Source dataset**: HuggingFaceH4/Multilingual-Thinking (English subset) - **Translation model**: GPT-5 Pro (gpt-5-pro-2025-10-06) - **Translation date**: November 2025 ## Training Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM from datasets import load_dataset # Load dataset dataset = load_dataset("Nadhari/Swahili-Thinking", split="train") # Format for training with thinking def format_example(example): return f"""<|system|> {example['developer']} <|user|> {example['user']} <|thinking|> {example['analysis']} <|assistant|> {example['final']}""" # Use with your training pipeline formatted = dataset.map(lambda x: {'text': format_example(x)}) ``` ## Limitations - Dataset size is relatively small (166 examples) - May not cover all Swahili dialects or regional variations (primarily Standard Swahili) - Technical/specialized domains may have limited representation - Some nuances from English may be lost in translation - 34 examples were excluded due to content policy violations or API issues ## Citation If you use this dataset, please cite: ```bibtex @misc{swahili-thinking-dataset-2025, title={Swahili Thinking Dataset}, author={Nadhari AI}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Nadhari/Swahili-Thinking} } ``` Also cite the original dataset: ```bibtex @misc{multilingual-thinking-2024, title={Multilingual Thinking Dataset}, author={HuggingFace H4}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking} } ``` ## License Apache 2.0 ## Acknowledgments - **Original dataset**: [HuggingFaceH4/Multilingual-Thinking](https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Thinking) - **Translation**: OpenAI's GPT-5 Pro - **Created by**: [Nadhari AI](https://github.com/nadhari) - **Support**: This work was supported by the O'Shaughnessy Ventures Fellowships & Grants ## Contact For questions or issues with this dataset, please open an issue on the [dataset repository](https://huggingface.co/datasets/Nadhari/Swahili-Thinking/discussions). --- **Note**: This is the first public dataset for chain-of-thought reasoning in Swahili, contributing to the advancement of AI capabilities for African languages.
提供机构:
Nadhari
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作