five

yen-av/tunix-stem-sft

收藏
Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/yen-av/tunix-stem-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: default features: - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 - name: task_id dtype: int32 - name: text dtype: string - name: code dtype: string - name: test_list list: string - name: test_setup_code dtype: string - name: challenge_test_list list: string splits: - name: train num_bytes: 485450485.0 num_examples: 429645 download_size: 251905581 dataset_size: 485450485.0 - config_name: gsm-textbook features: - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 splits: - name: train num_bytes: 472726095.0 num_examples: 428681 download_size: 250545416 dataset_size: 472726095.0 - config_name: mbpp features: - name: task_id dtype: int32 - name: text dtype: string - name: code dtype: string - name: test_list list: string - name: test_setup_code dtype: string - name: challenge_test_list list: string - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 splits: - name: train num_bytes: 2114530 num_examples: 964 download_size: 1043277 dataset_size: 2114530 - config_name: sft-20k features: - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 - name: task_id dtype: int32 - name: text dtype: string - name: code dtype: string - name: test_list list: string - name: test_setup_code dtype: string - name: challenge_test_list list: string splits: - name: train num_bytes: 22559496.812348995 num_examples: 20000 download_size: 12683986 dataset_size: 22559496.812348995 - config_name: verifiable features: - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 - name: answer_type dtype: string - name: task_id dtype: int32 - name: text dtype: string - name: code dtype: string - name: test_list list: string - name: test_setup_code dtype: string - name: challenge_test_list list: string splits: - name: train num_bytes: 213820869 num_examples: 211163 download_size: 110783444 dataset_size: 213820869 - config_name: verifiable-20k features: - name: prompt dtype: string - name: reasoning dtype: string - name: answer dtype: string - name: domain dtype: string - name: source dtype: string - name: reasoning_length dtype: int64 - name: answer_type dtype: string - name: test_list list: string - name: test_setup_code dtype: string - name: challenge_test_list list: string splits: - name: train num_bytes: 19923964.11431358 num_examples: 20000 download_size: 8961900 dataset_size: 19923964.11431358 configs: - config_name: default data_files: - split: train path: data/train-* - config_name: gsm-textbook data_files: - split: train path: gsm-textbook/train-* - config_name: mbpp data_files: - split: train path: mbpp/train-* - config_name: sft-20k data_files: - split: train path: sft-20k/train-* - config_name: verifiable data_files: - split: train path: verifiable/train-* - config_name: verifiable-20k data_files: - split: train path: verifiable-20k/train-* --- # Reasoning Training Dataset for Tunix Competition Reasoning dataset for training 1-2B thinking models on math, coding, and science problems. ## Sources - **GSM8K**: Grade school math with human reasoning traces - **TextbookReasoning**: STEM problems with step-by-step solutions - **MBPP**: Basic Python Programming prompts, with reasoning traces generated by gpt-oss-20b ## Format Each example contains: - `prompt`: The problem statement - `reasoning`: Step-by-step reasoning - `answer`: Final answer - `domain`: 'math'|'physics'|'cs'|'chemistry'|'biology' | 'code' - `source`: Original dataset name ## Intended Use Train with Tunix SFT for Gemma to learn concise reasoning traces matching competition format: ``` <reasoning>step-by-step thinking</reasoning> <answer>final answer</answer> ``` ## License Inherits licenses from source datasets
提供机构:
yen-av
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作