five

Pacific-i64/cot-dataset

收藏
Hugging Face2026-01-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Pacific-i64/cot-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - math - reasoning - chain-of-thought - cot - small-models - moe size_categories: - 1M<n<10M --- # CoT Dataset for Small Models (1.5B+) A curated Chain-of-Thought dataset optimized for training small language models (1.5B parameters) with structured reasoning capabilities. ## Key Features - **2.9M samples** of mathematical reasoning - **Key-Value format** to prevent hallucinations and keep small models on track - **Difficulty levels** (basic, intermediate, advanced) for curriculum learning - **Multiple sources** merged and shuffled for diversity ## Format ``` Reasoning: {concise step-by-step reasoning} Answer: {final answer} ``` This structured format helps small models: - Stay focused on the problem - Avoid rambling or hallucinating - Produce consistent, parseable outputs ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `question` | string | The math problem | | `answer` | string | Structured response (Reasoning + Answer) | | `final_answer` | string | Just the final answer | | `source` | string | Original dataset source | | `difficulty` | string | basic / intermediate / advanced | | `answer_length` | int | Character count of answer | ## Sources & Statistics | Source | Samples | Difficulty | |--------|---------|------------| | OpenMathInstruct-2 | 1,500,001 | intermediate | | NuminaMath-CoT | 500,001 | advanced | | MetaMathQA | 395,000 | intermediate | | MathInstruct | 262,039 | intermediate | | Orca-Math | 200,035 | basic | | Competition MATH | 12,500 | advanced | | GSM8K | 7,473 | basic | | **Total** | **2,877,049** | | ## Usage ```python from datasets import load_dataset # Load full dataset ds = load_dataset("Pacific-Prime/cot-dataset") # Filter by difficulty basic = ds["train"].filter(lambda x: x["difficulty"] == "basic") advanced = ds["train"].filter(lambda x: x["difficulty"] == "advanced") # Filter by source gsm8k = ds["train"].filter(lambda x: x["source"] == "gsm8k") ``` ## Training Configuration ```yaml data: datasets: - name: "Pacific-Prime/cot-dataset" weight: 1.0 format: "qa" ``` ## Why Key-Value Format? Small models (< 7B) struggle with free-form Chain-of-Thought: - They tend to ramble and lose focus - Long reasoning chains increase hallucination risk - Unstructured outputs are hard to parse The `Reasoning: ... Answer: ...` format: - **Constrains** the model to stay on topic - **Anchors** the response to the correct answer - **Enables** easy answer extraction ## Recommended Model Sizes | Model Size | CoT Capability | |------------|----------------| | 1.5B | With Key-Value format | | 7B | Short CoT | | 13B+ | Full CoT | | 70B+ | Complex reasoning | ## License Apache 2.0 ## Citation ```bibtex @dataset{pacific_prime_cot_2025, title={CoT Dataset for Small Models}, author={Pacific Prime}, year={2025}, publisher={HuggingFace} } ``` ## Acknowledgments Built from these excellent datasets: - [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) (NVIDIA) - [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) (AI-MO) - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) (Meta) - [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct) (TIGER-Lab) - [Orca-Math](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) (Microsoft) - [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (OpenAI) - [Competition MATH](https://huggingface.co/datasets/qwedsacf/competition_math)
提供机构:
Pacific-i64
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作