MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL-Research-Hub

Name: MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL-Research-Hub
Creator: MohammadRafiML
Published: 2026-04-16 20:59:30
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL-Research-Hub

下载链接

链接失效反馈

官方服务：

资源简介：

Mathematical reasoning is a long-standing benchmark for large language models (LLMs). A model that can autonomously invoke external tools (such as a calculator), verify intermediate results, and provide structured final answers represents the frontier of agentic behaviour. This capstone project investigates whether a modest-scale model (4B parameters) can be taught this behaviour through: 1. **Supervised Fine-Tuning (SFT)** — training on human-curated chain-of-thought solutions, teaching the model how to reason step-by-step and format answers correctly. 2. **Group Relative Policy Optimisation (GRPO)** — a reinforcement learning algorithm that trains the model via reward signals, encouraging correct answers, well-structured reasoning, and appropriate tool use without requiring reference completions. The two stages are applied sequentially. The SFT model is used as the starting policy for GRPO. All three checkpoints (base, SFT, GRPO) are evaluated on the same held-out benchmark. ### Key Contributions - A fully reproducible end-to-end pipeline (SFT + GRPO + evaluation) on the Tinker ML platform. - Synthetic training datasets derived from three public math benchmarks with strict train/test separation. - A multi-component GRPO reward function covering accuracy, format compliance, tool-use incentive, self-correction, and novelty. - Comprehensive per-tier, per-source accuracy analysis with failure-mode taxonomy. - Trained adapters publicly released on Hugging Face: MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL. --- ## Related Work **Chain-of-Thought Reasoning.** Wei et al. (2022) showed that prompting LLMs with step-by-step solutions dramatically improves multi-step arithmetic reasoning. SFT on CoT data further internalises this behaviour (Chung et al., 2022). **Tool-Augmented LLMs.** Toolformer (Schick et al., 2023) demonstrated that models can learn to call external APIs mid-generation. NuminaMath-TIR (AI-MO, 2024) provides Python-annotated solutions that serve as natural supervision for tool-use behaviour. **Reinforcement Learning for Reasoning.** PPO-based RLHF (Ouyang et al., 2022) is the dominant post-training paradigm. GRPO (Shao et al., 2024) removes the need for a value network by normalising rewards within a group of rollouts, making it more parameter-efficient for mathematical reasoning tasks. DeepSeek-R1 (DeepSeek-AI, 2025) demonstrated that pure RL on math problems can produce long chain-of-thought reasoning without explicit CoT supervision. --- ## Dataset Construction ### Design Principles - **Strict separation** — training and evaluation sets are drawn from disjoint splits of each source dataset. No duplicate questions across any two sets. - **Source diversity** — three public benchmarks cover different difficulty ranges and reasoning styles. - **Tiered difficulty** — all items are labelled easy, intermediate, or hard to enable per-tier analysis. - **Lean test set** — the evaluation benchmark contains only a question and the ground-truth answer, no hints or reference solutions. ### Source Datasets | Dataset | Used for | Tier | |---------|----------|------| | openai/gsm8k | SFT train, GRPO train, Eval | Easy | | AI-MO/NuminaMath-CoT | SFT train, GRPO train, Eval | Intermediate / Hard | | AI-MO/NuminaMath-TIR | GRPO train, Eval | Hard | ### SFT Training Set (sft_train_v1.jsonl) Prepared by `2_prep_traindata_sft_v1.py`: | Source | Tier | Count | |--------|------|-------| | GSM8K train | Easy | 100 | | NuminaMath-CoT train | Intermediate | 150 | | NuminaMath-CoT train | Hard | 250 | | **Total** | | **500** | Each item is formatted as a full CoT solution with mandatory `####` answer line. Example SFT training item: ```json { "question": "What is the coefficient of x^2*y^6 in (3/5*x - y/2)^8?", "answer": "<think>\nBy the binomial theorem, the k=6 term gives C(8,6)*(3/5)^2*(-1/2)^6 = 28 * 9/25 * 1/64 = 63/400.\n</think>\n#### 63/400", "source": "numina_agentic", "tier": "hard" } ``` ### GRPO Training Set (grpo_train_enriched_v2.jsonl) Prepared by `3_prep_traindata_grpo.py`: | Source | Tier | Count | |--------|------|-------| | NuminaMath-TIR train | Hard (agentic) | 100 | | GSM8K train | Easy | 200 | | NuminaMath-CoT | Intermediate + Hard | 600 | | **Total** | | **~900 (400 used with curriculum)** | Each item stores the question, numeric answer, source, tier, and a `valid_formats` list for equivalent answer representations. Example GRPO training item: ```json { "question": "What is the coefficient of x^2*y^6 in (3/5*x - y/2)^8?", "answer": "63/400", "source": "numina_tir_agentic", "tier": "hard", "valid_formats": ["63/400", "0.1575", "\\frac{63}{400}"] } ``` ### Evaluation Benchmark (eval_numinamath_gsm8k_benchmark.json) Prepared by `0_prep_eval_benchmarks_v1.py`: | Source | Tier | Count | |--------|------|-------| | GSM8K test | Easy | 160 | | NuminaMath-CoT test | Intermediate | 120 | | NuminaMath-CoT test | Hard | 37 | | NuminaMath-TIR test | Hard | 15 | | **Total** | | **332** | Items contain only question, answer, source, and tier — no hints. Example evaluation item: ```json { "question": "In 1988, a person's age equalled the sum of digits of their birth year. How old was this person?", "answer": "22", "source": "numina", "tier": "hard" } ``` --- ## Model Architecture and Base Model Base model: **Qwen/Qwen3-4B-Instruct-2507** — a 4-billion-parameter instruction-tuned transformer from Alibaba DAMO Academy. Supports a 32k-token context window and was pre-trained with multilingual mathematical content. All adaptations are implemented as **Low-Rank Adaptation (LoRA)** layers which freeze the base model weights and learn rank-decomposed updates. This reduces trainable parameters from ~4B to ~270M per adapter, enabling training on a single A100-80GB GPU. LoRA settings: rank r = 32, alpha = 32. ### Base Model System Prompt ``` You are a helpful assistant with access to a calculator. If a question requires math, you can use the tool by writing: <tool_call>{"name": "calculator", "arguments": {"expression": "..."}} </tool_call> After you get a <tool_response>, use that information to give your final answer. Always end your final response with: The final answer is #### [number] ``` ### Why the Base Model Fails on Many Problems 1. It frequently omits the mandatory `####` line, causing answer extraction to fall back to heuristics. 2. It almost never spontaneously emits a `<tool_call>` (only 8.4% of items). 3. On hard NuminaMath problems it over-generates long exploratory text, leading to hallucinated answers. --- ## Stage 1 — Supervised Fine-Tuning (SFT) ### Objective SFT trains the model to imitate high-quality demonstrations. Given a dataset of (question, solution) pairs, we minimise the standard cross-entropy loss over token predictions. ### SFT System Prompt ``` You are a mathematical reasoning assistant with access to a calculator. ### STRICT OUTPUT FORMAT Thought: [Analyse the problem step-by-step] <tool_call>{"name": "calculator", "arguments": {"expression": "..."}} </tool_call> <tool_response>[result will appear here]</tool_response> Thought: [Interpret result and confirm] The final answer is #### [NUMERIC_ANSWER] ### RULES 1. Use the tool for any arithmetic: division, powers, square roots. 2. ALWAYS end with exactly: The final answer is #### [number] 5. The #### line is MANDATORY -- never omit it. ``` ### SFT Hyperparameters | Parameter | Value | |-----------|-------| | Base model | Qwen/Qwen3-4B-Instruct-2507 | | LoRA rank | 32 | | LoRA alpha | 32 | | Training samples | 500 | | Epochs | 2 | | Total steps | 1,000 | | Loss function | Cross-entropy | | Optimizer | AdamW (Tinker default) | | Max sequence length | 2,048 tokens | ### Chain-of-Thought Format SFT answers follow a structured CoT wrapped in `<think>` tags, followed by the final `#### answer` line. The `<think>` wrapper teaches the model to internalise multi-step deliberation before committing to an answer. --- ## Stage 2 — GRPO Reinforcement Learning ### Algorithm Overview GRPO (Group Relative Policy Optimization) is an actor-only policy gradient algorithm that estimates advantages relative to a group of rollouts for the same question. For a group of G completions to a question, the advantage of each completion is computed as its reward minus the group mean, divided by the group standard deviation. This avoids the separate value-network of PPO, making it suitable for parameter-efficient fine-tuning. ### GRPO System Prompt ``` You are an autonomous mathematical reasoning agent. Solve problems accurately using step-by-step reasoning and a calculator tool. ### RESPONSE FORMAT (MANDATORY) Thought: [Analyse the problem] <tool_call>{"name": "calculator", "arguments": {"expression": "..."}} </tool_call> <tool_response>[result appears here automatically]</tool_response> Thought: [Verify result; reason to conclusion] The final answer is #### [NUMERIC_ANSWER] ### RULES 1. Always start with "Thought:" -- never skip it. 3. Self-correction is rewarded. 5. The #### line is MANDATORY. ``` ### Reward Function Total reward = (accuracy + format + tool + self-correction + efficiency + novelty) x tier multiplier | Component | Value | |-----------|-------| | Accuracy | +1.0 if correct (2% float tolerance), else 0 | | Format (partial) | +0.4 x (number of format tags present / 4) | | Tool penalty (hard) | -0.7 if tier=hard and no tool_call used | | Self-correction | +0.2 if correction keyword in response | | Efficiency | -(response length / 8000) x 0.05 | | Path novelty | +0.2 if correct and line count differs from SFT reference | | Tier multiplier | 1.30 (hard), 1.15 (intermediate), 1.0 (easy) | Format tags checked: `Thought:`, `<tool_call>`, `<tool_response>`, `####`. ### GRPO Hyperparameters | Parameter | Value | |-----------|-------| | Starting policy | SFT-trained adapter | | Training samples | 400 (curriculum: easy to intermediate to hard) | | Group size | 8 rollouts per question | | Learning rate | 3e-6 | | Gradient substeps | 1 | | Sampling temperature | 0.8 (rollout), 0.0 (eval) | | Max new tokens | 1,024 (rollout), 512 (follow-up) | | Loss function | Importance sampling | ### Curriculum Learning GRPO training samples are sorted easy to intermediate to hard before training begins. This ensures the policy receives stable positive reward signals early (easy problems yield high accuracy rewards) before encountering harder problems where rollouts may all receive near-zero reward. --- ## Answer Extraction and Matching A consistent answer extractor and matcher are used across all three evaluation stages and inside the GRPO reward function. **Extraction priority:** 1. Text after the last `####` marker. 2. Last `\boxed{...}` expression. 3. Last three non-empty lines (fallback). **Normalisation:** LaTeX fractions converted to decimal; comma stripping; whitespace trimming. **Matching tolerance:** Two numeric answers match if their relative difference is less than 2%. This handles borderline rounding artefacts (e.g. 36.36 is approximately equal to 36). **valid_formats:** For GRPO training items, the matcher additionally checks every entry in the item's valid_formats list, preventing 0.1575 from being marked wrong when the canonical answer is 63/400. --- ## Evaluation Results ### Overall Accuracy (332-item benchmark, Run ID: 20260416_111731) | Stage | Correct / Total | Accuracy | Change vs Baseline | Tool Use | |-------|-----------------|----------|--------------------|----------| | Baseline | 223 / 332 | 67.2% | --- | 8.4% | | SFT | 226 / 332 | 68.1% | +0.9 pp | 66.6% | | GRPO | 225 / 332 | 67.8% | +0.6 pp | 75.3% | | SFT target | --- | 80% | -11.9 pp | --- | | GRPO target | --- | 90% | -22.2 pp | --- | ### Per-Tier Breakdown | Stage | Easy (160) Acc | Intermediate (120) Acc | Hard (52) Acc | |-------|---------------|------------------------|---------------| | Baseline | 92.5% (148/160) | 39.2% (47/120) | 53.8% (28/52) | | SFT | 93.1% (149/160) | 37.5% (45/120) | 61.5% (32/52) | | GRPO | 95.0% (152/160) | 37.5% (45/120) | 53.8% (28/52) | ### Per-Source Breakdown | Stage | GSM8K (160) Acc | NuminaMath-CoT (157) Acc | NuminaMath-TIR (15) Acc | |-------|----------------|--------------------------|-------------------------| | Baseline | 92.5% (148/160) | 39.5% (62/157) | 86.7% (13/15) | | SFT | 93.1% (149/160) | 40.8% (64/157) | 86.7% (13/15) | | GRPO | 95.0% (152/160) | 38.9% (61/157) | 80.0% (12/15) | ### Status Distribution | Stage | Success | Fail | Hallucinated | Total | |-------|---------|------|--------------|-------| | Baseline | 223 | 66 | 43 | 332 | | SFT | 226 | 78 | 28 | 332 | | GRPO | 225 | 74 | 33 | 332 | Hallucinated: the model produced more than 2,500 tokens but the extracted answer was incorrect — indicative of uncontrolled chain-of-thought leading to a wrong conclusion. --- ## Qualitative Analysis ### Base Model: Success Example Question (Q2, hard, numina): "What is the total number of natural-number factors of n if 24 * 13 = 312?" Base model response: ``` The result of 24 * 13 is 312. The final answer is #### 312 ``` Result: success. The base model reasons correctly without needing the tool and emits the `####` line. ### Base Model Failure Recovered by SFT Question (Q1, hard, numina): "In 1988, a person's age was equal to the sum of the digits of their birth year. How old were they?" Base model: extracted answer 36, ground truth 22. Result: fail. SFT model response: ``` Thought: Let birth year = 1900 + x. Age in 1988 = 88 - x. Sum of digits = 1+9+0+x = 10+x (for x < 10) or adjusted. Setting 88-x = 1+9+x+x: 88-x = 10+2x => x=26 => birth 1966, age = 1988-1966 = 22. Digits: 1+9+6+6 = 22. Confirmed. <tool_call>{"name":"calculator","arguments":{"expression":"1+9+6+6"}} </tool_call> <tool_response>22</tool_response> Thought: The calculation confirms 1966 -> age 22. The final answer is #### 22 ``` Result: success. The tool-use behaviour taught by SFT is also preserved through GRPO training. ### GRPO Evaluation: Clean Tool-Use Example Question (Q4, hard, numina): "Find the smallest positive integer that is a multiple of both 4 and 14." ``` Thought: I need LCM(4,14). <tool_call>{"name":"calculator","arguments":{"expression":"4*14//gcd(4,14)"}} </tool_call> <tool_response>28</tool_response> Thought: LCM is 28. The final answer is #### 28 ``` All three checkpoints correctly answered this item. ### GRPO Evaluation: Hallucination Failure Question (Q3, hard, numina_tir): ground truth is empty/unanswerable. The GRPO model enters an extended reasoning loop (more than 1,800 tokens), considers multiple scenarios, and cannot resolve a perceived contradiction. Final extracted answer: None. Result: fail_hallucinated. The reward function has no mechanism for detecting circular reasoning. A future fix is to add a length-proportional penalty for outputs exceeding 1,000 tokens without a `####` token. --- ## Analysis — Why Targets Were Not Met The aspirational targets of 80% (SFT) and 90% (GRPO) were not achieved. Five root causes: **R1. Insufficient SFT data volume.** 500 samples over 2 epochs = 1,000 gradient steps. For a 4B model to generalise across the full distribution of NuminaMath (which spans advanced competition mathematics), the literature suggests around 10,000 diverse CoT demonstrations are needed. The intermediate tier accuracy (39%) barely moves across all three stages. **R2. Insufficient GRPO iterations.** Only 400 GRPO steps with group size 8 yields 3,200 total rollouts. DeepSeek-Math trained for hundreds of thousands of steps before reward curves converged. The rolling 20-step average reward hovered around 0.1 to 0.9 mid-training, indicating high variance and insufficient convergence. **R3. Intermediate tier bottleneck.** All three models achieve only ~38% on the 120 intermediate NuminaMath items. These problems require multi-paragraph algebraic manipulation that is poorly covered by the 500-item SFT set (only ~150 intermediate examples) and the 400-item GRPO curriculum. **R4. GRPO reward interference.** GRPO slightly reduces NuminaMath-TIR accuracy (86.7% to 80.0%) while improving GSM8K (92.5% to 95.0%). The tool-use penalty applies primarily to hard-tier items, but the calculator tool is inherently less useful for symbolic proofs in NuminaMath-TIR, causing the policy to over-use the tool on problems where closed-form reasoning was more effective. **R5. Distribution shift in answer formats.** NuminaMath-CoT answers often involve multi-token LaTeX expressions. The GRPO reward function receives a binary accuracy signal — correct or incorrect — which cannot distinguish between almost-correct answers (off by a sign) and completely wrong answers. A partial-credit accuracy reward could provide a denser learning signal. --- ## Future Work 1. **Scale SFT data to 5,000 to 10,000 samples** across all three tiers, with emphasis on intermediate NuminaMath CoT. Oversampling intermediate items (the bottleneck at ~38%) is expected to be the single highest-leverage change. 2. **Increase GRPO iterations to 2,000 to 5,000 steps** with checkpoint evaluation every 100 steps to detect convergence. Use cosine learning-rate decay. 3. **Add a length penalty** that fires when token count exceeds 1,000 without a `####` token, discouraging the hallucinated over-generation failure mode observed in ~10% of GRPO outputs. 4. **Partial-credit accuracy reward** — assign +0.5 when the answer is numerically close (less than 10% error) but not within the 2% tolerance. This provides a denser gradient signal for hard algebraic problems. 5. **Larger base model** — upgrade to Qwen3-7B or Qwen3-14B. The intermediate plateau strongly suggests that 4B parameters are insufficient to internalise the reasoning patterns required for competition-level NuminaMath. 6. **Self-play data augmentation** — use the GRPO model to generate additional CoT trajectories on held-in problems, filter for correctness, and add them to a second SFT pass (iterative expert iteration). --- ## Bias and Risk Assessment 1. **Dataset bias.** GSM8K focuses on English-language word problems from a US elementary-school context. The model performs much better on GSM8K (more than 92%) than on NuminaMath (~39%), suggesting it has not generalised to competition-level problems. Deployment in domains requiring such reasoning should be done with caution. 2. **Calculator sandboxing.** The CalcToolRunner executes Python expressions via eval(). Although builtins are restricted, adversarial inputs with deeply nested function calls could trigger denial-of-service. Recommendation: add a hard 200-character limit (already implemented) and a 1-second execution timeout. 3. **Hallucination risk.** Approximately 10% of GRPO outputs are classified as hallucinated (long responses with wrong answers). Production deployment requires a post-processing confidence filter. 4. **Overconfidence in format.** The model has been trained to always emit `####`. For problems without a clean numeric answer (proof-based, multiple-choice text), it may emit a spurious number. Deployment scope should be restricted to well-defined quantitative problems. --- ## Compute and Cost Estimates | Component | Estimate | Notes | |-----------|----------|-------| | Tinker GPU time | $10 | A100-80GB; SFT + GRPO train | | Claude API (planning/review) | $10 | Code generation and iteration | | Baseline eval (332 items) | ~2h | Avg 28s/item | | SFT training (1,000 steps) | ~2h | No GPU idle time | | SFT eval (332 items) | ~1.5h | Avg 18s/item | | GRPO training (400 steps x 8) | ~5h | 3,200 rollouts total | | GRPO eval (332 items) | ~1.5h | Avg 15s/item | | **Total wall time** | **~14h** | Single machine | Approximate token counts: - SFT: 500 samples x ~800 tokens/sample = ~400,000 training tokens. - GRPO rollouts: 400 steps x 8 rollouts x ~600 tokens = ~1.9M tokens generated. - Eval (all 3 stages): 332 items x 3 x ~1,000 tokens = ~1.0M tokens. --- ## Environment and Reproducibility ### Software Versions | Package | Version | |---------|---------| | Python | 3.11 | | PyTorch | 2.5.1+cu121 | | Transformers | 5.3.0 | | PEFT | 0.18.1 | | TRL | 0.24.0 | | Tinker | 0.16.1 | | tinker_cookbook | 0.2.2 | | datasets | 4.3.0 | | huggingface_hub | 1.9.2 | | safetensors | 0.7.0 | | CUDA | 12.1 | | unsloth | git (bcf4fd6) | ### Reproduction Steps **1. Clone or download the scripts.** The pipeline script is available in the research hub under `scripts/` as `capstone_sft_grpo_full_pipeline_v1.py` (renamed from `600_optimised_full_pipeline.py`; contents are identical). **2. Create a virtual environment.** ```bash python -m venv capstone_mathrl_env # Windows: capstone_mathrl_env\Scripts\activate # Linux / macOS: source capstone_mathrl_env/bin/activate ``` **3. Install dependencies.** ```bash pip install -r requirements.txt ``` **4. Set environment variables.** Create a `.env` file in the `script_new/` directory: ``` TINKER_API_KEY=your_tinker_key_here HF_TOKEN=your_hf_token_here ``` **5. Prepare datasets.** ```bash python scripts/0_prep_eval_benchmarks_v1.py python scripts/2_prep_traindata_sft_v1.py python scripts/3_prep_traindata_grpo.py ``` Expected outputs: - `data/eval_numinamath_gsm8k_benchmark.json` — 332 items - `data/sft-train/sft_train_v1.jsonl` — 500 items - `data/grpo-train/grpo_train_enriched_v2.jsonl` — ~900 items **6. Run the full pipeline.** ```bash cd script_new python capstone_sft_grpo_full_pipeline_v1.py ``` **7. Monitor progress.** The script writes a timestamped log to `run_600_<timestamp>.log`. Stage markers: - `STAGE 1: BASE MODEL EVALUATION` — baseline running. - `STAGE 2: SFT TRAINING` — LoRA fine-tuning in progress. - `SFT FINAL ACCURACY: ...` — Stage 3 complete; compare to 68.1%. - `GRPO [400/400]` — GRPO training complete. - `PIPELINE COMPLETE` — all stages done; adapters saved locally. **8. Verify outputs.** - `evaluation_comparison/basemodel_metrics_<ts>.csv` — 332 rows - `evaluation_comparison/sft_metrics_<ts>.csv` — 332 rows - `evaluation_comparison/grpo_metrics_<ts>.csv` — 332 rows - `models/sft_model_600_<ts>/adapter/adapter_model.safetensors` — ~271 MB - `models/grpo_model_600_<ts>/adapter/adapter_model.safetensors` — ~271 MB --- ## Model Release **Model repo:** MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL - `sft_adapter/` — LoRA adapter after Stage 1 SFT (270.92 MB) - `grpo_adapter/` — LoRA adapter after Stage 2 GRPO — recommended (270.92 MB) **Research hub:** MohammadRafiML/Qwen3-4B-Instruct-2507-Capstone-MathRL-Research-Hub - `scripts/` — all pipeline Python scripts - `logs/` — run logs - `metrics/` — CSV evaluation results - `data/` — training and evaluation datasets --- ## Conclusion This capstone project demonstrates a complete, reproducible two-stage training pipeline for mathematical reasoning on a 4B-parameter model. Starting from a 67.2% baseline, SFT achieves 68.1% and GRPO achieves 67.8% — marginal improvements that fall significantly short of the 80% and 90% targets. The primary limiting factor is data scale: 500 SFT samples and 400 GRPO steps are insufficient for a model of this size to generalise across intermediate-level competition mathematics (39% accuracy across all three checkpoints). Nevertheless, the training produces three clearly observable improvements: 1. **Tool-use adoption** — tool invocation grows from 8.4% (base) to 66.6% (SFT) to 75.3% (GRPO). 2. **Hallucination reduction** — SFT cuts hallucinated outputs from 43 to 28. 3. **Hard-tier improvement** — SFT improves hard-tier accuracy from 53.8% to 61.5%, validating that CoT demonstrations are learned. The roadmap to 90%+ is clear: scale data, increase GRPO steps, add length penalties, and consider a larger base model. The infrastructure built in this project — curriculum learning, valid-format-aware reward, inline prompt engineering, and a unified 5-stage evaluation pipeline — is ready to support those improvements. ---

提供机构：

MohammadRafiML

5,000+

优质数据集

54 个

任务类型

进入经典数据集