AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation

Name: AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation
Creator: AmanPriyanshu
Published: 2026-04-02 18:30:27
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- # Regularizer 250K (from Reasoning SFT 3M) 250,000 deduplicated reasoning samples extracted from [AmanPriyanshu/reasoning-sft-3M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/reasoning-sft-3M-random-compilation). Designed as a regularization set during tool-use and agentic mid-training — preserves general reasoning, code, structured output, and instruction-following capabilities. ## Construction 1. MD5-hashed all 3M `input` fields to identify duplicates (471,296 duplicate rows, 15.7%) 2. Kept exactly one copy of each duplicated input (284,866 unique) 3. Dropped domains unsuitable for mid-training regularization: - `safety` — reserved for final alignment - `SYNTHETIC-2-SFT-Verified` — overrepresented (same source as Dolci Think Python Algorithms) - `N/A` — unlabeled, unverifiable - `Wildchat` — raw user chats, weak reasoning signal - `One-Shot-CFT-Data` — tiny dataset, 79.5% duplicated in parent - Misc noise (`nitpick`, `style`, `OpenAssistant`, `TableGPT`, `synthetic`, `java`, `python`, `c`) 4. Added back 36 samples from `SYNTHETIC-2-SFT-Verified` to hit exactly 250,000 ## Schema | Column | Type | Description | |--------|------|-------------| | `input` | `list[{role: str, content: str}]` | Conversation history with roles: `user`, `system`, `assistant` | | `response` | `str` | Model response following strict think template | | `domain` | `str` | Unified domain label | | `source_dataset` | `str` | HuggingFace source dataset identifier | | `dataset_license` | `str` | License of the source dataset | ## Response Template Every response follows exactly: ``` <think> {reasoning} </think> {answer} ``` ## Domain Distribution (56 domains) | Category | Samples | % | Key Domains | |----------|---------|---|-------------| | Code / SWE | ~75K | 30% | Dolci Think Python Algorithms (31.6K), SWE Repair (19K), code (8.6K), suggestion (11K), bug/refactor/perf (4.6K) | | Reasoning / Math / Science | ~45K | 18% | math (8.9K), science (9.7K), stem-reasoning (4.2K), analytical_reasoning (6.3K), fermi (6.3K), brain_teaser (6.3K) | | Structured Output / IF | ~43K | 17% | instruction_following (19.1K), structured_outputs (5K), text_classification/extraction/modification (19K) | | Agentic / Tool-use | ~46K | 18% | tool_use (6.6K), webagent_flow (6.3K), rag (6.3K), fs_cot_flow (6.3K), struct2text_flow (6.3K), follow_up (6.5K) | | General / Conversational | ~41K | 17% | chat (6.7K), creative_content (6.3K), rc (6.3K), mcq (6.2K), open_domain_qa (1.3K) | ## Sources | Source | License | |--------|---------| | AmanPriyanshu/reasoning-sft-CHIMERA | apache-2.0 | | AmanPriyanshu/reasoning-sft-IF_multi_constraints_upto5 | odc-by | | AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K | cc-by-4.0 | | AmanPriyanshu/reasoning-sft-Nemotron-Instruction-Following-Chat-v1 | cc-by-4.0 | | AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K | apache-2.0 | | AmanPriyanshu/reasoning-sft-Superior-Reasoning-SFT-gpt-oss-120b-434K | cc-by-4.0 | | AmanPriyanshu/reasoning-sft-dolci-think-sft-32b-1M | odc-by | | AmanPriyanshu/reasoning-sft-github-codereview | mit | | AmanPriyanshu/reasoning-sft-interstellarninja-json-mode-reasoning-160K | apache-2.0 | | AmanPriyanshu/reasoning-sft-minimax-microsoft-orca-agentinstruct-1M-v1 | cdla-permissive-2.0 | | AmanPriyanshu/reasoning-sft-minimax-stratified-kmeans-diverse-reasoning-842K-only | cc-by-4.0 | | AmanPriyanshu/reasoning-sft-poor-quality-reasoning-sample-mix | apache-2.0 | | AmanPriyanshu/reasoning-sft-stem-reasoning-complex-FineProofs-126K | apache-2.0 | ## Files Single parquet file, 250,000 rows, ~2.1GB. Rows sorted by domain then input hash for deterministic reproducibility.

提供机构：

AmanPriyanshu

5,000+

优质数据集

54 个

任务类型

进入经典数据集