AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# Regularizer 250K (from Reasoning SFT 3M)
250,000 deduplicated reasoning samples extracted from [AmanPriyanshu/reasoning-sft-3M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/reasoning-sft-3M-random-compilation). Designed as a regularization set during tool-use and agentic mid-training — preserves general reasoning, code, structured output, and instruction-following capabilities.
## Construction
1. MD5-hashed all 3M `input` fields to identify duplicates (471,296 duplicate rows, 15.7%)
2. Kept exactly one copy of each duplicated input (284,866 unique)
3. Dropped domains unsuitable for mid-training regularization:
- `safety` — reserved for final alignment
- `SYNTHETIC-2-SFT-Verified` — overrepresented (same source as Dolci Think Python Algorithms)
- `N/A` — unlabeled, unverifiable
- `Wildchat` — raw user chats, weak reasoning signal
- `One-Shot-CFT-Data` — tiny dataset, 79.5% duplicated in parent
- Misc noise (`nitpick`, `style`, `OpenAssistant`, `TableGPT`, `synthetic`, `java`, `python`, `c`)
4. Added back 36 samples from `SYNTHETIC-2-SFT-Verified` to hit exactly 250,000
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `input` | `list[{role: str, content: str}]` | Conversation history with roles: `user`, `system`, `assistant` |
| `response` | `str` | Model response following strict think template |
| `domain` | `str` | Unified domain label |
| `source_dataset` | `str` | HuggingFace source dataset identifier |
| `dataset_license` | `str` | License of the source dataset |
## Response Template
Every response follows exactly:
```
<think>
{reasoning}
</think>
{answer}
```
## Domain Distribution (56 domains)
| Category | Samples | % | Key Domains |
|----------|---------|---|-------------|
| Code / SWE | ~75K | 30% | Dolci Think Python Algorithms (31.6K), SWE Repair (19K), code (8.6K), suggestion (11K), bug/refactor/perf (4.6K) |
| Reasoning / Math / Science | ~45K | 18% | math (8.9K), science (9.7K), stem-reasoning (4.2K), analytical_reasoning (6.3K), fermi (6.3K), brain_teaser (6.3K) |
| Structured Output / IF | ~43K | 17% | instruction_following (19.1K), structured_outputs (5K), text_classification/extraction/modification (19K) |
| Agentic / Tool-use | ~46K | 18% | tool_use (6.6K), webagent_flow (6.3K), rag (6.3K), fs_cot_flow (6.3K), struct2text_flow (6.3K), follow_up (6.5K) |
| General / Conversational | ~41K | 17% | chat (6.7K), creative_content (6.3K), rc (6.3K), mcq (6.2K), open_domain_qa (1.3K) |
## Sources
| Source | License |
|--------|---------|
| AmanPriyanshu/reasoning-sft-CHIMERA | apache-2.0 |
| AmanPriyanshu/reasoning-sft-IF_multi_constraints_upto5 | odc-by |
| AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-Nemotron-Instruction-Following-Chat-v1 | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K | apache-2.0 |
| AmanPriyanshu/reasoning-sft-Superior-Reasoning-SFT-gpt-oss-120b-434K | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-dolci-think-sft-32b-1M | odc-by |
| AmanPriyanshu/reasoning-sft-github-codereview | mit |
| AmanPriyanshu/reasoning-sft-interstellarninja-json-mode-reasoning-160K | apache-2.0 |
| AmanPriyanshu/reasoning-sft-minimax-microsoft-orca-agentinstruct-1M-v1 | cdla-permissive-2.0 |
| AmanPriyanshu/reasoning-sft-minimax-stratified-kmeans-diverse-reasoning-842K-only | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-poor-quality-reasoning-sample-mix | apache-2.0 |
| AmanPriyanshu/reasoning-sft-stem-reasoning-complex-FineProofs-126K | apache-2.0 |
## Files
Single parquet file, 250,000 rows, ~2.1GB. Rows sorted by domain then input hash for deterministic reproducibility.
提供机构:
AmanPriyanshu



