AmanPriyanshu/regularizer-250K-from-reasoning-and-tool-use-sft-4M-random-compilation
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-and-tool-use-sft-4M-random-compilation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# Regularizer 250K (from Reasoning + Tool-Use SFT 4M)
250,000 samples combining tool-use agentic data and general reasoning, designed as a regularization set during domain-specific fine-tuning. Preserves tool-use, coding, research, and general reasoning capabilities.
## Construction
1. **Tool-reasoning subset (150K)**: Sampled 50K per category (TOOLS, CODING, RESEARCH) from [AmanPriyanshu/tool-reasoning-sft-1M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/tool-reasoning-sft-1M-random-compilation) (1M total rows). Random sampling with seed=42.
2. **Reasoning subset (100K)**: Sampled 100K single-turn examples from [AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation) (250K total rows).
- Filtered for single-turn only: `len(input) <= 2` (system + user, no prior assistant turns)
- 227,813 single-turn available out of 250K; sampled 100K
- Response parsed: `<think>...</think>` → `reasoning` role, remainder → `answer` role
3. All 250K rows shuffled (seed=42). Messages stored as `json.dumps` strings (use `json.loads` to parse).
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `messages` | `str` (JSON) | JSON string of conversation messages. Use `json.loads()` to parse into `list[{role, content}]` |
| `source_dataset` | `str` | Original source dataset identifier |
| `source_category` | `str` | One of: `TOOLS`, `CODING`, `RESEARCH`, `REASONING` |
## Category Distribution
| Category | Count | % | Source |
|----------|-------|---|--------|
| TOOLS | 50,000 | 20% | tool-reasoning-sft-1M |
| CODING | 50,000 | 20% | tool-reasoning-sft-1M |
| RESEARCH | 50,000 | 20% | tool-reasoning-sft-1M |
| REASONING | 100,000 | 40% | regularizer-250K |
| **Total** | **250,000** | **100%** | |
## Source Dataset Distribution (38 unique)
### TOOLS (50K from 7 sources)
| Source Dataset | Sampled |
|----------------|---------|
| toucan-1.5m-sft-tool-use-data-cleaned-rectified-333k | 9,927 |
| hermes-reasoning-tool-style-data-cleaned-rectified-115k | 9,874 |
| ToolMind-data-cleaned-rectified | 9,747 |
| hermes_reasoning_tool_use-data-cleaned-rectified | 6,599 |
| toolace-sft-tool-use-agent-data-cleaned-rectified | 2,019 |
| mobile-actions-data-cleaned-rectified | 1,131 |
| toolmind-web-qa-sft-tool-use-data-cleaned-rectified-5.2k | 703 |
### CODING (50K from 9 sources)
| Source Dataset | Sampled |
|----------------|---------|
| nvidia-Nemotron-Agentic-v1 | 6,893 |
| Nemotron-Terminal-Corpus-data-cleaned-rectified | 6,881 |
| allenai-SERA-data-cleaned-rectified | 6,858 |
| text_to_terminal_v2-sft-tool-use-agent-data-cleaned-rectified | 6,828 |
| browsing-sft-tool-use-data-cleaned-rectified | 6,807 |
| CoderForge-Preview-data-cleaned-rectified | 6,797 |
| jupyter-agent-dataset-sft-tool-use-agent-data-cleaned-rectified | 6,762 |
| CoVe-12k-data-cleaned-rectified | 1,624 |
| MEnvData-SWE-Trajectory-data-cleaned-rectified | 550 |
### RESEARCH (50K from 8 sources)
| Source Dataset | Sampled |
|----------------|---------|
| OpenHands-CodeScout_Training_Rollouts | 8,960 |
| grill-lab-browsecomp-plus-runs-data-cleaned-rectified | 8,896 |
| explorations | 8,802 |
| rlvr-env-retrieval-source | 8,791 |
| openresearcher-dataset-sft-deep-research-agent-data-cleaned | 8,787 |
| dr-tulu-sft-deep-research-agent-data-cleaned-rectified | 2,392 |
| REDSearcher_SFT_10K | 1,808 |
| OpenSeeker-v1-Data | 1,564 |
### REASONING (100K from 14+ sources)
Sampled from single-turn entries of the [regularizer-250K](https://huggingface.co/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation) dataset. Sources include reasoning-sft-CHIMERA, reasoning-sft-OpenThoughts3, reasoning-sft-dolci-think-sft, reasoning-sft-Nemotron-Cascade-SFT-SWE, and others. See parent dataset for full source breakdown.
## Message Format
All rows use a unified `messages` format:
**All rows** use a unified message format with roles: `system`, `user`, `reasoning`, `tool_call`, `tool_output`, `answer`.
**Tool-use rows (TOOLS/CODING/RESEARCH)** — multi-turn agentic:
```json
[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "reasoning", "content": "<think>...</think>"},
{"role": "tool_call", "content": "..."},
{"role": "tool_output", "content": "..."},
...
{"role": "answer", "content": "..."}
]
```
**Reasoning rows (REASONING)** — single-turn with think tags:
```json
[
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "reasoning", "content": "<think>...</think>"},
{"role": "answer", "content": "..."}
]
```
## Usage
```python
import json
from datasets import load_dataset
ds = load_dataset("AmanPriyanshu/regularizer-250K-from-reasoning-and-tool-use-sft-4M-random-compilation", split="train")
messages = json.loads(ds[0]["messages"]) # list of {role, content} dicts
```
## Purpose
This dataset serves as a **regularizer** during domain-specific fine-tuning (e.g., cybersecurity). By mixing in diverse tool-use and reasoning examples, it prevents catastrophic forgetting of general capabilities while the model specializes.
## Parent Datasets
- [AmanPriyanshu/tool-reasoning-sft-1M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/tool-reasoning-sft-1M-random-compilation) — 1M multi-turn tool-use SFT samples across TOOLS, CODING, RESEARCH
- [AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation](https://huggingface.co/datasets/AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation) — 250K reasoning samples from the 3M reasoning SFT compilation
提供机构:
AmanPriyanshu



