Farseen0/opus-4.6-reasoning-sft-12k
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Farseen0/opus-4.6-reasoning-sft-12k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
tags:
- reasoning
- chain-of-thought
- distillation
- sft
- claude-opus
- math
- logic
size_categories:
- 10K<n<100K
task_categories:
- text-generation
pretty_name: Opus 4.6 Reasoning SFT 12k
---
# Opus 4.6 Reasoning SFT 12k
A unified, pre-cleaned reasoning dataset built from 4 Claude Opus 4.6 distillation sources. Ready for supervised fine-tuning — just load and train.
## Why This Dataset Exists
The source datasets have different schemas, null values, and reasoning stored in non-standard keys that `apply_chat_template()` silently drops. This dataset fixes all of that:
- Reasoning traces merged into assistant content using `<think>...</think>` tags
- Null/empty content handled (broken samples dropped)
- All schemas unified to standard `messages` format
- Single dataset load replaces 4 separate downloads + complex formatting logic
## Quick Start
```python
from datasets import load_dataset
dataset = load_dataset("Farseen0/opus-4.6-reasoning-sft-12k", split="train")
print(f"{len(dataset)} samples ready for training")
```
### With TRL / Unsloth
```python
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("Farseen0/opus-4.6-reasoning-sft-12k", split="train")
# Apply your model's chat template
def format_to_text(example):
text = tokenizer.apply_chat_template(
example["messages"], tokenize=False, add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_to_text, num_proc=8, remove_columns=dataset.column_names)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
args=SFTConfig(output_dir="output", num_train_epochs=2),
)
trainer.train()
```
## Dataset Details
| Stat | Value |
|------|-------|
| Total samples | 12,929 |
| With reasoning (`<think>` tags) | 12,611 (97.5%) |
| Without reasoning | 318 (2.5%) |
| Format | `messages` (list of `{role, content}` dicts) |
| Roles | `user`, `assistant` |
| Messages per sample | 2 (single-turn) |
| Content length p50 | 790 chars |
| Content length p90 | 4,310 chars |
| Content length p99 | 30,802 chars |
| Max content length | 62,560 chars |
## Sources
Built from 4 public datasets, all generated using Claude Opus 4.6:
| Source | Kept | Dropped | Reason for drops | License |
|--------|------|---------|------------------|---------|
| [Roman1111111/claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x) | 9,632 | 1 | Null content | MIT |
| [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | 2,151 | 9 | Problem text < 10 chars | Apache 2.0 |
| [TeichAI/Claude-Opus-4.6-Reasoning-887x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-887x) | 886 | 0 | — | Apache 2.0 |
| [crownelius/Opus4.6-No-Reasoning-260x](https://huggingface.co/datasets/crownelius/Opus4.6-No-Reasoning-260x) | 260 | 0 | — | Apache 2.0 |
### What was fixed per source
**Roman1111111** — Reasoning was stored as a separate `reasoning` key on message dicts (not in `content`). `apply_chat_template()` silently ignores this key, so all reasoning traces would be lost without preprocessing. Also had 1 sample with `content: null`. Generic system prompt removed.
**Crownelius/Reasoning** — Flat format (`problem`/`thinking`/`solution` columns, not `messages`). Converted to chat format with thinking embedded as `<think>` tags. 9 samples dropped for broken/truncated problems under 10 characters.
**TeichAI** — Messages had `thinking` and `name` keys alongside `content`. The `thinking` key is silently dropped by chat templates. 58 samples had `thinking: null` (direct answers without CoT) — kept as-is. System prompt removed.
**Crownelius/No-Reasoning** — Flat format (`original_question`/`response`). Converted to chat format. No reasoning traces by design — these provide general assistant balance to prevent over-reliance on chain-of-thought.
## Schema
```json
{
"messages": [
{"role": "user", "content": "What is 3x + 7 = 22? Solve for x."},
{"role": "assistant", "content": "<think>\nI need to solve for x.\n3x + 7 = 22\n3x = 15\nx = 5\n</think>\n\nx = 5. Subtracting 7 from both sides gives 3x = 15, dividing by 3 gives x = 5."}
],
"source": "roman",
"has_reasoning": true
}
```
### Columns
| Column | Type | Description |
|--------|------|-------------|
| `messages` | `list[{role: str, content: str}]` | Standard chat format, ready for `apply_chat_template()` |
| `source` | `str` | Origin dataset: `roman`, `crownelius_reasoning`, `teichai`, `crownelius_no_reasoning` |
| `has_reasoning` | `bool` | Whether the assistant response contains `<think>` tags |
## Content Categories
- **Mathematics** — algebra, calculus, number theory, geometry, GSM8K, MATH (majority of dataset)
- **Logic** — puzzles, deductive reasoning, constraint satisfaction
- **Programming** — algorithm design, debugging, code optimization
- **Science** — physics, chemistry, biology problems
- **Bullshit detection** — identifying false claims (from TeichAI's Bullshit Bench)
- **General knowledge** — graduate-level topics in topology, cryptography, astrophysics (from No-Reasoning set)
## Model Compatibility
This dataset is model-agnostic. It uses standard `messages` format that works with any model's chat template:
- **Gemma 4** — `tokenizer.apply_chat_template()` (auto-converts `assistant` to `model` role)
- **Qwen 3.5** — Direct compatibility
- **Llama 3** — Direct compatibility
- **Mistral** — Direct compatibility
## Preprocessing Script
The dataset was built using [`scripts/build_dataset.py`](https://github.com/farseenshaikh/gemma4/blob/main/scripts/build_dataset.py). To rebuild or modify:
```bash
python scripts/build_dataset.py --push Farseen0/opus-4.6-reasoning-sft-12k
```
## Acknowledgments
This dataset would not exist without the work of the original dataset creators:
- **Roman1111111** — [claude-opus-4.6-10000x](https://huggingface.co/datasets/Roman1111111/claude-opus-4.6-10000x)
- **Crownelius** — [Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) and [Opus4.6-No-Reasoning-260x](https://huggingface.co/datasets/crownelius/Opus4.6-No-Reasoning-260x)
- **TeichAI** — [Claude-Opus-4.6-Reasoning-887x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-887x)
## License
Apache 2.0 (most permissive common license across all sources). The Roman1111111 source uses MIT. Please review source dataset pages for full terms.
提供机构:
Farseen0



