j0no12/unified-reasoning-dataset
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/j0no12/unified-reasoning-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: thinking
dtype: string
- name: instruction
dtype: string
- name: response
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 106820084
num_examples: 94860
download_size: 53728052
dataset_size: 106820084
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- question-answering
- text-generation
language:
- en
license: apache-2.0
tags:
- reasoning
- chain-of-thought
- instruction-tuning
- math
- distillation
- synthetic
size_categories:
- 10K<n<100K
---
# j0no12/unified-reasoning-dataset
## Dataset Details
### Dataset Description
A unified, normalized reasoning dataset assembled from four high-quality sources. Each source was cleaned, schema-harmonized, quality-filtered (min 10 chars per field), and globally deduplicated on `(instruction, response)` pairs before merging. The result is a single 94,860-row dataset with consistent `instruction / thinking / response` structure, suitable for supervised fine-tuning of reasoning-capable language models.
- **Curated by:** j0no12
- **Language(s):** English
- **License:** Apache-2.0 (most restrictive license across sources)
### Dataset Sources
| Source | Repo | License | Approx. Rows |
|---|---|---|---|
| Opus-4.6-Reasoning (cleaned) | [Crownelius/Opus-4.6-Reasoning-3300x](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x) | Apache-2.0 | ~2,160 |
| Qwen3.5-Reasoning-700x | [Jackrong/Qwen3.5-reasoning-700x](https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x) | Apache-2.0 | ~700 |
| GPT-5.4-Step-by-Step-Reasoning | [Roman1111111/gpt-5.4-step-by-step-reasoning](https://huggingface.co/datasets/Roman1111111/gpt-5.4-step-by-step-reasoning) | MIT | ~1,500 |
| Crow-8B Training Data (cleaned) | [Crownelius/Crow-8B-Training-Data-Clean](https://huggingface.co/datasets/Crownelius/Crow-8B-Training-Data-Clean) | — | ~91,000 |
## Uses
### Direct Use
This dataset is intended for supervised fine-tuning (SFT) of language models to improve step-by-step reasoning, chain-of-thought generation, and instruction-following. It pairs naturally with models in the 1B–27B range such as Qwen3.5 (4B, 9B, 27B), GPT-OSS 20B, and similar reasoning-capable architectures.
### Out-of-Scope Use
This dataset should not be used for RLHF reward modeling without additional preference labels. It is not suitable for factual knowledge retrieval tasks, as responses are synthetically generated and may contain reasoning errors. Do not use for safety-critical applications without further human review.
## Dataset Structure
Each row contains four string fields:
| Field | Description |
|---|---|
| `instruction` | The problem, question, or prompt given to the model |
| `thinking` | The chain-of-thought / internal reasoning trace (may be empty for sources that did not include one) |
| `response` | The final answer or solution |
| `source` | Which source dataset the row originated from (`opus_reasoning`, `qwen_reasoning`, `gpt_reasoning`, `crow_8b`) |
The `thinking` field is populated for rows from `opus_reasoning` and any sources with explicit CoT traces. For rows where no reasoning trace was available in the source, `thinking` is an empty string.
## Dataset Creation
### Curation Rationale
Each of the four source datasets provides a distinct flavor of reasoning data — from Claude Opus-style long-form problem solving, to Qwen3.5 teacher-distilled CoT, to GPT-style step-by-step logic, to a large general instruction corpus. Merging them creates broader coverage across reasoning styles, difficulty levels, and domains while keeping a consistent schema that eliminates preprocessing friction during fine-tuning.
### Data Collection and Processing
1. **Loading** — All four datasets loaded via the HuggingFace `datasets` library, `train` split.
2. **Normalization** — Each dataset mapped to a unified `{instruction, thinking, response, source}` schema. Fields were matched by priority lookup across common naming conventions (`problem`/`input`/`question` → `instruction`, `solution`/`output`/`answer` → `response`, etc.). Datasets with a `messages`/`conversations` list format were parsed by role.
3. **Quality filtering** — Rows with `instruction` or `response` shorter than 10 characters after stripping were removed.
4. **Deduplication** — Global deduplication on the `(instruction, response)` pair across all sources to remove cross-dataset overlap.
5. **Output** — Final dataset serialized to Parquet and pushed via `push_to_hub`.
### Source Data Producers
- **Crownelius/Opus-4.6-Reasoning-3300x** — Synthetically generated using Claude Opus 4.6; pre-cleaned to remove refusals and low-quality completions.
- **Jackrong/Qwen3.5-reasoning-700x** — Distilled from Qwen3.5-27B (full-parameter) via Alibaba Cloud DashScope, seeded from Alibaba-Superior-Reasoning-Stage2 instructions. Covers math, logic, and general QA with long CoT traces.
- **Roman1111111/gpt-5.4-step-by-step-reasoning** — Ultra-high-density synthetic reasoning corpus generated with GPT-5.4 step-by-step prompting. Best suited for fine-tuning 2B–20B models.
- **Crownelius/Crow-8B-Training-Data-Clean** — Large general instruction dataset (~615K completion tokens, ~91K rows) generated via OpenRouter at ~$3 total cost; average 1 turn per example.
### Personal and Sensitive Information
All data is synthetically generated. No personally identifiable information (PII) is known to be present. Users should nonetheless apply their own PII screening if deploying in sensitive contexts.
## Bias, Risks, and Limitations
- **Synthetic origin** — All four sources are AI-generated. Reasoning chains may contain subtle logical errors, hallucinated facts, or stylistic artifacts from the teacher model.
- **English only** — All content is in English. Performance on multilingual fine-tuning is untested.
- **Math/logic skew** — Qwen and GPT sources skew toward mathematical and logical reasoning. General instruction coverage is primarily provided by the Crow-8B source.
- **Empty `thinking` fields** — A significant portion of rows (primarily from `crow_8b`) have no reasoning trace. Practitioners training with a thinking loss mask should filter or weight accordingly.
### Recommendations
When fine-tuning, consider masking the loss on the `thinking` field for rows where it is empty rather than training on empty strings. For reasoning-focused training, filter to rows where `len(thinking) > 50` to ensure the model learns from substantive CoT traces.
## Citation
If you use this dataset, please also credit the original source datasets:
@dataset{crownelius_opus_reasoning_2026,
title = {Opus-4.6-Reasoning-3300x (Cleaned)},
author = {Crownelius},
year = {2026},
url = {https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-3300x}
}
@dataset{jackrong_qwen_reasoning_2026,
title = {Qwen3.5-reasoning-700x},
author = {Jackrong},
year = {2026},
url = {https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x}
}
@dataset{roman_gpt_reasoning_2026,
title = {GPT-5.4-Step-by-Step-Reasoning},
author = {Roman1111111},
year = {2026},
url = {https://huggingface.co/datasets/Roman1111111/gpt-5.4-step-by-step-reasoning}
}
@dataset{crownelius_crow8b_2026,
title = {Crow-8B Training Data (Cleaned)},
author = {Crownelius},
year = {2026},
url = {https://huggingface.co/datasets/Crownelius/Crow-8B-Training-Data-Clean}
}
## Dataset Card Authors
j0no12
## Dataset Card Contact
https://huggingface.co/j0no12
提供机构:
j0no12



