cetusian/GeneralThought-Qwen3-filtered
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cetusian/GeneralThought-Qwen3-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- qwen3
- sft
- reasoning
- chain-of-thought
- cot
size_categories:
- 10K<n<100K
pretty_name: GeneralThought (Qwen3-filtered reasoning SFT)
---
# GeneralThought — Qwen3-filtered reasoning SFT
A drop-in SFT mix of general-reasoning chain-of-thought traces,
**pre-sanitized for Qwen3 models**. Same records as the upstream filter,
re-shaped into the `messages` schema expected by `trl.SFTTrainer` /
`transformers.apply_chat_template`, with one critical fix that prevents
a silent NaN failure on Qwen3 training with control-token-aware recipes
(e.g. FP8 / NVFP4 hybrid attention).
## What's in it
**27,776 records**, one line of JSONL each:
```json
{
"messages": [
{"role": "system", "content": "You are a careful reasoner..."},
{"role": "user", "content": "<question>"},
{"role": "assistant",
"content": "<think>\n<reasoning>\n</think>\n\n<answer>"}
]
}
```
All rendered chat-template lengths ≤ **4,096** tokens under the Qwen3
tokenizer. Distribution: p50 = 1,108, p95 = 2,619, max 4,094. Every
assistant turn contains exactly one `<think>` / `</think>` pair (records
with duplicated control-tokens mid-reasoning were dropped).
**Pure reasoning only.** Zero records contain tool-calling artifacts
(`<tool_call>`, `<tools>`, or OpenAI `tool_calls` blocks). If you want to
teach tool use, do it in a later stage (e.g. RL); this dataset is for the
CoT prior.
## Why "Qwen3-filtered"
Qwen3's tokenizer treats the following as **dedicated single control-token
IDs** rather than multi-token text:
| Substring | Qwen3 token ID |
|---|---|
| `<think>` / `</think>` | 151667 / 151668 |
| `<tool_call>` / `</tool_call>` | 151657 / 151658 |
| `<tool_response>` / `</tool_response>` | 151665 / 151666 |
If any of those strings appear **verbatim** inside a `system` or `user`
message (for instance, in an instruction like *"Reason inside
`<think>...</think>` tags"*), the tokenizer emits real control tokens
outside the assistant's completion span. Certain attention/normalisation
paths in FP8 and NVFP4 hybrid training recipes assume those IDs only
appear in assistant spans, and silently produce `NaN` loss from step 0.
This dataset's system prompt describes the reasoning wrapper in neutral
English — only the **assistant's completion** contains `<think>...</think>`
— so the data is safe across recipes. Records longer than 4,096 tokens are
also dropped, since mid-`<think>` truncation leaves unpaired control tokens
with the same symptom.
Works out of the box with:
- `transformers` + `trl.SFTTrainer` (`apply_chat_template`)
- Surogate's `type: conversation`, `messages_field: messages`
- Any framework that tokenizes via the Qwen3 chat template
Smoke-verified on Qwen3-8B + LoRA (r=32): clean finite loss from step 0
with both `bf16` and `fp8-hybrid` recipes.
## Quick start
```python
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset("cetusian/GeneralThought-Qwen3-filtered", split="train")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
rendered = tok.apply_chat_template(ds[0]["messages"], tokenize=False)
print(rendered[:400])
```
## License and attribution
This mix: **Apache-2.0**.
It is a derivative of
[**natolambert/GeneralThought-430K-filtered**](https://huggingface.co/datasets/natolambert/GeneralThought-430K-filtered)
(Apache-2.0), which is itself a commercial-safe subset of
[**GeneralReasoning/GeneralThought-430K**](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
(MIT, per-question licenses vary; commercial-incompatible rows were already
removed in the `natolambert` filter).
Please credit the upstream sources:
```bibtex
@misc{generalthought_430k,
title = {GeneralThought-430K},
author = {GeneralReasoning},
year = {2025},
url = {https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K}
}
@misc{generalthought_filtered,
title = {GeneralThought-430K-filtered},
author = {Nathan Lambert},
year = {2025},
url = {https://huggingface.co/datasets/natolambert/GeneralThought-430K-filtered}
}
```
## Changes vs. the upstream filter
- Re-shaped into the `messages` list-of-dict format (system / user /
assistant) used by `trl.SFTTrainer`.
- Stripped prepended instruction boilerplate from the user question.
- Rewrote any literal `<think>…</think>` references in the system prompt as
neutral English — only the assistant's completion carries the actual
`<think>…</think>` control tokens.
- Dropped records whose rendered chat-template length exceeds 4,096 Qwen3
tokens.
- Removed any tool-calling / `<tools>` records — this release is reasoning-only.
- Dropped records where an assistant turn contained more than one `<think>`
or `</think>` substring (stray control-token emissions mid-reasoning).
- Dropped records whose message content contained any literal Qwen3 control-token
substring (`<|im_start|>`, `<|im_end|>`, `<|endoftext|>`, `<tool_call>`,
`<tool_response>`, `<tools>` and their closers). The tokenizer emits these as
dedicated single-ID control tokens inside the text itself — same NaN / mask-
confusion hazard class as mid-`<think>` truncation.
No questions, reasoning traces, or answers were altered beyond that.
提供机构:
cetusian



