five

cetusian/GeneralThought-Qwen3-filtered

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/cetusian/GeneralThought-Qwen3-filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - qwen3 - sft - reasoning - chain-of-thought - cot size_categories: - 10K<n<100K pretty_name: GeneralThought (Qwen3-filtered reasoning SFT) --- # GeneralThought — Qwen3-filtered reasoning SFT A drop-in SFT mix of general-reasoning chain-of-thought traces, **pre-sanitized for Qwen3 models**. Same records as the upstream filter, re-shaped into the `messages` schema expected by `trl.SFTTrainer` / `transformers.apply_chat_template`, with one critical fix that prevents a silent NaN failure on Qwen3 training with control-token-aware recipes (e.g. FP8 / NVFP4 hybrid attention). ## What's in it **27,776 records**, one line of JSONL each: ```json { "messages": [ {"role": "system", "content": "You are a careful reasoner..."}, {"role": "user", "content": "<question>"}, {"role": "assistant", "content": "<think>\n<reasoning>\n</think>\n\n<answer>"} ] } ``` All rendered chat-template lengths ≤ **4,096** tokens under the Qwen3 tokenizer. Distribution: p50 = 1,108, p95 = 2,619, max 4,094. Every assistant turn contains exactly one `<think>` / `</think>` pair (records with duplicated control-tokens mid-reasoning were dropped). **Pure reasoning only.** Zero records contain tool-calling artifacts (`<tool_call>`, `<tools>`, or OpenAI `tool_calls` blocks). If you want to teach tool use, do it in a later stage (e.g. RL); this dataset is for the CoT prior. ## Why "Qwen3-filtered" Qwen3's tokenizer treats the following as **dedicated single control-token IDs** rather than multi-token text: | Substring | Qwen3 token ID | |---|---| | `<think>` / `</think>` | 151667 / 151668 | | `<tool_call>` / `</tool_call>` | 151657 / 151658 | | `<tool_response>` / `</tool_response>` | 151665 / 151666 | If any of those strings appear **verbatim** inside a `system` or `user` message (for instance, in an instruction like *"Reason inside `<think>...</think>` tags"*), the tokenizer emits real control tokens outside the assistant's completion span. Certain attention/normalisation paths in FP8 and NVFP4 hybrid training recipes assume those IDs only appear in assistant spans, and silently produce `NaN` loss from step 0. This dataset's system prompt describes the reasoning wrapper in neutral English — only the **assistant's completion** contains `<think>...</think>` — so the data is safe across recipes. Records longer than 4,096 tokens are also dropped, since mid-`<think>` truncation leaves unpaired control tokens with the same symptom. Works out of the box with: - `transformers` + `trl.SFTTrainer` (`apply_chat_template`) - Surogate's `type: conversation`, `messages_field: messages` - Any framework that tokenizes via the Qwen3 chat template Smoke-verified on Qwen3-8B + LoRA (r=32): clean finite loss from step 0 with both `bf16` and `fp8-hybrid` recipes. ## Quick start ```python from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("cetusian/GeneralThought-Qwen3-filtered", split="train") tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B") rendered = tok.apply_chat_template(ds[0]["messages"], tokenize=False) print(rendered[:400]) ``` ## License and attribution This mix: **Apache-2.0**. It is a derivative of [**natolambert/GeneralThought-430K-filtered**](https://huggingface.co/datasets/natolambert/GeneralThought-430K-filtered) (Apache-2.0), which is itself a commercial-safe subset of [**GeneralReasoning/GeneralThought-430K**](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K) (MIT, per-question licenses vary; commercial-incompatible rows were already removed in the `natolambert` filter). Please credit the upstream sources: ```bibtex @misc{generalthought_430k, title = {GeneralThought-430K}, author = {GeneralReasoning}, year = {2025}, url = {https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K} } @misc{generalthought_filtered, title = {GeneralThought-430K-filtered}, author = {Nathan Lambert}, year = {2025}, url = {https://huggingface.co/datasets/natolambert/GeneralThought-430K-filtered} } ``` ## Changes vs. the upstream filter - Re-shaped into the `messages` list-of-dict format (system / user / assistant) used by `trl.SFTTrainer`. - Stripped prepended instruction boilerplate from the user question. - Rewrote any literal `<think>…</think>` references in the system prompt as neutral English — only the assistant's completion carries the actual `<think>…</think>` control tokens. - Dropped records whose rendered chat-template length exceeds 4,096 Qwen3 tokens. - Removed any tool-calling / `<tools>` records — this release is reasoning-only. - Dropped records where an assistant turn contained more than one `<think>` or `</think>` substring (stray control-token emissions mid-reasoning). - Dropped records whose message content contained any literal Qwen3 control-token substring (`<|im_start|>`, `<|im_end|>`, `<|endoftext|>`, `<tool_call>`, `<tool_response>`, `<tools>` and their closers). The tokenizer emits these as dedicated single-ID control tokens inside the text itself — same NaN / mask- confusion hazard class as mid-`<think>` truncation. No questions, reasoning traces, or answers were altered beyond that.
提供机构:
cetusian
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作