ansulev/GLM-5.1-Reasoning-1M-Cleaned

Name: ansulev/GLM-5.1-Reasoning-1M-Cleaned
Creator: ansulev
Published: 2026-04-19 23:06:31
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ansulev/GLM-5.1-Reasoning-1M-Cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en - zh size_categories: - 100K<n<1M task_categories: - text-generation - question-answering tags: - reasoning - chain-of-thought - instruction-tuning - sft - distillation - glm - glm-5.1 - cleaned configs: - config_name: main default: true data_files: - split: train path: "main.jsonl" - config_name: PHD-Science data_files: - split: train path: "PHD-Science.jsonl" - config_name: Multilingual-STEM data_files: - split: train path: "Multilingual-STEM.jsonl" - config_name: Math data_files: - split: train path: "Math.jsonl" --- # GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields. This release was prepared from the original dataset published by **Kassadin88**. ## Summary ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - Teacher model in the data: **GLM-5.1** - Total processed records: **766,535** - Total kept records: **746,321** - Total removed records: **20,214** - Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math` ## Included Content - `main`: general reasoning and instruction-following data. - `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces. - `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release. - `Math`: mathematics-heavy reasoning and proof-style responses. ## Cleaning and Reformatting The raw source dataset mixed two answer layouts: 1. Standard `<think>...</think>` reasoning tags. 2. A non-standard short-dash wrapper around the reasoning section. This cleaned release normalizes both styles into a single output format: ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ``` ### Removed data The cleaning pipeline removed records with: - incomplete or obviously truncated answers, - repeated reasoning paragraphs or duplicated answer segments, - refusal-style answers, - unparseable reasoning/answer boundaries, - exact duplicate records after normalization. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | main | incomplete_output | 12,726 | | main | repeated_paragraph | 6,152 | | main | refusal_answer | 638 | | main | unparseable_output | 38 | | main | duplicate_record | 1 | | PHD-Science | incomplete_output | 33 | | PHD-Science | repeated_paragraph | 19 | | PHD-Science | refusal_answer | 1 | | Multilingual-STEM | repeated_paragraph | 116 | | Multilingual-STEM | incomplete_output | 88 | | Multilingual-STEM | refusal_answer | 28 | | Multilingual-STEM | unparseable_output | 19 | | Math | repeated_paragraph | 303 | | Math | incomplete_output | 26 | | Math | unparseable_output | 26 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## Data Structure Each example is a single-turn reasoning distillation sample: - `conversations[0]`: the user prompt. - `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag. - `input`: prompt-only view for training pipelines that prefer flat prompt fields. - `output`: tagged answer-only view for training pipelines that prefer flat completion fields. - `domain`: original subset/domain name extracted from the source record ID. - `meta`: lightweight per-example metadata. ## Usage ```python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") ``` If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org. ## Provenance This dataset is derived from: - Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - Original author: **Kassadin88** ## Citation Please cite the original dataset first: ```bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } ``` You can additionally cite this cleaned derivative release as: ```bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} } ```

license: Apache-2.0 language: - 英语 - 中文 size_categories: - 10万<样本数<100万 task_categories: - 文本生成 - 问答 tags: - 推理 - 思维链 - 指令微调 - 监督微调（SFT，Supervised Fine-Tuning） - GLM（通用语言模型，General Language Model） - GLM-5.1 - 已清洗 configs: - config_name: main default: true data_files: - split: train path: "main.jsonl" - config_name: PHD-Science data_files: - split: train path: "PHD-Science.jsonl" - config_name: Multilingual-STEM data_files: - split: train path: "Multilingual-STEM.jsonl" - config_name: Math data_files: - split: train path: "Math.jsonl" # GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** 是 [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的清洗重构衍生数据集。该数据集保留了原始数据集的四个子集结构（`main`、`PHD-Science`、`Multilingual-STEM`、`Math`），并将所有样本转换为统一的适用于监督微调的标准格式，包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。本版本基于**Kassadin88**发布的原始数据集构建。 ## 摘要 ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - 数据中使用的教师模型：**GLM-5.1** - 总处理样本数：**766,535** - 总保留样本数：**746,321** - 总移除样本数：**20,214** - 严格保留原始数据集的子集结构：`main`、`PHD-Science`、`Multilingual-STEM`、`Math` ## 包含内容 - `main`：通用推理与指令遵循数据 - `PHD-Science`：研究生阶段物理、化学与生物学推理轨迹数据 - `Multilingual-STEM`：多语言STEM推理数据，包含原始发布版本中的中文、英语及其他语言样本 - `Math`：以数学推理与证明风格回复为主的数据集 ## 清洗与重构原始源数据集混合了两种答案格式： 1. 标准的`<think>...</think>`推理标签格式 2. 用非标准短横线包裹推理内容的格式本次清洗版本将两种格式统一为单一标准输出格式： json { "id": "领域-输入-推理-答案的MD5哈希值", "conversations": [ {"from": "human", "value": "用户提示词"}, {"from": "gpt", "value": "<think> 推理轨迹 </think> 最终答案"} ], "input": "用户提示词", "output": "<think> 推理轨迹 </think> 最终答案", "domain": "从原始样本ID前缀提取的子集/领域名称", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ### 移除数据本次清洗流程移除了以下类型的样本： - 不完整或明显被截断的答案 - 重复的推理段落或重复的答案片段 - 拒绝型回复 - 无法解析的推理/答案边界 - 归一化后完全重复的样本 ## 子集统计 | 子集 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## 过滤统计 | 子集 | 过滤问题类型 | 移除数量 | | --- | --- | ---: | | main | 输出不完整 | 12,726 | | main | 段落重复 | 6,152 | | main | 拒绝型回复 | 638 | | main | 输出无法解析 | 38 | | main | 重复样本 | 1 | | PHD-Science | 输出不完整 | 33 | | PHD-Science | 段落重复 | 19 | | PHD-Science | 拒绝型回复 | 1 | | Multilingual-STEM | 段落重复 | 116 | | Multilingual-STEM | 输出不完整 | 88 | | Multilingual-STEM | 拒绝型回复 | 28 | | Multilingual-STEM | 输出无法解析 | 19 | | Math | 段落重复 | 303 | | Math | 输出不完整 | 26 | | Math | 输出无法解析 | 26 | ## 额外Token统计 | 子集 | 输入Token均值 | 输入Token 95分位数 | 输出Token均值 | 输出Token 95分位数 | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## 数据结构每个样本均为单轮推理蒸馏样本： - `conversations[0]`：用户提示词 - `conversations[1]`：模型回复，其中推理内容被包裹在`<think>...</think>`标签中，最终答案紧随闭合标签之后 - `input`：仅包含提示词的字段，适配偏好扁平化提示字段的训练流程 - `output`：仅包含带标签答案的字段，适配偏好扁平化补全字段的训练流程 - `domain`：从原始样本ID中提取的原始子集/领域名称 - `meta`：单样本轻量级元数据 ## 使用方法 python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") 若您在其他命名空间下发布本数据集，请将`Jackrong/`替换为您的Hugging Face用户名或组织名称。 ## 数据集来源本数据集衍生自： - 原始数据集：[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - 原始作者：**Kassadin88** ## 引用方式请优先引用原始数据集： bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } 您也可以额外引用本清洗衍生版本： bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} }

提供机构：

ansulev

5,000+

优质数据集

54 个

任务类型

进入经典数据集