five

ansulev/GLM-5.1-Reasoning-1M-Cleaned

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ansulev/GLM-5.1-Reasoning-1M-Cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - zh size_categories: - 100K<n<1M task_categories: - text-generation - question-answering tags: - reasoning - chain-of-thought - instruction-tuning - sft - distillation - glm - glm-5.1 - cleaned configs: - config_name: main default: true data_files: - split: train path: "main.jsonl" - config_name: PHD-Science data_files: - split: train path: "PHD-Science.jsonl" - config_name: Multilingual-STEM data_files: - split: train path: "Multilingual-STEM.jsonl" - config_name: Math data_files: - split: train path: "Math.jsonl" --- # GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** is a cleaned and reformatted derivative of [Kassadin88/GLM-5.1-1000000x](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x). It preserves the original four-subset layout (`main`, `PHD-Science`, `Multilingual-STEM`, `Math`) while converting every example into a unified SFT-ready schema with explicit `conversations`, `input`, `output`, `domain`, and `meta` fields. This release was prepared from the original dataset published by **Kassadin88**. ## Summary ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - Teacher model in the data: **GLM-5.1** - Total processed records: **766,535** - Total kept records: **746,321** - Total removed records: **20,214** - Subset layout preserved exactly as the source dataset: `main`, `PHD-Science`, `Multilingual-STEM`, `Math` ## Included Content - `main`: general reasoning and instruction-following data. - `PHD-Science`: graduate-level physics, chemistry, and biology reasoning traces. - `Multilingual-STEM`: multilingual STEM reasoning data, including Chinese, English, and other languages present in the source release. - `Math`: mathematics-heavy reasoning and proof-style responses. ## Cleaning and Reformatting The raw source dataset mixed two answer layouts: 1. Standard `<think>...</think>` reasoning tags. 2. A non-standard short-dash wrapper around the reasoning section. This cleaned release normalizes both styles into a single output format: ```json { "id": "md5-hash-of-domain-input-reasoning-answer", "conversations": [ {"from": "human", "value": "user prompt"}, {"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"} ], "input": "user prompt", "output": "<think>\nreasoning trace\n</think>\n\nfinal answer", "domain": "subset/domain name from the original _id prefix", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ``` ### Removed data The cleaning pipeline removed records with: - incomplete or obviously truncated answers, - repeated reasoning paragraphs or duplicated answer segments, - refusal-style answers, - unparseable reasoning/answer boundaries, - exact duplicate records after normalization. ## Subset Statistics | Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## Filter Statistics | Subset | Issue | Removed | | --- | --- | ---: | | main | incomplete_output | 12,726 | | main | repeated_paragraph | 6,152 | | main | refusal_answer | 638 | | main | unparseable_output | 38 | | main | duplicate_record | 1 | | PHD-Science | incomplete_output | 33 | | PHD-Science | repeated_paragraph | 19 | | PHD-Science | refusal_answer | 1 | | Multilingual-STEM | repeated_paragraph | 116 | | Multilingual-STEM | incomplete_output | 88 | | Multilingual-STEM | refusal_answer | 28 | | Multilingual-STEM | unparseable_output | 19 | | Math | repeated_paragraph | 303 | | Math | incomplete_output | 26 | | Math | unparseable_output | 26 | ## Additional Token Statistics | Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## Data Structure Each example is a single-turn reasoning distillation sample: - `conversations[0]`: the user prompt. - `conversations[1]`: the model response with reasoning wrapped in `<think>...</think>` and the final answer after the closing tag. - `input`: prompt-only view for training pipelines that prefer flat prompt fields. - `output`: tagged answer-only view for training pipelines that prefer flat completion fields. - `domain`: original subset/domain name extracted from the source record ID. - `meta`: lightweight per-example metadata. ## Usage ```python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") ``` If you publish under a different namespace, replace `Jackrong/` with your actual Hugging Face username or org. ## Provenance This dataset is derived from: - Original dataset: [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - Original author: **Kassadin88** ## Citation Please cite the original dataset first: ```bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } ``` You can additionally cite this cleaned derivative release as: ```bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} } ```

license: Apache-2.0 language: - 英语 - 中文 size_categories: - 10万<样本数<100万 task_categories: - 文本生成 - 问答 tags: - 推理 - 思维链 - 指令微调 - 监督微调(SFT,Supervised Fine-Tuning) - GLM(通用语言模型,General Language Model) - GLM-5.1 - 已清洗 configs: - config_name: main default: true data_files: - split: train path: "main.jsonl" - config_name: PHD-Science data_files: - split: train path: "PHD-Science.jsonl" - config_name: Multilingual-STEM data_files: - split: train path: "Multilingual-STEM.jsonl" - config_name: Math data_files: - split: train path: "Math.jsonl" # GLM-5.1-Reasoning-1M-Cleaned ![GLM-5.1](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/Bn6WT4WoRayEe8l-D-_TL.jpeg) **GLM-5.1-Reasoning-1M-Cleaned** 是 [`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) 的清洗重构衍生数据集。该数据集保留了原始数据集的四个子集结构(`main`、`PHD-Science`、`Multilingual-STEM`、`Math`),并将所有样本转换为统一的适用于监督微调的标准格式,包含明确的`conversations`、`input`、`output`、`domain`与`meta`字段。 本版本基于**Kassadin88**发布的原始数据集构建。 ## 摘要 ![bench_51](https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/GtLN3BAse6QsnpvZfBdHv.png) - 数据中使用的教师模型:**GLM-5.1** - 总处理样本数:**766,535** - 总保留样本数:**746,321** - 总移除样本数:**20,214** - 严格保留原始数据集的子集结构:`main`、`PHD-Science`、`Multilingual-STEM`、`Math` ## 包含内容 - `main`:通用推理与指令遵循数据 - `PHD-Science`:研究生阶段物理、化学与生物学推理轨迹数据 - `Multilingual-STEM`:多语言STEM推理数据,包含原始发布版本中的中文、英语及其他语言样本 - `Math`:以数学推理与证明风格回复为主的数据集 ## 清洗与重构 原始源数据集混合了两种答案格式: 1. 标准的`<think>...</think>`推理标签格式 2. 用非标准短横线包裹推理内容的格式 本次清洗版本将两种格式统一为单一标准输出格式: json { "id": "领域-输入-推理-答案的MD5哈希值", "conversations": [ {"from": "human", "value": "用户提示词"}, {"from": "gpt", "value": "<think> 推理轨迹 </think> 最终答案"} ], "input": "用户提示词", "output": "<think> 推理轨迹 </think> 最终答案", "domain": "从原始样本ID前缀提取的子集/领域名称", "meta": { "input_tokens": 123, "output_tokens": 456, "teacher_model": "GLM-5.1" } } ### 移除数据 本次清洗流程移除了以下类型的样本: - 不完整或明显被截断的答案 - 重复的推理段落或重复的答案片段 - 拒绝型回复 - 无法解析的推理/答案边界 - 归一化后完全重复的样本 ## 子集统计 | 子集 | 处理样本数 | 保留样本数 | 移除样本数 | 文件大小 | 输入Token中位数 | 输出Token中位数 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | main | 547,292 | 527,737 | 19,555 | 18.28 GB | 50 | 3008 | | PHD-Science | 103,759 | 103,706 | 53 | 3.72 GB | 45 | 3387 | | Multilingual-STEM | 93,032 | 92,781 | 251 | 3.53 GB | 66 | 4084 | | Math | 22,452 | 22,097 | 355 | 4.03 GB | 59 | 24498 | ## 过滤统计 | 子集 | 过滤问题类型 | 移除数量 | | --- | --- | ---: | | main | 输出不完整 | 12,726 | | main | 段落重复 | 6,152 | | main | 拒绝型回复 | 638 | | main | 输出无法解析 | 38 | | main | 重复样本 | 1 | | PHD-Science | 输出不完整 | 33 | | PHD-Science | 段落重复 | 19 | | PHD-Science | 拒绝型回复 | 1 | | Multilingual-STEM | 段落重复 | 116 | | Multilingual-STEM | 输出不完整 | 88 | | Multilingual-STEM | 拒绝型回复 | 28 | | Multilingual-STEM | 输出无法解析 | 19 | | Math | 段落重复 | 303 | | Math | 输出不完整 | 26 | | Math | 输出无法解析 | 26 | ## 额外Token统计 | 子集 | 输入Token均值 | 输入Token 95分位数 | 输出Token均值 | 输出Token 95分位数 | | --- | ---: | ---: | ---: | ---: | | main | 118.62 | 515 | 4482.35 | 15041 | | PHD-Science | 45.05 | 56 | 4387.73 | 10447 | | Multilingual-STEM | 79.91 | 192 | 5461.45 | 11034 | | Math | 62.01 | 89 | 28133.4 | 64633 | ## 数据结构 每个样本均为单轮推理蒸馏样本: - `conversations[0]`:用户提示词 - `conversations[1]`:模型回复,其中推理内容被包裹在`<think>...</think>`标签中,最终答案紧随闭合标签之后 - `input`:仅包含提示词的字段,适配偏好扁平化提示字段的训练流程 - `output`:仅包含带标签答案的字段,适配偏好扁平化补全字段的训练流程 - `domain`:从原始样本ID中提取的原始子集/领域名称 - `meta`:单样本轻量级元数据 ## 使用方法 python from datasets import load_dataset main = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "main") science = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "PHD-Science") stem = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Multilingual-STEM") math = load_dataset("Jackrong/GLM5.1-Reasoning-1M-Cleaned", "Math") 若您在其他命名空间下发布本数据集,请将`Jackrong/`替换为您的Hugging Face用户名或组织名称。 ## 数据集来源 本数据集衍生自: - 原始数据集:[`Kassadin88/GLM-5.1-1000000x`](https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x) - 原始作者:**Kassadin88** ## 引用方式 请优先引用原始数据集: bibtex @misc{glm51-1000000x, title={GLM-5.1-1000000x: One Million Reasoning Traces Distilled from GLM-5.1}, author={Kassadin88}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Kassadin88/GLM-5.1-1000000x} } 您也可以额外引用本清洗衍生版本: bibtex @misc{glm51_reasoning_1m_cleaned, title={GLM-5.1-Reasoning-1M-Cleaned}, author={Jackrong}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Jackrong/GLM5.1-Reasoning-1M-Cleaned} }
提供机构:
ansulev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作