Bas95/GLM-5.1-Reasoning-1M-Cleaned
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Bas95/GLM-5.1-Reasoning-1M-Cleaned
下载链接
链接失效反馈官方服务:
资源简介:
GLM-5.1-Reasoning-1M-Cleaned是一个经过清理和重新格式化的数据集,源自Kassadin88/GLM-5.1-1000000x。它保留了原始的四部分布局(main、PHD-Science、Multilingual-STEM、Math),并将每个示例转换为统一的SFT-ready模式,包含明确的conversations、input、output、domain和meta字段。数据集主要用于文本生成和问答任务,特别强调推理和思维链(chain-of-thought)能力。数据集包含746,321条记录,分为四个子集:main(通用推理和指令遵循数据)、PHD-Science(研究生级别的物理、化学和生物推理数据)、Multilingual-STEM(多语言STEM推理数据,包括中文、英文等)和Math(数学密集型推理和证明式回答)。数据集经过清理,移除了不完整、重复或无法解析的记录。
GLM-5.1-Reasoning-1M-Cleaned is a cleaned and reformatted derivative of Kassadin88/GLM-5.1-1000000x. It preserves the original four-subset layout (main, PHD-Science, Multilingual-STEM, Math) while converting every example into a unified SFT-ready schema with explicit conversations, input, output, domain, and meta fields. The dataset is primarily used for text-generation and question-answering tasks, with a focus on reasoning and chain-of-thought capabilities. It contains 746,321 records, divided into four subsets: main (general reasoning and instruction-following data), PHD-Science (graduate-level physics, chemistry, and biology reasoning traces), Multilingual-STEM (multilingual STEM reasoning data, including Chinese, English, and other languages), and Math (mathematics-heavy reasoning and proof-style responses). The dataset has been cleaned to remove incomplete, repeated, or unparseable records.
提供机构:
Bas95



