Kimi-K2.5-Reasoning-1M-Cleaned
收藏魔搭社区2026-05-11 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned
下载链接
链接失效反馈官方服务:
资源简介:
# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned
**Kimi-K2.5-Reasoning-1M-Cleaned** is a cleaned derivative of [ianncity/KIMI-K2.5-1000000x](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x). It preserves the original four-config layout from the source dataset and rewrites each record into a unified reasoning-SFT schema with `id`, `conversations`, `input`, `output`, `domain`, and `meta`.

## Summary
- Source dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x)
- Source author: **ianncity**
- Teacher model recorded in `meta.teacher_model`: `KIMI-K2.5`
- Token lengths computed with tokenizer: `moonshotai/Kimi-K2.5`
- Total processed records: **1,003,589**
- Total kept records: **844,388**
- Total removed records: **159,201**
- Original source configs preserved: `General-Distillation`, `PHD-Science`, `General-Math`, `MultilingualSTEM`
## What This Release Fixes
The source JSONL files expose each example as a two-turn `messages` conversation only. This cleaned release standardizes that raw structure into a training-ready schema and removes records with quality issues.
### Transformations applied
1. Renamed the source `messages` field to `conversations`.
2. Split each record into `input` plus tagged `output`.
3. Normalized `output` into `<think>...</think>` followed by the final answer.
4. Rebuilt `id` as a deterministic MD5 hash over `domain + input + reasoning + answer`.
5. Wrote subset-level provenance into the `domain` field because the source data does not provide a finer per-example domain label.
6. Added `meta.input_tokens`, `meta.output_tokens`, and `meta.teacher_model`.
7. Preserved the original four-config subset boundaries instead of merging everything into one file.
### Removed data
The cleaning pipeline filters records with:
- malformed or unparseable reasoning / answer boundaries,
- incomplete or obviously truncated answers,
- refusal-style answers,
- repeated reasoning or duplicated answer segments,
- exact duplicate records after normalization.
## Dataset Structure
```json
{
"id": "md5-hash-of-domain-input-reasoning-answer",
"conversations": [
{"from": "human", "value": "user prompt"},
{"from": "gpt", "value": "<think>\nreasoning trace\n</think>\n\nfinal answer"}
],
"input": "user prompt",
"output": "<think>\nreasoning trace\n</think>\n\nfinal answer",
"domain": "subset-derived label such as General-Math",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "KIMI-K2.5"
}
}
```
### Field notes
- `conversations[0]`: the user prompt.
- `conversations[1]`: the cleaned assistant response with `<think>` tags.
- `input`: flat prompt view.
- `output`: flat completion view containing reasoning plus final answer.
- `domain`: subset-derived label. The source repository does not include an explicit per-example domain field, so this release uses the source config name as the domain value.
- `meta`: lightweight token-length metadata and teacher model provenance.
## Subset Statistics
| Subset | Processed | Kept | Removed | File Size | Median Input Tokens | Median Output Tokens |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 |
| PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 |
| General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 |
| MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 |
## Filter Statistics
| Subset | Issue | Removed |
| --- | --- | ---: |
| General-Distillation | repeated_paragraph | 38,925 |
| General-Distillation | incomplete_output | 4,849 |
| General-Distillation | unparseable_output | 748 |
| General-Distillation | refusal_answer | 531 |
| PHD-Science | incomplete_output | 311 |
| PHD-Science | unparseable_output | 101 |
| PHD-Science | repeated_paragraph | 37 |
| PHD-Science | refusal_answer | 3 |
| General-Math | unparseable_output | 99,375 |
| General-Math | repeated_paragraph | 7,448 |
| General-Math | incomplete_output | 3,832 |
| MultilingualSTEM | unparseable_output | 1,841 |
| MultilingualSTEM | incomplete_output | 677 |
| MultilingualSTEM | repeated_paragraph | 522 |
| MultilingualSTEM | refusal_answer | 1 |
## Additional Token Statistics
| Subset | Mean Input Tokens | P95 Input Tokens | Mean Output Tokens | P95 Output Tokens |
| --- | ---: | ---: | ---: | ---: |
| General-Distillation | 115.94 | 506 | 3189.8 | 6761 |
| PHD-Science | 44.98 | 56 | 3213.31 | 5107 |
| General-Math | 57.76 | 81 | 9402.39 | 12485 |
| MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 |
## Included Content
- `General-Distillation`: the broad mixed-domain reasoning split from the source release.
- `PHD-Science`: science-heavy reasoning traces.
- `General-Math`: math-focused reasoning traces.
- `MultilingualSTEM`: multilingual STEM reasoning traces.
## Usage
```python
from datasets import load_dataset
general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation")
science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science")
math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math")
multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM")
```
## Provenance
- Original dataset: [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x)
- Original author: **ianncity**
- This release is a cleaned derivative and should not be treated as the original source dataset.
## Citation
Please cite the original dataset:
```bibtex
@misc{kimi_k25_1000000x,
title={KIMI-K2.5-1000000x},
author={ianncity},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x}
}
```
You can additionally cite this cleaned derivative release:
```bibtex
@misc{kimi_k25_reasoning_1m_cleaned,
title={Kimi-K2.5-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned}
}
```
# 🪐 Kimi-K2.5-Reasoning-1M-Cleaned
**Kimi-K2.5-Reasoning-1M-Cleaned** 是 [`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x) 的清洗版衍生数据集。其保留了源数据集的四种配置布局,并将每条记录重写为统一的推理监督微调(Supervised Fine-Tuning, SFT)模式,包含`id`、`conversations`、`input`、`output`、`domain`及`meta`字段。

## 摘要
- 源数据集:[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x)
- 源作者:**ianncity**
- `meta.teacher_model` 中记录的教师模型:`KIMI-K2.5`
- 使用分词器`moonshotai/Kimi-K2.5`计算Token(Token)长度
- 总处理记录数:**1,003,589**
- 保留记录数:**844,388**
- 移除记录数:**159,201**
- 保留的原始源配置:`General-Distillation`、`PHD-Science`、`General-Math`、`MultilingualSTEM`
## 本次发布的修复内容
原始JSONL文件仅将每个示例以两轮`messages`对话的形式呈现。此清洗版本将原始结构标准化为适合训练的格式,并移除了存在质量问题的记录。
### 执行的转换操作
1. 将源数据中的`messages`字段重命名为`conversations`。
2. 将每条记录拆分为`input`与带标签的`output`。
3. 将`output`规范化为以`<think>...</think>`包裹推理过程、后跟最终答案的格式。
4. 将`id`重建为基于`domain + input + reasoning + answer`的确定性MD5哈希值。
5. 将子集级别的来源信息写入`domain`字段,因源数据未提供更细粒度的单条示例域标签。
6. 新增`meta.input_tokens`、`meta.output_tokens`及`meta.teacher_model`字段。
7. 保留原始的四种配置子集划分,未将所有数据合并为单个文件。
### 被移除的数据
清洗流程会过滤以下类型的记录:
- 格式错误或无法解析的推理/答案边界,
- 不完整或明显被截断的答案,
- 拒绝类回复,
- 重复的推理或答案片段,
- 标准化后完全重复的记录。
## 数据集结构
json
{
"id": "domain-input-reasoning-answer的MD5哈希值",
"conversations": [
{"from": "human", "value": "用户提示词"},
{"from": "gpt", "value": "<think>
推理过程
</think>
最终答案"}
],
"input": "用户提示词",
"output": "<think>
推理过程
</think>
最终答案",
"domain": "子集衍生标签,例如General-Math",
"meta": {
"input_tokens": 123,
"output_tokens": 456,
"teacher_model": "KIMI-K2.5"
}
}
### 字段说明
- `conversations[0]`:用户的提示词。
- `conversations[1]`:带有`<think>`标签的清洗后助手回复。
- `input`:扁平化的提示词视图。
- `output`:扁平化的完成结果视图,包含推理过程与最终答案。
- `domain`:子集衍生的标签。因源仓库未提供显式的单条示例域字段,此版本使用源配置名称作为域值。
- `meta`:轻量级的Token长度元数据与教师模型来源信息。
## 子集统计数据
| 子集 | 处理记录数 | 保留记录数 | 移除记录数 | 文件大小 | 输入Token中位数 | 输出Token中位数 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| General-Distillation | 598,366 | 553,313 | 45,053 | 14.20 GB | 49 | 2972 |
| PHD-Science | 103,759 | 103,307 | 452 | 2.80 GB | 45 | 3021 |
| General-Math | 208,426 | 97,771 | 110,655 | 6.16 GB | 56 | 9616 |
| MultilingualSTEM | 93,038 | 89,997 | 3,041 | 2.90 GB | 76 | 3956 |
## 过滤统计数据
| 子集 | 问题类型 | 移除记录数 |
| --- | --- | ---: |
| General-Distillation | 段落重复 | 38,925 |
| General-Distillation | 输出不完整 | 4,849 |
| General-Distillation | 输出无法解析 | 748 |
| General-Distillation | 拒绝类回复 | 531 |
| PHD-Science | 输出不完整 | 311 |
| PHD-Science | 输出无法解析 | 101 |
| PHD-Science | 段落重复 | 37 |
| PHD-Science | 拒绝类回复 | 3 |
| General-Math | 输出无法解析 | 99,375 |
| General-Math | 段落重复 | 7,448 |
| General-Math | 输出不完整 | 3,832 |
| MultilingualSTEM | 输出无法解析 | 1,841 |
| MultilingualSTEM | 输出不完整 | 677 |
| MultilingualSTEM | 段落重复 | 522 |
| MultilingualSTEM | 拒绝类回复 | 1 |
## 额外Token统计数据
| 子集 | 输入Token均值 | 输入Token P95分位数 | 输出Token均值 | 输出Token P95分位数 |
| --- | ---: | ---: | ---: | ---: |
| General-Distillation | 115.94 | 506 | 3189.8 | 6761 |
| PHD-Science | 44.98 | 56 | 3213.31 | 5107 |
| General-Math | 57.76 | 81 | 9402.39 | 12485 |
| MultilingualSTEM | 79.22 | 123 | 4555.73 | 9082 |
## 包含内容
- `General-Distillation`:源版本中广泛混合域的推理拆分子集。
- `PHD-Science`:偏重科学领域的推理过程数据集。
- `General-Math`:专注数学领域的推理过程数据集。
- `MultilingualSTEM`:多语言STEM(科学、技术、工程、数学)推理过程数据集。
## 使用方法
python
from datasets import load_dataset
general = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Distillation")
science = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "PHD-Science")
math_ds = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "General-Math")
multi = load_dataset("Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned", "MultilingualSTEM")
## 来源说明
- 原始数据集:[`ianncity/KIMI-K2.5-1000000x`](https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x)
- 原始作者:**ianncity**
- 此版本为经过清洗的衍生数据集,不应被视为原始源数据集。
## 引用格式
请引用原始数据集:
bibtex
@misc{kimi_k25_1000000x,
title={KIMI-K2.5-1000000x},
author={ianncity},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/ianncity/KIMI-K2.5-1000000x}
}
你也可以额外引用此清洗后的衍生版本:
bibtex
@misc{kimi_k25_reasoning_1m_cleaned,
title={Kimi-K2.5-Reasoning-1M-Cleaned},
author={Jackrong},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/Jackrong/Kimi-K2.5-Reasoning-1M-Cleaned}
}
提供机构:
maas
创建时间:
2026-04-18



