five

GetWetter/KIMI-K2.5-1000000x

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/GetWetter/KIMI-K2.5-1000000x
下载链接
链接失效反馈
官方服务:
资源简介:
KIMI-K2.5-1000000x是一个包含1,000,000条推理轨迹的数据集,来源于KIMI-K2.5的高阶推理。数据集分布广泛,包括编码(50%,涵盖Web开发、Python、C++、Java等多种语言)、科学(20%,包括物理、化学、生物,其中PHD-Science子集包含额外的100k条)、数学(15%,包括代数、微积分、概率,其中kimiMath200k.jsonl子集包含额外的200k条)、计算机科学(5%)、逻辑问题(5%)、创意写作(5%)和多语言STEM(MultilingualSTEM.jsonl子集包含100k条)。数据集总token数为5B,适用于文本生成和问答任务,特别适合推理、思维链、指令调优和SFT等应用。数据收集使用了TeichAI的修改版Datagen,耗时约80小时。

KIMI-K2.5-1000000x is a dataset containing 1,000,000 reasoning traces distilled from KIMI-K2.5 on high reasoning. The dataset is broadly distributed, including Coding (50%, covering Webdev, Python, C++, Java, etc.), Science (20%, including Physics, Chemistry, Biology, with an additional 100k completions in the PHD-Science subset), Math (15%, including Algebra, Calculus, Probability, with an additional 200k completions in kimiMath200k.jsonl), Computer Science (5%), Logical Questions (5%), Creative Writing (5%), and MultilingualSTEM (100k completions in MultilingualSTEM.jsonl). The total token count is 5B, and the dataset is suitable for text-generation and question-answering tasks, particularly for reasoning, chain-of-thought, instruction-tuning, and SFT applications. The data was collected using a modified Datagen by TeichAI over the course of about 80 hours.
提供机构:
GetWetter
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作