ahmad21omar/SFT-Collection-v2
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ahmad21omar/SFT-Collection-v2
下载链接
链接失效反馈官方服务:
资源简介:
SFT-Collection-v2是一个用于监督微调(SFT)的大规模、多语言推理语料库,名为统一推理语料库。它整合、过滤、去重并扩展了多个公共推理数据集,形成一个单一、一致的模式。该语料库专注于数学、代码、科学和一般推理领域的思维链推理痕迹,涵盖英语及其他五种语言(德语、法语、意大利语、西班牙语、日语)。数据集经过四个阶段的处理流程:数据集级别的筛选、行级别的过滤、跨数据集去重以及后处理和多语言扩展。最终数据集包含23,896,757行数据,分为11个配置,每个配置对应一个上游数据集源,并根据语言进行分割。数据集支持多种推理任务,包括文本生成和问答,适用于多语言、数学、代码和科学等领域的研究和应用。
SFT-Collection-v2 is a large-scale, curated corpus for supervised fine-tuning (SFT) of reasoning-oriented language models, titled Unified Reasoning Corpus. It combines, filters, deduplicates, and language-extends a broad set of public reasoning datasets into a single, consistent schema. The collection focuses on chain-of-thought reasoning traces across math, code, science, and general reasoning, covering English and five additional languages (German, French, Italian, Spanish, Japanese). The dataset is processed through a four-stage pipeline: dataset-level elimination, row-level filtering, cross-dataset deduplication, and post-processing with multilingual extension. The final dataset comprises 23,896,757 rows organized into 11 configs, each corresponding to an upstream dataset source, with splits per language. It supports various reasoning tasks, including text generation and question-answering, and is suitable for research and applications in multilingual, math, code, and science domains.
提供机构:
ahmad21omar



