inclusionAI/Ling-Coder-SFT

Name: inclusionAI/Ling-Coder-SFT
Creator: inclusionAI
Published: 2025-03-27 12:40:12
License: 暂无描述

Hugging Face2025-03-27 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/inclusionAI/Ling-Coder-SFT

下载链接

链接失效反馈

官方服务：

资源简介：

Ling-Coder数据集包括以下三个组成部分：Ling-Coder-SFT、Ling-Coder-DPO和Ling-Coder-SyntheticQA。Ling-Coder-SFT是用于微调Ling-Coder Lite模型的SFT数据子集，包含超过500万个英语和中文样本，覆盖20多种编程语言，包括文本到代码、代码补全、代码执行推理、复杂算法问答以及流行Python库的使用等主题。该数据集使用类似于OSS-Instruct和Evol-Instruct的方法合成，通过LLM提取关键点并进一步解释，然后将这些关键点扩展成七个不同的关键点组合集，并为每个集合生成10个独特的编程相关问题，最后使用LLM回答每个问题并经过一系列过滤和质量保证步骤来生成数据集。

The Ling-Coder Dataset consists of three subsets: Ling-Coder-SFT, Ling-Coder-DPO, and Ling-Coder-SyntheticQA. Ling-Coder-SFT is a subset of SFT data used for fine-tuning the Ling-Coder Lite model, containing over 5 million English and Chinese samples covering more than 20 programming languages and topics such as text-to-code, code completion, code execution reasoning, complex algorithm question-and-answer, and the use of popular Python libraries. The dataset was created using methods similar to OSS-Instruct and Evol-Instruct, involving LLMs for extracting key points, generating questions, and providing answers, followed by a series of filtering and quality assurance steps.

提供机构：

inclusionAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集