CIIRC-NLP/alquistcoder2025_SFT_dataset

Name: CIIRC-NLP/alquistcoder2025_SFT_dataset
Creator: CIIRC-NLP
Published: 2025-12-16 16:33:41
License: 暂无描述

Hugging Face2025-12-16 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/CIIRC-NLP/alquistcoder2025_SFT_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

AlquistCoder SFT数据集（安全编码+对话）是一个用于监督微调（SFT）的数据集，旨在训练紧凑型编码模型，以生成安全的Python代码并保持对话能力。数据集整合了合成任务系列和精选的公共数据，包括三个主要部分：通用和对话式编码（多轮对话、迭代开发）、恶意编码请求与原则性拒绝、安全编码（围绕易受攻击的代码片段构建的提示，助手目标提供修补后的无漏洞代码）。数据集还包含少量公共补充数据，如OpenCode-Instruct的子样本和精选的安全问答资源。数据集的结构为JSONL格式，每条记录都是一个聊天风格的SFT样本，包含id、split、family、messages和meta等字段。数据集的安全措施包括恶意请求与拒绝配对、静态分析确保安全目标等。

The AlquistCoder SFT Dataset (Secure Coding + Conversations) is a supervised fine-tuning (SFT) dataset designed to train a compact coding model to produce secure, helpful Python code while maintaining conversational competence. The dataset integrates synthetic task families and curated public data, including three main components: general and conversational coding (multi-turn dialogues, iterative development), malicious coding requests paired with principled, educational refusals, and secure coding (prompts built around vulnerable snippets with assistant targets that provide patched, vulnerability-free code). The dataset also includes small complements of public data, such as a subsample of OpenCode-Instruct and selected security QA sources. The dataset is structured in JSONL format, with each record being a chat-style SFT sample containing fields like id, split, family, messages, and meta. Safety measures include pairing malicious requests with refusals and static analysis to ensure secure targets.

提供机构：

CIIRC-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集