five

CIIRC-NLP/alquistcoder2025_SFT_dataset

收藏
Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/CIIRC-NLP/alquistcoder2025_SFT_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
AlquistCoder SFT数据集(安全编码+对话)是一个用于监督微调(SFT)的数据集,旨在训练紧凑型编码模型,以生成安全的Python代码并保持对话能力。数据集整合了合成任务系列和精选的公共数据,包括三个主要部分:通用和对话式编码(多轮对话、迭代开发)、恶意编码请求与原则性拒绝、安全编码(围绕易受攻击的代码片段构建的提示,助手目标提供修补后的无漏洞代码)。数据集还包含少量公共补充数据,如OpenCode-Instruct的子样本和精选的安全问答资源。数据集的结构为JSONL格式,每条记录都是一个聊天风格的SFT样本,包含id、split、family、messages和meta等字段。数据集的安全措施包括恶意请求与拒绝配对、静态分析确保安全目标等。

The AlquistCoder SFT Dataset (Secure Coding + Conversations) is a supervised fine-tuning (SFT) dataset designed to train a compact coding model to produce secure, helpful Python code while maintaining conversational competence. The dataset integrates synthetic task families and curated public data, including three main components: general and conversational coding (multi-turn dialogues, iterative development), malicious coding requests paired with principled, educational refusals, and secure coding (prompts built around vulnerable snippets with assistant targets that provide patched, vulnerability-free code). The dataset also includes small complements of public data, such as a subsample of OpenCode-Instruct and selected security QA sources. The dataset is structured in JSONL format, with each record being a chat-style SFT sample containing fields like id, split, family, messages, and meta. Safety measures include pairing malicious requests with refusals and static analysis to ensure secure targets.
提供机构:
CIIRC-NLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作