ivitopow/SecuCoder

Name: ivitopow/SecuCoder
Creator: ivitopow
Published: 2026-03-31 16:07:20
License: 暂无描述

Hugging Face2026-03-31 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ivitopow/SecuCoder

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 language: - en task_categories: - text-generation task_ids: - language-modeling tags: - code - security - python - vulnerability - cwe - sft - cybersecurity - secure-coding - fine-tuning - unsloth - axolotl pretty_name: SecuCoder Messages Corpus size_categories: - 1K<n<10K --- # SecuCoder — Messages Corpus SecuCoder is a supervised fine-tuning (SFT) dataset for training Python code generation models with a focus on **security**. It contains instruction-following examples in `messages` (chat) format covering vulnerability remediation, secure code generation, and vulnerability classification in Python. This dataset is part of the **SecuCoder** project, whose goal is to build language models capable of producing secure, production-ready Python code and identifying common security weaknesses. --- ## Dataset Summary | Split | Examples | |-------|----------| | train | 5,708 | | validation | 317 | | test | 317 | | **Total** | **6,342** | ### Task distribution | Task | Examples | Description | |------|----------|-------------| | `fix` | 4,037 | Fix a security vulnerability in a Python snippet | | `conversation` | 2,210 | Multi-turn conversation about secure coding practices | | `classify` | 52 | Classify a snippet as SECURE or VULNERABLE | | `prompt_to_code` | 43 | Generate secure Python code from a natural language prompt | --- ## Format Each example follows the **messages** format, compatible with SFTTrainer, Unsloth, and Axolotl: ```json { "messages": [ { "role": "system", "content": "You are a secure Python assistant. Help identify, explain, and fix security issues in Python code. Prefer safe, practical, and production-ready solutions." }, { "role": "user", "content": "Fix the security vulnerability in this Python code.\n\n```python\nname = request.args.get('name')\nresp = make_response(\"Your name is \" + name)\n```\n\nCWE: CWE-079" }, { "role": "assistant", "content": "```python\nname = request.args.get('name')\nresp = make_response(\"Your name is {}\".format(name))\n```" } ], "meta": { "task": "fix", "language": "python", "cwe": ["CWE-079"], "syntax_ok": true } } ``` Each record also includes a `meta` field with: `task`, `language`, `source`, `dataset_style`, `cwe` (when applicable), and `syntax_ok` (Python syntax validation of the output). --- ## CWE Coverage The dataset covers a wide range of Common Weakness Enumeration (CWE) categories. The most represented are: | CWE | Description | Examples | |-----|-------------|----------| | CWE-020 | Improper Input Validation | 263 | | CWE-079 | Cross-site Scripting (XSS) | 250 | | CWE-601 | Open Redirect | 240 | | CWE-022 | Path Traversal | 239 | | CWE-502 | Deserialization of Untrusted Data | 211 | | CWE-611 | XML External Entity (XXE) | 195 | | CWE-117 | Improper Output Neutralization for Logs | 181 | | CWE-089 | SQL Injection | 128 | | CWE-094 | Code Injection | 126 | | CWE-078 | OS Command Injection | 120 | --- ## Usage ```python from datasets import load_dataset dataset = load_dataset("ivitopow/secucoder") # Access a training example example = dataset["train"][0] for msg in example["messages"]: print(f"[{msg['role']}]: {msg['content'][:100]}...") ``` ### Fine-tuning with Unsloth / Axolotl This dataset is directly compatible with the `messages` format expected by Unsloth and Axolotl for SFT training. No preprocessing needed. ```python # With TRL / SFTTrainer from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset["train"], ... ) ``` --- ## Construction The corpus was built using a custom pipeline (`01_data`) that: 1. Ingests heterogeneous security datasets from multiple sources. 2. Normalises schemas mapping source fields to canonical `messages` format. 3. Deduplicates using SHA-1 (exact) and SimHash (near-duplicate) strategies. 4. Validates Python syntax on assistant outputs. 5. Splits into train / val / test (90 / 5 / 5). ### Source datasets This corpus was compiled and derived from the following publicly available datasets: - [CodeLLMExp](https://huggingface.co/datasets/CodeLLMExp) — vulnerability fix examples - [scthornton/securecode-mlai](https://huggingface.co/datasets/scthornton/securecode-mlai) — secure coding conversations - [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) — web security conversations - [cmonplz/Python_Vulnerability_Remediation](https://huggingface.co/datasets/cmonplz/Python_Vulnerability_Remediation) — vulnerability remediation pairs - [CyberNative/Code_Vulnerability_Security_SFT](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_SFT) — secure programming examples - [darkknight25/vulerable_codes_programming_languages_dataset](https://huggingface.co/datasets/darkknight25/vulerable_codes_programming_languages_dataset) — vulnerable code samples - [codelmsec/prompt_code_pairs](https://huggingface.co/datasets/codelmsec/prompt_code_pairs) — prompt-to-code pairs > If you are the author of one of these datasets and have concerns about its inclusion, please open an issue. --- ## Limitations - All examples are in **English** and cover **Python** only. - The `conversation` subset is less structured and may contain off-topic turns. - CWE labels come from source datasets and have not been independently verified. - The `classify` and `prompt_to_code` tasks are underrepresented compared to `fix`. --- ## License This dataset is released under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. You are free to share and adapt this dataset for **non-commercial purposes**, as long as you give appropriate credit and distribute any derivatives under the same license. Note that individual source datasets may carry their own licenses. Please review them before use. --- ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{secucoder_dataset, title = {SecuCoder Messages Corpus}, author = {SecuCoder Project}, year = {2025}, license = {CC-BY-NC-SA-4.0}, url = {https://huggingface.co/datasets/ivitopow/secucoder} } ``` --- ## Related - 🤖 **SecuCoder Model** — Fine-tuned model trained on this corpus: `ivitopow/secucoder`

--- license: CC BY-NC-SA 4.0 language: - 英语（en） task_categories: - 文本生成 task_ids: - 语言建模 tags: - 代码 - 安全 - Python - 漏洞 - CWE - 监督微调（Supervised Fine-Tuning, SFT） - 网络安全 - 安全编码 - 微调 - Unsloth - Axolotl pretty_name: SecuCoder 消息语料库 size_categories: - 1K<n<10K --- # SecuCoder — 消息语料库 SecuCoder是一款聚焦安全领域的监督微调（Supervised Fine-Tuning, SFT）数据集，用于训练Python代码生成模型。其包含遵循指令的对话式（`messages`格式）示例，涵盖Python语言中的漏洞修复、安全代码生成以及漏洞分类任务。本数据集隶属于**SecuCoder**项目，该项目旨在打造可生成安全、符合生产级标准的Python代码，并识别常见安全弱点的语言模型。 --- ## 数据集概览 | 拆分方式 | 示例数量 | |-------|----------| | 训练集 | 5,708 | | 验证集 | 317 | | 测试集 | 317 | | **总计** | **6,342** | ### 任务分布 | 任务类型 | 示例数量 | 任务描述 | |------|----------|-------------| | `fix` | 4,037 | 修复Python代码片段中的安全漏洞 | | `conversation` | 2,210 | 围绕安全编码实践展开的多轮对话 | | `classify` | 52 | 将代码片段分类为「安全（SECURE）」或「存在漏洞（VULNERABLE）」 | | `prompt_to_code` | 43 | 根据自然语言提示生成安全的Python代码 | --- ## 数据格式每条示例均遵循**`messages`格式**，可直接适配SFTTrainer、Unsloth及Axolotl框架： json { "messages": [ { "role": "system", "content": "你是一名安全Python助手，请协助识别、解释并修复Python代码中的安全问题，优先采用安全、实用且符合生产级标准的解决方案。" }, { "role": "user", "content": "修复以下Python代码中的安全漏洞。 python name = request.args.get('name') resp = make_response("Your name is " + name) CWE编号：CWE-079" }, { "role": "assistant", "content": "python name = request.args.get('name') resp = make_response("Your name is {}".format(name)) " } ], "meta": { "task": "fix", "language": "python", "cwe": ["CWE-079"], "syntax_ok": true } } 每条数据记录还包含一个`meta`字段，涵盖以下信息：`task`（任务类型）、`language`（编程语言）、`source`（数据来源）、`dataset_style`（数据集格式）、`cwe`（适用时的通用弱点枚举编号）以及`syntax_ok`（助手输出代码的Python语法校验结果）。 --- ## CWE覆盖范围本数据集覆盖了大量通用弱点枚举（Common Weakness Enumeration, CWE）类别，其中出现频次最高的如下： | CWE编号 | 弱点描述 | 示例数量 | |-----|-------------|----------| | CWE-020 | 输入验证不当 | 263 | | CWE-079 | 跨站脚本（XSS） | 250 | | CWE-601 | 开放重定向 | 240 | | CWE-022 | 路径遍历 | 239 | | CWE-502 | 不可信数据反序列化 | 211 | | CWE-611 | XML外部实体（XXE） | 195 | | CWE-117 | 日志输出未适当中和 | 181 | | CWE-089 | SQL注入 | 128 | | CWE-094 | 代码注入 | 126 | | CWE-078 | 操作系统命令注入 | 120 | --- ## 使用方法 python from datasets import load_dataset # 加载SecuCoder数据集 dataset = load_dataset("ivitopow/secucoder") # 获取一条训练集示例 example = dataset["train"][0] for msg in example["messages"]: print(f"[{msg['role']}]: {msg['content'][:100]}...") ### 使用Unsloth / Axolotl进行微调本数据集的`messages`格式与Unsloth和Axolotl用于监督微调训练的要求完全兼容，无需额外预处理。 python # 使用TRL / SFTTrainer from trl import SFTTrainer trainer = SFTTrainer( model=model, train_dataset=dataset["train"], ... ) --- ## 数据集构建本语料库通过自定义流水线（`01_data`）构建，流程如下： 1. 从多源获取异构安全数据集 2. 标准化数据Schema，将源字段映射至标准`messages`格式 3. 采用SHA-1（精确去重）与SimHash（近似去重）策略进行去重处理 4. 对助手输出的代码进行Python语法校验 5. 按照90:5:5的比例划分为训练集、验证集与测试集 ### 源数据集本语料库整合自以下公开数据集： - [CodeLLMExp](https://huggingface.co/datasets/CodeLLMExp) — 漏洞修复示例 - [scthornton/securecode-mlai](https://huggingface.co/datasets/scthornton/securecode-mlai) — 安全编码对话数据 - [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) — Web安全对话数据 - [cmonplz/Python_Vulnerability_Remediation](https://huggingface.co/datasets/cmonplz/Python_Vulnerability_Remediation) — 漏洞修复配对数据 - [CyberNative/Code_Vulnerability_Security_SFT](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_SFT) — 安全编程示例数据 - [darkknight25/vulerable_codes_programming_languages_dataset](https://huggingface.co/datasets/darkknight25/vulerable_codes_programming_languages_dataset) — 漏洞代码样本集 - [codelmsec/prompt_code_pairs](https://huggingface.co/datasets/codelmsec/prompt_code_pairs) — 提示词-代码配对数据 > 若您是上述某一数据集的作者，且对本数据集收录其数据存在疑虑，请提交Issue。 --- ## 数据集局限性 - 所有示例均为**英语**，且仅覆盖**Python**语言 - `conversation`子集结构不够规整，可能包含偏离主题的对话轮次 - CWE标签均源自源数据集，未经过独立验证 - 与`fix`任务相比，`classify`与`prompt_to_code`任务的示例占比偏低 --- ## 许可证本数据集采用[知识共享署名-非商业性使用-相同方式共享4.0国际许可协议（CC BY-NC-SA 4.0）](https://creativecommons.org/licenses/by-nc-sa/4.0/)进行授权。您可自由共享、改编本数据集用于**非商业用途**，但需注明原作者，并将衍生作品采用相同许可协议进行分发。请注意，各源数据集可能拥有独立的许可证，使用前请自行核查。 --- ## 引用声明若您在研究中使用本数据集，请引用如下内容： bibtex @dataset{secucoder_dataset, title = {SecuCoder Messages Corpus}, author = {SecuCoder Project}, year = {2025}, license = {CC-BY-NC-SA-4.0}, url = {https://huggingface.co/datasets/ivitopow/secucoder} } --- ## 相关资源 - 🤖 **SecuCoder 模型** — 基于本语料库微调得到的模型：`ivitopow/secucoder`

提供机构：

ivitopow

5,000+

优质数据集

54 个

任务类型

进入经典数据集