ivitopow/SecuCoder
收藏Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ivitopow/SecuCoder
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- en
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- code
- security
- python
- vulnerability
- cwe
- sft
- cybersecurity
- secure-coding
- fine-tuning
- unsloth
- axolotl
pretty_name: SecuCoder Messages Corpus
size_categories:
- 1K<n<10K
---
# SecuCoder — Messages Corpus
SecuCoder is a supervised fine-tuning (SFT) dataset for training Python code generation models with a focus on **security**. It contains instruction-following examples in `messages` (chat) format covering vulnerability remediation, secure code generation, and vulnerability classification in Python.
This dataset is part of the **SecuCoder** project, whose goal is to build language models capable of producing secure, production-ready Python code and identifying common security weaknesses.
---
## Dataset Summary
| Split | Examples |
|-------|----------|
| train | 5,708 |
| validation | 317 |
| test | 317 |
| **Total** | **6,342** |
### Task distribution
| Task | Examples | Description |
|------|----------|-------------|
| `fix` | 4,037 | Fix a security vulnerability in a Python snippet |
| `conversation` | 2,210 | Multi-turn conversation about secure coding practices |
| `classify` | 52 | Classify a snippet as SECURE or VULNERABLE |
| `prompt_to_code` | 43 | Generate secure Python code from a natural language prompt |
---
## Format
Each example follows the **messages** format, compatible with SFTTrainer, Unsloth, and Axolotl:
```json
{
"messages": [
{
"role": "system",
"content": "You are a secure Python assistant. Help identify, explain, and fix security issues in Python code. Prefer safe, practical, and production-ready solutions."
},
{
"role": "user",
"content": "Fix the security vulnerability in this Python code.\n\n```python\nname = request.args.get('name')\nresp = make_response(\"Your name is \" + name)\n```\n\nCWE: CWE-079"
},
{
"role": "assistant",
"content": "```python\nname = request.args.get('name')\nresp = make_response(\"Your name is {}\".format(name))\n```"
}
],
"meta": {
"task": "fix",
"language": "python",
"cwe": ["CWE-079"],
"syntax_ok": true
}
}
```
Each record also includes a `meta` field with: `task`, `language`, `source`, `dataset_style`, `cwe` (when applicable), and `syntax_ok` (Python syntax validation of the output).
---
## CWE Coverage
The dataset covers a wide range of Common Weakness Enumeration (CWE) categories. The most represented are:
| CWE | Description | Examples |
|-----|-------------|----------|
| CWE-020 | Improper Input Validation | 263 |
| CWE-079 | Cross-site Scripting (XSS) | 250 |
| CWE-601 | Open Redirect | 240 |
| CWE-022 | Path Traversal | 239 |
| CWE-502 | Deserialization of Untrusted Data | 211 |
| CWE-611 | XML External Entity (XXE) | 195 |
| CWE-117 | Improper Output Neutralization for Logs | 181 |
| CWE-089 | SQL Injection | 128 |
| CWE-094 | Code Injection | 126 |
| CWE-078 | OS Command Injection | 120 |
---
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("ivitopow/secucoder")
# Access a training example
example = dataset["train"][0]
for msg in example["messages"]:
print(f"[{msg['role']}]: {msg['content'][:100]}...")
```
### Fine-tuning with Unsloth / Axolotl
This dataset is directly compatible with the `messages` format expected by Unsloth and Axolotl for SFT training. No preprocessing needed.
```python
# With TRL / SFTTrainer
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
...
)
```
---
## Construction
The corpus was built using a custom pipeline (`01_data`) that:
1. Ingests heterogeneous security datasets from multiple sources.
2. Normalises schemas mapping source fields to canonical `messages` format.
3. Deduplicates using SHA-1 (exact) and SimHash (near-duplicate) strategies.
4. Validates Python syntax on assistant outputs.
5. Splits into train / val / test (90 / 5 / 5).
### Source datasets
This corpus was compiled and derived from the following publicly available datasets:
- [CodeLLMExp](https://huggingface.co/datasets/CodeLLMExp) — vulnerability fix examples
- [scthornton/securecode-mlai](https://huggingface.co/datasets/scthornton/securecode-mlai) — secure coding conversations
- [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) — web security conversations
- [cmonplz/Python_Vulnerability_Remediation](https://huggingface.co/datasets/cmonplz/Python_Vulnerability_Remediation) — vulnerability remediation pairs
- [CyberNative/Code_Vulnerability_Security_SFT](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_SFT) — secure programming examples
- [darkknight25/vulerable_codes_programming_languages_dataset](https://huggingface.co/datasets/darkknight25/vulerable_codes_programming_languages_dataset) — vulnerable code samples
- [codelmsec/prompt_code_pairs](https://huggingface.co/datasets/codelmsec/prompt_code_pairs) — prompt-to-code pairs
> If you are the author of one of these datasets and have concerns about its inclusion, please open an issue.
---
## Limitations
- All examples are in **English** and cover **Python** only.
- The `conversation` subset is less structured and may contain off-topic turns.
- CWE labels come from source datasets and have not been independently verified.
- The `classify` and `prompt_to_code` tasks are underrepresented compared to `fix`.
---
## License
This dataset is released under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
You are free to share and adapt this dataset for **non-commercial purposes**, as long as you give appropriate credit and distribute any derivatives under the same license.
Note that individual source datasets may carry their own licenses. Please review them before use.
---
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{secucoder_dataset,
title = {SecuCoder Messages Corpus},
author = {SecuCoder Project},
year = {2025},
license = {CC-BY-NC-SA-4.0},
url = {https://huggingface.co/datasets/ivitopow/secucoder}
}
```
---
## Related
- 🤖 **SecuCoder Model** — Fine-tuned model trained on this corpus: `ivitopow/secucoder`
---
license: CC BY-NC-SA 4.0
language:
- 英语(en)
task_categories:
- 文本生成
task_ids:
- 语言建模
tags:
- 代码
- 安全
- Python
- 漏洞
- CWE
- 监督微调(Supervised Fine-Tuning, SFT)
- 网络安全
- 安全编码
- 微调
- Unsloth
- Axolotl
pretty_name: SecuCoder 消息语料库
size_categories:
- 1K<n<10K
---
# SecuCoder — 消息语料库
SecuCoder是一款聚焦安全领域的监督微调(Supervised Fine-Tuning, SFT)数据集,用于训练Python代码生成模型。其包含遵循指令的对话式(`messages`格式)示例,涵盖Python语言中的漏洞修复、安全代码生成以及漏洞分类任务。
本数据集隶属于**SecuCoder**项目,该项目旨在打造可生成安全、符合生产级标准的Python代码,并识别常见安全弱点的语言模型。
---
## 数据集概览
| 拆分方式 | 示例数量 |
|-------|----------|
| 训练集 | 5,708 |
| 验证集 | 317 |
| 测试集 | 317 |
| **总计** | **6,342** |
### 任务分布
| 任务类型 | 示例数量 | 任务描述 |
|------|----------|-------------|
| `fix` | 4,037 | 修复Python代码片段中的安全漏洞 |
| `conversation` | 2,210 | 围绕安全编码实践展开的多轮对话 |
| `classify` | 52 | 将代码片段分类为「安全(SECURE)」或「存在漏洞(VULNERABLE)」 |
| `prompt_to_code` | 43 | 根据自然语言提示生成安全的Python代码 |
---
## 数据格式
每条示例均遵循**`messages`格式**,可直接适配SFTTrainer、Unsloth及Axolotl框架:
json
{
"messages": [
{
"role": "system",
"content": "你是一名安全Python助手,请协助识别、解释并修复Python代码中的安全问题,优先采用安全、实用且符合生产级标准的解决方案。"
},
{
"role": "user",
"content": "修复以下Python代码中的安全漏洞。
python
name = request.args.get('name')
resp = make_response("Your name is " + name)
CWE编号:CWE-079"
},
{
"role": "assistant",
"content": "python
name = request.args.get('name')
resp = make_response("Your name is {}".format(name))
"
}
],
"meta": {
"task": "fix",
"language": "python",
"cwe": ["CWE-079"],
"syntax_ok": true
}
}
每条数据记录还包含一个`meta`字段,涵盖以下信息:`task`(任务类型)、`language`(编程语言)、`source`(数据来源)、`dataset_style`(数据集格式)、`cwe`(适用时的通用弱点枚举编号)以及`syntax_ok`(助手输出代码的Python语法校验结果)。
---
## CWE覆盖范围
本数据集覆盖了大量通用弱点枚举(Common Weakness Enumeration, CWE)类别,其中出现频次最高的如下:
| CWE编号 | 弱点描述 | 示例数量 |
|-----|-------------|----------|
| CWE-020 | 输入验证不当 | 263 |
| CWE-079 | 跨站脚本(XSS) | 250 |
| CWE-601 | 开放重定向 | 240 |
| CWE-022 | 路径遍历 | 239 |
| CWE-502 | 不可信数据反序列化 | 211 |
| CWE-611 | XML外部实体(XXE) | 195 |
| CWE-117 | 日志输出未适当中和 | 181 |
| CWE-089 | SQL注入 | 128 |
| CWE-094 | 代码注入 | 126 |
| CWE-078 | 操作系统命令注入 | 120 |
---
## 使用方法
python
from datasets import load_dataset
# 加载SecuCoder数据集
dataset = load_dataset("ivitopow/secucoder")
# 获取一条训练集示例
example = dataset["train"][0]
for msg in example["messages"]:
print(f"[{msg['role']}]: {msg['content'][:100]}...")
### 使用Unsloth / Axolotl进行微调
本数据集的`messages`格式与Unsloth和Axolotl用于监督微调训练的要求完全兼容,无需额外预处理。
python
# 使用TRL / SFTTrainer
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
...
)
---
## 数据集构建
本语料库通过自定义流水线(`01_data`)构建,流程如下:
1. 从多源获取异构安全数据集
2. 标准化数据Schema,将源字段映射至标准`messages`格式
3. 采用SHA-1(精确去重)与SimHash(近似去重)策略进行去重处理
4. 对助手输出的代码进行Python语法校验
5. 按照90:5:5的比例划分为训练集、验证集与测试集
### 源数据集
本语料库整合自以下公开数据集:
- [CodeLLMExp](https://huggingface.co/datasets/CodeLLMExp) — 漏洞修复示例
- [scthornton/securecode-mlai](https://huggingface.co/datasets/scthornton/securecode-mlai) — 安全编码对话数据
- [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) — Web安全对话数据
- [cmonplz/Python_Vulnerability_Remediation](https://huggingface.co/datasets/cmonplz/Python_Vulnerability_Remediation) — 漏洞修复配对数据
- [CyberNative/Code_Vulnerability_Security_SFT](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_SFT) — 安全编程示例数据
- [darkknight25/vulerable_codes_programming_languages_dataset](https://huggingface.co/datasets/darkknight25/vulerable_codes_programming_languages_dataset) — 漏洞代码样本集
- [codelmsec/prompt_code_pairs](https://huggingface.co/datasets/codelmsec/prompt_code_pairs) — 提示词-代码配对数据
> 若您是上述某一数据集的作者,且对本数据集收录其数据存在疑虑,请提交Issue。
---
## 数据集局限性
- 所有示例均为**英语**,且仅覆盖**Python**语言
- `conversation`子集结构不够规整,可能包含偏离主题的对话轮次
- CWE标签均源自源数据集,未经过独立验证
- 与`fix`任务相比,`classify`与`prompt_to_code`任务的示例占比偏低
---
## 许可证
本数据集采用[知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)进行授权。
您可自由共享、改编本数据集用于**非商业用途**,但需注明原作者,并将衍生作品采用相同许可协议进行分发。
请注意,各源数据集可能拥有独立的许可证,使用前请自行核查。
---
## 引用声明
若您在研究中使用本数据集,请引用如下内容:
bibtex
@dataset{secucoder_dataset,
title = {SecuCoder Messages Corpus},
author = {SecuCoder Project},
year = {2025},
license = {CC-BY-NC-SA-4.0},
url = {https://huggingface.co/datasets/ivitopow/secucoder}
}
---
## 相关资源
- 🤖 **SecuCoder 模型** — 基于本语料库微调得到的模型:`ivitopow/secucoder`
提供机构:
ivitopow



