code_x_glue_cc_defect_detection
收藏魔搭社区2025-11-07 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_defect_detection
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_defect_detection"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection
### Dataset Summary
CodeXGLUE Defect-detection dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection
Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure code.
The dataset we use comes from the paper Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. We combine all projects and split 80%/10%/10% for training/dev/test.
### Supported Tasks and Leaderboards
- `multi-class-classification`: The dataset can be used to train a model for detecting if code has a defect in it.
### Languages
- C **programming** language
## Dataset Structure
### Data Instances
An example of 'validation' looks as follows.
```
{
"commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e",
"func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n MirrorState *s = FILTER_MIRROR(nf);\n Chardev *chr;\n chr = qemu_chr_find(s->outdev);\n if (chr == NULL) {\n error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n \"Device '%s' not found\", s->outdev);\n qemu_chr_fe_init(&s->chr_out, chr, errp);",
"id": 8,
"project": "qemu",
"target": true
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### default
|field name| type | description |
|----------|------|------------------------------------------|
|id |int32 | Index of the sample |
|func |string| The source code |
|target |bool | 0 or 1 (vulnerability or not) |
|project |string| Original project that contains this code |
|commit_id |string| Commit identifier in the original project|
### Data Splits
| name |train|validation|test|
|-------|----:|---------:|---:|
|default|21854| 2732|2732|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@inproceedings{zhou2019devign,
title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks},
author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang},
booktitle={Advances in Neural Information Processing Systems},
pages={10197--10207}, year={2019}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# "code_x_glue_cc_defect_detection"数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks)
- [编程语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分(样本量)](#data-splits-sample-size)
- [数据集创建](#dataset-creation)
- [整理依据](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection
### 数据集摘要
CodeXGLUE缺陷检测数据集,获取地址为https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection。给定一段源代码,任务为识别其是否为可能攻击软件系统的不安全代码,例如资源泄漏、释放后使用(use-after-free)漏洞以及拒绝服务(DoS)攻击。我们将该任务视为二分类(0/1)任务,其中1代表不安全代码,0代表安全代码。本数据集源自论文《Devign:通过图神经网络(Graph Neural Networks)学习全面程序语义实现高效漏洞识别》(Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks)。我们整合所有项目,并按照80%/10%/10%的比例划分为训练集/验证集/测试集。
### 支持的任务与排行榜
- `多分类(multi-class-classification)`:该数据集可用于训练模型以检测代码是否存在缺陷。
### 编程语言
- C语言
## 数据集结构
### 数据实例
验证集的一个示例如下:
{
"commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e",
"func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp)
{
MirrorState *s = FILTER_MIRROR(nf);
Chardev *chr;
chr = qemu_chr_find(s->outdev);
if (chr == NULL) {
error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,
"Device '%s' not found", s->outdev);
qemu_chr_fe_init(&s->chr_out, chr, errp);",
"id": 8,
"project": "qemu",
"target": true
}
### 数据字段
以下将针对各配置逐一解释数据字段,所有划分下的数据字段均保持一致。
#### 默认配置
| 字段名 | 类型 | 描述 |
| ------ | ------- | ---- |
| id | int32 | 样本索引 |
| func | string | 源代码 |
| target | bool | 0或1(分别代表存在漏洞或不存在漏洞) |
| project| string | 包含该代码的原始项目 |
| commit_id | string | 原始项目中的提交标识符 |
### 数据划分
| 名称 | 训练集 | 验证集 | 测试集 |
| ------- | -----: | ------: | -----: |
| default | 21854 | 2732 | 2732 |
## 数据集创建
### 整理依据
[需要更多信息]
### 源数据
#### 初始数据收集与归一化
[需要更多信息]
#### 源语言生产者是谁?
[需要更多信息]
### 标注
#### 标注流程
[需要更多信息]
#### 标注者是谁?
[需要更多信息]
### 个人与敏感信息
[需要更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需要更多信息]
### 偏差讨论
[需要更多信息]
### 其他已知局限性
[需要更多信息]
## 附加信息
### 数据集维护者
https://github.com/microsoft, https://github.com/madlag
### 许可信息
计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可证。
### 引用信息
@inproceedings{zhou2019devign,
title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks},
author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang},
booktitle={Advances in Neural Information Processing Systems},
pages={10197--10207}, year={2019}
}
### 贡献
感谢@madlag(部分贡献来自@ncoop57)添加此数据集。
提供机构:
maas
创建时间:
2025-04-21



