code_x_glue_cc_defect_detection

Name: code_x_glue_cc_defect_detection
Creator: maas
Published: 2025-11-07 16:30:57
License: 暂无描述

魔搭社区2025-11-07 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/code_x_glue_cc_defect_detection

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "code_x_glue_cc_defect_detection" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection ### Dataset Summary CodeXGLUE Defect-detection dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection Given a source code, the task is to identify whether it is an insecure code that may attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. We treat the task as binary classification (0/1), where 1 stands for insecure code and 0 for secure code. The dataset we use comes from the paper Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. We combine all projects and split 80%/10%/10% for training/dev/test. ### Supported Tasks and Leaderboards - `multi-class-classification`: The dataset can be used to train a model for detecting if code has a defect in it. ### Languages - C **programming** language ## Dataset Structure ### Data Instances An example of 'validation' looks as follows. ``` { "commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e", "func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp)\n{\n MirrorState *s = FILTER_MIRROR(nf);\n Chardev *chr;\n chr = qemu_chr_find(s->outdev);\n if (chr == NULL) {\n error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND,\n \"Device '%s' not found\", s->outdev);\n qemu_chr_fe_init(&s->chr_out, chr, errp);", "id": 8, "project": "qemu", "target": true } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default |field name| type | description | |----------|------|------------------------------------------| |id |int32 | Index of the sample | |func |string| The source code | |target |bool | 0 or 1 (vulnerability or not) | |project |string| Original project that contains this code | |commit_id |string| Commit identifier in the original project| ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|21854| 2732|2732| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @inproceedings{zhou2019devign, title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks}, author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang}, booktitle={Advances in Neural Information Processing Systems}, pages={10197--10207}, year={2019} ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# "code_x_glue_cc_defect_detection"数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与排行榜](#supported-tasks) - [编程语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分（样本量）](#data-splits-sample-size) - [数据集创建](#dataset-creation) - [整理依据](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**：https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection ### 数据集摘要 CodeXGLUE缺陷检测数据集，获取地址为https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection。给定一段源代码，任务为识别其是否为可能攻击软件系统的不安全代码，例如资源泄漏、释放后使用（use-after-free）漏洞以及拒绝服务（DoS）攻击。我们将该任务视为二分类（0/1）任务，其中1代表不安全代码，0代表安全代码。本数据集源自论文《Devign：通过图神经网络（Graph Neural Networks）学习全面程序语义实现高效漏洞识别》（Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks）。我们整合所有项目，并按照80%/10%/10%的比例划分为训练集/验证集/测试集。 ### 支持的任务与排行榜 - `多分类（multi-class-classification）`：该数据集可用于训练模型以检测代码是否存在缺陷。 ### 编程语言 - C语言 ## 数据集结构 ### 数据实例验证集的一个示例如下： { "commit_id": "aa1530dec499f7525d2ccaa0e3a876dc8089ed1e", "func": "static void filter_mirror_setup(NetFilterState *nf, Error **errp) { MirrorState *s = FILTER_MIRROR(nf); Chardev *chr; chr = qemu_chr_find(s->outdev); if (chr == NULL) { error_set(errp, ERROR_CLASS_DEVICE_NOT_FOUND, "Device '%s' not found", s->outdev); qemu_chr_fe_init(&s->chr_out, chr, errp);", "id": 8, "project": "qemu", "target": true } ### 数据字段以下将针对各配置逐一解释数据字段，所有划分下的数据字段均保持一致。 #### 默认配置 | 字段名 | 类型 | 描述 | | ------ | ------- | ---- | | id | int32 | 样本索引 | | func | string | 源代码 | | target | bool | 0或1（分别代表存在漏洞或不存在漏洞） | | project| string | 包含该代码的原始项目 | | commit_id | string | 原始项目中的提交标识符 | ### 数据划分 | 名称 | 训练集 | 验证集 | 测试集 | | ------- | -----: | ------: | -----: | | default | 21854 | 2732 | 2732 | ## 数据集创建 ### 整理依据 [需要更多信息] ### 源数据 #### 初始数据收集与归一化 [需要更多信息] #### 源语言生产者是谁？ [需要更多信息] ### 标注 #### 标注流程 [需要更多信息] #### 标注者是谁？ [需要更多信息] ### 个人与敏感信息 [需要更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需要更多信息] ### 偏差讨论 [需要更多信息] ### 其他已知局限性 [需要更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 许可信息计算数据使用协议（Computational Use of Data Agreement, C-UDA）许可证。 ### 引用信息 @inproceedings{zhou2019devign, title={Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks}, author={Zhou, Yaqin and Liu, Shangqing and Siow, Jingkai and Du, Xiaoning and Liu, Yang}, booktitle={Advances in Neural Information Processing Systems}, pages={10197--10207}, year={2019} } ### 贡献感谢@madlag（部分贡献来自@ncoop57）添加此数据集。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集