code_x_glue_cc_code_refinement

Name: code_x_glue_cc_code_refinement
Creator: maas
Published: 2025-12-05 12:14:07
License: 暂无描述

魔搭社区2025-12-05 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/code_x_glue_cc_code_refinement

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "code_x_glue_cc_code_refinement" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement - **Paper:** https://arxiv.org/abs/2102.04664 ### Dataset Summary CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on the function length. ### Supported Tasks and Leaderboards - `text2text-generation-other-debugging`: The dataset can be used to train a model for automatically fixing buggy code. ### Languages - Java **programming** language ## Dataset Structure ### Data Instances #### medium An example of 'train' looks as follows. ``` { "buggy": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n", "fixed": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = null ; if ( date != null ) { VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; } VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n", "id": 0 } ``` #### small An example of 'validation' looks as follows. ``` { "buggy": "public java.util.List < TYPE_1 > METHOD_1 ( ) { java.util.ArrayList < TYPE_1 > VAR_1 = new java.util.ArrayList < TYPE_1 > ( ) ; for ( TYPE_2 VAR_2 : VAR_3 ) { VAR_1 . METHOD_2 ( VAR_2 . METHOD_1 ( ) ) ; } return VAR_1 ; } \n", "fixed": "public java.util.List < TYPE_1 > METHOD_1 ( ) { return VAR_1 ; } \n", "id": 0 } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### medium, small |field name| type | description | |----------|------|--------------------------------| |id |int32 | Index of the sample | |buggy |string| The buggy version of the code | |fixed |string| The correct version of the code| ### Data Splits | name |train|validation|test| |------|----:|---------:|---:| |medium|52364| 6546|6545| |small |46680| 5835|5835| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Downloaded from GitHub Archive every public GitHub event between March 2011 and October 2017 and used the Google BigQuery APIs. [More Information Needed] #### Who are the source language producers? Software Engineering developers. ### Annotations #### Annotation process Automatically annotated by filtering commit messages containing the pattern: ("fix" or "solve") and ("bug" or "issue" or "problem" or "error"). A statistically significant amount of samples (95% confidence level with 5% confidence interval) were manually evaluated by two authors to check if the filtered bug/fix pairs were correct. After all disagreements were settled, authors conclude that 97.6% were true positives. #### Who are the annotators? Heuristics and the authors of the paper. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{tufano2019empirical, title={An empirical study on learning bug-fixing patches in the wild via neural machine translation}, author={Tufano, Michele and Watson, Cody and Bavota, Gabriele and Penta, Massimiliano Di and White, Martin and Poshyvanyk, Denys}, journal={ACM Transactions on Software Engineering and Methodology (TOSEM)}, volume={28}, number={4}, pages={1--29}, year={2019}, publisher={ACM New York, NY, USA} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# "code_x_glue_cc_code_refinement"数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持的任务与排行榜](#支持的任务与排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可协议](#许可协议) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集描述 - **主页：** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement - **论文：** https://arxiv.org/abs/2102.04664 ### 数据集摘要 CodeXGLUE代码精修（code-refinement）数据集，可通过https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement获取。本次所用数据集源自该论文（https://arxiv.org/pdf/1812.08693.pdf）发布的原始数据。数据集的输入侧为存在缺陷的Java函数，输出侧为修复后的正确代码；所有函数名与变量名都已做标准化处理。该数据集依据函数长度分为两个子集，即small（小型）与medium（中型）。 ### 支持的任务与排行榜 - `文本到文本生成-其他调试类`：该数据集可用于训练自动修复缺陷代码的模型。 ### 语言 - Java编程语言 ## 数据集结构 ### 数据实例 #### medium（中型）训练集的一个示例如下： json { "buggy": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; } ", "fixed": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = null ; if ( date != null ) { VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; } VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; } ", "id": 0 } #### small（小型）验证集的一个示例如下： json { "buggy": "public java.util.List < TYPE_1 > METHOD_1 ( ) { java.util.ArrayList < TYPE_1 > VAR_1 = new java.util.ArrayList < TYPE_1 > ( ) ; for ( TYPE_2 VAR_2 : VAR_3 ) { VAR_1 . METHOD_2 ( VAR_2 . METHOD_1 ( ) ) ; } return VAR_1 ; } ", "fixed": "public java.util.List < TYPE_1 > METHOD_1 ( ) { return VAR_1 ; } ", "id": 0 } ### 数据字段下文将针对各配置逐一解释数据字段，所有划分的数据集字段均保持一致。 #### medium、small子集 |字段名| 类型 | 描述 | |----------|------|--------------------------------| |id |int32 | 样本索引 | |buggy |string| 存在缺陷的代码版本 | |fixed |string| 修复后的正确代码版本| ### 数据划分 | 数据集划分 | 训练集 | 验证集 | 测试集 | |------|----:|---------:|---:| |medium|52364| 6546|6545| |small |46680| 5835|5835| ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化从GitHub Archive下载2011年3月至2017年10月期间的所有公开GitHub事件，并使用Google BigQuery API进行处理。 [需补充更多信息] #### 源语言的创作者是谁？软件工程开发者。 ### 标注信息 #### 标注流程采用自动标注方式：筛选包含("fix"或"solve")与("bug"或"issue"或"problem"或"error")组合关键词的提交信息，以此自动生成缺陷-修复样本对。随后由两位作者手动评估了具有统计显著性的样本子集（置信水平95%，置信区间5%），以验证筛选出的缺陷-修复对的正确性。在解决所有标注分歧后，作者确认97.6%的样本为真阳性样本。 #### 标注人员是谁？标注采用启发式规则完成，同时由论文作者参与标注。 ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 许可协议计算数据使用协议（Computational Use of Data Agreement，C-UDA）。 ### 引用信息 bibtex @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{tufano2019empirical, title={An empirical study on learning bug-fixing patches in the wild via neural machine translation}, author={Tufano, Michele and Watson, Cody and Bavota, Gabriele and Penta, Massimiliano Di and White, Martin and Poshyvanyk, Denys}, journal={ACM Transactions on Software Engineering and Methodology (TOSEM)}, volume={28}, number={4}, pages={1--29}, year={2019}, publisher={ACM New York, NY, USA} } ### 贡献致谢感谢@madlag（部分贡献来自@ncoop57）添加本数据集。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集