code_x_glue_cc_code_refinement
收藏魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_code_refinement
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_code_refinement"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement
- **Paper:** https://arxiv.org/abs/2102.04664
### Dataset Summary
CodeXGLUE code-refinement dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement
We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on the function length.
### Supported Tasks and Leaderboards
- `text2text-generation-other-debugging`: The dataset can be used to train a model for automatically fixing buggy code.
### Languages
- Java **programming** language
## Dataset Structure
### Data Instances
#### medium
An example of 'train' looks as follows.
```
{
"buggy": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n",
"fixed": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = null ; if ( date != null ) { VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; } VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }\n",
"id": 0
}
```
#### small
An example of 'validation' looks as follows.
```
{
"buggy": "public java.util.List < TYPE_1 > METHOD_1 ( ) { java.util.ArrayList < TYPE_1 > VAR_1 = new java.util.ArrayList < TYPE_1 > ( ) ; for ( TYPE_2 VAR_2 : VAR_3 ) { VAR_1 . METHOD_2 ( VAR_2 . METHOD_1 ( ) ) ; } return VAR_1 ; } \n",
"fixed": "public java.util.List < TYPE_1 > METHOD_1 ( ) { return VAR_1 ; } \n",
"id": 0
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### medium, small
|field name| type | description |
|----------|------|--------------------------------|
|id |int32 | Index of the sample |
|buggy |string| The buggy version of the code |
|fixed |string| The correct version of the code|
### Data Splits
| name |train|validation|test|
|------|----:|---------:|---:|
|medium|52364| 6546|6545|
|small |46680| 5835|5835|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
Downloaded from GitHub Archive every public GitHub event between March 2011 and October 2017 and used the Google BigQuery APIs.
[More Information Needed]
#### Who are the source language producers?
Software Engineering developers.
### Annotations
#### Annotation process
Automatically annotated by filtering commit messages containing the pattern: ("fix" or "solve") and ("bug" or "issue" or "problem" or "error"). A statistically significant amount of samples (95% confidence level with 5% confidence interval) were manually evaluated by two authors to check if the filtered bug/fix pairs were correct. After all disagreements were settled, authors conclude that 97.6% were true positives.
#### Who are the annotators?
Heuristics and the authors of the paper.
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@article{DBLP:journals/corr/abs-2102-04664,
author = {Shuai Lu and
Daya Guo and
Shuo Ren and
Junjie Huang and
Alexey Svyatkovskiy and
Ambrosio Blanco and
Colin B. Clement and
Dawn Drain and
Daxin Jiang and
Duyu Tang and
Ge Li and
Lidong Zhou and
Linjun Shou and
Long Zhou and
Michele Tufano and
Ming Gong and
Ming Zhou and
Nan Duan and
Neel Sundaresan and
Shao Kun Deng and
Shengyu Fu and
Shujie Liu},
title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
and Generation},
journal = {CoRR},
volume = {abs/2102.04664},
year = {2021}
}
@article{tufano2019empirical,
title={An empirical study on learning bug-fixing patches in the wild via neural machine translation},
author={Tufano, Michele and Watson, Cody and Bavota, Gabriele and Penta, Massimiliano Di and White, Martin and Poshyvanyk, Denys},
journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
volume={28},
number={4},
pages={1--29},
year={2019},
publisher={ACM New York, NY, USA}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# "code_x_glue_cc_code_refinement"数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集摘要](#数据集摘要)
- [支持的任务与排行榜](#支持的任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注信息](#标注信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可协议](#许可协议)
- [引用信息](#引用信息)
- [贡献致谢](#贡献致谢)
## 数据集描述
- **主页:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement
- **论文:** https://arxiv.org/abs/2102.04664
### 数据集摘要
CodeXGLUE代码精修(code-refinement)数据集,可通过https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-refinement获取。
本次所用数据集源自该论文(https://arxiv.org/pdf/1812.08693.pdf)发布的原始数据。数据集的输入侧为存在缺陷的Java函数,输出侧为修复后的正确代码;所有函数名与变量名都已做标准化处理。该数据集依据函数长度分为两个子集,即small(小型)与medium(中型)。
### 支持的任务与排行榜
- `文本到文本生成-其他调试类`:该数据集可用于训练自动修复缺陷代码的模型。
### 语言
- Java编程语言
## 数据集结构
### 数据实例
#### medium(中型)
训练集的一个示例如下:
json
{
"buggy": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }
",
"fixed": "public static TYPE_1 init ( java.lang.String name , java.util.Date date ) { TYPE_1 VAR_1 = new TYPE_1 ( ) ; VAR_1 . METHOD_1 ( name ) ; java.util.Calendar VAR_2 = null ; if ( date != null ) { VAR_2 = java.util.Calendar.getInstance ( ) ; VAR_2 . METHOD_2 ( date ) ; } VAR_1 . METHOD_3 ( VAR_2 ) ; return VAR_1 ; }
",
"id": 0
}
#### small(小型)
验证集的一个示例如下:
json
{
"buggy": "public java.util.List < TYPE_1 > METHOD_1 ( ) { java.util.ArrayList < TYPE_1 > VAR_1 = new java.util.ArrayList < TYPE_1 > ( ) ; for ( TYPE_2 VAR_2 : VAR_3 ) { VAR_1 . METHOD_2 ( VAR_2 . METHOD_1 ( ) ) ; } return VAR_1 ; }
",
"fixed": "public java.util.List < TYPE_1 > METHOD_1 ( ) { return VAR_1 ; }
",
"id": 0
}
### 数据字段
下文将针对各配置逐一解释数据字段,所有划分的数据集字段均保持一致。
#### medium、small子集
|字段名| 类型 | 描述 |
|----------|------|--------------------------------|
|id |int32 | 样本索引 |
|buggy |string| 存在缺陷的代码版本 |
|fixed |string| 修复后的正确代码版本|
### 数据划分
| 数据集划分 | 训练集 | 验证集 | 测试集 |
|------|----:|---------:|---:|
|medium|52364| 6546|6545|
|small |46680| 5835|5835|
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
从GitHub Archive下载2011年3月至2017年10月期间的所有公开GitHub事件,并使用Google BigQuery API进行处理。
[需补充更多信息]
#### 源语言的创作者是谁?
软件工程开发者。
### 标注信息
#### 标注流程
采用自动标注方式:筛选包含("fix"或"solve")与("bug"或"issue"或"problem"或"error")组合关键词的提交信息,以此自动生成缺陷-修复样本对。随后由两位作者手动评估了具有统计显著性的样本子集(置信水平95%,置信区间5%),以验证筛选出的缺陷-修复对的正确性。在解决所有标注分歧后,作者确认97.6%的样本为真阳性样本。
#### 标注人员是谁?
标注采用启发式规则完成,同时由论文作者参与标注。
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
https://github.com/microsoft, https://github.com/madlag
### 许可协议
计算数据使用协议(Computational Use of Data Agreement,C-UDA)。
### 引用信息
bibtex
@article{DBLP:journals/corr/abs-2102-04664,
author = {Shuai Lu and
Daya Guo and
Shuo Ren and
Junjie Huang and
Alexey Svyatkovskiy and
Ambrosio Blanco and
Colin B. Clement and
Dawn Drain and
Daxin Jiang and
Duyu Tang and
Ge Li and
Lidong Zhou and
Linjun Shou and
Long Zhou and
Michele Tufano and
Ming Gong and
Ming Zhou and
Nan Duan and
Neel Sundaresan and
Shao Kun Deng and
Shengyu Fu and
Shujie Liu},
title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
and Generation},
journal = {CoRR},
volume = {abs/2102.04664},
year = {2021}
}
@article{tufano2019empirical,
title={An empirical study on learning bug-fixing patches in the wild via neural machine translation},
author={Tufano, Michele and Watson, Cody and Bavota, Gabriele and Penta, Massimiliano Di and White, Martin and Poshyvanyk, Denys},
journal={ACM Transactions on Software Engineering and Methodology (TOSEM)},
volume={28},
number={4},
pages={1--29},
year={2019},
publisher={ACM New York, NY, USA}
}
### 贡献致谢
感谢@madlag(部分贡献来自@ncoop57)添加本数据集。
提供机构:
maas
创建时间:
2025-04-21



