h4iku/coconut_c2005
收藏Hugging Face2023-09-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/h4iku/coconut_c2005
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- code
pretty_name: CoCoNuT-C(2005)
---
# Dataset Card for CoCoNuT-C(2005)
## Dataset Description
- **Homepage:** [CoCoNuT training data](https://github.com/lin-tan/CoCoNut-Artifact/releases/tag/training_data_1.0.0)
- **Repository:** [CoCoNuT repository](https://github.com/lin-tan/CoCoNut-Artifact)
- **Paper:** [CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair](https://dl.acm.org/doi/abs/10.1145/3395363.3397369)
### Dataset Summary
Part of the data used to train the models in the "CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair" paper.
These datasets contain raw data extracted from GitHub, GitLab, and Bitbucket, and have neither been shuffled nor tokenized.
The year in the dataset’s name is the cutting year that shows the year of the newest commit in the dataset.
### Languages
- C
## Dataset Structure
### Data Fields
The dataset consists of 4 columns: `add`, `rem`, `context`, and `meta`.
These match the original dataset files: `add.txt`, `rem.txt`, `context.txt`, and `meta.txt`.
### Data Instances
There is a mapping between the 4 columns for each instance.
For example:
5 first rows of `rem` (i.e., the buggy line/hunk):
```
1 public synchronized StringBuffer append(char ch)
2 ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this;
3 public String substring(int beginIndex, int endIndex)
4 if (beginIndex < 0 || endIndex > count || beginIndex > endIndex) throw new StringIndexOutOfBoundsException(); if (beginIndex == 0 && endIndex == count) return this; int len = endIndex - beginIndex; return new String(value, beginIndex + offset, len, (len << 2) >= value.length);
5 public Object next() {
```
5 first rows of add (i.e., the fixed line/hunk):
```
1 public StringBuffer append(Object obj)
2 return append(obj == null ? "null" : obj.toString());
3 public String substring(int begin)
4 return substring(begin, count);
5 public FSEntry next() {
```
These map to the 5 instances:
```diff
- public synchronized StringBuffer append(char ch)
+ public StringBuffer append(Object obj)
```
```diff
- ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this;
+ return append(obj == null ? "null" : obj.toString());
```
```diff
- public String substring(int beginIndex, int endIndex)
+ public String substring(int begin)
```
```diff
- if (beginIndex < 0 || endIndex > count || beginIndex > endIndex) throw new StringIndexOutOfBoundsException(); if (beginIndex == 0 && endIndex == count) return this; int len = endIndex - beginIndex; return new String(value, beginIndex + offset, len, (len << 2) >= value.length);
+ return substring(begin, count);
```
```diff
- public Object next() {
+ public FSEntry next() {
```
`context` contains the associated "context". Context is the (in-lined) buggy function (including the buggy lines and comments).
For example, the context of
```
public synchronized StringBuffer append(char ch)
```
is its associated function:
```java
public synchronized StringBuffer append(char ch) { ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this; }
```
`meta` contains some metadata about the project:
```
1056 /local/tlutelli/issta_data/temp/all_java0context/java/2006_temp/2006/1056/68a6301301378680519f2b146daec37812a1bc22/StringBuffer.java/buggy/core/src/classpath/java/java/lang/StringBuffer.java
```
`1056` is the project id. `/local/...` is the absolute path to the buggy file. This can be parsed to extract the commit id: `68a6301301378680519f2b146daec37812a1bc22`, the file name: `StringBuffer.java` and the original path within the project
`core/src/classpath/java/java/lang/StringBuffer.java`
| Number of projects | Number of Instances |
| ------------------ |-------------------- |
| 12,577 | 2,735,506 |
## Dataset Creation
### Curation Rationale
Data is collected to train automated program repair (APR) models.
### Citation Information
```bib
@inproceedings{lutellierCoCoNuTCombiningContextaware2020,
title = {{{CoCoNuT}}: Combining Context-Aware Neural Translation Models Using Ensemble for Program Repair},
shorttitle = {{{CoCoNuT}}},
booktitle = {Proceedings of the 29th {{ACM SIGSOFT International Symposium}} on {{Software Testing}} and {{Analysis}}},
author = {Lutellier, Thibaud and Pham, Hung Viet and Pang, Lawrence and Li, Yitong and Wei, Moshi and Tan, Lin},
year = {2020},
month = jul,
series = {{{ISSTA}} 2020},
pages = {101--114},
publisher = {{Association for Computing Machinery}},
address = {{New York, NY, USA}},
doi = {10.1145/3395363.3397369},
url = {https://doi.org/10.1145/3395363.3397369},
urldate = {2022-12-06},
isbn = {978-1-4503-8008-9},
keywords = {AI and Software Engineering,Automated program repair,Deep Learning,Neural Machine Translation}
}
```
提供机构:
h4iku
原始信息汇总
数据集概述:CoCoNuT-C(2005)
数据集描述
- 数据集名称: CoCoNuT-C(2005)
- 数据集用途: 用于训练自动化程序修复(APR)模型。
- 数据来源: 数据提取自GitHub、GitLab和Bitbucket。
- 数据处理: 数据未经打乱或标记化处理。
- 数据集年份: 数据集名称中的年份代表数据集中最新提交的年份。
数据集结构
数据字段
- 字段数量: 4个
- 字段名称:
add,rem,context,meta - 字段描述:
add: 修复后的代码行/块rem: 有问题的代码行/块context: 包含问题的函数(包括有问题的行和注释)meta: 项目元数据,包括项目ID、文件的绝对路径、提交ID、文件名及项目内的原始路径。
数据实例
- 实例映射: 每个实例的四个字段相互关联。
- 示例: 提供了
rem和add字段的前五行示例及其对应关系。
数据统计
- 项目数量: 12,577
- 实例数量: 2,735,506
数据集创建
筛选理由
- 目的: 收集数据以训练自动化程序修复模型。
引用信息
- 引用文献: Lutellier, Thibaud et al. "CoCoNuT: Combining Context-Aware Neural Translation Models Using Ensemble for Program Repair." Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (2020): 101-114.



