h4iku/coconut_c2005

Name: h4iku/coconut_c2005
Creator: h4iku
Published: 2023-09-28 23:19:25
License: 暂无描述

Hugging Face2023-09-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/h4iku/coconut_c2005

下载链接

链接失效反馈

官方服务：

资源简介：

--- tags: - code pretty_name: CoCoNuT-C(2005) --- # Dataset Card for CoCoNuT-C(2005) ## Dataset Description - **Homepage:** [CoCoNuT training data](https://github.com/lin-tan/CoCoNut-Artifact/releases/tag/training_data_1.0.0) - **Repository:** [CoCoNuT repository](https://github.com/lin-tan/CoCoNut-Artifact) - **Paper:** [CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair](https://dl.acm.org/doi/abs/10.1145/3395363.3397369) ### Dataset Summary Part of the data used to train the models in the "CoCoNuT: Combining Context-Aware Neural Translation Models using Ensemble for Program Repair" paper. These datasets contain raw data extracted from GitHub, GitLab, and Bitbucket, and have neither been shuffled nor tokenized. The year in the dataset’s name is the cutting year that shows the year of the newest commit in the dataset. ### Languages - C ## Dataset Structure ### Data Fields The dataset consists of 4 columns: `add`, `rem`, `context`, and `meta`. These match the original dataset files: `add.txt`, `rem.txt`, `context.txt`, and `meta.txt`. ### Data Instances There is a mapping between the 4 columns for each instance. For example: 5 first rows of `rem` (i.e., the buggy line/hunk): ``` 1 public synchronized StringBuffer append(char ch) 2 ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this; 3 public String substring(int beginIndex, int endIndex) 4 if (beginIndex < 0 || endIndex > count || beginIndex > endIndex) throw new StringIndexOutOfBoundsException(); if (beginIndex == 0 && endIndex == count) return this; int len = endIndex - beginIndex; return new String(value, beginIndex + offset, len, (len << 2) >= value.length); 5 public Object next() { ``` 5 first rows of add (i.e., the fixed line/hunk): ``` 1 public StringBuffer append(Object obj) 2 return append(obj == null ? "null" : obj.toString()); 3 public String substring(int begin) 4 return substring(begin, count); 5 public FSEntry next() { ``` These map to the 5 instances: ```diff - public synchronized StringBuffer append(char ch) + public StringBuffer append(Object obj) ``` ```diff - ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this; + return append(obj == null ? "null" : obj.toString()); ``` ```diff - public String substring(int beginIndex, int endIndex) + public String substring(int begin) ``` ```diff - if (beginIndex < 0 || endIndex > count || beginIndex > endIndex) throw new StringIndexOutOfBoundsException(); if (beginIndex == 0 && endIndex == count) return this; int len = endIndex - beginIndex; return new String(value, beginIndex + offset, len, (len << 2) >= value.length); + return substring(begin, count); ``` ```diff - public Object next() { + public FSEntry next() { ``` `context` contains the associated "context". Context is the (in-lined) buggy function (including the buggy lines and comments). For example, the context of ``` public synchronized StringBuffer append(char ch) ``` is its associated function: ```java public synchronized StringBuffer append(char ch) { ensureCapacity_unsynchronized(count + 1); value[count++] = ch; return this; } ``` `meta` contains some metadata about the project: ``` 1056 /local/tlutelli/issta_data/temp/all_java0context/java/2006_temp/2006/1056/68a6301301378680519f2b146daec37812a1bc22/StringBuffer.java/buggy/core/src/classpath/java/java/lang/StringBuffer.java ``` `1056` is the project id. `/local/...` is the absolute path to the buggy file. This can be parsed to extract the commit id: `68a6301301378680519f2b146daec37812a1bc22`, the file name: `StringBuffer.java` and the original path within the project `core/src/classpath/java/java/lang/StringBuffer.java` | Number of projects | Number of Instances | | ------------------ |-------------------- | | 12,577 | 2,735,506 | ## Dataset Creation ### Curation Rationale Data is collected to train automated program repair (APR) models. ### Citation Information ```bib @inproceedings{lutellierCoCoNuTCombiningContextaware2020, title = {{{CoCoNuT}}: Combining Context-Aware Neural Translation Models Using Ensemble for Program Repair}, shorttitle = {{{CoCoNuT}}}, booktitle = {Proceedings of the 29th {{ACM SIGSOFT International Symposium}} on {{Software Testing}} and {{Analysis}}}, author = {Lutellier, Thibaud and Pham, Hung Viet and Pang, Lawrence and Li, Yitong and Wei, Moshi and Tan, Lin}, year = {2020}, month = jul, series = {{{ISSTA}} 2020}, pages = {101--114}, publisher = {{Association for Computing Machinery}}, address = {{New York, NY, USA}}, doi = {10.1145/3395363.3397369}, url = {https://doi.org/10.1145/3395363.3397369}, urldate = {2022-12-06}, isbn = {978-1-4503-8008-9}, keywords = {AI and Software Engineering,Automated program repair,Deep Learning,Neural Machine Translation} } ```

提供机构：

h4iku

原始信息汇总

数据集概述：CoCoNuT-C(2005)

数据集描述

数据集名称： CoCoNuT-C(2005)
数据集用途： 用于训练自动化程序修复（APR）模型。
数据来源： 数据提取自GitHub、GitLab和Bitbucket。
数据处理： 数据未经打乱或标记化处理。
数据集年份： 数据集名称中的年份代表数据集中最新提交的年份。

数据集结构

数据字段

字段数量： 4个
字段名称： add, rem, context, meta
字段描述：
- add: 修复后的代码行/块
- rem: 有问题的代码行/块
- context: 包含问题的函数（包括有问题的行和注释）
- meta: 项目元数据，包括项目ID、文件的绝对路径、提交ID、文件名及项目内的原始路径。

数据实例

实例映射： 每个实例的四个字段相互关联。
示例： 提供了rem和add字段的前五行示例及其对应关系。

数据统计

项目数量： 12,577
实例数量： 2,735,506

数据集创建

筛选理由

目的： 收集数据以训练自动化程序修复模型。

引用信息

引用文献： Lutellier, Thibaud et al. "CoCoNuT: Combining Context-Aware Neural Translation Models Using Ensemble for Program Repair." Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (2020): 101-114.

5,000+

优质数据集

54 个

任务类型

进入经典数据集