资源简介:
## Concode dataset
A large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment.
Concode dataset is a widely used code generation dataset from Iyer's EMNLP 2018 paper [Mapping Language to Code in Programmatic Context](https://www.aclweb.org/anthology/D18-1192.pdf).
Data statistics of concode dataset are shown in the below table:
| | #Examples |
| --------- | :---------: |
| Train | 100,000 |
| Validation | 2,000 |
| Test | 2,000 |
## Data Format
Code corpus are saved in json lines format files. one line is a json object:
```
{
"nl": "Increment this vector in this place. con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
"code": "public void inc ( ) { this . add ( 1 ) ; }"
}
```
`nl` combines natural language description and class environment. Elements in class environment are seperated by special tokens like `con_elem_sep` and `con_func_sep`.
## Task Definition
Generate source code of class member functions in Java, given natural language description and class environment. Class environment is the programmatic context provided by the rest of the class, including other member variables and member functions in class. Models are evaluated by exact match and BLEU.
It's a challenging task because the desired code can vary greatly depending on the functionality the class provides. Models must (a) have a deep understanding of NL description and map the NL to environment variables, library API calls and user-defined methods in the class, and (b) decide on the structure of the resulting code.
## Reference
Concode dataset:
<pre><code>@article{iyer2018mapping,
title={Mapping language to code in programmatic context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}</code></pre>
### Concode数据集(Concode dataset)
该数据集包含超过10万条样本,均源自在线代码仓库中的Java类,相关研究同时提出了一种全新的编码器-解码器架构,用于建模方法文档与类环境之间的交互。
Concode数据集是由Iyer等人在2018年自然语言处理经验方法会议(EMNLP)论文《在程序化上下文中将语言映射至代码》(*Mapping Language to Code in Programmatic Context*)中提出并广泛使用的代码生成数据集,论文链接为:https://www.aclweb.org/anthology/D18-1192.pdf。
#### 数据集统计
Concode数据集的样本统计信息如下表所示:
| | 样本数量 |
| --------- | :---------: |
| 训练集 | 100,000 |
| 验证集 | 2,000 |
| 测试集 | 2,000 |
#### 数据格式
代码语料以JSON Lines格式存储,每行对应一个JSON对象:
json
{
"nl": "在此处对该向量进行增量操作。con_elem_sep double[] vecElement con_elem_sep double[] weights con_func_sep void add(double)",
"code": "public void inc ( ) { this . add ( 1 ) ; }"
}
其中`nl`字段将自然语言描述与类环境进行拼接,类环境中的元素通过`con_elem_sep`与`con_func_sep`这类特殊Token进行分隔。
#### 任务定义
本任务的目标为:给定自然语言描述与类环境,生成Java类成员函数的源代码。其中类环境指由类的其余部分提供的程序化上下文,涵盖类中其他成员变量与成员函数。模型将通过精确匹配率(exact match)与BLEU指标进行性能评估。
该任务具备较高挑战性,因为所需生成的代码会因类所提供的功能差异而产生显著变化。模型需同时满足两项核心要求:(a) 深度理解自然语言描述,并将其映射至环境变量、库API调用以及类中自定义的方法;(b) 确定最终生成代码的结构。
#### 参考文献
Concode数据集的引用信息如下:
bibtex
@article{iyer2018mapping,
title={Mapping Language to Code in Programmatic Context},
author={Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:1808.09588},
year={2018}
}