conala
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/neulab/conala
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
- **Repository:** https://conala-corpus.github.io/
- **Paper:** [Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow](https://arxiv.org/pdf/1805.08949.pdf)
### Dataset Summary
[CoNaLa](https://conala-corpus.github.io/) is a benchmark of code and natural language pairs, for the evaluation of code generation tasks. The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators, split into 2,379 training and 500 test examples. The automatically mined dataset is also available with almost 600k examples.
### Supported Tasks and Leaderboards
This dataset is used to evaluate code generations.
### Languages
English - Python code.
## Dataset Structure
```python
dataset_curated = load_dataset("neulab/conala")
DatasetDict({
train: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 2379
})
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 500
})
})
dataset_mined = load_dataset("neulab/conala", "mined")
DatasetDict({
train: Dataset({
features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'],
num_rows: 593891
})
})
```
### Data Instances
#### CoNaLa - curated
This is the curated dataset by annotators
```
{
'question_id': 41067960,
'intent': 'How to convert a list of multiple integers into a single integer?',
'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))'
}
```
#### CoNaLa - mined
This is the automatically mined dataset before curation
```
{
'question_id': 34705205,
'parent_answer_post_id': 34705233,
'prob': 0.8690001442846342,
'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))',
'intent': 'Sort a nested list by two elements',
'id': '34705205_34705233_0'
}
```
### Data Fields
Curated:
|Field|Type|Description|
|---|---|---|
|question_id|int64|Id of the Stack Overflow question|
|intent|string|Natural Language intent (i.e., the title of a Stack Overflow question)|
|rewritten_intent|string|Crowdsourced revised intents that try to better reflect the full meaning of the code|
|snippet|string| Code snippet that implements the intent|
Mined:
|Field|Type|Description|
|---|---|---|
|question_id|int64|Id of the Stack Overflow question|
|parent_answer_post_id|int64|Id of the answer post from which the candidate snippet is extracted|
|intent|string|Natural Language intent (i.e., the title of a Stack Overflow question)|
|snippet|string| Code snippet that implements the intent|
|id|string|Unique id for this intent/snippet pair|
|prob|float64|Probability given by the mining model|
### Data Splits
There are two version of the dataset (curated and mined), mined only has a train split and curated has two splits: train and test.
## Dataset Creation
The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original [paper](https://arxiv.org/pdf/1805.08949.pdf)
### Citation Information
```
@inproceedings{yin2018learning,
title={Learning to mine aligned code and natural language pairs from stack overflow},
author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
pages={476--486},
year={2018},
organization={IEEE}
}
```
# 数据集描述
- **仓库地址:** https://conala-corpus.github.io/
- **论文:** [Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow](https://arxiv.org/pdf/1805.08949.pdf)
## 数据集概览
[CoNaLa](https://conala-corpus.github.io/) 是一款代码-自然语言配对基准数据集,用于代码生成(code generation)任务的评估。该数据集从Stack Overflow爬取,经自动筛选后由标注人员进行人工精调,划分为2379条训练样本与500条测试样本。此外还提供了自动挖掘得到的近60万条样本的未精调版本。
## 支持任务与排行榜
本数据集用于代码生成任务的评估。
## 语言类型
英语 - Python代码。
## 数据集结构
python
dataset_curated = load_dataset("neulab/conala")
DatasetDict({
train: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 2379
})
test: Dataset({
features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
num_rows: 500
})
})
dataset_mined = load_dataset("neulab/conala", "mined")
DatasetDict({
train: Dataset({
features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'],
num_rows: 593891
})
})
## 数据实例
### CoNaLa - 精调版
该版本为标注人员人工精调后的数据集
{
'question_id': 41067960,
'intent': '如何将多个整数组成的列表转换为单个整数?',
'rewritten_intent': "将由多个整数构成的列表'x'的元素拼接为单个整数",
'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))'
}
### CoNaLa - 挖掘版
该版本为未经过人工精调的自动挖掘数据集
{
'question_id': 34705205,
'parent_answer_post_id': 34705233,
'prob': 0.8690001442846342,
'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))',
'intent': '基于两个元素对嵌套列表进行排序',
'id': '34705205_34705233_0'
}
## 数据字段
### 精调版数据集
|字段|类型|描述|
|---|---|---|
|question_id|int64|Stack Overflow问题的唯一标识符|
|intent|string|自然语言意图(即Stack Overflow问题的标题)|
|rewritten_intent|string|众包修订后的意图,用于更准确地反映代码的完整语义|
|snippet|string|实现对应意图的代码片段|
### 挖掘版数据集
|字段|类型|描述|
|---|---|---|
|question_id|int64|Stack Overflow问题的唯一标识符|
|parent_answer_post_id|int64|提取候选代码片段的回答帖ID|
|intent|string|自然语言意图(即Stack Overflow问题的标题)|
|snippet|string|实现对应意图的代码片段|
|id|string|该意图-代码片段对的唯一标识符|
|prob|float64|挖掘模型给出的置信概率|
## 数据划分
该数据集包含两个版本:精调版与未精调挖掘版。未精调挖掘版仅包含训练划分,而精调版则包含训练与测试两个划分。
## 数据集构建
本数据集从Stack Overflow爬取,经自动筛选后由标注人员进行人工精调。更多细节请参阅原始[论文](https://arxiv.org/pdf/1805.08949.pdf)。
## 引用信息
@inproceedings{yin2018learning,
title={Learning to mine aligned code and natural language pairs from stack overflow},
author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
pages={476--486},
year={2018},
organization={IEEE}
}
提供机构:
maas
创建时间:
2025-10-10



