neulab/docprompting-conala
收藏Hugging Face2023-03-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/neulab/docprompting-conala
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language_creators:
- crowdsourced
- expert-generated
language:
- code
license:
- mit
multilinguality:
- monolingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- text2text-generation
task_ids: []
pretty_name: DocPrompting-CoNaLa
tags:
- code-generation
- doc retrieval
- retrieval augmented generation
---
## Dataset Description
- **Repository:** https://github.com/shuyanzhou/docprompting
- **Paper:** [DocPrompting: Generating Code by Retrieving the Docs](https://arxiv.org/pdf/2207.05987.pdf)
### Dataset Summary
This is the re-split of [CoNaLa](https://conala-corpus.github.io/) dataset.
For each code snippet in the dev and test set, at least one function is held out from the training set.
This split aims at testing a code generation model's capacity in generating *unseen* functions
We further make sure that examples from the same StackOverflow post (same `question_id` before `-`) are in the same split.
### Supported Tasks and Leaderboards
This dataset is used to evaluate code generations.
### Languages
English - Python code.
## Dataset Structure
```python
dataset = load_dataset("neulab/docpromting-conala")
DatasetDict({
train: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 2135
})
test: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 543
})
validation: Dataset({
features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'],
num_rows: 201
})
})
})
code_docs = load_dataset("neulab/docprompting-conala", "docs")
DatasetDict({
train: Dataset({
features: ['doc_id', 'doc_content'],
num_rows: 34003
})
})
```
### Data Fields
train/dev/test:
- nl: The natural language intent
- cmd: The reference code snippet
- question_id: `x-y`where `x` is the StackOverflow post ID
- oracle_man: The `doc_id` of the functions used in the reference code snippet. The corresponding contents are in `doc` split
- canonical_cmd: The canonical version reference code snippet
docs:
- doc_id: the id of a doc
- doc_content: the content of the doc
## Dataset Creation
The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original [paper](https://arxiv.org/pdf/1805.08949.pdf)
### Citation Information
```
@article{zhou2022doccoder,
title={DocCoder: Generating Code by Retrieving and Reading Docs},
author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham},
journal={arXiv preprint arXiv:2207.05987},
year={2022}
}
```
提供机构:
neulab
原始信息汇总
数据集概述
- 名称: DocPrompting-CoNaLa
- 任务类型: 文本到文本生成
- 语言: 英语 - Python代码
- 许可证: MIT
- 多语言性: 单语种
- 数据来源: 原始数据集
- 大小: 未知
- 标签: 代码生成, 文档检索, 增强检索生成
数据集详情
- 数据集总结: 该数据集是CoNaLa数据集的重新分割版本。开发集和测试集中的每个代码片段至少有一个函数未出现在训练集中,旨在测试代码生成模型生成未见函数的能力。确保来自同一StackOverflow帖子的示例(相同
question_id)在同一分割中。 - 支持的任务: 用于评估代码生成。
- 数据集结构:
- 训练集: 包含2135条记录,特征包括自然语言意图(
nl)、参考代码片段(cmd)、StackOverflow帖子ID(question_id)、函数名称(cmd_name)、文档ID(oracle_man)和规范参考代码片段(canonical_cmd)。 - 测试集: 包含543条记录,具有与训练集相同的特征。
- 验证集: 包含201条记录,具有与训练集相同的特征。
- 文档集: 包含34003条记录,特征包括文档ID(
doc_id)和文档内容(doc_content)。
- 训练集: 包含2135条记录,特征包括自然语言意图(
数据集创建
数据集从Stack Overflow爬取,经过自动过滤后由标注者进行整理。详细信息请参阅原始论文。
引用信息
@article{zhou2022doccoder, title={DocCoder: Generating Code by Retrieving and Reading Docs}, author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham}, journal={arXiv preprint arXiv:2207.05987}, year={2022} }



