neulab/docprompting-conala

Name: neulab/docprompting-conala
Creator: neulab
Published: 2023-03-14 17:59:47
License: 暂无描述

Hugging Face2023-03-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/neulab/docprompting-conala

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language_creators: - crowdsourced - expert-generated language: - code license: - mit multilinguality: - monolingual size_categories: - unknown source_datasets: - original task_categories: - text2text-generation task_ids: [] pretty_name: DocPrompting-CoNaLa tags: - code-generation - doc retrieval - retrieval augmented generation --- ## Dataset Description - **Repository:** https://github.com/shuyanzhou/docprompting - **Paper:** [DocPrompting: Generating Code by Retrieving the Docs](https://arxiv.org/pdf/2207.05987.pdf) ### Dataset Summary This is the re-split of [CoNaLa](https://conala-corpus.github.io/) dataset. For each code snippet in the dev and test set, at least one function is held out from the training set. This split aims at testing a code generation model's capacity in generating *unseen* functions We further make sure that examples from the same StackOverflow post (same `question_id` before `-`) are in the same split. ### Supported Tasks and Leaderboards This dataset is used to evaluate code generations. ### Languages English - Python code. ## Dataset Structure ```python dataset = load_dataset("neulab/docpromting-conala") DatasetDict({ train: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 2135 }) test: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 543 }) validation: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 201 }) }) }) code_docs = load_dataset("neulab/docprompting-conala", "docs") DatasetDict({ train: Dataset({ features: ['doc_id', 'doc_content'], num_rows: 34003 }) }) ``` ### Data Fields train/dev/test: - nl: The natural language intent - cmd: The reference code snippet - question_id: `x-y`where `x` is the StackOverflow post ID - oracle_man: The `doc_id` of the functions used in the reference code snippet. The corresponding contents are in `doc` split - canonical_cmd: The canonical version reference code snippet docs: - doc_id: the id of a doc - doc_content: the content of the doc ## Dataset Creation The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original [paper](https://arxiv.org/pdf/1805.08949.pdf) ### Citation Information ``` @article{zhou2022doccoder, title={DocCoder: Generating Code by Retrieving and Reading Docs}, author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham}, journal={arXiv preprint arXiv:2207.05987}, year={2022} } ```

提供机构：

neulab

原始信息汇总

数据集概述

名称: DocPrompting-CoNaLa
任务类型: 文本到文本生成
语言: 英语 - Python代码
许可证: MIT
多语言性: 单语种
数据来源: 原始数据集
大小: 未知
标签: 代码生成, 文档检索, 增强检索生成

数据集详情

数据集总结: 该数据集是CoNaLa数据集的重新分割版本。开发集和测试集中的每个代码片段至少有一个函数未出现在训练集中，旨在测试代码生成模型生成未见函数的能力。确保来自同一StackOverflow帖子的示例（相同question_id）在同一分割中。
支持的任务: 用于评估代码生成。
数据集结构:
- 训练集: 包含2135条记录，特征包括自然语言意图(nl)、参考代码片段(cmd)、StackOverflow帖子ID(question_id)、函数名称(cmd_name)、文档ID(oracle_man)和规范参考代码片段(canonical_cmd)。
- 测试集: 包含543条记录，具有与训练集相同的特征。
- 验证集: 包含201条记录，具有与训练集相同的特征。
- 文档集: 包含34003条记录，特征包括文档ID(doc_id)和文档内容(doc_content)。

数据集创建

数据集从Stack Overflow爬取，经过自动过滤后由标注者进行整理。详细信息请参阅原始论文。

引用信息

@article{zhou2022doccoder, title={DocCoder: Generating Code by Retrieving and Reading Docs}, author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham}, journal={arXiv preprint arXiv:2207.05987}, year={2022} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集