IslamMesabah/CoderAPI_Dataset

Name: IslamMesabah/CoderAPI_Dataset
Creator: IslamMesabah
Published: 2024-03-28 04:45:09
License: 暂无描述

Hugging Face2024-03-28 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/IslamMesabah/CoderAPI_Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - code - 'API ' size_categories: - n<1K --- ### Large Language Models for instructed and effective code generation using Documentation of APIs This thesis explores the effective utilization of Large Language Models, specifically the Instruct CodeT5+ 16 Billion model, for the generation of multi-line, ready-to-execute code in Python. Departing from conventional reliance solely on pre-trained LLM knowledge, we employ API documentation to enhance the correctness of generated code for both seen and unseen APIs in the LLM knowledge. We utilize the Retrieval-Augmented Generation technique to incorporate user intents expressed in English, specifically targeting APIs, to select the most suitable segments from the relevant API documentation. Subsequently, these user intents and API documentation segments are utilized in model prompt engineering and fine-tuning procedures. We collect a newly synthesized dataset comprising 938 data points encompassing 46 distinct APIs. Furthermore, we demonstrate significant advancements in code generation accuracy and utility, resulting in a remarkable 0.2 increase in ICE score and a 0.33% elevation in CodeBLEU. Our experimental evaluation provides valuable insights into code generation complexities, including the impact of seen and unseen API documentation on model performance and the effectiveness of prompt engineering strategies. This work underscores the importance of leveraging natural language processing techniques to address real-world challenges in software engineering, with implications for automated software development and enhanced developer productivity.

This dataset is used to explore the effective utilization of Large Language Models, specifically the Instruct CodeT5+ 16 Billion model, for generating multi-line, ready-to-execute Python code. By incorporating API documentation and user intents, using Retrieval-Augmented Generation techniques, the dataset comprises 938 data points covering 46 distinct APIs, aiming to enhance the correctness and utility of code generation.

提供机构：

IslamMesabah

原始信息汇总

数据集概述

数据集名称

大型语言模型在API文档指导下生成有效代码

数据集描述

该数据集用于探索大型语言模型（特别是Instruct CodeT5+ 16 Billion模型）在Python中生成多行、可执行代码的有效利用。通过使用API文档来增强模型对已知和未知API生成代码的正确性，采用检索增强生成技术，结合用户意图和API文档段落进行模型提示工程和微调。

数据集规模

数据集包含938个数据点，涵盖46个不同的API。

数据集应用

数据集用于评估代码生成准确性和实用性，实验结果显示ICE得分提高了0.2，CodeBLEU提高了0.33%。该数据集有助于理解代码生成复杂性，包括已知和未知API文档对模型性能的影响以及提示工程策略的有效性。

数据集意义

该数据集强调了利用自然语言处理技术解决软件工程中的实际挑战的重要性，对自动化软件开发和提高开发者生产力具有重要意义。

5,000+

优质数据集

54 个

任务类型

进入经典数据集