DecoderLLMs-CodeSearch-main

Name: DecoderLLMs-CodeSearch-main
Creator: IEEE DataPort
Published: 2025-02-19 11:43:41
License: 暂无描述

DataCite Commons2025-02-19 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/decoderllms-codesearch-main

下载链接

链接失效反馈

官方服务：

资源简介：

Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. Traditional encoder-based models, however, face challenges with poor generalization and input length limitations. In contrast, decoder-only large language models (LLMs), with their larger size, extensive pre-training, and ability to handle longer inputs, present a promising solution to these issues. However, their effectiveness in code search has not been fully explored.To address this gap, our study offers the first systematic evaluation of decoder-only LLMs for code search. We assessed nine state-of-the-art decoder-only models, employing two fine-tuning methods, two datasets (CSN and CoSQA$^+$), and three model sizes. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder. On the CSN dataset, CodeGemma achieved a 12.1\% improvement in Mean Reciprocal Rank (MRR) over UniXcoder, and on the CoSQA$^+$ dataset, it outperformed UniXcoder with a 49.6\% increase in Mean Average Precision (MAP). These findings highlight the superior adaptability and performance of decoder-only models, especially after fine-tuning, in generalizing across unseen datasets.Our analysis further reveals that fine-tuning on code-specific datasets, employing supervised contrastive learning, and increasing model size contribute to performance gains. Despite the higher computational costs, decoder-only LLMs demonstrate greater training efficiency and generalization on limited data compared to smaller encoder-only models.

提供机构：

IEEE DataPort

创建时间：

2025-02-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集