DecoderLLMs-CodeSearch-main
收藏DataCite Commons2025-02-19 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/decoderllms-codesearch-main
下载链接
链接失效反馈官方服务:
资源简介:
Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. Traditional encoder-based models, however, face challenges with poor generalization and input length limitations. In contrast, decoder-only large language models (LLMs), with their larger size, extensive pre-training, and ability to handle longer inputs, present a promising solution to these issues. However, their effectiveness in code search has not been fully explored.To address this gap, our study offers the first systematic evaluation of decoder-only LLMs for code search. We assessed nine state-of-the-art decoder-only models, employing two fine-tuning methods, two datasets (CSN and CoSQA$^+$), and three model sizes. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder. On the CSN dataset, CodeGemma achieved a 12.1\% improvement in Mean Reciprocal Rank (MRR) over UniXcoder, and on the CoSQA$^+$ dataset, it outperformed UniXcoder with a 49.6\% increase in Mean Average Precision (MAP). These findings highlight the superior adaptability and performance of decoder-only models, especially after fine-tuning, in generalizing across unseen datasets.Our analysis further reveals that fine-tuning on code-specific datasets, employing supervised contrastive learning, and increasing model size contribute to performance gains. Despite the higher computational costs, decoder-only LLMs demonstrate greater training efficiency and generalization on limited data compared to smaller encoder-only models.
提供机构:
IEEE DataPort
创建时间:
2025-02-19



