five

DecoderLLMs-CodeSearch-main

收藏
IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/decoderllms-codesearch-main
下载链接
链接失效反馈
官方服务:
资源简介:
Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. Traditional encoder-based models, however, face challenges with poor generalization and input length limitations. In contrast, decoder-only large language models (LLMs), with their larger size, extensive pre-training, and ability to handle longer inputs, present a promising solution to these issues. However, their effectiveness in code search has not been fully explored.To address this gap, our study offers the first systematic evaluation of decoder-only LLMs for code search. We assessed nine state-of-the-art decoder-only models, employing two fine-tuning methods, two datasets (CSN and CoSQA$^+$), and three model sizes. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder. On the CSN dataset, CodeGemma achieved a 12.1\% improvement in Mean Reciprocal Rank (MRR) over UniXcoder, and on the CoSQA$^+$ dataset, it outperformed UniXcoder with a 49.6\% increase in Mean Average Precision (MAP). These findings highlight the superior adaptability and performance of decoder-only models, especially after fine-tuning, in generalizing across unseen datasets.Our analysis further reveals that fine-tuning on code-specific datasets, employing supervised contrastive learning, and increasing model size contribute to performance gains. Despite the higher computational costs, decoder-only LLMs demonstrate greater training efficiency and generalization on limited data compared to smaller encoder-only models.
提供机构:
Chen, Yxuan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作