five

GenCodeSearchNet (GeCS)

收藏
arXiv2023-11-16 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2311.09707v1
下载链接
链接失效反馈
官方服务:
资源简介:
GenCodeSearchNet(GeCS)是一个用于评估编程语言理解泛化能力的基准数据集,特别关注自然语言代码搜索任务。该数据集结合了现有的代码搜索数据集和一个新的手工筛选子集StatCodeSearch,后者专注于R语言,一种在统计分析中广泛使用但资源较少的编程语言。StatCodeSearch包含1070个文本-代码对,源自社会科学和心理学领域的研究项目。数据集的创建过程涉及从Open Science Framework(OSF)筛选项目,通过自动化和人工审核相结合的方式提取和验证代码-评论对。GeCS数据集旨在解决现有模型在处理不同编程语言和领域时的泛化问题,特别是在面对资源较少的语言时。

GenCodeSearchNet (GeCS) is a benchmark dataset for evaluating the generalization capability of programming language understanding, with a particular focus on the natural language code search task. This dataset combines existing code search datasets with a new manually curated subset named StatCodeSearch, which focuses on the R programming language—a widely used yet under-resourced language for statistical analysis. StatCodeSearch contains 1,070 text-code pairs derived from research projects across the social sciences and psychology. The dataset's construction involves screening projects from the Open Science Framework (OSF), and extracting and validating code-comment pairs via a hybrid approach of automated processing and manual review. The GeCS dataset is designed to address the generalization challenges faced by existing models when handling diverse programming languages and domains, especially for under-resourced programming languages.
提供机构:
乌尔姆大学
创建时间:
2023-11-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作