Galeras
收藏arXiv2023-08-24 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2308.12415v1
下载链接
链接失效反馈官方服务:
资源简介:
Galeras是由威廉与玛丽学院计算机科学系创建的数据集,专注于解释大型语言模型(LLMs)在源代码生成任务中的性能。该数据集包含约227000条经过筛选的Python代码片段,用于评估代码完成、代码摘要和提交生成等软件工程任务。数据集的创建过程涉及从GitHub等开源仓库收集代码,通过预处理、数据验证和去重等步骤确保数据质量。Galeras的应用领域主要集中在通过因果推断方法减少混杂偏差,提供对LLMs性能的解释性分析,从而帮助研究人员更好地理解和优化代码生成模型的性能。
Galeras is a dataset created by the Department of Computer Science at the College of William & Mary, which focuses on interpreting the performance of Large Language Models (LLMs) on source code generation tasks. The dataset contains approximately 227,000 filtered Python code snippets for evaluating software engineering tasks such as code completion, code summarization, and commit generation. The dataset's creation process involves collecting code from open-source repositories like GitHub, and ensuring data quality through steps including preprocessing, data validation, and deduplication. The main application areas of Galeras focus on reducing confounding bias via causal inference methods, providing explanatory analyses of LLM performance, thereby helping researchers better understand and optimize the performance of code generation models.
提供机构:
威廉与玛丽学院计算机科学系
创建时间:
2023-08-24



