tianyang/repobench-p

Name: tianyang/repobench-p
Creator: tianyang
Published: 2023-07-19 06:13:35
License: 暂无描述

Hugging Face2023-07-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tianyang/repobench-p

下载链接

链接失效反馈

官方服务：

资源简介：

--- language_creators: - found language: - code license: - cc-by-nc-nd-4.0 multilinguality: - multilingual pretty_name: RepoBench-Pipeline source_datasets: - original task_categories: - text-retrieval - text-generation task_ids: - document-retrieval tags: - code --- # Dataset Card for RepoBench-P ## Dataset Description - **Homepage:** https://github.com/Leolty/repobench - **Paper:** https://arxiv.org/abs/2306.03091 ## Dataset Summary **RepoBench-P (Pipeline)** is a subtask of **RepoBench**([GitHub](https://github.com/Leolty/repobench), [arXiv](https://arxiv.org/abs/2306.03091)), combinig the retrieval and code completion tasks. Specifically, the retrieval task is used to retrieve the most relevant code snippet first, and then do the code completion task with retrieved code snippet as cross-file context for next-line prediction, which mirrors complex real-world scenarios that a practical auto-completion system would face. ## Settings - `cff`: short for cross_file_first, indicating the cross-file module in next line is first used in the current file. - `cfr`: short for cross_file_random, indicating the cross-file module in next line is not first used in the current file. - `if`: short for in_file, indicating the next line does not contain any cross-file module. ## Supported Languages - `python` and `java` ## Loading Data For example, to load the `python` dataset, and you can provide the `split` argument to choose the specific setting. ```python from datasets import load_dataset dataset = load_dataset("tianyang/repobench-p", "python", split="cff") ``` > Note: The `split` argument is optional. If not provided, the entire dataset will be loaded. ## Dataset Structure ```json { "repo_name": "repository name of the data point", "file_path": "path/to/current_file", "context": [ { "path": "path/to/cross_file_1", "identifier": "identifier of the cross-file module", "snippet": "the code snippet of the cross-file module", "tokenized_snippet": "tokenized code snippet of the cross-file module" }, // ... { "path": "path/to/cross_file_k", "identifier": "identifier of the cross-file module", "snippet": "the code snippet of the cross-file module", "tokenized_snippet": "tokenized code snippet of the cross-file module" }, ], "import_statement": "all import statements in current file", "code": "the code for next-line prediction", "next_line": "the next line of the code", "gold_snippet_index": 2 // NOTE: Only for "cross_file_first" and "cross_file_random" settings, for "in_file" setting, we set it to -1. } ``` ## Licensing Information CC BY-NC-ND 4.0 ## Citation Information ```bibtex @misc{liu2023repobench, title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, author={Tianyang Liu and Canwen Xu and Julian McAuley}, year={2023}, eprint={2306.03091}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Contributions Thanks to [@Leolty](https://github.com/Leolty) for adding this dataset.

提供机构：

tianyang

原始信息汇总

数据集概述

数据集名称: RepoBench-P (Pipeline)

数据集简介: RepoBench-P是RepoBench的一个子任务，结合了文本检索和代码生成任务。具体来说，检索任务用于检索最相关的代码片段，然后使用检索到的代码片段作为跨文件上下文进行下一行代码预测，模拟了实际自动完成系统面临的复杂场景。

数据集设置

cff: 表示下一行的跨文件模块首次在当前文件中使用。
cfr: 表示下一行的跨文件模块不是首次在当前文件中使用。
if: 表示下一行不包含任何跨文件模块。

支持的语言

python
java

数据集结构

json { "repo_name": "数据点的仓库名称", "file_path": "当前文件的路径", "context": [ { "path": "跨文件1的路径", "identifier": "跨文件模块的标识符", "snippet": "跨文件模块的代码片段", "tokenized_snippet": "跨文件模块的代码片段（已分词）" }, // ... { "path": "跨文件k的路径", "identifier": "跨文件模块的标识符", "snippet": "跨文件模块的代码片段", "tokenized_snippet": "跨文件模块的代码片段（已分词）" }, ], "import_statement": "当前文件中的所有导入语句", "code": "用于下一行预测的代码", "next_line": "代码的下一行", "gold_snippet_index": 2 // 仅适用于"cross_file_first"和"cross_file_random"设置，对于"in_file"设置，我们将其设置为-1。 }

许可证信息

CC BY-NC-ND 4.0

引用信息

bibtex @misc{liu2023repobench, title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, author={Tianyang Liu and Canwen Xu and Julian McAuley}, year={2023}, eprint={2306.03091}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总

数据集介绍

构建方式

RepoBench-P 数据集的构建源于对仓库级代码自动补全系统的深入探索，其设计旨在模拟现实世界中复杂的代码补全场景。该数据集通过整合检索与代码补全两个核心任务，首先利用检索模块从仓库中定位最相关的代码片段，随后将检索到的跨文件上下文作为输入，进行下一行代码的预测。数据来源覆盖 Python 和 Java 两种编程语言，并依据跨文件模块的使用情况细分为三种设置：cff（首次使用跨文件模块）、cfr（非首次使用跨文件模块）以及 if（仅包含文件内上下文）。每条数据样本均包含仓库名称、文件路径、跨文件上下文列表、导入语句、当前代码片段、目标下一行代码以及黄金片段索引，从而构建出结构化的评估基准。

特点

RepoBench-P 数据集的核心特点在于其高度模拟真实开发环境中的跨文件依赖关系，突破了传统代码补全数据集仅关注局部上下文的局限。通过引入检索与补全的串联流程，该数据集能够评估模型在复杂仓库结构下定位并利用远程上下文的能力。三种设置（cff、cfr、if）的划分使得研究者可以分别考察模型在不同跨文件依赖场景下的表现，从而深入分析其鲁棒性与适应性。此外，数据集提供了 tokenized_snippet 字段，便于直接用于基于 transformer 的模型训练，而 gold_snippet_index 字段则明确标注了正确答案的来源，为检索性能的量化分析提供了便利。

使用方法

使用 RepoBench-P 数据集时，研究者可通过 HuggingFace 的 datasets 库便捷地加载数据。加载时需指定语言子集（如 Python 或 Java），并通过 split 参数选择特定的设置（cff、cfr 或 if）。若不提供 split 参数，将加载全部数据。每条样本以 JSON 格式呈现，包含丰富的字段信息，其中 context 列表存储了多个跨文件模块的路径、标识符、代码片段及其分词结果。研究者可直接利用 code 字段作为输入，以 next_line 字段作为预测目标进行模型训练或评估。对于检索任务，可依据 gold_snippet_index 索引从 context 中提取正确答案，从而实现对检索与补全联合系统的端到端评测。

背景与挑战

背景概述

在软件工程领域，代码自动补全系统已成为提升开发者效率的关键工具，然而现有基准测试多局限于单文件或局部上下文，难以评估模型在真实仓库级场景下的表现。RepoBench-P（Pipeline）作为RepoBench的子任务，由Tianyang Liu、Canwen Xu和Julian McAuley于2023年提出，旨在弥合这一鸿沟。该数据集通过融合检索与代码补全任务，模拟实际自动补全系统面临的跨文件依赖挑战，要求模型先检索最相关的代码片段，再基于此进行下一行预测。其核心研究问题聚焦于评估模型在复杂仓库级上下文中的代码生成能力，涵盖Python和Java两种主流语言，为代码智能领域提供了更具生态效度的评测标准，对推动仓库级代码补全技术的发展具有里程碑意义。

当前挑战

RepoBench-P所解决的领域问题在于现有代码补全基准往往忽略跨文件依赖，导致模型在真实仓库环境中性能显著下降。其构建过程中面临多重挑战：首先，需从海量开源仓库中精准识别并提取跨文件模块的调用关系，确保上下文片段与预测目标逻辑关联；其次，需设计三种差异化设置（cff、cfr、if）以覆盖不同跨文件依赖场景，平衡测试的全面性与难度；此外，数据清洗时需过滤无关import语句与噪声代码，避免对模型预测产生误导。最终，通过gold_snippet_index标注实现检索结果的可量化评估，为后续研究提供了严谨的对比基准。

常用场景

经典使用场景

RepoBench-P作为仓库级代码自动补全系统的基准测试，其经典使用场景在于评估跨文件上下文感知的代码检索与生成能力。该数据集将检索任务与代码补全任务巧妙融合，首先通过检索模块从仓库中定位最相关的代码片段，随后基于检索结果进行下一行代码预测。这种流水线设计完美模拟了现实开发环境中IDE自动补全系统面临的复杂挑战，要求模型不仅理解当前文件语法，还需掌握跨文件依赖关系。研究者在Python和Java两种主流语言上，通过cff、cfr、if三种设置分别考察模型处理首次跨文件引用、非首次跨文件引用以及纯文件内补全的差异化能力。

衍生相关工作

RepoBench-P的发布催生了多项前沿研究。基于其检索-补全流水线架构，研究者提出了RepoFusion模型，通过注意力机制动态融合跨文件上下文；另有工作探索了利用图神经网络编码仓库依赖图以增强检索精度的RepoGraph方法。在预训练领域，CodeGPT-Repo变体专门针对仓库级代码进行掩码语言建模，在RepoBench-P上取得了显著性能提升。此外，该数据集还被用于评估大语言模型（如CodeLlama、StarCoder）的跨文件代码生成能力，推动了代码智能领域从孤立片段理解向仓库级语义建模的转型。

数据集最近研究