tianyang/repobench_python_v1.1
收藏Hugging Face2024-02-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tianyang/repobench_python_v1.1
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: cross_file_first
path: data/cross_file_first-*
- split: cross_file_random
path: data/cross_file_random-*
- split: in_file
path: data/in_file-*
dataset_info:
features:
- name: repo_name
dtype: string
- name: file_path
dtype: string
- name: context
list:
- name: identifier
dtype: string
- name: path
dtype: string
- name: snippet
dtype: string
- name: import_statement
dtype: string
- name: token_num
dtype: int64
- name: cropped_code
dtype: string
- name: all_code
dtype: string
- name: next_line
dtype: string
- name: gold_snippet_index
dtype: int64
- name: created_at
dtype: string
- name: level
dtype: string
splits:
- name: cross_file_first
num_bytes: 504528431
num_examples: 8033
- name: cross_file_random
num_bytes: 467242455
num_examples: 7618
- name: in_file
num_bytes: 488999100
num_examples: 7910
download_size: 472994299
dataset_size: 1460769986
license: cc
task_categories:
- text-generation
language:
- en
tags:
- code
---
# RepoBench v1.1 (Python)
## Introduction
This dataset presents the **Python** portion of [RepoBench](https://arxiv.org/abs/2306.03091) v1.1 (ICLR 2024). The data encompasses a collection from GitHub, spanning the period from **October 6th to December 31st, 2023**. With a commitment to data integrity, we've implemented a deduplication process based on file content against the Stack v2 dataset (coming soon), aiming to mitigate data leakage and memorization concerns.
## Resources and Links
- [Paper](https://arxiv.org/abs/2306.03091)
- [GitHub](https://github.com/Leolty/repobench)
- [Dataset Introduction](https://github.com/Leolty/repobench/blob/main/data/README.md)
## FAQs
- **Q:** What do the features in the dataset mean?
**A:** Imagine you're coding in Python and you want to write the next line of your code. The dataset provides you the following information:
- `repo_name` (string): the name of the repository
- `file_path` (string): the path of the current file
- `context` (list): the cross-file code snippets that might be helpful for writing the next line:
- `identifier` (string): the identifier of the code snippet
- `path` (string): the path of the code snippet
- `snippet` (string): the code snippet
- `import_statement` (string): the import statement of the current file
- `cropped_code` (string): the cropped code of the current file (up to previous 120 lines)
- `all_code` (string): the entire code of the current file (not cropped)
- `next_line` (string): the next line of the code (this serves as the target)
- `gold_snippet_index` (int): the index of the gold snippet in the context (which will be used in next line, just for reference, you should not use this for next line prediction)
- `created_at` (string): the creation time of the repository
- `level` (string): the level of next line completion, which is measured by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens)
- **Q:** How does the level be defined?
**A:** The level is determined by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens). The token number is calculated by the tokenizer of GPT-4 by using [tiktoken](https://github.com/openai/tiktoken). The following table shows the level definition:
| Level | Prompt Length (Number of Tokens) |
|-------|------------------------|
| 2k | 640 - 1,600 |
| 4k | 1,600 - 3,600 |
| 8k | 3,600 - 7,200 |
| 12k | 7,200 - 10,800 |
| 16k | 10,800 - 14,400 |
| 24k | 14,400 - 21,600 |
| 32k | 21,600 - 28,800 |
| 64k | 28,800 - 57,600 |
| 128k | 57,600 - 100,000 |
- **Q:** What does the different splits mean?
**A:** The dataset is split into three parts:
- `cross_file_first`: the next line of code utilizes content from a cross-file code snippet and it is its first usage within current file.
- `cross_file_random`: the next line of code utilizes content from a cross-file code snippet and it is NOT its first usage within current file.
- `in_file`: the next line of code does not utilize content from a cross-file code snippet.
- **Q:** How to construct the prompt for next line prediction?
**A:** We hereby provide the official implementation for constructing prompts. Please note that the methods described below are not necessarily the optimal way of construction. Reordering, retrieval argumentation, or employing different cropping/construction techniques could potentially lead to varying degrees of improvement. Ensure that your model evaluations are conducted in a fair manner.
```python
import re
def construct_prompt(
data: dict,
language: str = "python",
tokenizer= None,
max_token_nums: int = 15800
) -> str:
"""
Construct the prompt for next line prediction.
:param data: data point from the dataset
:param language: the language of the code
:param tokenizer: the tokenizer of the evaluation model
:param max_token_nums: the maximum number of tokens constraint for the prompt
:return: the constructed prompt
"""
# comment symbol for different languages
comment_symbol = "#" if language == "python" else "//"
# construct the cross-file prompt and in-file prompt separately
# cross-file prompt
cross_file_prompt = f"{comment_symbol} Repo Name: {data['repo_name']}\n"
for snippet in data['context']:
cross_file_prompt += f"{comment_symbol} Path: {snippet['path']}\n{snippet['snippet']}" + "\n\n"
# in-file prompt
in_file_prompt = f"{comment_symbol} Path: {data['file_path']}\n{data['import_statement']}\n{data['cropped_code']}\n"
# if we assign the tokenizer and the max_token_nums, we will truncate the cross-file prompt to meet the constraint
if tokenizer is not None and max_token_nums is not None:
cross_file_prompt_token_nums = len(tokenizer.encode(cross_file_prompt))
in_file_prompt_token_nums = len(tokenizer.encode(in_file_prompt))
exceed_token_nums = cross_file_prompt_token_nums + in_file_prompt_token_nums - max_token_nums
if exceed_token_nums > 0:
# split the cross-file prompt into lines
cross_file_prompt_lines = cross_file_prompt.split("\n")
# drop lines from end until the extra token number is less than 0
for i in range(len(repo_prompt_lines)-1, -1, -1):
extra_token_num -= len(tokenizer.encode(cross_file_prompt_lines[i]))
if extra_token_num < 0:
break
# join the lines back
cross_file_prompt = "\n".join(cross_file_prompt_lines[:i]) + "\n\n"
# combine the cross-file prompt and in-file prompt
prompt = cross_file_prompt + in_file_prompt
# normalize some empty lines
prompt = re.sub(r'\n{4,}', '\n\n', prompt)
return prompt
```
- **Q:** How to load the dataset?
**A:** You can simply use the following code to load the dataset:
```python
from datasets import load_dataset
dataset = load_dataset("tianyang/repobench_python_v1.1")
```
To construct the prompt for next line prediction, you can refer to the official implementation provided in the previous question and use the `construct_prompt` function to construct the prompt, for example:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base")
prompt = construct_prompt(dataset['cross_file_first'][0], tokenizer=tokenizer, max_token_nums=15800)
```
- **Q:** How often will the dataset be updated?
**A:** We plan to update the dataset every three months, but there might be slight delays considering the time required for data crawling and our own schedules. If you require updated data, please feel free to contact us, and we can coordinate the timing and expedite the process.
- **Q:** What models should I use to evaluate the dataset?
**A:** RepoBench is designed to evaluate base models, not those that have been instruction fine-tuned. Please use base models for evaluation.
- **Q:** I am training a new model but the knowledge cutoff date is after the dataset's. Can you provide me with the latest data?
**A:** Sure! We are happy to provide you with the latest data (even customized data with specific requirements). Please feel free to contact us.
- **Q:** Can I opt-out?
**A:** Yes, you can opt-out your repository from the dataset. Please check [Am I in RepoBench?](https://huggingface.co/spaces/tianyang/in-the-repobench), we will upload the raw data of the repository information we crawled at least 15 days before the dataset creation and release. We will respect your decision and remove your repository from the dataset if you opt-out.
## Citation
If you find RepoBench useful in your research, please consider citing the paper using the following BibTeX entry:
```bibtex
@misc{liu2023repobench,
title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems},
author={Tianyang Liu and Canwen Xu and Julian McAuley},
year={2024},
url={https://arxiv.org/abs/2306.03091},
booktitle={International Conference on Learning Representations}
}
```
Your interest and contributions to RepoBench are immensely valued. Happy coding! 🚀
提供机构:
tianyang
原始信息汇总
RepoBench v1.1 (Python) 数据集概述
数据集配置
- 默认配置:
- 数据文件:
cross_file_first:路径为data/cross_file_first-*cross_file_random:路径为data/cross_file_random-*in_file:路径为data/in_file-*
- 数据文件:
数据集信息
-
特征:
repo_name(字符串):仓库名称file_path(字符串):当前文件路径context(列表):跨文件代码片段,包含:identifier(字符串):代码片段标识符path(字符串):代码片段路径snippet(字符串):代码片段内容
import_statement(字符串):当前文件的导入语句token_num(int64):标记数量cropped_code(字符串):当前文件的裁剪代码(最多前120行)all_code(字符串):当前文件的完整代码next_line(字符串):下一行代码(目标)gold_snippet_index(int64):上下文中黄金片段的索引(仅供参考,不应用于下一行预测)created_at(字符串):仓库创建时间level(字符串):下一行完成级别,由整个提示的标记数量决定
-
分割:
cross_file_first:下一行代码使用跨文件代码片段且为首次使用- 字节数:504528431
- 样本数:8033
cross_file_random:下一行代码使用跨文件代码片段但非首次使用- 字节数:467242455
- 样本数:7618
in_file:下一行代码不使用跨文件代码片段- 字节数:488999100
- 样本数:7910
-
下载大小:472994299 字节
-
数据集大小:1460769986 字节
-
许可证:cc
-
任务类别:文本生成
-
语言:英语
-
标签:代码
数据集加载
- 使用以下代码加载数据集: python from datasets import load_dataset dataset = load_dataset("tianyang/repobench_python_v1.1")
提示构造
- 提供官方实现的提示构造函数
construct_prompt,用于下一行预测的提示构造。
数据集更新
- 计划每三个月更新一次数据集。
模型评估
- 建议使用基础模型进行评估,而非指令微调模型。
数据定制
- 可根据需求提供最新数据或定制数据。
退出选项
- 支持从数据集中移除特定仓库。
引用
- 如需引用该数据集,请使用提供的 BibTeX 条目。
搜集汇总
数据集介绍

构建方式
在代码生成领域,RepoBench Python v1.1数据集通过系统化方法构建,以支持仓库级代码自动补全的基准测试。该数据集从GitHub平台采集了2023年10月6日至12月31日期间的Python代码仓库,并基于文件内容与Stack v2数据集进行了去重处理,旨在减少数据泄露和记忆化风险。数据构建过程精心设计了三种分割方式:跨文件首次使用、跨文件随机使用以及文件内使用,每种分割均通过自动化脚本提取代码片段、导入语句及上下文信息,确保了数据在代码补全任务中的代表性和实用性。
特点
该数据集在代码补全基准测试中展现出多维度特征。其核心在于模拟真实编程场景,提供了跨文件与文件内两种代码依赖关系,覆盖了从简单到复杂的代码补全需求。数据集中的每个样本均包含仓库名称、文件路径、上下文代码片段及目标下一行代码,并标注了提示长度级别,从2k到128k令牌不等,以适应不同模型的处理能力。这种分层设计使得研究者能够评估模型在不同上下文长度下的性能表现,为代码生成模型的细粒度评估提供了坚实基础。
使用方法
使用该数据集进行代码补全评估时,需遵循规范的流程。研究者可通过Hugging Face的datasets库直接加载数据集,并利用官方提供的construct_prompt函数构建提示文本。该函数支持根据指定语言和令牌数限制,智能组合跨文件提示与文件内提示,并自动进行截断处理以适配模型输入。评估时应使用基础模型而非指令微调模型,以确保基准测试的公平性。数据集的三种分割方式允许针对不同代码依赖场景进行专门测试,为全面评估模型在仓库级代码补全能力提供了系统化框架。
背景与挑战
背景概述
在代码智能领域,随着大型语言模型在代码生成任务上的广泛应用,传统的单文件代码补全评测已难以全面评估模型在真实软件开发环境中的实际能力。为此,由研究人员Tianyang Liu等人于2024年提出的RepoBench数据集应运而生,其Python子集v1.1版本收录了2023年10月至12月期间GitHub上的开源仓库数据。该数据集的核心研究问题聚焦于仓库级别的代码自动补全,旨在通过引入跨文件上下文信息,推动模型理解项目级代码结构与依赖关系,从而为代码大模型的评估设立了新的基准,对提升代码智能系统的实用性与泛化能力产生了深远影响。
当前挑战
RepoBench数据集致力于解决仓库级别代码自动补全这一复杂任务,其首要挑战在于如何精准建模跨文件的代码依赖与引用关系,要求模型不仅理解局部语法,还需具备项目级的语义关联能力。在构建过程中,研究团队面临数据去重与泄露防范的严峻考验,通过基于文件内容与Stack v2数据集进行去重处理,以降低模型记忆风险;同时,数据采集需平衡代码质量与规模,并设计合理的提示词构建与长度分级机制,以适配不同规模的模型评估,确保评测的公平性与科学性。
常用场景
经典使用场景
在代码智能研究领域,RepoBench Python数据集为评估代码自动补全系统的性能提供了基准。该数据集通过模拟真实开发场景,要求模型基于跨文件代码片段和当前文件上下文,预测下一行代码。其经典使用场景体现在对大型语言模型在代码生成任务中的能力进行系统性评测,尤其是在处理长上下文和复杂依赖关系时,能够检验模型对代码库级信息的理解与利用效率。
实际应用
在实际软件开发中,该数据集支撑了智能编程助手工具的研发与优化。基于其构建的评测体系,能够指导如集成开发环境插件、代码补全引擎等实际系统的性能提升。通过模拟开发者需要参考其他文件代码片段才能完成当前编写的真实情境,帮助技术团队训练和筛选出更能理解项目整体结构、减少上下文切换的AI辅助工具,从而提升软件开发的效率与代码质量。
衍生相关工作
围绕RepoBench数据集,已衍生出一系列聚焦于仓库级代码智能的经典研究工作。这些工作主要探索如何更有效地检索与整合跨文件上下文信息,以及设计适应长序列代码生成的模型架构。部分研究进一步扩展了其评测框架至多编程语言场景,或将其与指令微调、检索增强生成等技术结合,推动了代码补全系统在复杂软件工程任务中的实际应用与理论深化。
以上内容由遇见数据集搜集并总结生成



