tianyang/repobench_java_v1.1
收藏Hugging Face2024-02-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tianyang/repobench_java_v1.1
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: cross_file_first
path: data/cross_file_first-*
- split: cross_file_random
path: data/cross_file_random-*
- split: in_file
path: data/in_file-*
dataset_info:
features:
- name: repo_name
dtype: string
- name: file_path
dtype: string
- name: context
list:
- name: identifier
dtype: string
- name: path
dtype: string
- name: snippet
dtype: string
- name: import_statement
dtype: string
- name: token_num
dtype: int64
- name: cropped_code
dtype: string
- name: all_code
dtype: string
- name: next_line
dtype: string
- name: gold_snippet_index
dtype: int64
- name: created_at
dtype: string
- name: level
dtype: string
splits:
- name: cross_file_first
num_bytes: 504528431
num_examples: 8033
- name: cross_file_random
num_bytes: 467242455
num_examples: 7618
- name: in_file
num_bytes: 488999100
num_examples: 7910
download_size: 472994299
dataset_size: 1460769986
license: cc
task_categories:
- text-generation
language:
- en
tags:
- code
---
# RepoBench v1.1 (Java)
## Introduction
This dataset presents the **Java** portion of [RepoBench](https://arxiv.org/abs/2306.03091) v1.1 (ICLR 2024). The data encompasses a collection from GitHub, spanning the period from **October 6th to December 31st, 2023**. With a commitment to data integrity, we've implemented a deduplication process based on file content against the Stack v2 dataset (coming soon), aiming to mitigate data leakage and memorization concerns.
## Resources and Links
- [Paper](https://arxiv.org/abs/2306.03091)
- [GitHub](https://github.com/Leolty/repobench)
- [Dataset Introduction](https://github.com/Leolty/repobench/blob/main/data/README.md)
## FAQs
- **Q:** What do the features in the dataset mean?
**A:** Imagine you're coding and you want to write the next line of your code. The dataset provides you the following information:
- `repo_name` (string): the name of the repository
- `file_path` (string): the path of the current file
- `context` (list): the cross-file code snippets that might be helpful for writing the next line:
- `identifier` (string): the identifier of the code snippet
- `path` (string): the path of the code snippet
- `snippet` (string): the code snippet
- `import_statement` (string): the import statement of the current file
- `cropped_code` (string): the cropped code of the current file (up to previous 120 lines)
- `all_code` (string): the entire code of the current file (not cropped)
- `next_line` (string): the next line of the code (this serves as the target)
- `gold_snippet_index` (int): the index of the gold snippet in the context (which will be used in next line, just for reference, you should not use this for next line prediction)
- `created_at` (string): the creation time of the repository
- `level` (string): the level of next line completion, which is measured by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens)
- **Q:** How does the level be defined?
**A:** The level is determined by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens). The token number is calculated by the tokenizer of GPT-4 by using [tiktoken](https://github.com/openai/tiktoken). The following table shows the level definition:
| Level | Prompt Length (Number of Tokens) |
|-------|------------------------|
| 2k | 640 - 1,600 |
| 4k | 1,600 - 3,600 |
| 8k | 3,600 - 7,200 |
| 12k | 7,200 - 10,800 |
| 16k | 10,800 - 14,400 |
| 24k | 14,400 - 21,600 |
| 32k | 21,600 - 28,800 |
| 64k | 28,800 - 57,600 |
| 128k | 57,600 - 100,000 |
- **Q:** What does the different splits mean?
**A:** The dataset is split into three parts:
- `cross_file_first`: the next line of code utilizes content from a cross-file code snippet and it is its first usage within current file.
- `cross_file_random`: the next line of code utilizes content from a cross-file code snippet and it is NOT its first usage within current file.
- `in_file`: the next line of code does not utilize content from a cross-file code snippet.
- **Q:** How to construct the prompt for next line prediction?
**A:** We hereby provide the official implementation for constructing prompts. Please note that the methods described below are not necessarily the optimal way of construction. Reordering, retrieval argumentation, or employing different cropping/construction techniques could potentially lead to varying degrees of improvement. Ensure that your model evaluations are conducted in a fair manner.
```python
import re
def construct_prompt(
data: dict,
language: str = "java",
tokenizer= None,
max_token_nums: int = 15800
) -> str:
"""
Construct the prompt for next line prediction.
:param data: data point from the dataset
:param language: the language of the code
:param tokenizer: the tokenizer of the evaluation model
:param max_token_nums: the maximum number of tokens constraint for the prompt
:return: the constructed prompt
"""
# comment symbol for different languages
comment_symbol = "#" if language == "python" else "//"
# construct the cross-file prompt and in-file prompt separately
# cross-file prompt
cross_file_prompt = f"{comment_symbol} Repo Name: {data['repo_name']}\n"
for snippet in data['context']:
cross_file_prompt += f"{comment_symbol} Path: {snippet['path']}\n{snippet['snippet']}" + "\n\n"
# in-file prompt
in_file_prompt = f"{comment_symbol} Path: {data['file_path']}\n{data['import_statement']}\n{data['cropped_code']}\n"
# if we assign the tokenizer and the max_token_nums, we will truncate the cross-file prompt to meet the constraint
if tokenizer is not None and max_token_nums is not None:
cross_file_prompt_token_nums = len(tokenizer.encode(cross_file_prompt))
in_file_prompt_token_nums = len(tokenizer.encode(in_file_prompt))
exceed_token_nums = cross_file_prompt_token_nums + in_file_prompt_token_nums - max_token_nums
if exceed_token_nums > 0:
# split the cross-file prompt into lines
cross_file_prompt_lines = cross_file_prompt.split("\n")
# drop lines from end until the extra token number is less than 0
for i in range(len(repo_prompt_lines)-1, -1, -1):
extra_token_num -= len(tokenizer.encode(cross_file_prompt_lines[i]))
if extra_token_num < 0:
break
# join the lines back
cross_file_prompt = "\n".join(cross_file_prompt_lines[:i]) + "\n\n"
# combine the cross-file prompt and in-file prompt
prompt = cross_file_prompt + in_file_prompt
# normalize some empty lines
prompt = re.sub(r'\n{4,}', '\n\n', prompt)
return prompt
```
- **Q:** How to load the dataset?
**A:** You can simply use the following code to load the dataset:
```python
from datasets import load_dataset
dataset = load_dataset("tianyang/repobench_java_v1.1")
```
To construct the prompt for next line prediction, you can refer to the official implementation provided in the previous question and use the `construct_prompt` function to construct the prompt, for example:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base")
prompt = construct_prompt(dataset['cross_file_first'][0], language="java", tokenizer=tokenizer, max_token_nums=15800)
```
- **Q:** How often will the dataset be updated?
**A:** We plan to update the dataset every three months, but there might be slight delays considering the time required for data crawling and our own schedules. If you require updated data, please feel free to contact us, and we can coordinate the timing and expedite the process.
- **Q:** What models should I use to evaluate the dataset?
**A:** RepoBench is designed to evaluate base models, not those that have been instruction fine-tuned. Please use base models for evaluation.
- **Q:** I am training a new model but the knowledge cutoff date is after the dataset's. Can you provide me with the latest data?
**A:** Sure! We are happy to provide you with the latest data (even customized data with specific requirements). Please feel free to contact us.
- **Q:** Can I opt-out?
**A:** Yes, you can opt-out your repository from the dataset. Please check [Am I in RepoBench?](https://huggingface.co/spaces/tianyang/in-the-repobench), we will upload the raw data of the repository information we crawled at least 15 days before the dataset creation and release. We will respect your decision and remove your repository from the dataset if you opt-out.
## Citation
If you find RepoBench useful in your research, please consider citing the paper using the following BibTeX entry:
```bibtex
@misc{liu2023repobench,
title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems},
author={Tianyang Liu and Canwen Xu and Julian McAuley},
year={2024},
url={https://arxiv.org/abs/2306.03091},
booktitle={International Conference on Learning Representations}
}
```
Your interest and contributions to RepoBench are immensely valued. Happy coding! 🚀
提供机构:
tianyang
原始信息汇总
RepoBench v1.1 (Java) 数据集概述
数据集配置
- 默认配置:
- 数据文件:
cross_file_first:路径为data/cross_file_first-*cross_file_random:路径为data/cross_file_random-*in_file:路径为data/in_file-*
- 数据文件:
数据集信息
-
特征:
repo_name(字符串):仓库名称file_path(字符串):当前文件路径context(列表):跨文件代码片段,可能有助于编写下一行代码:identifier(字符串):代码片段标识符path(字符串):代码片段路径snippet(字符串):代码片段
import_statement(字符串):当前文件的导入语句token_num(int64):标记数量cropped_code(字符串):当前文件的裁剪代码(最多前120行)all_code(字符串):当前文件的完整代码(未裁剪)next_line(字符串):下一行代码(作为目标)gold_snippet_index(int64):上下文中黄金片段的索引(仅供参考,不应用于下一行预测)created_at(字符串):仓库创建时间level(字符串):下一行完成的级别,由整个提示的标记数量(包括所有上下文、导入语句、裁剪代码和一些必要的分隔符标记)决定
-
分割:
cross_file_first:下一行代码使用跨文件代码片段,并且在当前文件中首次使用- 字节数:504528431
- 样本数:8033
cross_file_random:下一行代码使用跨文件代码片段,但不是在当前文件中首次使用- 字节数:467242455
- 样本数:7618
in_file:下一行代码不使用跨文件代码片段- 字节数:488999100
- 样本数:7910
-
下载大小:472994299 字节
-
数据集大小:1460769986 字节
-
许可证:cc
-
任务类别:文本生成
-
语言:英语
-
标签:代码
数据集使用
-
加载数据集: python from datasets import load_dataset dataset = load_dataset("tianyang/repobench_java_v1.1")
-
构建下一行预测的提示: python import re
def construct_prompt( data: dict, language: str = "java", tokenizer= None, max_token_nums: int = 15800 ) -> str: """ 构建下一行预测的提示。
:param data: 数据集中的数据点 :param language: 代码的语言 :param tokenizer: 评估模型的分词器 :param max_token_nums: 提示的最大标记数量约束 :return: 构建的提示 """ # 不同语言的注释符号 comment_symbol = "#" if language == "python" else "//" # 分别构建跨文件提示和当前文件提示 # 跨文件提示 cross_file_prompt = f"{comment_symbol} Repo Name: {data[repo_name]}
"
for snippet in data[context]:
cross_file_prompt += f"{comment_symbol} Path: {snippet[path]}
{snippet[snippet]}" + "
"
# 当前文件提示
in_file_prompt = f"{comment_symbol} Path: {data[file_path]}
{data[import_statement]} {data[cropped_code]} "
# 如果指定了分词器和最大标记数量,我们将截断跨文件提示以满足约束
if tokenizer is not None and max_token_nums is not None:
cross_file_prompt_token_nums = len(tokenizer.encode(cross_file_prompt))
in_file_prompt_token_nums = len(tokenizer.encode(in_file_prompt))
exceed_token_nums = cross_file_prompt_token_nums + in_file_prompt_token_nums - max_token_nums
if exceed_token_nums > 0:
# 将跨文件提示拆分为行
cross_file_prompt_lines = cross_file_prompt.split("
") # 从末尾开始丢弃行,直到额外标记数量小于0 for i in range(len(repo_prompt_lines)-1, -1, -1): extra_token_num -= len(tokenizer.encode(cross_file_prompt_lines[i])) if extra_token_num < 0: break
# 将行重新连接
cross_file_prompt = "
".join(cross_file_prompt_lines[:i]) + "
"
# 组合跨文件提示和当前文件提示
prompt = cross_file_prompt + in_file_prompt
# 规范化一些空行
prompt = re.sub(r
{4,},
, prompt)
return prompt



