five

tianyang/repobench_java_v1.1

收藏
Hugging Face2024-02-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tianyang/repobench_java_v1.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: cross_file_first path: data/cross_file_first-* - split: cross_file_random path: data/cross_file_random-* - split: in_file path: data/in_file-* dataset_info: features: - name: repo_name dtype: string - name: file_path dtype: string - name: context list: - name: identifier dtype: string - name: path dtype: string - name: snippet dtype: string - name: import_statement dtype: string - name: token_num dtype: int64 - name: cropped_code dtype: string - name: all_code dtype: string - name: next_line dtype: string - name: gold_snippet_index dtype: int64 - name: created_at dtype: string - name: level dtype: string splits: - name: cross_file_first num_bytes: 504528431 num_examples: 8033 - name: cross_file_random num_bytes: 467242455 num_examples: 7618 - name: in_file num_bytes: 488999100 num_examples: 7910 download_size: 472994299 dataset_size: 1460769986 license: cc task_categories: - text-generation language: - en tags: - code --- # RepoBench v1.1 (Java) ## Introduction This dataset presents the **Java** portion of [RepoBench](https://arxiv.org/abs/2306.03091) v1.1 (ICLR 2024). The data encompasses a collection from GitHub, spanning the period from **October 6th to December 31st, 2023**. With a commitment to data integrity, we've implemented a deduplication process based on file content against the Stack v2 dataset (coming soon), aiming to mitigate data leakage and memorization concerns. ## Resources and Links - [Paper](https://arxiv.org/abs/2306.03091) - [GitHub](https://github.com/Leolty/repobench) - [Dataset Introduction](https://github.com/Leolty/repobench/blob/main/data/README.md) ## FAQs - **Q:** What do the features in the dataset mean? **A:** Imagine you're coding and you want to write the next line of your code. The dataset provides you the following information: - `repo_name` (string): the name of the repository - `file_path` (string): the path of the current file - `context` (list): the cross-file code snippets that might be helpful for writing the next line: - `identifier` (string): the identifier of the code snippet - `path` (string): the path of the code snippet - `snippet` (string): the code snippet - `import_statement` (string): the import statement of the current file - `cropped_code` (string): the cropped code of the current file (up to previous 120 lines) - `all_code` (string): the entire code of the current file (not cropped) - `next_line` (string): the next line of the code (this serves as the target) - `gold_snippet_index` (int): the index of the gold snippet in the context (which will be used in next line, just for reference, you should not use this for next line prediction) - `created_at` (string): the creation time of the repository - `level` (string): the level of next line completion, which is measured by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens) - **Q:** How does the level be defined? **A:** The level is determined by the number of tokens for the whole prompt (including all the context, import statement, cropped code and some neccessary separator tokens). The token number is calculated by the tokenizer of GPT-4 by using [tiktoken](https://github.com/openai/tiktoken). The following table shows the level definition: | Level | Prompt Length (Number of Tokens) | |-------|------------------------| | 2k | 640 - 1,600 | | 4k | 1,600 - 3,600 | | 8k | 3,600 - 7,200 | | 12k | 7,200 - 10,800 | | 16k | 10,800 - 14,400 | | 24k | 14,400 - 21,600 | | 32k | 21,600 - 28,800 | | 64k | 28,800 - 57,600 | | 128k | 57,600 - 100,000 | - **Q:** What does the different splits mean? **A:** The dataset is split into three parts: - `cross_file_first`: the next line of code utilizes content from a cross-file code snippet and it is its first usage within current file. - `cross_file_random`: the next line of code utilizes content from a cross-file code snippet and it is NOT its first usage within current file. - `in_file`: the next line of code does not utilize content from a cross-file code snippet. - **Q:** How to construct the prompt for next line prediction? **A:** We hereby provide the official implementation for constructing prompts. Please note that the methods described below are not necessarily the optimal way of construction. Reordering, retrieval argumentation, or employing different cropping/construction techniques could potentially lead to varying degrees of improvement. Ensure that your model evaluations are conducted in a fair manner. ```python import re def construct_prompt( data: dict, language: str = "java", tokenizer= None, max_token_nums: int = 15800 ) -> str: """ Construct the prompt for next line prediction. :param data: data point from the dataset :param language: the language of the code :param tokenizer: the tokenizer of the evaluation model :param max_token_nums: the maximum number of tokens constraint for the prompt :return: the constructed prompt """ # comment symbol for different languages comment_symbol = "#" if language == "python" else "//" # construct the cross-file prompt and in-file prompt separately # cross-file prompt cross_file_prompt = f"{comment_symbol} Repo Name: {data['repo_name']}\n" for snippet in data['context']: cross_file_prompt += f"{comment_symbol} Path: {snippet['path']}\n{snippet['snippet']}" + "\n\n" # in-file prompt in_file_prompt = f"{comment_symbol} Path: {data['file_path']}\n{data['import_statement']}\n{data['cropped_code']}\n" # if we assign the tokenizer and the max_token_nums, we will truncate the cross-file prompt to meet the constraint if tokenizer is not None and max_token_nums is not None: cross_file_prompt_token_nums = len(tokenizer.encode(cross_file_prompt)) in_file_prompt_token_nums = len(tokenizer.encode(in_file_prompt)) exceed_token_nums = cross_file_prompt_token_nums + in_file_prompt_token_nums - max_token_nums if exceed_token_nums > 0: # split the cross-file prompt into lines cross_file_prompt_lines = cross_file_prompt.split("\n") # drop lines from end until the extra token number is less than 0 for i in range(len(repo_prompt_lines)-1, -1, -1): extra_token_num -= len(tokenizer.encode(cross_file_prompt_lines[i])) if extra_token_num < 0: break # join the lines back cross_file_prompt = "\n".join(cross_file_prompt_lines[:i]) + "\n\n" # combine the cross-file prompt and in-file prompt prompt = cross_file_prompt + in_file_prompt # normalize some empty lines prompt = re.sub(r'\n{4,}', '\n\n', prompt) return prompt ``` - **Q:** How to load the dataset? **A:** You can simply use the following code to load the dataset: ```python from datasets import load_dataset dataset = load_dataset("tianyang/repobench_java_v1.1") ``` To construct the prompt for next line prediction, you can refer to the official implementation provided in the previous question and use the `construct_prompt` function to construct the prompt, for example: ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base") prompt = construct_prompt(dataset['cross_file_first'][0], language="java", tokenizer=tokenizer, max_token_nums=15800) ``` - **Q:** How often will the dataset be updated? **A:** We plan to update the dataset every three months, but there might be slight delays considering the time required for data crawling and our own schedules. If you require updated data, please feel free to contact us, and we can coordinate the timing and expedite the process. - **Q:** What models should I use to evaluate the dataset? **A:** RepoBench is designed to evaluate base models, not those that have been instruction fine-tuned. Please use base models for evaluation. - **Q:** I am training a new model but the knowledge cutoff date is after the dataset's. Can you provide me with the latest data? **A:** Sure! We are happy to provide you with the latest data (even customized data with specific requirements). Please feel free to contact us. - **Q:** Can I opt-out? **A:** Yes, you can opt-out your repository from the dataset. Please check [Am I in RepoBench?](https://huggingface.co/spaces/tianyang/in-the-repobench), we will upload the raw data of the repository information we crawled at least 15 days before the dataset creation and release. We will respect your decision and remove your repository from the dataset if you opt-out. ## Citation If you find RepoBench useful in your research, please consider citing the paper using the following BibTeX entry: ```bibtex @misc{liu2023repobench, title={RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, author={Tianyang Liu and Canwen Xu and Julian McAuley}, year={2024}, url={https://arxiv.org/abs/2306.03091}, booktitle={International Conference on Learning Representations} } ``` Your interest and contributions to RepoBench are immensely valued. Happy coding! 🚀
提供机构:
tianyang
原始信息汇总

RepoBench v1.1 (Java) 数据集概述

数据集配置

  • 默认配置
    • 数据文件:
      • cross_file_first:路径为 data/cross_file_first-*
      • cross_file_random:路径为 data/cross_file_random-*
      • in_file:路径为 data/in_file-*

数据集信息

  • 特征

    • repo_name (字符串):仓库名称
    • file_path (字符串):当前文件路径
    • context (列表):跨文件代码片段,可能有助于编写下一行代码:
      • identifier (字符串):代码片段标识符
      • path (字符串):代码片段路径
      • snippet (字符串):代码片段
    • import_statement (字符串):当前文件的导入语句
    • token_num (int64):标记数量
    • cropped_code (字符串):当前文件的裁剪代码(最多前120行)
    • all_code (字符串):当前文件的完整代码(未裁剪)
    • next_line (字符串):下一行代码(作为目标)
    • gold_snippet_index (int64):上下文中黄金片段的索引(仅供参考,不应用于下一行预测)
    • created_at (字符串):仓库创建时间
    • level (字符串):下一行完成的级别,由整个提示的标记数量(包括所有上下文、导入语句、裁剪代码和一些必要的分隔符标记)决定
  • 分割

    • cross_file_first:下一行代码使用跨文件代码片段,并且在当前文件中首次使用
      • 字节数:504528431
      • 样本数:8033
    • cross_file_random:下一行代码使用跨文件代码片段,但不是在当前文件中首次使用
      • 字节数:467242455
      • 样本数:7618
    • in_file:下一行代码不使用跨文件代码片段
      • 字节数:488999100
      • 样本数:7910
  • 下载大小:472994299 字节

  • 数据集大小:1460769986 字节

  • 许可证:cc

  • 任务类别:文本生成

  • 语言:英语

  • 标签:代码

数据集使用

  • 加载数据集: python from datasets import load_dataset dataset = load_dataset("tianyang/repobench_java_v1.1")

  • 构建下一行预测的提示: python import re

    def construct_prompt( data: dict, language: str = "java", tokenizer= None, max_token_nums: int = 15800 ) -> str: """ 构建下一行预测的提示。

    :param data: 数据集中的数据点
    :param language: 代码的语言
    :param tokenizer: 评估模型的分词器
    :param max_token_nums: 提示的最大标记数量约束
    
    :return: 构建的提示
    """
    
    # 不同语言的注释符号
    comment_symbol = "#" if language == "python" else "//"
    
    # 分别构建跨文件提示和当前文件提示
    # 跨文件提示
    cross_file_prompt = f"{comment_symbol} Repo Name: {data[repo_name]}
    

"

  for snippet in data[context]:
      cross_file_prompt += f"{comment_symbol} Path: {snippet[path]}

{snippet[snippet]}" + "

"

  # 当前文件提示
  in_file_prompt = f"{comment_symbol} Path: {data[file_path]}

{data[import_statement]} {data[cropped_code]} "

  # 如果指定了分词器和最大标记数量,我们将截断跨文件提示以满足约束
  if tokenizer is not None and max_token_nums is not None:
      
      cross_file_prompt_token_nums = len(tokenizer.encode(cross_file_prompt))
      in_file_prompt_token_nums = len(tokenizer.encode(in_file_prompt))

      exceed_token_nums = cross_file_prompt_token_nums + in_file_prompt_token_nums - max_token_nums

      if exceed_token_nums > 0:
          # 将跨文件提示拆分为行
          cross_file_prompt_lines = cross_file_prompt.split("

") # 从末尾开始丢弃行,直到额外标记数量小于0 for i in range(len(repo_prompt_lines)-1, -1, -1): extra_token_num -= len(tokenizer.encode(cross_file_prompt_lines[i])) if extra_token_num < 0: break

          # 将行重新连接
          cross_file_prompt = "

".join(cross_file_prompt_lines[:i]) + "

"

  # 组合跨文件提示和当前文件提示
  prompt = cross_file_prompt + in_file_prompt

  # 规范化一些空行
  prompt = re.sub(r

{4,},

, prompt)

  return prompt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作