RepoFusion/Stack-Repo

Name: RepoFusion/Stack-Repo
Creator: RepoFusion
Published: 2023-07-10 19:43:46
License: 暂无描述

Hugging Face2023-07-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/RepoFusion/Stack-Repo

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other --- # Summary of the Dataset ## Description Stack-Repo is a dataset of 200 Java repositories from GitHub with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. - Prompt Proposal (PP) Contexts: These contexts are based on the prompt proposals from the paper [Repository-Level Prompt Generation for Large Language Models of Code](https://arxiv.org/abs/2206.12839). - BM25 Contexts: These contexts are obtained based on the BM25 similarity scores. - RandomNN Contexts: These contexts are obtained using the nearest neighbors in the representation space of an embedding model. For more details, please check our paper [RepoFusion: Training Code Models to Understand Your Repository](https://arxiv.org/abs/2306.10998). The original Java source files are obtained using a [modified version](https://huggingface.co/datasets/bigcode/the-stack-dedup) of [The Stack](https://huggingface.co/datasets/bigcode/the-stack). ## Data Splits The dataset consists of three splits: `train`, `validation` and `test`, comprising of 100, 50, and 50 repositories, respectively. ## Data Organization Each split contains separate folder for a repository where each repository contains all `.java` source code files in the repository in the original directory structure along with three `.json` files corresponding to the PP, BM25 and RandomNN repo contexts. In terms of the HuggingFace Datasets terminology, we have four subdatasets or configurations. - `PP_contexts`: Propmt Proposal repo contexts. - `bm25_contexts`: BM25 repo contexts. - `randomNN_contexts`: RandomNN repo contexts. - `sources`: actual java (`.java`) source code files # Dataset Usage To clone the dataset locally ``` git clone https://huggingface.co/datasets/RepoFusion/Stack-Repo <local_path> ``` To load the dataset desired configuration and split: ```python import datasets ds = datasets.load_dataset( "RepoFusion/Stack-Repo", name="<configuration_name>", split="<split_name>" data_dir="<local_path>" ) ``` NOTE: The configurations for the repo contexts `bm25_contexts`, `PP_contexts` and `randomNN_contexts` can be loaded directly by specifying the corresponding `<configuration_name>` along with the `<split_name>` in the load_dataset command listed above without cloning the repo locally. For the `sources` if not cloned beforehand or `data_dir` not specified, `ManualDownloadError` will be raised. ## Data Format The expected data format of the `.json` files is a list of target holes and corresponding repo contexts where each entry in the `.json` file corresponds to a target hole consisting of the location of the target hole, the target hole as a string, the surrounding context as a string and a list of repo-contexts as strings. Specifically, each row is a dictionary containing - `id`: hole_id (location of the target hole) - `question`: surrounding context - `target`: target hole - `ctxs`: a list of repo contexts where each item is a dictionary containing - `title`: name of the repo context - `text`: content of the repo context The actual java sources can be accessed via file system directly. The format is like this `[<data_set_root>/data/<split_name>/<github_user>/<repo_name>/<path/to/every/java/file/in/the/repo>.java]`. When accessed through `Datasets.load_dataset`, the data fields for the `sources` can be specified as below. ```python features = datasets.Features({ 'file': datasets.Value('string'), 'content': datasets.Value('string') }) ``` When accessed through `Datasets.load_dataset`, the data fields for the repo contexts can be specified as below. ```python features = datasets.Features({ 'id': datasets.Value('string'), 'hole_file': datasets.Value('string'), 'hole_line': datasets.Value('int32'), 'hole_pos': datasets.Value('int32'), 'question': datasets.Value('string'), 'target': datasets.Value('string'), 'answers': datasets.Sequence( datasets.Value('string') ), 'ctxs': [{ 'title': datasets.Value('string'), 'text': datasets.Value('string'), 'score': datasets.Value('float64') }] }) ``` # Additional Information ## Dataset Curators - Disha Shrivastava, dishu.905@gmail.com - Denis Kocetkov, denis.kocetkov@servicenow.com ## Licensing Information Stack-Repo is derived from a modified version of The Stack. The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant. We facilitate this by providing provenance information for each data point. The list of [SPDX license identifiers](https://spdx.org/licenses/) included in the dataset can be found [here](https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json). ## Citation ``` @article{shrivastava2023repofusion, title={RepoFusion: Training Code Models to Understand Your Repository}, author={Shrivastava, Disha and Kocetkov, Denis and de Vries, Harm and Bahdanau, Dzmitry and Scholak, Torsten}, journal={arXiv preprint arXiv:2306.10998}, year={2023} } ```

提供机构：

RepoFusion

原始信息汇总

数据集概述

描述

Stack-Repo 是一个包含 200 个 Java 仓库的数据集，这些仓库来自 GitHub，具有宽松的许可证，并且文件接近去重。数据集增加了三种类型的仓库上下文：

Prompt Proposal (PP) Contexts：基于论文 Repository-Level Prompt Generation for Large Language Models of Code 中的提示提案。
BM25 Contexts：基于 BM25 相似度得分获取的上下文。
RandomNN Contexts：使用嵌入模型表示空间中的最近邻获取的上下文。

更多细节请参考论文 RepoFusion: Training Code Models to Understand Your Repository。

原始 Java 源文件是通过修改版本的 The Stack 获取的。

数据分割

数据集包含三个分割：train、validation 和 test，分别包含 100、50 和 50 个仓库。

数据组织

每个分割包含一个仓库的单独文件夹，每个仓库包含所有 .java 源代码文件以及三个 .json 文件，对应 PP、BM25 和 RandomNN 仓库上下文。按照 HuggingFace Datasets 术语，我们有四个子数据集或配置：

PP_contexts：Prompt Proposal 仓库上下文。
bm25_contexts：BM25 仓库上下文。
randomNN_contexts：RandomNN 仓库上下文。
sources：实际的 Java (.java) 源代码文件。

数据格式

.json 文件的预期数据格式是一个目标孔列表和相应的仓库上下文，每个条目对应一个目标孔，包含孔的位置、孔的字符串、周围上下文和仓库上下文列表。具体来说，每行是一个字典，包含：

id：孔的 ID（位置）
question：周围上下文
target：目标孔
ctxs：仓库上下文列表，每个项是一个字典，包含
- title：仓库上下文的名称
- text：仓库上下文的内容

实际的 Java 源文件可以通过文件系统直接访问。格式如下：[<data_set_root>/data/<split_name>/<github_user>/<repo_name>/<path/to/every/java/file/in/the/repo>.java]。通过 Datasets.load_dataset 访问时，sources 的数据字段可以指定如下： python features = datasets.Features({ file: datasets.Value(string), content: datasets.Value(string) })

通过 Datasets.load_dataset 访问时，仓库上下文的数据字段可以指定如下： python features = datasets.Features({ id: datasets.Value(string), hole_file: datasets.Value(string), hole_line: datasets.Value(int32), hole_pos: datasets.Value(int32), question: datasets.Value(string), target: datasets.Value(string), answers: datasets.Sequence( datasets.Value(string) ), ctxs: [{ title: datasets.Value(string), text: datasets.Value(string), score: datasets.Value(float64) }] })

数据集策展人

Disha Shrivastava, dishu.905@gmail.com
Denis Kocetkov, denis.kocetkov@servicenow.com

许可信息

Stack-Repo 源自 The Stack 的修改版本。The Stack 是一个包含各种许可证的源代码集合。任何使用 The Stack 中的代码都必须遵守原始许可证的条款，包括在相关时提供归属条款。我们通过为每个数据点提供来源信息来促进这一点。

包含在数据集中的 SPDX 许可证标识符列表可以在此处找到。

引用

@article{shrivastava2023repofusion, title={RepoFusion: Training Code Models to Understand Your Repository}, author={Shrivastava, Disha and Kocetkov, Denis and de Vries, Harm and Bahdanau, Dzmitry and Scholak, Torsten}, journal={arXiv preprint arXiv:2306.10998}, year={2023} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集