JetBrains-Research/lca-codegen-huge

Name: JetBrains-Research/lca-codegen-huge
Creator: JetBrains-Research
Published: 2024-05-30 15:41:31
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/JetBrains-Research/lca-codegen-huge

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: repo dtype: string - name: commit_hash dtype: string - name: completion_file struct: - name: filename dtype: string - name: content dtype: string - name: completion_lines struct: - name: infile sequence: int32 - name: inproject sequence: int32 - name: common sequence: int32 - name: commited sequence: int32 - name: non_informative sequence: int32 - name: random sequence: int32 - name: repo_snapshot sequence: - name: filename dtype: string - name: content dtype: string - name: completion_lines_raw struct: - name: commited sequence: int64 - name: common sequence: int64 - name: infile sequence: int64 - name: inproject sequence: int64 - name: non_informative sequence: int64 - name: other sequence: int64 splits: - name: test num_bytes: 5220255729 num_examples: 296 download_size: 1810961403 dataset_size: 5220255729 configs: - config_name: default data_files: - split: test path: data/test-* --- # LCA Project Level Code Completion ## How to load the dataset ``` from datasets import load_dataset ds = load_dataset('JetBrains-Research/lca-codegen-huge', split='test') ``` ## Data Point Structure * `repo` – repository name in format `{GitHub_user_name}__{repository_name}` * `commit_hash` – commit hash * `completion_file` – dictionary with the completion file content in the following format: * `filename` – filepath to the completion file * `content` – content of the completion file * `completion_lines` – dictionary where keys are classes of lines and values are a list of integers (numbers of lines to complete). The classes are: * `committed` – line contains at least one function or class that was declared in the committed files from `commit_hash` * `inproject` – line contains at least one function or class that was declared in the project (excluding previous) * `infile` – line contains at least one function or class that was declared in the completion file (excluding previous) * `common` – line contains at least one function or class that was classified to be common, e.g., `main`, `get`, etc (excluding previous) * `non_informative` – line that was classified to be non-informative, e.g. too short, contains comments, etc * `random` – randomly sampled from the rest of the lines * `repo_snapshot` – dictionary with a snapshot of the repository before the commit. Has the same structure as `completion_file`, but filenames and contents are orginized as lists. * `completion_lines_raw` – the same as `completion_lines`, but before sampling. ## How we collected the data To collect the data, we cloned repositories from GitHub where the main language is Python. The completion file for each data point is a `.py` file that was added to the repository in a commit. The state of the repository before this commit is the repo snapshot. Huge dataset is defined by number of characters in `.py` files from the repository snapshot. This number larger then 768K. ## Dataset Stats * Number of datapoints: 296 * Number of repositories: 75 * Number of commits: 252 ### Completion File * Number of lines, median: 313.5 * Number of lines, min: 200 * Number of lines, max: 1877 ### Repository Snapshot * `.py` files: median 261, from 47 to 5227 * non `.py` files: median 262, from 24 to 7687 * `.py` lines: median 49811 * non `.py` lines: median 60163 ### Line Counts: * infile: 2608 * inproject: 2901 * common: 692 * committed: 1019 * non-informative: 1164 * random: 1426 * **total**: 9810 ## Scores [HF Space](https://huggingface.co/spaces/JetBrains-Research/long-code-arena)

提供机构：

JetBrains-Research

原始信息汇总

LCA Project Level Code Completion 数据集概述

数据集特征

repo: 仓库名称，格式为 {GitHub_用户名}__{仓库名称}，数据类型为字符串。
commit_hash: 提交哈希，数据类型为字符串。
completion_file: 包含完成文件内容的字典，结构如下：
- filename: 完成文件的路径，数据类型为字符串。
- content: 完成文件的内容，数据类型为字符串。
completion_lines: 包含行类别的字典，值为整数列表（行号），类别包括：
- committed: 包含至少一个在提交文件中声明的函数或类。
- inproject: 包含至少一个在项目中声明的函数或类（不包括之前的）。
- infile: 包含至少一个在完成文件中声明的函数或类（不包括之前的）。
- common: 包含至少一个被分类为常见的函数或类，例如 main、get 等（不包括之前的）。
- non_informative: 被分类为非信息的行，例如太短、包含注释等。
- random: 从其余行中随机采样。
repo_snapshot: 包含提交前仓库快照的字典，结构与 completion_file 相同，但文件名和内容组织为列表。
completion_lines_raw: 与 completion_lines 相同，但采样前。

数据集分割

test: 测试集，字节数为 5220255729，样本数为 296。

数据集大小

下载大小: 1810961403 字节
数据集大小: 5220255729 字节

数据集配置

default: 默认配置，数据文件路径为 data/test-*。

数据收集方法

从 GitHub 克隆主要语言为 Python 的仓库。
每个数据点的完成文件是在提交中添加到仓库的 .py 文件。
提交前的仓库状态为仓库快照。

数据集统计

数据点数量: 296
仓库数量: 75
提交数量: 252

完成文件统计

行数中位数: 313.5
行数最小值: 200
行数最大值: 1877

仓库快照统计

.py 文件数量中位数: 261，范围从 47 到 5227
非 .py 文件数量中位数: 262，范围从 24 到 7687
.py 行数中位数: 49811
非 .py 行数中位数: 60163

行数统计

infile: 2608
inproject: 2901
common: 692
committed: 1019
non-informative: 1164
random: 1426
总计: 9810

5,000+

优质数据集

54 个

任务类型

进入经典数据集