JetBrains-Research/lca-codegen-huge
收藏Hugging Face2024-05-30 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/JetBrains-Research/lca-codegen-huge
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: repo
dtype: string
- name: commit_hash
dtype: string
- name: completion_file
struct:
- name: filename
dtype: string
- name: content
dtype: string
- name: completion_lines
struct:
- name: infile
sequence: int32
- name: inproject
sequence: int32
- name: common
sequence: int32
- name: commited
sequence: int32
- name: non_informative
sequence: int32
- name: random
sequence: int32
- name: repo_snapshot
sequence:
- name: filename
dtype: string
- name: content
dtype: string
- name: completion_lines_raw
struct:
- name: commited
sequence: int64
- name: common
sequence: int64
- name: infile
sequence: int64
- name: inproject
sequence: int64
- name: non_informative
sequence: int64
- name: other
sequence: int64
splits:
- name: test
num_bytes: 5220255729
num_examples: 296
download_size: 1810961403
dataset_size: 5220255729
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
# LCA Project Level Code Completion
## How to load the dataset
```
from datasets import load_dataset
ds = load_dataset('JetBrains-Research/lca-codegen-huge', split='test')
```
## Data Point Structure
* `repo` – repository name in format `{GitHub_user_name}__{repository_name}`
* `commit_hash` – commit hash
* `completion_file` – dictionary with the completion file content in the following format:
* `filename` – filepath to the completion file
* `content` – content of the completion file
* `completion_lines` – dictionary where keys are classes of lines and values are a list of integers (numbers of lines to complete). The classes are:
* `committed` – line contains at least one function or class that was declared in the committed files from `commit_hash`
* `inproject` – line contains at least one function or class that was declared in the project (excluding previous)
* `infile` – line contains at least one function or class that was declared in the completion file (excluding previous)
* `common` – line contains at least one function or class that was classified to be common, e.g., `main`, `get`, etc (excluding previous)
* `non_informative` – line that was classified to be non-informative, e.g. too short, contains comments, etc
* `random` – randomly sampled from the rest of the lines
* `repo_snapshot` – dictionary with a snapshot of the repository before the commit. Has the same structure as `completion_file`, but filenames and contents are orginized as lists.
* `completion_lines_raw` – the same as `completion_lines`, but before sampling.
## How we collected the data
To collect the data, we cloned repositories from GitHub where the main language is Python.
The completion file for each data point is a `.py` file that was added to the repository in a commit.
The state of the repository before this commit is the repo snapshot.
Huge dataset is defined by number of characters in `.py` files from the repository snapshot. This number larger then 768K.
## Dataset Stats
* Number of datapoints: 296
* Number of repositories: 75
* Number of commits: 252
### Completion File
* Number of lines, median: 313.5
* Number of lines, min: 200
* Number of lines, max: 1877
### Repository Snapshot
* `.py` files: <u>median 261</u>, from 47 to 5227
* non `.py` files: <u>median 262</u>, from 24 to 7687
* `.py` lines: <u>median 49811</u>
* non `.py` lines: <u>median 60163</u>
### Line Counts:
* infile: 2608
* inproject: 2901
* common: 692
* committed: 1019
* non-informative: 1164
* random: 1426
* **total**: 9810
## Scores
[HF Space](https://huggingface.co/spaces/JetBrains-Research/long-code-arena)
提供机构:
JetBrains-Research
原始信息汇总
LCA Project Level Code Completion 数据集概述
数据集特征
- repo: 仓库名称,格式为
{GitHub_用户名}__{仓库名称},数据类型为字符串。 - commit_hash: 提交哈希,数据类型为字符串。
- completion_file: 包含完成文件内容的字典,结构如下:
- filename: 完成文件的路径,数据类型为字符串。
- content: 完成文件的内容,数据类型为字符串。
- completion_lines: 包含行类别的字典,值为整数列表(行号),类别包括:
- committed: 包含至少一个在提交文件中声明的函数或类。
- inproject: 包含至少一个在项目中声明的函数或类(不包括之前的)。
- infile: 包含至少一个在完成文件中声明的函数或类(不包括之前的)。
- common: 包含至少一个被分类为常见的函数或类,例如
main、get等(不包括之前的)。 - non_informative: 被分类为非信息的行,例如太短、包含注释等。
- random: 从其余行中随机采样。
- repo_snapshot: 包含提交前仓库快照的字典,结构与
completion_file相同,但文件名和内容组织为列表。 - completion_lines_raw: 与
completion_lines相同,但采样前。
数据集分割
- test: 测试集,字节数为 5220255729,样本数为 296。
数据集大小
- 下载大小: 1810961403 字节
- 数据集大小: 5220255729 字节
数据集配置
- default: 默认配置,数据文件路径为
data/test-*。
数据收集方法
- 从 GitHub 克隆主要语言为 Python 的仓库。
- 每个数据点的完成文件是在提交中添加到仓库的
.py文件。 - 提交前的仓库状态为仓库快照。
数据集统计
- 数据点数量: 296
- 仓库数量: 75
- 提交数量: 252
完成文件统计
- 行数中位数: 313.5
- 行数最小值: 200
- 行数最大值: 1877
仓库快照统计
.py文件数量中位数: 261,范围从 47 到 5227- 非
.py文件数量中位数: 262,范围从 24 到 7687 .py行数中位数: 49811- 非
.py行数中位数: 60163
行数统计
- infile: 2608
- inproject: 2901
- common: 692
- committed: 1019
- non-informative: 1164
- random: 1426
- 总计: 9810



