five

JetBrains-Research/lca-codegen-huge

收藏
Hugging Face2024-05-30 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/JetBrains-Research/lca-codegen-huge
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: repo dtype: string - name: commit_hash dtype: string - name: completion_file struct: - name: filename dtype: string - name: content dtype: string - name: completion_lines struct: - name: infile sequence: int32 - name: inproject sequence: int32 - name: common sequence: int32 - name: commited sequence: int32 - name: non_informative sequence: int32 - name: random sequence: int32 - name: repo_snapshot sequence: - name: filename dtype: string - name: content dtype: string - name: completion_lines_raw struct: - name: commited sequence: int64 - name: common sequence: int64 - name: infile sequence: int64 - name: inproject sequence: int64 - name: non_informative sequence: int64 - name: other sequence: int64 splits: - name: test num_bytes: 5220255729 num_examples: 296 download_size: 1810961403 dataset_size: 5220255729 configs: - config_name: default data_files: - split: test path: data/test-* --- # LCA Project Level Code Completion ## How to load the dataset ``` from datasets import load_dataset ds = load_dataset('JetBrains-Research/lca-codegen-huge', split='test') ``` ## Data Point Structure * `repo` – repository name in format `{GitHub_user_name}__{repository_name}` * `commit_hash` – commit hash * `completion_file` – dictionary with the completion file content in the following format: * `filename` – filepath to the completion file * `content` – content of the completion file * `completion_lines` – dictionary where keys are classes of lines and values are a list of integers (numbers of lines to complete). The classes are: * `committed` – line contains at least one function or class that was declared in the committed files from `commit_hash` * `inproject` – line contains at least one function or class that was declared in the project (excluding previous) * `infile` – line contains at least one function or class that was declared in the completion file (excluding previous) * `common` – line contains at least one function or class that was classified to be common, e.g., `main`, `get`, etc (excluding previous) * `non_informative` – line that was classified to be non-informative, e.g. too short, contains comments, etc * `random` – randomly sampled from the rest of the lines * `repo_snapshot` – dictionary with a snapshot of the repository before the commit. Has the same structure as `completion_file`, but filenames and contents are orginized as lists. * `completion_lines_raw` – the same as `completion_lines`, but before sampling. ## How we collected the data To collect the data, we cloned repositories from GitHub where the main language is Python. The completion file for each data point is a `.py` file that was added to the repository in a commit. The state of the repository before this commit is the repo snapshot. Huge dataset is defined by number of characters in `.py` files from the repository snapshot. This number larger then 768K. ## Dataset Stats * Number of datapoints: 296 * Number of repositories: 75 * Number of commits: 252 ### Completion File * Number of lines, median: 313.5 * Number of lines, min: 200 * Number of lines, max: 1877 ### Repository Snapshot * `.py` files: <u>median 261</u>, from 47 to 5227 * non `.py` files: <u>median 262</u>, from 24 to 7687 * `.py` lines: <u>median 49811</u> * non `.py` lines: <u>median 60163</u> ### Line Counts: * infile: 2608 * inproject: 2901 * common: 692 * committed: 1019 * non-informative: 1164 * random: 1426 * **total**: 9810 ## Scores [HF Space](https://huggingface.co/spaces/JetBrains-Research/long-code-arena)
提供机构:
JetBrains-Research
原始信息汇总

LCA Project Level Code Completion 数据集概述

数据集特征

  • repo: 仓库名称,格式为 {GitHub_用户名}__{仓库名称},数据类型为字符串。
  • commit_hash: 提交哈希,数据类型为字符串。
  • completion_file: 包含完成文件内容的字典,结构如下:
    • filename: 完成文件的路径,数据类型为字符串。
    • content: 完成文件的内容,数据类型为字符串。
  • completion_lines: 包含行类别的字典,值为整数列表(行号),类别包括:
    • committed: 包含至少一个在提交文件中声明的函数或类。
    • inproject: 包含至少一个在项目中声明的函数或类(不包括之前的)。
    • infile: 包含至少一个在完成文件中声明的函数或类(不包括之前的)。
    • common: 包含至少一个被分类为常见的函数或类,例如 mainget 等(不包括之前的)。
    • non_informative: 被分类为非信息的行,例如太短、包含注释等。
    • random: 从其余行中随机采样。
  • repo_snapshot: 包含提交前仓库快照的字典,结构与 completion_file 相同,但文件名和内容组织为列表。
  • completion_lines_raw: 与 completion_lines 相同,但采样前。

数据集分割

  • test: 测试集,字节数为 5220255729,样本数为 296。

数据集大小

  • 下载大小: 1810961403 字节
  • 数据集大小: 5220255729 字节

数据集配置

  • default: 默认配置,数据文件路径为 data/test-*

数据收集方法

  • 从 GitHub 克隆主要语言为 Python 的仓库。
  • 每个数据点的完成文件是在提交中添加到仓库的 .py 文件。
  • 提交前的仓库状态为仓库快照。

数据集统计

  • 数据点数量: 296
  • 仓库数量: 75
  • 提交数量: 252

完成文件统计

  • 行数中位数: 313.5
  • 行数最小值: 200
  • 行数最大值: 1877

仓库快照统计

  • .py 文件数量中位数: 261,范围从 47 到 5227
  • .py 文件数量中位数: 262,范围从 24 到 7687
  • .py 行数中位数: 49811
  • .py 行数中位数: 60163

行数统计

  • infile: 2608
  • inproject: 2901
  • common: 692
  • committed: 1019
  • non-informative: 1164
  • random: 1426
  • 总计: 9810
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作