devngho/the-stack-mini
收藏Hugging Face2024-09-22 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/devngho/the-stack-mini
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含代码相关的数据,特征包括代码的哈希值(hexsha)、大小(size)、扩展名(ext)、语言(lang)、最大星标仓库路径(max_stars_repo_path)、最大星标仓库名称(max_stars_repo_name)、最大星标仓库头部哈希值(max_stars_repo_head_hexsha)、最大星标仓库许可证(max_stars_repo_licenses)、最大星标数量(max_stars_count)、最大星标仓库星标事件最小和最大时间(max_stars_repo_stars_event_min_datetime和max_stars_repo_stars_event_max_datetime)、最大问题仓库路径(max_issues_repo_path)、最大问题仓库名称(max_issues_repo_name)、最大问题仓库头部哈希值(max_issues_repo_head_hexsha)、最大问题仓库许可证(max_issues_repo_licenses)、最大问题数量(max_issues_count)、最大问题仓库问题事件最小和最大时间(max_issues_repo_issues_event_min_datetime和max_issues_repo_issues_event_max_datetime)、最大分叉仓库路径(max_forks_repo_path)、最大分叉仓库名称(max_forks_repo_name)、最大分叉仓库头部哈希值(max_forks_repo_head_hexsha)、最大分叉仓库许可证(max_forks_repo_licenses)、最大分叉数量(max_forks_count)、最大分叉仓库分叉事件最小和最大时间(max_forks_repo_forks_event_min_datetime和max_forks_repo_forks_event_max_datetime)、内容(content)、平均行长度(avg_line_length)、最大行长度(max_line_length)和字母数字比例(alphanum_fraction)。数据集分为训练集,包含6219883个样本,总大小为39589908892字节。数据来源于bigcode/the-stack-dedup,任务类别为文本生成和填充掩码,语言为代码,许可证为其他。
This dataset contains code-related data, with features including code hash (hexsha), size, extension (ext), language (lang), path to the repository with the most stars (max_stars_repo_path), name of the repository with the most stars (max_stars_repo_name), head hash of the repository with the most stars (max_stars_repo_head_hexsha), licenses of the repository with the most stars (max_stars_repo_licenses), number of stars for the repository with the most stars (max_stars_count), minimum and maximum datetime of star events for the repository with the most stars (max_stars_repo_stars_event_min_datetime and max_stars_repo_stars_event_max_datetime), path to the repository with the most issues (max_issues_repo_path), name of the repository with the most issues (max_issues_repo_name), head hash of the repository with the most issues (max_issues_repo_head_hexsha), licenses of the repository with the most issues (max_issues_repo_licenses), number of issues for the repository with the most issues (max_issues_count), minimum and maximum datetime of issue events for the repository with the most issues (max_issues_repo_issues_event_min_datetime and max_issues_repo_issues_event_max_datetime), path to the repository with the most forks (max_forks_repo_path), name of the repository with the most forks (max_forks_repo_name), head hash of the repository with the most forks (max_forks_repo_head_hexsha), licenses of the repository with the most forks (max_forks_repo_licenses), number of forks for the repository with the most forks (max_forks_count), minimum and maximum datetime of fork events for the repository with the most forks (max_forks_repo_forks_event_min_datetime and max_forks_repo_forks_event_max_datetime), content, average line length (avg_line_length), maximum line length (max_line_length), and alphanumeric fraction (alphanum_fraction). The dataset is divided into a training set containing 6219883 samples with a total size of 39589908892 bytes. The data is sourced from bigcode/the-stack-dedup, with task categories including text generation and fill-mask, language as code, and license as other.
提供机构:
devngho



