BEE-spoke-data/the-stack-smol-xl-readable
收藏Hugging Face2024-06-07 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/the-stack-smol-xl-readable
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于bigcode/the-stack-smol-xl,包含两个配置:default和original。每个配置都有训练集的数据文件路径和特征列表。特征包括代码大小、扩展名、语言、最大星标仓库名称、最大星标仓库许可证、最大星标数、最大问题仓库名称、最大问题仓库许可证、最大问题数、最大分叉仓库名称、最大分叉仓库许可证、最大分叉数、文本、平均行长度、最大行长度、字母数字比例、代码可读性和令牌长度。数据集还提供了关于代码可读性和令牌长度的统计信息。
The dataset the-stack-smol-xl-readable is a filtered subset of bigcode/the-stack-smol-xl, specifically filtered for code readability with an index greater than 0.8. The dataset includes two configurations: default and original. The default configuration contains additional features such as token length and code readability index. The dataset is primarily intended for training, with each configuration having corresponding number of bytes and examples. The dataset is licensed under odc-by.
提供机构:
BEE-spoke-data
原始信息汇总
数据集概述
数据集配置
-
配置名称: default
- 数据文件路径: data/train-*
- 特征:
- name: size, dtype: int64
- name: ext, dtype: string
- name: lang, dtype: string
- name: max_stars_repo_name, dtype: string
- name: max_stars_repo_licenses, sequence: string
- name: max_stars_count, dtype: float64
- name: max_issues_repo_name, dtype: string
- name: max_issues_repo_licenses, sequence: string
- name: max_issues_count, dtype: float64
- name: max_forks_repo_name, dtype: string
- name: max_forks_repo_licenses, sequence: string
- name: max_forks_count, dtype: float64
- name: text, dtype: string
- name: avg_line_length, dtype: float64
- name: max_line_length, dtype: int64
- name: alphanum_fraction, dtype: float64
- name: code_readability, dtype: float64
- name: token_len, dtype: int64
- 分割:
- name: train, num_bytes: 1138355923.3692224, num_examples: 205173
- 下载大小: 394010253
- 数据集大小: 1138355923.3692224
-
配置名称: original
- 数据文件路径: original/train-*
- 特征:
- name: size, dtype: int64
- name: ext, dtype: string
- name: lang, dtype: string
- name: max_stars_repo_name, dtype: string
- name: max_stars_repo_licenses, sequence: string
- name: max_stars_count, dtype: float64
- name: max_issues_repo_name, dtype: string
- name: max_issues_repo_licenses, sequence: string
- name: max_issues_count, dtype: float64
- name: max_forks_repo_name, dtype: string
- name: max_forks_repo_licenses, sequence: string
- name: max_forks_count, dtype: float64
- name: text, dtype: string
- name: avg_line_length, dtype: float64
- name: max_line_length, dtype: int64
- name: alphanum_fraction, dtype: float64
- name: code_readability, dtype: float64
- 分割:
- name: train, num_bytes: 1210173023, num_examples: 218432
- 下载大小: 401845994
- 数据集大小: 1210173023
许可证
- 许可证: odc-by
统计信息
-
可读性指数:
- count: 218432
- mean: 0.840515
- std: 0.0288337
- min: 0.8
- 25%: 0.816354
- 50%: 0.836042
- 75%: 0.863379
- max: 0.987153
-
tokens (llama2):
- token_len
- count: 205173
- mean: 2654.322581
- std: 18141.069771
- min: 33
- 25%: 166
- 50%: 390
- 75%: 1078
- max: 686163
- token_len



