thepowerfuldeez/the-stack-v2-train-smol-ids-updated-content
收藏Hugging Face2025-09-16 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/thepowerfuldeez/the-stack-v2-train-smol-ids-updated-content
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了存储在GitHub仓库中的代码以及相关的文本信息,具有`repo_name`和`text`两个字段。数据集被分割为训练集,以Parquet格式存储,并且经过格式化处理。总token数量约为1000亿。
The dataset contains code and associated text information stored in GitHub repositories, with `repo_name` and `text` fields. The dataset is split into a training set and stored in Parquet format, which has been formatted and processed. The total number of tokens is approximately 100 billion.
提供机构:
thepowerfuldeez



