starcoderdata
收藏魔搭社区2026-05-16 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/starcoderdata
下载链接
链接失效反馈官方服务:
资源简介:
# StarCoder Training Dataset
## Dataset description
This is the dataset used for training [StarCoder](https://huggingface.co/bigcode/starcoder) and [StarCoderBase](https://huggingface.co/bigcode/starcoderbase). It contains 783GB of code in 86 programming languages, and includes 54GB GitHub Issues + 13GB Jupyter notebooks in scripts and text-code pairs,
and 32GB of GitHub commits, which is approximately 250 Billion tokens.
## Dataset creation
The creation and filtering of The Stack is explained in the [original dataset](https://huggingface.co/datasets/bigcode/the-stack-dedup), we additionally decontaminate and clean all 86 programming
languages in the dataset, in addition to GitHub issues, Jupyter Notebooks and GitHub commits. We also apply near-deduplication and remove PII, all details are mentionned in our [Paper: 💫 StarCoder, May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)
## How to use the dataset
```python
from datasets import load_dataset
# to load python for example
ds = load_dataset("bigcode/starcoderdata", data_dir="python", split="train")
```
GitHub issues, GitHub commits and Jupyter notebooks subsets have different columns from the rest so loading the entire dataset at once may fail, we suggest loading programming languages separatly from these categories.
````
jupyter-scripts-dedup-filtered
jupyter-structured-clean-dedup
github-issues-filtered-structured
git-commits-cleaned
````
# StarCoder训练数据集
## 数据集描述
本数据集用于训练StarCoder与StarCoderBase两款模型,对应链接分别为:https://huggingface.co/bigcode/starcoder、https://huggingface.co/bigcode/starcoderbase。数据集包含86种编程语言的代码总计783GB,此外还包含54GB的GitHub议题(GitHub Issues)、13GB的脚本与文本-代码配对形式的Jupyter笔记本(Jupyter Notebooks),以及32GB的GitHub提交记录(GitHub Commits),总Token数约为2500亿。
## 数据集构建
The Stack数据集的构建与筛选流程已在[原始数据集](https://huggingface.co/datasets/bigcode/the-stack-dedup)中详述,除此之外,我们还对数据集中的86种编程语言、GitHub议题、Jupyter笔记本以及GitHub提交记录进行了额外的数据去污染与清洗操作,并执行了近似去重以及个人可识别信息(Personally Identifiable Information,PII)的移除操作。所有细节均可参阅我们的论文:《💫 StarCoder,愿代码与你同在》(https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)。
## 数据集使用方法
python
from datasets import load_dataset
# 以加载Python语言数据集为例
ds = load_dataset("bigcode/starcoderdata", data_dir="python", split="train")
GitHub议题、GitHub提交记录与Jupyter笔记本子集的列字段与其余子集存在差异,一次性加载完整数据集可能会失败,因此我们建议将编程语言子集与上述三类子集分别加载。
jupyter-scripts-dedup-filtered
jupyter-structured-clean-dedup
github-issues-filtered-structured
git-commits-cleaned
提供机构:
maas
创建时间:
2023-12-06



