starcoder2data-extras
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/bigcode/starcoder2data-extras
下载链接
链接失效反馈官方服务:
资源简介:
# StarCoder2 Extras
This is the dataset of extra sources (besides Stack v2 code data) used to train the [StarCoder2](https://arxiv.org/abs/2402.19173) family of models. It contains the following subsets:
- Kaggle (`kaggle`): Kaggle notebooks from [Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script.
- StackOverflow (`stackoverflow`): stackoverflow conversations from this [StackExchange dump](https://archive.org/details/stackexchange).
- Issues (`issues`): processed GitHub issues, same as the Stack v1 issues.
- OWM (`owm`): the [Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math) dataset.
- LHQ (`lhq`): Leandro's High quality dataset, it is a compilation of high quality code files from: APPS-train, CodeContests, GSM8K-train, GSM8K-SciRel, DeepMind-Mathematics, Rosetta-Code, MultiPL-T, ProofSteps, ProofSteps-lean.
- Wiki (`wikipedia`): the English subset of the Wikipedia dump in [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
- ArXiv (`arxiv`): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset, further processed the dataset only to retain latex source files and remove preambles, comments, macros, and bibliographies from these files.
- IR_language (`ir_cpp`, `ir_low_resource`, `ir_python`, `ir_rust`): these are intermediate representations of Python, Rust, C++ and other low resource languages.
- Documentation (`documentation`): documentation of popular libraries.
For more details on the processing of each subset, check the [StarCoder2 paper](https://arxiv.org/abs/2402.19173) or The Stack v2 [GitHub repository](https://github.com/bigcode-project/the-stack-v2/).
## Usage
```python
from datasets import load_dataset
# replace `kaggle` with one of the config names listed above
ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train")
```
## Citation
```
@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
```
# StarCoder2 附加数据集
本数据集为训练StarCoder2系列模型所用的额外源代码数据集(不含Stack v2代码数据),包含以下子集:
- Kaggle(`kaggle`):源自[Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)数据集的Kaggle笔记本,已转换为脚本格式,并在文件头部添加该笔记本所引用的Kaggle数据集相关信息。其文件头格式与Jupyter结构化格式相近,但代码内容仅为单个独立脚本。
- StackOverflow(`stackoverflow`):来自该[StackExchange存档数据集](https://archive.org/details/stackexchange)的StackOverflow对话内容。
- Issues(`issues`):经预处理的GitHub议题,与Stack v1的议题数据集格式一致。
- OWM(`owm`):[Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math)数据集。
- LHQ(`lhq`):Leandro高质量数据集,该数据集汇集了来自以下来源的高质量代码文件:APPS-train、CodeContests、GSM8K-train、GSM8K-SciRel、DeepMind-Mathematics、Rosetta-Code、MultiPL-T、ProofSteps、ProofSteps-lean。
- Wiki(`wikipedia`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的英文维基百科存档子集。
- ArXiv(`arxiv`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的ArXiv子集,经进一步处理后仅保留LaTeX源文件,并移除了其中的前言、注释、宏定义与参考文献列表。
- IR_language(`ir_cpp`、`ir_low_resource`、`ir_python`、`ir_rust`):涵盖Python、Rust、C++及其他低资源语言的中间表示数据集。
- Documentation(`documentation`):热门开源库的官方文档。
有关各子集的详细处理流程,请查阅[StarCoder2论文](https://arxiv.org/abs/2402.19173)或Stack v2的[GitHub仓库](https://github.com/bigcode-project/the-stack-v2/)。
## 使用方法
python
from datasets import load_dataset
# 将`kaggle`替换为上述列出的任一配置名称
ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train")
## 引用格式
@article{lozhkov2024starcoder,
title={StarCoder 2与Stack v2:下一代代码模型},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-11



