starcoder2data-extras

Name: starcoder2data-extras
Creator: maas
Published: 2025-11-27 16:52:25
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/bigcode/starcoder2data-extras

下载链接

链接失效反馈

官方服务：

资源简介：

# StarCoder2 Extras This is the dataset of extra sources (besides Stack v2 code data) used to train the [StarCoder2](https://arxiv.org/abs/2402.19173) family of models. It contains the following subsets: - Kaggle (`kaggle`): Kaggle notebooks from [Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script. - StackOverflow (`stackoverflow`): stackoverflow conversations from this [StackExchange dump](https://archive.org/details/stackexchange). - Issues (`issues`): processed GitHub issues, same as the Stack v1 issues. - OWM (`owm`): the [Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math) dataset. - LHQ (`lhq`): Leandro's High quality dataset, it is a compilation of high quality code files from: APPS-train, CodeContests, GSM8K-train, GSM8K-SciRel, DeepMind-Mathematics, Rosetta-Code, MultiPL-T, ProofSteps, ProofSteps-lean. - Wiki (`wikipedia`): the English subset of the Wikipedia dump in [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). - ArXiv (`arxiv`): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset, further processed the dataset only to retain latex source files and remove preambles, comments, macros, and bibliographies from these files. - IR_language (`ir_cpp`, `ir_low_resource`, `ir_python`, `ir_rust`): these are intermediate representations of Python, Rust, C++ and other low resource languages. - Documentation (`documentation`): documentation of popular libraries. For more details on the processing of each subset, check the [StarCoder2 paper](https://arxiv.org/abs/2402.19173) or The Stack v2 [GitHub repository](https://github.com/bigcode-project/the-stack-v2/). ## Usage ```python from datasets import load_dataset # replace `kaggle` with one of the config names listed above ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ``` ## Citation ``` @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

# StarCoder2 附加数据集本数据集为训练StarCoder2系列模型所用的额外源代码数据集（不含Stack v2代码数据），包含以下子集： - Kaggle（`kaggle`）：源自[Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)数据集的Kaggle笔记本，已转换为脚本格式，并在文件头部添加该笔记本所引用的Kaggle数据集相关信息。其文件头格式与Jupyter结构化格式相近，但代码内容仅为单个独立脚本。 - StackOverflow（`stackoverflow`）：来自该[StackExchange存档数据集](https://archive.org/details/stackexchange)的StackOverflow对话内容。 - Issues（`issues`）：经预处理的GitHub议题，与Stack v1的议题数据集格式一致。 - OWM（`owm`）：[Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math)数据集。 - LHQ（`lhq`）：Leandro高质量数据集，该数据集汇集了来自以下来源的高质量代码文件：APPS-train、CodeContests、GSM8K-train、GSM8K-SciRel、DeepMind-Mathematics、Rosetta-Code、MultiPL-T、ProofSteps、ProofSteps-lean。 - Wiki（`wikipedia`）：[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的英文维基百科存档子集。 - ArXiv（`arxiv`）：[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的ArXiv子集，经进一步处理后仅保留LaTeX源文件，并移除了其中的前言、注释、宏定义与参考文献列表。 - IR_language（`ir_cpp`、`ir_low_resource`、`ir_python`、`ir_rust`）：涵盖Python、Rust、C++及其他低资源语言的中间表示数据集。 - Documentation（`documentation`）：热门开源库的官方文档。有关各子集的详细处理流程，请查阅[StarCoder2论文](https://arxiv.org/abs/2402.19173)或Stack v2的[GitHub仓库](https://github.com/bigcode-project/the-stack-v2/)。 ## 使用方法 python from datasets import load_dataset # 将`kaggle`替换为上述列出的任一配置名称 ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ## 引用格式 @article{lozhkov2024starcoder, title={StarCoder 2与Stack v2：下一代代码模型}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} }

提供机构：

maas

创建时间：

2025-10-11

搜集汇总

数据集介绍