five

starcoder2data-extras

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/bigcode/starcoder2data-extras
下载链接
链接失效反馈
官方服务:
资源简介:
# StarCoder2 Extras This is the dataset of extra sources (besides Stack v2 code data) used to train the [StarCoder2](https://arxiv.org/abs/2402.19173) family of models. It contains the following subsets: - Kaggle (`kaggle`): Kaggle notebooks from [Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script. - StackOverflow (`stackoverflow`): stackoverflow conversations from this [StackExchange dump](https://archive.org/details/stackexchange). - Issues (`issues`): processed GitHub issues, same as the Stack v1 issues. - OWM (`owm`): the [Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math) dataset. - LHQ (`lhq`): Leandro's High quality dataset, it is a compilation of high quality code files from: APPS-train, CodeContests, GSM8K-train, GSM8K-SciRel, DeepMind-Mathematics, Rosetta-Code, MultiPL-T, ProofSteps, ProofSteps-lean. - Wiki (`wikipedia`): the English subset of the Wikipedia dump in [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). - ArXiv (`arxiv`): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset, further processed the dataset only to retain latex source files and remove preambles, comments, macros, and bibliographies from these files. - IR_language (`ir_cpp`, `ir_low_resource`, `ir_python`, `ir_rust`): these are intermediate representations of Python, Rust, C++ and other low resource languages. - Documentation (`documentation`): documentation of popular libraries. For more details on the processing of each subset, check the [StarCoder2 paper](https://arxiv.org/abs/2402.19173) or The Stack v2 [GitHub repository](https://github.com/bigcode-project/the-stack-v2/). ## Usage ```python from datasets import load_dataset # replace `kaggle` with one of the config names listed above ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ``` ## Citation ``` @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

# StarCoder2 附加数据集 本数据集为训练StarCoder2系列模型所用的额外源代码数据集(不含Stack v2代码数据),包含以下子集: - Kaggle(`kaggle`):源自[Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)数据集的Kaggle笔记本,已转换为脚本格式,并在文件头部添加该笔记本所引用的Kaggle数据集相关信息。其文件头格式与Jupyter结构化格式相近,但代码内容仅为单个独立脚本。 - StackOverflow(`stackoverflow`):来自该[StackExchange存档数据集](https://archive.org/details/stackexchange)的StackOverflow对话内容。 - Issues(`issues`):经预处理的GitHub议题,与Stack v1的议题数据集格式一致。 - OWM(`owm`):[Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math)数据集。 - LHQ(`lhq`):Leandro高质量数据集,该数据集汇集了来自以下来源的高质量代码文件:APPS-train、CodeContests、GSM8K-train、GSM8K-SciRel、DeepMind-Mathematics、Rosetta-Code、MultiPL-T、ProofSteps、ProofSteps-lean。 - Wiki(`wikipedia`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的英文维基百科存档子集。 - ArXiv(`arxiv`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的ArXiv子集,经进一步处理后仅保留LaTeX源文件,并移除了其中的前言、注释、宏定义与参考文献列表。 - IR_language(`ir_cpp`、`ir_low_resource`、`ir_python`、`ir_rust`):涵盖Python、Rust、C++及其他低资源语言的中间表示数据集。 - Documentation(`documentation`):热门开源库的官方文档。 有关各子集的详细处理流程,请查阅[StarCoder2论文](https://arxiv.org/abs/2402.19173)或Stack v2的[GitHub仓库](https://github.com/bigcode-project/the-stack-v2/)。 ## 使用方法 python from datasets import load_dataset # 将`kaggle`替换为上述列出的任一配置名称 ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ## 引用格式 @article{lozhkov2024starcoder, title={StarCoder 2与Stack v2:下一代代码模型}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} }
提供机构:
maas
创建时间:
2025-10-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作