StarCoderData
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/bigcode/starcoderdata
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从The Stack v1精心挑选的,用于训练StarCoderBase模型,特别是其中的一部分被用于构建NT-Java-1.1B模型。此外,该数据集还用于训练一个专门的Java代码语言模型,旨在执行Java编程任务。其规模达到了220亿个Java标记,任务是对Java代码语言模型进行训练。
This dataset is carefully curated from The Stack v1 for training the StarCoderBase model, with a subset thereof specifically used to develop the NT-Java-1.1B model. Additionally, it is also employed to train a specialized Java code language model designed to perform Java programming tasks. With a total of 22 billion Java tokens, this dataset is primarily utilized for training Java code language models.
提供机构:
BigCode



