five

ArnabPluxury/starcoder2data-extras

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ArnabPluxury/starcoder2data-extras
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: arxiv features: - name: content dtype: string splits: - name: train num_bytes: 89223183645.0 num_examples: 1558306 download_size: 40911186876 dataset_size: 89223183645.0 - config_name: documentation features: - name: project dtype: string - name: source dtype: string - name: language dtype: string - name: content dtype: string splits: - name: train num_bytes: 5421472234.0 num_examples: 59733 download_size: 1853451922 dataset_size: 5421472234.0 - config_name: ir_cpp features: - name: __index_level_0__ dtype: string - name: id dtype: string - name: content dtype: string splits: - name: train num_bytes: 102081135272.0 num_examples: 2916655 download_size: 26047978422 dataset_size: 102081135272.0 - config_name: ir_low_resource features: - name: __index_level_0__ dtype: string - name: id dtype: string - name: content dtype: string - name: size dtype: int64 splits: - name: train num_bytes: 10383382043.0 num_examples: 393988 download_size: 2464513603 dataset_size: 10383382043.0 - config_name: ir_python features: - name: id dtype: string - name: content dtype: string splits: - name: train num_bytes: 12446664464.0 num_examples: 154507 download_size: 3039297625 dataset_size: 12446664464.0 - config_name: ir_rust features: - name: __index_level_0__ dtype: string - name: id dtype: string - name: content dtype: string splits: - name: train num_bytes: 4764927851.0 num_examples: 32720 download_size: 1254786199 dataset_size: 4764927851.0 - config_name: issues features: - name: repo_name dtype: string - name: content dtype: string - name: issue_id dtype: string splits: - name: train num_bytes: 31219575534.38484 num_examples: 15549682 download_size: 16483899047 dataset_size: 31219575534.38484 - config_name: kaggle features: - name: content dtype: string - name: file_id dtype: string splits: - name: train num_bytes: 5228745262.0 num_examples: 580195 download_size: 2234440007 dataset_size: 5228745262.0 - config_name: lhq features: - name: content dtype: string - name: metadata struct: - name: difficulty dtype: string - name: field dtype: string - name: topic dtype: string splits: - name: train num_bytes: 751273849.0 num_examples: 7037500 download_size: 272913202 dataset_size: 751273849.0 - config_name: owm features: - name: url dtype: string - name: date dtype: timestamp[s] - name: metadata dtype: string - name: content dtype: string splits: - name: train num_bytes: 56294728333.0 num_examples: 6315233 download_size: 27160071916 dataset_size: 56294728333.0 - config_name: stackoverflow features: - name: date dtype: string - name: nb_tokens dtype: int64 - name: text_size dtype: int64 - name: content dtype: string splits: - name: train num_bytes: 35548199612.0 num_examples: 10404628 download_size: 17008831030 dataset_size: 35548199612.0 - config_name: wikipedia features: - name: content dtype: string - name: meta dtype: string - name: red_pajama_subset dtype: string splits: - name: train num_bytes: 21572720540.0 num_examples: 6630651 download_size: 12153445493 dataset_size: 21572720540.0 configs: - config_name: arxiv data_files: - split: train path: arxiv/train-* - config_name: documentation data_files: - split: train path: documentation/train-* - config_name: ir_cpp data_files: - split: train path: ir_cpp/train-* - config_name: ir_low_resource data_files: - split: train path: ir_low_resource/train-* - config_name: ir_python data_files: - split: train path: ir_python/train-* - config_name: ir_rust data_files: - split: train path: ir_rust/train-* - config_name: issues data_files: - split: train path: issues/train-* - config_name: kaggle data_files: - split: train path: kaggle/train-* - config_name: lhq data_files: - split: train path: lhq/train-* - config_name: owm data_files: - split: train path: owm/train-* - config_name: stackoverflow data_files: - split: train path: stackoverflow/train-* - config_name: wikipedia data_files: - split: train path: wikipedia/train-* --- # StarCoder2 Extras This is the dataset of extra sources (besides Stack v2 code data) used to train the [StarCoder2](https://arxiv.org/abs/2402.19173) family of models. It contains the following subsets: - Kaggle (`kaggle`): Kaggle notebooks from [Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script. - StackOverflow (`stackoverflow`): stackoverflow conversations from this [StackExchange dump](https://archive.org/details/stackexchange). - Issues (`issues`): processed GitHub issues, same as the Stack v1 issues. - OWM (`owm`): the [Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math) dataset. - LHQ (`lhq`): Leandro's High quality dataset, it is a compilation of high quality code files from: APPS-train, CodeContests, GSM8K-train, GSM8K-SciRel, DeepMind-Mathematics, Rosetta-Code, MultiPL-T, ProofSteps, ProofSteps-lean. - Wiki (`wikipedia`): the English subset of the Wikipedia dump in [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T). - ArXiv (`arxiv`): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset, further processed the dataset only to retain latex source files and remove preambles, comments, macros, and bibliographies from these files. - IR_language (`ir_cpp`, `ir_low_resource`, `ir_python`, `ir_rust`): these are intermediate representations of Python, Rust, C++ and other low resource languages. - Documentation (`documentation`): documentation of popular libraries. For more details on the processing of each subset, check the [StarCoder2 paper](https://arxiv.org/abs/2402.19173) or The Stack v2 [GitHub repository](https://github.com/bigcode-project/the-stack-v2/). ## Usage ```python from datasets import load_dataset # replace `kaggle` with one of the config names listed above ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ``` ## Citation ``` @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

数据集信息: - 配置名称:arxiv 特征字段: - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:89223183645.0 样本数量:1558306 下载大小:40911186876 数据集总大小:89223183645.0 - 配置名称:documentation 特征字段: - 名称:project 数据类型:字符串 - 名称:source 数据类型:字符串 - 名称:language 数据类型:字符串 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:5421472234.0 样本数量:59733 下载大小:1853451922 数据集总大小:5421472234.0 - 配置名称:ir_cpp 特征字段: - 名称:__index_level_0__ 数据类型:字符串 - 名称:id 数据类型:字符串 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:102081135272.0 样本数量:2916655 下载大小:26047978422 数据集总大小:102081135272.0 - 配置名称:ir_low_resource 特征字段: - 名称:__index_level_0__ 数据类型:字符串 - 名称:id 数据类型:字符串 - 名称:content 数据类型:字符串 - 名称:size 数据类型:64位整数 划分集: - 名称:train 字节数:10383382043.0 样本数量:393988 下载大小:2464513603 数据集总大小:10383382043.0 - 配置名称:ir_python 特征字段: - 名称:id 数据类型:字符串 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:12446664464.0 样本数量:154507 下载大小:3039297625 数据集总大小:12446664464.0 - 配置名称:ir_rust 特征字段: - 名称:__index_level_0__ 数据类型:字符串 - 名称:id 数据类型:字符串 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:4764927851.0 样本数量:32720 下载大小:1254786199 数据集总大小:4764927851.0 - 配置名称:issues 特征字段: - 名称:repo_name 数据类型:字符串 - 名称:content 数据类型:字符串 - 名称:issue_id 数据类型:字符串 划分集: - 名称:train 字节数:31219575534.38484 样本数量:15549682 下载大小:16483899047 数据集总大小:31219575534.38484 - 配置名称:kaggle 特征字段: - 名称:content 数据类型:字符串 - 名称:file_id 数据类型:字符串 划分集: - 名称:train 字节数:5228745262.0 样本数量:580195 下载大小:2234440007 数据集总大小:5228745262.0 - 配置名称:lhq 特征字段: - 名称:content 数据类型:字符串 - 名称:metadata 结构体: - 名称:difficulty 数据类型:字符串 - 名称:field 数据类型:字符串 - 名称:topic 数据类型:字符串 划分集: - 名称:train 字节数:751273849.0 样本数量:7037500 下载大小:272913202 数据集总大小:751273849.0 - 配置名称:owm 特征字段: - 名称:url 数据类型:字符串 - 名称:date 数据类型:秒级时间戳 - 名称:metadata 数据类型:字符串 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:56294728333.0 样本数量:6315233 下载大小:27160071916 数据集总大小:56294728333.0 - 配置名称:stackoverflow 特征字段: - 名称:date 数据类型:字符串 - 名称:nb_tokens 数据类型:64位整数 - 名称:text_size 数据类型:64位整数 - 名称:content 数据类型:字符串 划分集: - 名称:train 字节数:35548199612.0 样本数量:10404628 下载大小:17008831030 数据集总大小:35548199612.0 - 配置名称:wikipedia 特征字段: - 名称:content 数据类型:字符串 - 名称:meta 数据类型:字符串 - 名称:red_pajama_subset 数据类型:字符串 划分集: - 名称:train 字节数:21572720540.0 样本数量:6630651 下载大小:12153445493 数据集总大小:21572720540.0 配置项: - 配置名称:arxiv 数据文件: - 划分集:train 路径:arxiv/train-* - 配置名称:documentation 数据文件: - 划分集:train 路径:documentation/train-* - 配置名称:ir_cpp 数据文件: - 划分集:train 路径:ir_cpp/train-* - 配置名称:ir_low_resource 数据文件: - 划分集:train 路径:ir_low_resource/train-* - 配置名称:ir_python 数据文件: - 划分集:train 路径:ir_python/train-* - 配置名称:ir_rust 数据文件: - 划分集:train 路径:ir_rust/train-* - 配置名称:issues 数据文件: - 划分集:train 路径:issues/train-* - 配置名称:kaggle 数据文件: - 划分集:train 路径:kaggle/train-* - 配置名称:lhq 数据文件: - 划分集:train 路径:lhq/train-* - 配置名称:owm 数据文件: - 划分集:train 路径:owm/train-* - 配置名称:stackoverflow 数据文件: - 划分集:train 路径:stackoverflow/train-* - 配置名称:wikipedia 数据文件: - 划分集:train 路径:wikipedia/train-* --- # StarCoder2 附加数据集 本数据集为训练StarCoder2系列模型所用的额外数据源(除Stack v2代码数据外),相关模型详情可参见[StarCoder2](https://arxiv.org/abs/2402.19173)论文。数据集包含以下子集: - Kaggle(`kaggle`):源自[Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)数据集的Kaggle笔记本,已转换为脚本格式,并在脚本头部添加了该笔记本所使用的Kaggle数据集相关信息。其文件头格式与Jupyter结构化格式类似,但代码内容仅为单个独立脚本。 - StackOverflow(`stackoverflow`):来自[StackExchange归档数据集](https://archive.org/details/stackexchange)的StackOverflow对话内容。 - Issues(`issues`):经过预处理的GitHub议题数据,与Stack v1中的议题数据集一致。 - OWM(`owm`):[Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math)数据集。 - LHQ(`lhq`):Leandro高质量数据集,该数据集整合了以下来源的高质量代码文件:APPS-train、CodeContests、GSM8K-train、GSM8K-SciRel、DeepMind-Mathematics、Rosetta-Code、MultiPL-T、ProofSteps、ProofSteps-lean。 - Wiki(`wikipedia`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的英文维基百科子集。 - ArXiv(`arxiv`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的ArXiv子集,经过进一步处理,仅保留LaTeX源文件,并移除了文件中的前言、注释、宏命令与参考文献列表。 - 中间表示(Intermediate Representation, IR)相关子集(`ir_cpp`、`ir_low_resource`、`ir_python`、`ir_rust`):涵盖Python、Rust、C++及其他低资源语言的中间表示形式数据。 - 文档集(`documentation`):热门开源库的官方文档内容。 若需了解各子集的具体处理细节,可查阅[StarCoder2论文](https://arxiv.org/abs/2402.19173)或Stack v2的[GitHub仓库](https://github.com/bigcode-project/the-stack-v2/)。 ## 使用方法 python from datasets import load_dataset # 将`kaggle`替换为上述任一配置名称 ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train") ## 引用格式 @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} }
提供机构:
ArnabPluxury
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作