ArnabPluxury/starcoder2data-extras
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ArnabPluxury/starcoder2data-extras
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: arxiv
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 89223183645.0
num_examples: 1558306
download_size: 40911186876
dataset_size: 89223183645.0
- config_name: documentation
features:
- name: project
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 5421472234.0
num_examples: 59733
download_size: 1853451922
dataset_size: 5421472234.0
- config_name: ir_cpp
features:
- name: __index_level_0__
dtype: string
- name: id
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 102081135272.0
num_examples: 2916655
download_size: 26047978422
dataset_size: 102081135272.0
- config_name: ir_low_resource
features:
- name: __index_level_0__
dtype: string
- name: id
dtype: string
- name: content
dtype: string
- name: size
dtype: int64
splits:
- name: train
num_bytes: 10383382043.0
num_examples: 393988
download_size: 2464513603
dataset_size: 10383382043.0
- config_name: ir_python
features:
- name: id
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 12446664464.0
num_examples: 154507
download_size: 3039297625
dataset_size: 12446664464.0
- config_name: ir_rust
features:
- name: __index_level_0__
dtype: string
- name: id
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 4764927851.0
num_examples: 32720
download_size: 1254786199
dataset_size: 4764927851.0
- config_name: issues
features:
- name: repo_name
dtype: string
- name: content
dtype: string
- name: issue_id
dtype: string
splits:
- name: train
num_bytes: 31219575534.38484
num_examples: 15549682
download_size: 16483899047
dataset_size: 31219575534.38484
- config_name: kaggle
features:
- name: content
dtype: string
- name: file_id
dtype: string
splits:
- name: train
num_bytes: 5228745262.0
num_examples: 580195
download_size: 2234440007
dataset_size: 5228745262.0
- config_name: lhq
features:
- name: content
dtype: string
- name: metadata
struct:
- name: difficulty
dtype: string
- name: field
dtype: string
- name: topic
dtype: string
splits:
- name: train
num_bytes: 751273849.0
num_examples: 7037500
download_size: 272913202
dataset_size: 751273849.0
- config_name: owm
features:
- name: url
dtype: string
- name: date
dtype: timestamp[s]
- name: metadata
dtype: string
- name: content
dtype: string
splits:
- name: train
num_bytes: 56294728333.0
num_examples: 6315233
download_size: 27160071916
dataset_size: 56294728333.0
- config_name: stackoverflow
features:
- name: date
dtype: string
- name: nb_tokens
dtype: int64
- name: text_size
dtype: int64
- name: content
dtype: string
splits:
- name: train
num_bytes: 35548199612.0
num_examples: 10404628
download_size: 17008831030
dataset_size: 35548199612.0
- config_name: wikipedia
features:
- name: content
dtype: string
- name: meta
dtype: string
- name: red_pajama_subset
dtype: string
splits:
- name: train
num_bytes: 21572720540.0
num_examples: 6630651
download_size: 12153445493
dataset_size: 21572720540.0
configs:
- config_name: arxiv
data_files:
- split: train
path: arxiv/train-*
- config_name: documentation
data_files:
- split: train
path: documentation/train-*
- config_name: ir_cpp
data_files:
- split: train
path: ir_cpp/train-*
- config_name: ir_low_resource
data_files:
- split: train
path: ir_low_resource/train-*
- config_name: ir_python
data_files:
- split: train
path: ir_python/train-*
- config_name: ir_rust
data_files:
- split: train
path: ir_rust/train-*
- config_name: issues
data_files:
- split: train
path: issues/train-*
- config_name: kaggle
data_files:
- split: train
path: kaggle/train-*
- config_name: lhq
data_files:
- split: train
path: lhq/train-*
- config_name: owm
data_files:
- split: train
path: owm/train-*
- config_name: stackoverflow
data_files:
- split: train
path: stackoverflow/train-*
- config_name: wikipedia
data_files:
- split: train
path: wikipedia/train-*
---
# StarCoder2 Extras
This is the dataset of extra sources (besides Stack v2 code data) used to train the [StarCoder2](https://arxiv.org/abs/2402.19173) family of models. It contains the following subsets:
- Kaggle (`kaggle`): Kaggle notebooks from [Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, converted to scripts and prefixed with information on the Kaggle datasets used in the notebook. The file headers have a similar format to Jupyter Structured but the code content is only one single script.
- StackOverflow (`stackoverflow`): stackoverflow conversations from this [StackExchange dump](https://archive.org/details/stackexchange).
- Issues (`issues`): processed GitHub issues, same as the Stack v1 issues.
- OWM (`owm`): the [Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math) dataset.
- LHQ (`lhq`): Leandro's High quality dataset, it is a compilation of high quality code files from: APPS-train, CodeContests, GSM8K-train, GSM8K-SciRel, DeepMind-Mathematics, Rosetta-Code, MultiPL-T, ProofSteps, ProofSteps-lean.
- Wiki (`wikipedia`): the English subset of the Wikipedia dump in [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T).
- ArXiv (`arxiv`): the ArXiv subset of [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) dataset, further processed the dataset only to retain latex source files and remove preambles, comments, macros, and bibliographies from these files.
- IR_language (`ir_cpp`, `ir_low_resource`, `ir_python`, `ir_rust`): these are intermediate representations of Python, Rust, C++ and other low resource languages.
- Documentation (`documentation`): documentation of popular libraries.
For more details on the processing of each subset, check the [StarCoder2 paper](https://arxiv.org/abs/2402.19173) or The Stack v2 [GitHub repository](https://github.com/bigcode-project/the-stack-v2/).
## Usage
```python
from datasets import load_dataset
# replace `kaggle` with one of the config names listed above
ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train")
```
## Citation
```
@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
```
数据集信息:
- 配置名称:arxiv
特征字段:
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:89223183645.0
样本数量:1558306
下载大小:40911186876
数据集总大小:89223183645.0
- 配置名称:documentation
特征字段:
- 名称:project
数据类型:字符串
- 名称:source
数据类型:字符串
- 名称:language
数据类型:字符串
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:5421472234.0
样本数量:59733
下载大小:1853451922
数据集总大小:5421472234.0
- 配置名称:ir_cpp
特征字段:
- 名称:__index_level_0__
数据类型:字符串
- 名称:id
数据类型:字符串
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:102081135272.0
样本数量:2916655
下载大小:26047978422
数据集总大小:102081135272.0
- 配置名称:ir_low_resource
特征字段:
- 名称:__index_level_0__
数据类型:字符串
- 名称:id
数据类型:字符串
- 名称:content
数据类型:字符串
- 名称:size
数据类型:64位整数
划分集:
- 名称:train
字节数:10383382043.0
样本数量:393988
下载大小:2464513603
数据集总大小:10383382043.0
- 配置名称:ir_python
特征字段:
- 名称:id
数据类型:字符串
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:12446664464.0
样本数量:154507
下载大小:3039297625
数据集总大小:12446664464.0
- 配置名称:ir_rust
特征字段:
- 名称:__index_level_0__
数据类型:字符串
- 名称:id
数据类型:字符串
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:4764927851.0
样本数量:32720
下载大小:1254786199
数据集总大小:4764927851.0
- 配置名称:issues
特征字段:
- 名称:repo_name
数据类型:字符串
- 名称:content
数据类型:字符串
- 名称:issue_id
数据类型:字符串
划分集:
- 名称:train
字节数:31219575534.38484
样本数量:15549682
下载大小:16483899047
数据集总大小:31219575534.38484
- 配置名称:kaggle
特征字段:
- 名称:content
数据类型:字符串
- 名称:file_id
数据类型:字符串
划分集:
- 名称:train
字节数:5228745262.0
样本数量:580195
下载大小:2234440007
数据集总大小:5228745262.0
- 配置名称:lhq
特征字段:
- 名称:content
数据类型:字符串
- 名称:metadata
结构体:
- 名称:difficulty
数据类型:字符串
- 名称:field
数据类型:字符串
- 名称:topic
数据类型:字符串
划分集:
- 名称:train
字节数:751273849.0
样本数量:7037500
下载大小:272913202
数据集总大小:751273849.0
- 配置名称:owm
特征字段:
- 名称:url
数据类型:字符串
- 名称:date
数据类型:秒级时间戳
- 名称:metadata
数据类型:字符串
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:56294728333.0
样本数量:6315233
下载大小:27160071916
数据集总大小:56294728333.0
- 配置名称:stackoverflow
特征字段:
- 名称:date
数据类型:字符串
- 名称:nb_tokens
数据类型:64位整数
- 名称:text_size
数据类型:64位整数
- 名称:content
数据类型:字符串
划分集:
- 名称:train
字节数:35548199612.0
样本数量:10404628
下载大小:17008831030
数据集总大小:35548199612.0
- 配置名称:wikipedia
特征字段:
- 名称:content
数据类型:字符串
- 名称:meta
数据类型:字符串
- 名称:red_pajama_subset
数据类型:字符串
划分集:
- 名称:train
字节数:21572720540.0
样本数量:6630651
下载大小:12153445493
数据集总大小:21572720540.0
配置项:
- 配置名称:arxiv
数据文件:
- 划分集:train
路径:arxiv/train-*
- 配置名称:documentation
数据文件:
- 划分集:train
路径:documentation/train-*
- 配置名称:ir_cpp
数据文件:
- 划分集:train
路径:ir_cpp/train-*
- 配置名称:ir_low_resource
数据文件:
- 划分集:train
路径:ir_low_resource/train-*
- 配置名称:ir_python
数据文件:
- 划分集:train
路径:ir_python/train-*
- 配置名称:ir_rust
数据文件:
- 划分集:train
路径:ir_rust/train-*
- 配置名称:issues
数据文件:
- 划分集:train
路径:issues/train-*
- 配置名称:kaggle
数据文件:
- 划分集:train
路径:kaggle/train-*
- 配置名称:lhq
数据文件:
- 划分集:train
路径:lhq/train-*
- 配置名称:owm
数据文件:
- 划分集:train
路径:owm/train-*
- 配置名称:stackoverflow
数据文件:
- 划分集:train
路径:stackoverflow/train-*
- 配置名称:wikipedia
数据文件:
- 划分集:train
路径:wikipedia/train-*
---
# StarCoder2 附加数据集
本数据集为训练StarCoder2系列模型所用的额外数据源(除Stack v2代码数据外),相关模型详情可参见[StarCoder2](https://arxiv.org/abs/2402.19173)论文。数据集包含以下子集:
- Kaggle(`kaggle`):源自[Meta-Kaggle-Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)数据集的Kaggle笔记本,已转换为脚本格式,并在脚本头部添加了该笔记本所使用的Kaggle数据集相关信息。其文件头格式与Jupyter结构化格式类似,但代码内容仅为单个独立脚本。
- StackOverflow(`stackoverflow`):来自[StackExchange归档数据集](https://archive.org/details/stackexchange)的StackOverflow对话内容。
- Issues(`issues`):经过预处理的GitHub议题数据,与Stack v1中的议题数据集一致。
- OWM(`owm`):[Open-Web-Math](https://huggingface.co/datasets/open-web-math/open-web-math)数据集。
- LHQ(`lhq`):Leandro高质量数据集,该数据集整合了以下来源的高质量代码文件:APPS-train、CodeContests、GSM8K-train、GSM8K-SciRel、DeepMind-Mathematics、Rosetta-Code、MultiPL-T、ProofSteps、ProofSteps-lean。
- Wiki(`wikipedia`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的英文维基百科子集。
- ArXiv(`arxiv`):[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)数据集中的ArXiv子集,经过进一步处理,仅保留LaTeX源文件,并移除了文件中的前言、注释、宏命令与参考文献列表。
- 中间表示(Intermediate Representation, IR)相关子集(`ir_cpp`、`ir_low_resource`、`ir_python`、`ir_rust`):涵盖Python、Rust、C++及其他低资源语言的中间表示形式数据。
- 文档集(`documentation`):热门开源库的官方文档内容。
若需了解各子集的具体处理细节,可查阅[StarCoder2论文](https://arxiv.org/abs/2402.19173)或Stack v2的[GitHub仓库](https://github.com/bigcode-project/the-stack-v2/)。
## 使用方法
python
from datasets import load_dataset
# 将`kaggle`替换为上述任一配置名称
ds = load_dataset("bigcode/starcoder2data-extras", "kaggle", split="train")
## 引用格式
@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
提供机构:
ArnabPluxury



