issues-kaggle-notebooks
收藏魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceTB/issues-kaggle-notebooks
下载链接
链接失效反馈官方服务:
资源简介:
# GitHub Issues & Kaggle Notebooks
## Description
GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the [StarCoder2](https://arxiv.org/abs/2402.19173) model training corpus, precisely the [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display kaggle notebooks in markdown and code blocks.
The dataset includes:
- 🐛 GitHub Issues – 11B tokens of discussions from GitHub issues sourced from [GH Archive](https://www.gharchive.org/).
- 📊 Kaggle Notebooks – 1.7B tokens of data analysis notebooks in markdonw format, curated from Kaggle's [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset.
These datasets have undergone filtering to remove low-quality content, duplicates and PII. More details in StarCoder2 [paper](https://arxiv.org/abs/2402.19173)
## How to load the dataset
You can load a specific subset using the following code:
```python
from datasets import load_dataset
issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues
kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks
```
## Dataset curation
These curation details are from the StarCoder2 pipeline. The original datasets can be found at: https://huggingface.co/datasets/bigcode/starcoder2data-extras and more details can be found in the StarCoder2 paper.
### 🐛 GitHub Issues
The GitHub Issues dataset consists of discussions from GitHub repositories, sourced from GHArchive. It contains issue reports, bug tracking, and technical Q&A discussions.
To ensure high-quality data, the StarCoder2 processing pipeline included:
- Removing bot-generated comments and auto-replies from email responses.
- Filtering out short issues (<200 characters) and extremely long comments.
- Keeping only discussions with multiple users (or highly detailed single-user reports).
- Anonymizing usernames while preserving the conversation structure, names, emails, keys, passwords, IP addresses using [StarPII](https://huggingface.co/bigcode/starpii).
We format the conversatiosn using this template:
```
Title: [Issue title]
Question:
username_0: [Issue content]
Answers:
username_1: [Answer from user 1]
username_0: [Author reply]
username_2: [Answer from user 2]
...
Status: Issue closed (optional)
```
## 📊 Kaggle Notebooks
The Kaggle Notebooks are sourced from the [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, licensed under Apache 2.0. They were cleaned using a multi-step filtering process, which included:
- Removing notebooks with syntax errors or less than 100 characters.
- Extracting metadata for notebooks that reference Kaggle datasets. When possible, we retrieve the datasets references in the notebook and add information about them to the beginning of the notebook (description, `ds.info()` output and 4 examples)
- Filtering out duplicates, which reduced the dataset volume by 78%, and redacting PII.
Each notebook is formatted in Markdown format, where we start with the notebook title, dataset description when available and put the notebook (converted to a Python script) in a code block.
Below is an example of a kaggle notebook:
````
# Iris Flower Dataset
### Context
The Iris flower data set is a multivariate data set introduced ... (truncated)
```python
import pandas as pd
df = pd.read_csv('iris-flower-dataset/IRIS.csv')
df.info()
```
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
```
Examples from the dataset:
```
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "Iris-setosa"
}
... (truncated)
```
Code:
```python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk("/kaggle/input"):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import matplotlib.pyplot as plt
data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
data.head()
X = data.drop("species", axis=1)
... (truncated)
````
## Citation
```
@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
```
# GitHub 议题与 Kaggle 笔记本(GitHub Issues & Kaggle Notebooks)
## 数据集概述
GitHub 议题与 Kaggle 笔记本是面向大语言模型(Large Language Model, LLM)训练的双代码数据集合集,数据源自 GitHub 平台议题与 Kaggle 平台的笔记本文件。本数据集为 StarCoder2 模型训练语料库的改造子集,具体对应 [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) 数据集。我们对样本进行了重格式化处理,移除了 StarCoder2 的专属标记符,并使用自然文本分隔议题中的注释,同时以 Markdown 与代码块的形式展示 Kaggle 笔记本内容。
本数据集包含以下两部分:
- 🐛 GitHub 议题(GitHub Issues):110亿 Token 的 GitHub 议题讨论数据,源自 [GH Archive](https://www.gharchive.org/)。
- 📊 Kaggle 笔记本(Kaggle Notebooks):17亿 Token 的 Markdown 格式数据分析笔记本数据,源自 Kaggle 的 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集。
本数据集已经过过滤处理,剔除了低质量内容、重复样本与个人可识别信息(Personally Identifiable Information, PII),更多细节可参考 StarCoder2 的 [相关论文](https://arxiv.org/abs/2402.19173)。
## 如何加载数据集
你可以通过以下代码加载指定子集:
python
from datasets import load_dataset
issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues
kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks
## 数据集整理流程
本数据集的整理流程源自 StarCoder2 处理管线,原始数据集可访问:https://huggingface.co/datasets/bigcode/starcoder2data-extras,更多细节可参考 StarCoder2 相关论文。
### 🐛 GitHub 议题
GitHub 议题数据集包含源自 GitHub 仓库的讨论内容,涵盖议题报告、缺陷追踪与技术问答类讨论。
为保障数据质量,StarCoder2 处理管线包含以下筛选步骤:
- 移除机器人生成的评论与邮件自动回复内容。
- 过滤长度过短(少于200字符)的议题与过长的评论内容。
- 仅保留多用户参与的讨论,或内容详实的单用户报告。
- 使用 [StarPII](https://huggingface.co/bigcode/starpii) 工具对用户名进行匿名处理,同时保留对话结构、姓名、邮箱、密钥、密码与IP地址等信息。
我们使用以下模板格式化对话内容:
Title: [Issue title]
Question:
username_0: [Issue content]
Answers:
username_1: [Answer from user 1]
username_0: [Author reply]
username_2: [Answer from user 2]
...
Status: Issue closed (optional)
## 📊 Kaggle 笔记本
Kaggle 笔记本数据集源自 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集,采用 Apache 2.0 开源协议。我们通过多步筛选流程对数据进行清洗,具体包括:
- 移除存在语法错误或长度少于100字符的笔记本文件。
- 提取引用 Kaggle 数据集的笔记本元数据:若可行,我们会获取笔记本中引用的数据集信息,并将数据集描述、`ds.info()` 输出结果与4条样本示例添加至笔记本开头。
- 过滤重复样本,该步骤使数据集体量缩减78%,同时对个人可识别信息(Personally Identifiable Information, PII)进行脱敏处理。
每份笔记本均以 Markdown 格式呈现:内容以笔记本标题开头,若存在可用数据集描述则一并添加,随后将转换为 Python 脚本的笔记本内容置于代码块中。
以下为 Kaggle 笔记本示例:
`
# Iris Flower Dataset
### Context
The Iris flower data set is a multivariate data set introduced ... (truncated)
python
import pandas as pd
df = pd.read_csv('iris-flower-dataset/IRIS.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Examples from the dataset:
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2,
"species": "Iris-setosa"
}
... (truncated)
Code:
python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk("/kaggle/input"):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import matplotlib.pyplot as plt
data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv")
data.head()
X = data.drop("species", axis=1)
... (truncated)
`
## 引用
@article{lozhkov2024starcoder,
title={Starcoder 2 and the stack v2: The next generation},
author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others},
journal={arXiv preprint arXiv:2402.19173},
year={2024}
}
提供机构:
maas
创建时间:
2025-09-08



