issues-kaggle-notebooks

Name: issues-kaggle-notebooks
Creator: maas
Published: 2025-12-05 16:49:57
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/HuggingFaceTB/issues-kaggle-notebooks

下载链接

链接失效反馈

官方服务：

资源简介：

# GitHub Issues & Kaggle Notebooks ## Description GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the [StarCoder2](https://arxiv.org/abs/2402.19173) model training corpus, precisely the [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display kaggle notebooks in markdown and code blocks. The dataset includes: - 🐛 GitHub Issues – 11B tokens of discussions from GitHub issues sourced from [GH Archive](https://www.gharchive.org/). - 📊 Kaggle Notebooks – 1.7B tokens of data analysis notebooks in markdonw format, curated from Kaggle's [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset. These datasets have undergone filtering to remove low-quality content, duplicates and PII. More details in StarCoder2 [paper](https://arxiv.org/abs/2402.19173) ## How to load the dataset You can load a specific subset using the following code: ```python from datasets import load_dataset issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks ``` ## Dataset curation These curation details are from the StarCoder2 pipeline. The original datasets can be found at: https://huggingface.co/datasets/bigcode/starcoder2data-extras and more details can be found in the StarCoder2 paper. ### 🐛 GitHub Issues The GitHub Issues dataset consists of discussions from GitHub repositories, sourced from GHArchive. It contains issue reports, bug tracking, and technical Q&A discussions. To ensure high-quality data, the StarCoder2 processing pipeline included: - Removing bot-generated comments and auto-replies from email responses. - Filtering out short issues (<200 characters) and extremely long comments. - Keeping only discussions with multiple users (or highly detailed single-user reports). - Anonymizing usernames while preserving the conversation structure, names, emails, keys, passwords, IP addresses using [StarPII](https://huggingface.co/bigcode/starpii). We format the conversatiosn using this template: ``` Title: [Issue title] Question: username_0: [Issue content] Answers: username_1: [Answer from user 1] username_0: [Author reply] username_2: [Answer from user 2] ... Status: Issue closed (optional) ``` ## 📊 Kaggle Notebooks The Kaggle Notebooks are sourced from the [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, licensed under Apache 2.0. They were cleaned using a multi-step filtering process, which included: - Removing notebooks with syntax errors or less than 100 characters. - Extracting metadata for notebooks that reference Kaggle datasets. When possible, we retrieve the datasets references in the notebook and add information about them to the beginning of the notebook (description, `ds.info()` output and 4 examples) - Filtering out duplicates, which reduced the dataset volume by 78%, and redacting PII. Each notebook is formatted in Markdown format, where we start with the notebook title, dataset description when available and put the notebook (converted to a Python script) in a code block. Below is an example of a kaggle notebook: ```` # Iris Flower Dataset ### Context The Iris flower data set is a multivariate data set introduced ... (truncated) ```python import pandas as pd df = pd.read_csv('iris-flower-dataset/IRIS.csv') df.info() ``` ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB ``` Examples from the dataset: ``` { "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2, "species": "Iris-setosa" } ... (truncated) ``` Code: ```python import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory import os for dirname, _, filenames in os.walk("/kaggle/input"): for filename in filenames: print(os.path.join(dirname, filename)) # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session import matplotlib.pyplot as plt data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv") data.head() X = data.drop("species", axis=1) ... (truncated) ```` ## Citation ``` @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

# GitHub 议题与 Kaggle 笔记本（GitHub Issues & Kaggle Notebooks） ## 数据集概述 GitHub 议题与 Kaggle 笔记本是面向大语言模型（Large Language Model, LLM）训练的双代码数据集合集，数据源自 GitHub 平台议题与 Kaggle 平台的笔记本文件。本数据集为 StarCoder2 模型训练语料库的改造子集，具体对应 [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) 数据集。我们对样本进行了重格式化处理，移除了 StarCoder2 的专属标记符，并使用自然文本分隔议题中的注释，同时以 Markdown 与代码块的形式展示 Kaggle 笔记本内容。本数据集包含以下两部分： - 🐛 GitHub 议题（GitHub Issues）：110亿 Token 的 GitHub 议题讨论数据，源自 [GH Archive](https://www.gharchive.org/)。 - 📊 Kaggle 笔记本（Kaggle Notebooks）：17亿 Token 的 Markdown 格式数据分析笔记本数据，源自 Kaggle 的 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集。本数据集已经过过滤处理，剔除了低质量内容、重复样本与个人可识别信息（Personally Identifiable Information, PII），更多细节可参考 StarCoder2 的 [相关论文](https://arxiv.org/abs/2402.19173)。 ## 如何加载数据集你可以通过以下代码加载指定子集： python from datasets import load_dataset issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks ## 数据集整理流程本数据集的整理流程源自 StarCoder2 处理管线，原始数据集可访问：https://huggingface.co/datasets/bigcode/starcoder2data-extras，更多细节可参考 StarCoder2 相关论文。 ### 🐛 GitHub 议题 GitHub 议题数据集包含源自 GitHub 仓库的讨论内容，涵盖议题报告、缺陷追踪与技术问答类讨论。为保障数据质量，StarCoder2 处理管线包含以下筛选步骤： - 移除机器人生成的评论与邮件自动回复内容。 - 过滤长度过短（少于200字符）的议题与过长的评论内容。 - 仅保留多用户参与的讨论，或内容详实的单用户报告。 - 使用 [StarPII](https://huggingface.co/bigcode/starpii) 工具对用户名进行匿名处理，同时保留对话结构、姓名、邮箱、密钥、密码与IP地址等信息。我们使用以下模板格式化对话内容： Title: [Issue title] Question: username_0: [Issue content] Answers: username_1: [Answer from user 1] username_0: [Author reply] username_2: [Answer from user 2] ... Status: Issue closed (optional) ## 📊 Kaggle 笔记本 Kaggle 笔记本数据集源自 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集，采用 Apache 2.0 开源协议。我们通过多步筛选流程对数据进行清洗，具体包括： - 移除存在语法错误或长度少于100字符的笔记本文件。 - 提取引用 Kaggle 数据集的笔记本元数据：若可行，我们会获取笔记本中引用的数据集信息，并将数据集描述、`ds.info()` 输出结果与4条样本示例添加至笔记本开头。 - 过滤重复样本，该步骤使数据集体量缩减78%，同时对个人可识别信息（Personally Identifiable Information, PII）进行脱敏处理。每份笔记本均以 Markdown 格式呈现：内容以笔记本标题开头，若存在可用数据集描述则一并添加，随后将转换为 Python 脚本的笔记本内容置于代码块中。以下为 Kaggle 笔记本示例： ` # Iris Flower Dataset ### Context The Iris flower data set is a multivariate data set introduced ... (truncated) python import pandas as pd df = pd.read_csv('iris-flower-dataset/IRIS.csv') df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB Examples from the dataset: { "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2, "species": "Iris-setosa" } ... (truncated) Code: python import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory import os for dirname, _, filenames in os.walk("/kaggle/input"): for filename in filenames: print(os.path.join(dirname, filename)) # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session import matplotlib.pyplot as plt data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv") data.head() X = data.drop("species", axis=1) ... (truncated) ` ## 引用 @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} }

提供机构：

maas

创建时间：

2025-09-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集