five

issues-kaggle-notebooks

收藏
魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceTB/issues-kaggle-notebooks
下载链接
链接失效反馈
官方服务:
资源简介:
# GitHub Issues & Kaggle Notebooks ## Description GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the [StarCoder2](https://arxiv.org/abs/2402.19173) model training corpus, precisely the [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display kaggle notebooks in markdown and code blocks. The dataset includes: - 🐛 GitHub Issues – 11B tokens of discussions from GitHub issues sourced from [GH Archive](https://www.gharchive.org/). - 📊 Kaggle Notebooks – 1.7B tokens of data analysis notebooks in markdonw format, curated from Kaggle's [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset. These datasets have undergone filtering to remove low-quality content, duplicates and PII. More details in StarCoder2 [paper](https://arxiv.org/abs/2402.19173) ## How to load the dataset You can load a specific subset using the following code: ```python from datasets import load_dataset issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks ``` ## Dataset curation These curation details are from the StarCoder2 pipeline. The original datasets can be found at: https://huggingface.co/datasets/bigcode/starcoder2data-extras and more details can be found in the StarCoder2 paper. ### 🐛 GitHub Issues The GitHub Issues dataset consists of discussions from GitHub repositories, sourced from GHArchive. It contains issue reports, bug tracking, and technical Q&A discussions. To ensure high-quality data, the StarCoder2 processing pipeline included: - Removing bot-generated comments and auto-replies from email responses. - Filtering out short issues (<200 characters) and extremely long comments. - Keeping only discussions with multiple users (or highly detailed single-user reports). - Anonymizing usernames while preserving the conversation structure, names, emails, keys, passwords, IP addresses using [StarPII](https://huggingface.co/bigcode/starpii). We format the conversatiosn using this template: ``` Title: [Issue title] Question: username_0: [Issue content] Answers: username_1: [Answer from user 1] username_0: [Author reply] username_2: [Answer from user 2] ... Status: Issue closed (optional) ``` ## 📊 Kaggle Notebooks The Kaggle Notebooks are sourced from the [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) dataset, licensed under Apache 2.0. They were cleaned using a multi-step filtering process, which included: - Removing notebooks with syntax errors or less than 100 characters. - Extracting metadata for notebooks that reference Kaggle datasets. When possible, we retrieve the datasets references in the notebook and add information about them to the beginning of the notebook (description, `ds.info()` output and 4 examples) - Filtering out duplicates, which reduced the dataset volume by 78%, and redacting PII. Each notebook is formatted in Markdown format, where we start with the notebook title, dataset description when available and put the notebook (converted to a Python script) in a code block. Below is an example of a kaggle notebook: ```` # Iris Flower Dataset ### Context The Iris flower data set is a multivariate data set introduced ... (truncated) ```python import pandas as pd df = pd.read_csv('iris-flower-dataset/IRIS.csv') df.info() ``` ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB ``` Examples from the dataset: ``` { "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2, "species": "Iris-setosa" } ... (truncated) ``` Code: ```python import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory import os for dirname, _, filenames in os.walk("/kaggle/input"): for filename in filenames: print(os.path.join(dirname, filename)) # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session import matplotlib.pyplot as plt data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv") data.head() X = data.drop("species", axis=1) ... (truncated) ```` ## Citation ``` @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

# GitHub 议题与 Kaggle 笔记本(GitHub Issues & Kaggle Notebooks) ## 数据集概述 GitHub 议题与 Kaggle 笔记本是面向大语言模型(Large Language Model, LLM)训练的双代码数据集合集,数据源自 GitHub 平台议题与 Kaggle 平台的笔记本文件。本数据集为 StarCoder2 模型训练语料库的改造子集,具体对应 [bigcode/StarCoder2-Extras](https://huggingface.co/datasets/bigcode/starcoder2data-extras) 数据集。我们对样本进行了重格式化处理,移除了 StarCoder2 的专属标记符,并使用自然文本分隔议题中的注释,同时以 Markdown 与代码块的形式展示 Kaggle 笔记本内容。 本数据集包含以下两部分: - 🐛 GitHub 议题(GitHub Issues):110亿 Token 的 GitHub 议题讨论数据,源自 [GH Archive](https://www.gharchive.org/)。 - 📊 Kaggle 笔记本(Kaggle Notebooks):17亿 Token 的 Markdown 格式数据分析笔记本数据,源自 Kaggle 的 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集。 本数据集已经过过滤处理,剔除了低质量内容、重复样本与个人可识别信息(Personally Identifiable Information, PII),更多细节可参考 StarCoder2 的 [相关论文](https://arxiv.org/abs/2402.19173)。 ## 如何加载数据集 你可以通过以下代码加载指定子集: python from datasets import load_dataset issues = load_dataset("HuggingFaceTB/github-issues-notebooks", "issues", split="train") # GitHub Issues kaggle_notebooks = load_dataset("HuggingFaceTB/github-issues-notebooks", "kaggle", split="train") # Kaggle Notebooks ## 数据集整理流程 本数据集的整理流程源自 StarCoder2 处理管线,原始数据集可访问:https://huggingface.co/datasets/bigcode/starcoder2data-extras,更多细节可参考 StarCoder2 相关论文。 ### 🐛 GitHub 议题 GitHub 议题数据集包含源自 GitHub 仓库的讨论内容,涵盖议题报告、缺陷追踪与技术问答类讨论。 为保障数据质量,StarCoder2 处理管线包含以下筛选步骤: - 移除机器人生成的评论与邮件自动回复内容。 - 过滤长度过短(少于200字符)的议题与过长的评论内容。 - 仅保留多用户参与的讨论,或内容详实的单用户报告。 - 使用 [StarPII](https://huggingface.co/bigcode/starpii) 工具对用户名进行匿名处理,同时保留对话结构、姓名、邮箱、密钥、密码与IP地址等信息。 我们使用以下模板格式化对话内容: Title: [Issue title] Question: username_0: [Issue content] Answers: username_1: [Answer from user 1] username_0: [Author reply] username_2: [Answer from user 2] ... Status: Issue closed (optional) ## 📊 Kaggle 笔记本 Kaggle 笔记本数据集源自 [Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code) 数据集,采用 Apache 2.0 开源协议。我们通过多步筛选流程对数据进行清洗,具体包括: - 移除存在语法错误或长度少于100字符的笔记本文件。 - 提取引用 Kaggle 数据集的笔记本元数据:若可行,我们会获取笔记本中引用的数据集信息,并将数据集描述、`ds.info()` 输出结果与4条样本示例添加至笔记本开头。 - 过滤重复样本,该步骤使数据集体量缩减78%,同时对个人可识别信息(Personally Identifiable Information, PII)进行脱敏处理。 每份笔记本均以 Markdown 格式呈现:内容以笔记本标题开头,若存在可用数据集描述则一并添加,随后将转换为 Python 脚本的笔记本内容置于代码块中。 以下为 Kaggle 笔记本示例: ` # Iris Flower Dataset ### Context The Iris flower data set is a multivariate data set introduced ... (truncated) python import pandas as pd df = pd.read_csv('iris-flower-dataset/IRIS.csv') df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal_length 150 non-null float64 1 sepal_width 150 non-null float64 2 petal_length 150 non-null float64 3 petal_width 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB Examples from the dataset: { "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2, "species": "Iris-setosa" } ... (truncated) Code: python import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory import os for dirname, _, filenames in os.walk("/kaggle/input"): for filename in filenames: print(os.path.join(dirname, filename)) # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session import matplotlib.pyplot as plt data = pd.read_csv("/kaggle/input/iris-flower-dataset/IRIS.csv") data.head() X = data.drop("species", axis=1) ... (truncated) ` ## 引用 @article{lozhkov2024starcoder, title={Starcoder 2 and the stack v2: The next generation}, author={Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} }
提供机构:
maas
创建时间:
2025-09-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作