five

jupyter-agent-dataset

收藏
魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/data-agents/jupyter-agent-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Jupyter Agent Dataset ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/ZyF9foqe5SLECwkq0dOpT.png) ## Dataset Details ### Dataset Description The dataset uses real Kaggle notebooks processed through a multi-stage pipeline to de-duplicate, fetch referenced datasets, score educational quality, filter to data-analysis–relevant content, generate dataset-grounded question–answer (QA) pairs, and produce executable reasoning traces by running notebooks. The resulting examples include natural questions about a dataset/notebook, verified answers, and step-by-step execution traces suitable for agent training. You can load the dataset using the following code: ```python from datasets import load_dataset # To load the train split of a specific subset, such as non-thinking, you can do ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="non-thinking") # apply chat template tokenizer.apply_chat_template(ds[0]["text"]) ``` The dataset contains in total 51389 synthetic notebooks, which amounts to ~200M training tokens. The dataset is provided in two subsets - `thinking` and `non-thinking`, where the code generation thinking commentary is wrapped with or without thinkinng tags, depending on base model type. We provide both subsets for convenince and ability to use the dataset for fine-tuning out-of-the-box. - Created by: [Hugging Face Jupyter-Agent Team](https://huggingface.co/jupyter-agent) ([Baptiste Colle](https://huggingface.co/baptistecolle), [Hanna Yukhymenko](https://huggingface.co/hannayukhymenko), [Leandro von Werra](https://huggingface.co/lvwerra)) - Source Code: [github link](https://github.com/huggingface/jupyter-agent) - Blog: [blog link](https://huggingface.co/blog/jupyter-agent-2) - Demo: [Jupyter Agent 2 Demo](https://huggingface.co/spaces/lvwerra/jupyter-agent-2) - License: Apache-2.0 ### Updates **04/09/2025**: We have added original tool calls used in the notebook generation and renamed `text` column to `messages` to enable straightforward out-of-the-box training with [TRL](https://github.com/huggingface/trl)! Now you just need `messages` and `tools` columns in your training dataset which you can directly pass to [SFTTrainer](https://huggingface.co/docs/trl/en/dataset_formats#tool-calling). ## Uses Jupyter Agent Dataset allows users to train code agents that are able to: - Read notebook and dataset context - Execute Python code (e.g., pandas, numpy, matplotlib) to answer dataset-grounded questions - Produce step-by-step solutions with intermediate computations We trained [Qwen-3-4b-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and [Qwen-3-4b-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) on Jupyter Agent Dataset using [TRL](https://github.com/huggingface/trl) and evaluated the agent efficiency on DABstep benchmarks, which evaluates models on their ability to generate code which answers questions about provided datasets. The dataset helps both models to achieve significant gains **up to 20%** on the DABstep easy score: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/WAgyjhdh-ObZ_bmT-9R59.png) We also observed the ability of the dataset to enhance model's EDA and coding skills which improve the hard score: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/8FHBTNSpfbCtHY3Ti0G4e.png) ## Dataset Structure Each example contains the LLM-generated notebook and its respective QA pair, derived from the linked Kaggle notebook using real Kaggle datasets with the following keys: - `id`: Unique identifier for the notebook and question pair number. - `messages`: Synthetic notebook in ChatML format which enables out-of-the-box training. - `question`: Natural-language question grounded in the notebook/dataset. - `answer`: Verified final answer in short form. - `edu_score`: Educational quality score used for filtering (LLM-assigned). - `files_used`: Files used in the original referenced Kaggle notebook for which the analysis was done. - `packages_used`: Packages used in the original referenced Kaggle notebook whic were used for the analysis. - `kaggle_dataset_name`: Full Kaggle source dataset name, suited for Kaggle Hub download. - `executor_type`: Code executor, used for generating execution traces (either E2B or LLM/Qwen-Coder). - `original_notebook`: Original Kaggle source notebook, used for QA and code generation. - `tools`: Tool calls used for the notebook generation. Notes: - The dataset contains derived synthetic QA pairs and traces; it does not redistribute original Kaggle datasets or full notebook contents. ## Dataset Creation ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/Qbu-WR9wcbWVquy7bZlYg.png) ### Data Sourcing and Preparation 1. Large-scale deduplication of Kaggle notebooks — Derived from public Kaggle notebooks ([Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code)) and linked datasets metadata using [Datatrove](https://github.com/huggingface/datatrove/). 2. Downloading linked datasets — Auto-fetched via Kaggle metadata ([Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle)); ensured notebooks are end-to-end runnable for trace execution and agent training. 3. Educational scoring — Used [Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B) scoring notebooks for their educational quality; selected high-quality sections (not whole notebooks) to avoid trivial/broken code - better notebook sources allowed us to yield better synthetic data. 4. Filtering irrelevant notebooks — Excluded LLM-training and non–data analysis notebooks; removed notebooks that didn’t use datasets via an LLM-assisted filter. You can use sourced Kaggle datasets directly with E2B code execution using the following code: ```python import kagglehub import e2b_code_interpreter as e2b from datasets import load_dataset # load the Jupyter Agent Dataset ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="thinking") # get the kaggle dataset name dataset_name = ds[0]["kaggle_dataset_name"] # load the dataset locally from Kaggle Hub path = kagglehub.dataset_download(dataset_name) print(path) # this is the folder path where the dataset is downloaded # initialize sandbox sandbox_init = e2b.Sandbox(timeout=240) # write used file to E2B sandbox file_name = ds[0]["files_used"][0] file_name = file_name.split('/')[-1] if '/' in file_name else file_name with open(f"{path}/{file_name}", "rb") as file: sandbox_init.files.write(f"/home/user/input/{file_name}", file) # execute code with E2B execution = sandbox_init.run_code("<some code>") ``` ### Synthetic Notebook Generation 1. QA generation — Produced dataset-grounded QA pairs from cleaned notebooks using a two-step process: (a) [Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B) generates question and candidate answer, (b) another LLM validates with notebook context to reduce hallucinations. 2. Traces generation — Used [Qwen-Coder-480B](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct) for code/thinking; executed with [E2B](https://e2b.dev/) when Kaggle datasets were locally available, otherwise simulated an LLM sandbox with Qwen-Coder. ### Summary - [Datatrove](https://github.com/huggingface/datatrove/) for large-scale processing of real Kaggle notebooks and their linked Kaggle datasets. - [Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B) for scoring and QA generation; [Qwen-Coder-480B](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct) for notebook and code execution traces generation. - [E2B](https://e2b.dev/) for secure, sandboxed execution with authetntic code execution traces. ### Recommendations Users should be made aware of the risks, biases and limitations of the dataset: - Licensing and terms: upstream Kaggle notebooks and datasets have their own licenses/ToS. This dataset provides derived artifacts and references; users are responsible for complying with Kaggle ToS and any upstream licenses when accessing original content. - Data quality: notebooks may contain errors, non-deterministic outputs, or environment-specific behavior. Traces may not be perfectly reproducible across environments. - LLM-generated artifacts: QA pairs and validations are machine-generated and may contain mistakes. Verify results before use in critical settings. - Bias: source notebooks and datasets may reflect author/domain biases; generated QAs may inherit those biases. - Safety: executable traces may include environment-specific code. Run code in secure E2B sandboxes and review before execution. ## Additional Information ### Dataset Creators 1. Baptiste Colle, Hugging Face, baptiste.colle@huggingface.co 2. Hanna Yukhymenko, Hugging Face, hanna.yukhymenko@huggingface.co 3. Leandro von Werra, Hugging Face, leandro@huggingface.co ### Licensing Information This dataset is released under the Apache License 2.0. - SPDX identifier: Apache-2.0 - License text: https://www.apache.org/licenses/LICENSE-2.0 Note: While this dataset is Apache-2.0 licensed, any use of referenced Kaggle notebooks or datasets must comply with Kaggle’s Terms of Service and the original authors’ licenses. This dataset aims to include only derived artifacts (e.g., QA pairs, execution traces, metadata references) and not redistribute upstream data. ### Citation Information ``` @misc{jupyteragentdataset, title={Jupyter Agent Dataset}, author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra}, year={2025} } ```

# Jupyter智能体数据集 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/ZyF9foqe5SLECwkq0dOpT.png) ## 数据集详情 ### 数据集描述 本数据集基于真实的Kaggle笔记本,通过多阶段流水线处理:先进行去重、获取关联数据集、评估教育质量、筛选与数据分析相关的内容,随后生成基于数据集的问答(QA)对,并通过运行笔记本生成可执行的推理轨迹。最终的示例包含针对数据集/笔记本的自然语言问题、经过验证的答案,以及适用于智能体训练的分步执行轨迹。 你可通过如下代码加载本数据集: python from datasets import load_dataset # 若要加载特定子集的训练拆分(例如non-thinking子集),可执行如下代码: ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="non-thinking") # 应用对话模板 tokenizer.apply_chat_template(ds[0]["text"]) 本数据集共包含51389个合成笔记本,总训练Token数约2亿。数据集提供`thinking`与`non-thinking`两个子集:根据基座模型类型的不同,代码生成的思考注释会被包裹或不包裹思考标签。我们提供这两个子集以方便用户直接开箱即用地进行微调。 - 制作方:[Hugging Face Jupyter-Agent团队](https://huggingface.co/jupyter-agent)([Baptiste Colle](https://huggingface.co/baptistecolle)、[Hanna Yukhymenko](https://huggingface.co/hannayukhymenko)、[Leandro von Werra](https://huggingface.co/lvwerra)) - 源代码:[GitHub链接](https://github.com/huggingface/jupyter-agent) - 博客:[博客链接](https://huggingface.co/blog/jupyter-agent-2) - 演示:[Jupyter Agent 2演示](https://huggingface.co/spaces/lvwerra/jupyter-agent-2) - 许可证:Apache-2.0 ### 更新记录 **2025年4月9日**:我们新增了笔记本生成过程中使用的原始工具调用记录,并将`text`列重命名为`messages`,以支持直接使用[TRL](https://github.com/huggingface/trl)开箱即用地进行训练!现在你的训练数据集仅需包含`messages`与`tools`列,即可直接传递给[SFTTrainer](https://huggingface.co/docs/trl/en/dataset_formats#tool-calling)。 ## 数据集用途 Jupyter智能体数据集可用于训练具备以下能力的代码智能体: - 读取笔记本与数据集上下文 - 执行Python代码(如pandas、numpy、matplotlib)以回答基于数据集的问题 - 生成包含中间计算的分步解决方案 我们使用[TRL](https://github.com/huggingface/trl)在Jupyter智能体数据集上训练了[Qwen-3-4b-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)与[Qwen-3-4b-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507),并在DABstep基准测试中评估了智能体的效率——该基准测试用于评估模型生成代码以回答给定数据集相关问题的能力。 该数据集可帮助这两个模型在DABstep简易得分上实现最高达20%的显著提升: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/WAgyjhdh-ObZ_bmT-9R59.png) 我们还观察到,该数据集能够提升模型的探索性数据分析(EDA,Exploratory Data Analysis)与编码能力,进而改善困难得分: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/8FHBTNSpfbCtHY3Ti0G4e.png) ## 数据集结构 每个示例均包含由大语言模型(Large Language Model,LLM)生成的笔记本,以及源自关联Kaggle笔记本、基于真实Kaggle数据集构建的对应问答对,其包含以下字段: - `id`:笔记本与问答对的唯一标识符。 - `messages`:采用ChatML格式的合成笔记本,支持开箱即用地训练。 - `question`:基于笔记本/数据集的自然语言问题。 - `answer`:经过验证的简短最终答案。 - `edu_score`:用于筛选的教育质量评分(由大语言模型标注)。 - `files_used`:原始关联Kaggle笔记本中用于数据分析的文件。 - `packages_used`:原始关联Kaggle笔记本中用于数据分析的依赖包。 - `kaggle_dataset_name`:完整的Kaggle源数据集名称,可用于从Kaggle Hub下载。 - `executor_type`:代码执行器类型,用于生成执行轨迹(可选E2B或大语言模型/Qwen-Coder)。 - `original_notebook`:原始Kaggle源笔记本,用于问答生成与代码生成。 - `tools`:笔记本生成过程中使用的工具调用记录。 注意事项: - 本数据集仅包含衍生的合成问答对与轨迹,不会重新分发原始Kaggle数据集或完整的笔记本内容。 ## 数据集构建 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/Qbu-WR9wcbWVquy7bZlYg.png) ### 数据获取与预处理 1. 大规模Kaggle笔记本去重——基于公开Kaggle笔记本([Meta Kaggle Code](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code))与关联数据集元数据,使用[Datatrove](https://github.com/huggingface/datatrove/)完成处理。 2. 下载关联数据集——通过Kaggle元数据([Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle))自动获取;确保笔记本可端到端运行,以支持轨迹生成与智能体训练。 3. 教育质量评分——使用[Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B)对笔记本的教育质量进行评分;仅选取高质量的章节(而非整个笔记本)以避免无关或损坏的代码——更优质的笔记本源可产出更优质的合成数据。 4. 筛选无关笔记本——排除用于大语言模型训练的笔记本与非数据分析类笔记本;通过大语言模型辅助过滤器移除未使用数据集的笔记本。 你可通过如下代码直接使用获取的Kaggle数据集结合E2B代码执行器: python import kagglehub import e2b_code_interpreter as e2b from datasets import load_dataset # 加载Jupyter智能体数据集 ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="thinking") # 获取Kaggle数据集名称 dataset_name = ds[0]["kaggle_dataset_name"] # 从Kaggle Hub本地下载数据集 path = kagglehub.dataset_download(dataset_name) print(path) # 输出数据集下载所在的文件夹路径 # 初始化沙箱 sandbox_init = e2b.Sandbox(timeout=240) # 将使用的文件写入E2B沙箱 file_name = ds[0]["files_used"][0] file_name = file_name.split('/')[-1] if '/' in file_name else file_name with open(f"{path}/{file_name}", "rb") as file: sandbox_init.files.write(f"/home/user/input/{file_name}", file) # 使用E2B执行代码 execution = sandbox_init.run_code("<some code>") ### 合成笔记本生成 1. 问答对生成——从预处理后的笔记本中生成基于数据集的问答对,采用两步流程:(a) [Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B)生成问题与候选答案;(b) 另一台大语言模型基于笔记本上下文进行验证,以减少幻觉问题。 2. 轨迹生成——使用[Qwen-Coder-480B](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)生成代码与思考内容;当Kaggle数据集可本地获取时,通过[E2B](https://e2b.dev/)执行代码,否则使用Qwen-Coder模拟大语言模型沙箱。 ### 总结 - 使用[Datatrove](https://github.com/huggingface/datatrove/)对真实Kaggle笔记本及其关联的Kaggle数据集进行大规模处理。 - 使用[Qwen-32B](https://huggingface.co/Qwen/Qwen3-32B)进行教育质量评分与问答对生成;使用[Qwen-Coder-480B](https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)生成笔记本与代码执行轨迹。 - 使用[E2B](https://e2b.dev/)实现安全的沙箱执行,生成真实的代码执行轨迹。 ### 使用建议 用户需知晓本数据集的风险、偏差与局限性: - 许可与条款:上游Kaggle笔记本与数据集拥有各自的许可条款与服务协议。本数据集仅提供衍生制品与引用信息;用户在访问原始内容时,需自行遵守Kaggle服务协议与所有上游许可条款。 - 数据质量:笔记本可能包含错误、非确定性输出或特定环境下的行为。执行轨迹在不同环境中可能无法完全复现。 - 大语言模型生成的制品:问答对与验证结果均为机器生成,可能存在错误。在关键场景中使用前,请验证结果。 - 偏差:源笔记本与数据集可能反映作者或领域的固有偏差;生成的问答对可能继承此类偏差。 - 安全性:可执行轨迹可能包含特定环境的代码。请在安全的E2B沙箱中运行代码,并在执行前进行审查。 ## 附加信息 ### 数据集创作者 1. Baptiste Colle, Hugging Face, baptiste.colle@huggingface.co 2. Hanna Yukhymenko, Hugging Face, hanna.yukhymenko@huggingface.co 3. Leandro von Werra, Hugging Face, leandro@huggingface.co ### 许可信息 本数据集采用Apache License 2.0协议发布。 - SPDX标识符:Apache-2.0 - 许可文本:https://www.apache.org/licenses/LICENSE-2.0 注:尽管本数据集采用Apache-2.0许可,引用的Kaggle笔记本或数据集的使用仍需遵守Kaggle服务协议与原作者的许可条款。本数据集仅包含衍生制品(如问答对、执行轨迹、元数据引用),不会重新分发上游数据。 ### 引用信息 @misc{jupyteragentdataset, title={Jupyter Agent Dataset}, author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra}, year={2025} }
提供机构:
maas
创建时间:
2025-09-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作