jupyter-agent/jupyter-agent-dataset
收藏Hugging Face2025-09-10 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/jupyter-agent/jupyter-agent-dataset
下载链接
链接失效反馈官方服务:
资源简介:
Jupyter Agent数据集是一个由机器生成的数据集,使用真实Kaggle笔记本经过多阶段处理,去重、获取引用的数据集、评估教育质量、筛选与数据分析相关的数据、生成基于数据集的问题-答案(QA)对,并通过运行笔记本产生可执行的推理轨迹。数据集包含自然语言问题、验证答案和适合代理训练的逐步执行轨迹。数据集总共有51389个合成笔记本,大约有2亿个训练标记。数据集分为两个子集:`thinking`和`non-thinking`,其中代码生成思考评论根据基础模型类型被标记或不被标记。数据集以Apache-2.0许可证发布。
The Jupyter Agent Dataset is a machine-generated dataset that uses real Kaggle notebooks processed through a multi-stage pipeline to de-duplicate, fetch referenced datasets, score educational quality, filter to data-analysis–relevant content, generate dataset-grounded question–answer (QA) pairs, and produce executable reasoning traces by running notebooks. The dataset includes natural questions about a dataset/notebook, verified answers, and step-by-step execution traces suitable for agent training. The dataset contains a total of 51389 synthetic notebooks, which amounts to ~200M training tokens. The dataset is provided in two subsets - `thinking` and `non-thinking`, where the code generation thinking commentary is wrapped with or without thinkinng tags, depending on base model type. The dataset is released under the Apache-2.0 license.
提供机构:
jupyter-agent



