JetBrains-Research/jupyter-errors-dataset
收藏Hugging Face2024-03-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/JetBrains-Research/jupyter-errors-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: id
dtype: int64
- name: repo_name
dtype: string
- name: repo_owner
dtype: string
- name: file_link
dtype: string
- name: line_link
dtype: string
- name: path
dtype: string
- name: content_sha
dtype: string
- name: content
dtype: string
splits:
- name: test
num_bytes: 32708409
num_examples: 50
- name: train
num_bytes: 8081954107
num_examples: 10000
download_size: 5914651135
dataset_size: 8114662516
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
- split: train
path: data/train-*
tags:
- jupyter notebook
size_categories:
- 1K<n<10K
---
# Dataset Summary
The presented dataset contains `10000` Jupyter notebooks,
each of which contains at least one error. In addition to the notebook content,
the dataset also provides information about the repository where the notebook is stored.
This information can help restore the environment if needed.
# Getting Started
This dataset is organized such that it can be naively loaded via the Hugging Face datasets library. We recommend using streaming due to the large size of the files.
```Python
import nbformat
from datasets import load_dataset
dataset = load_dataset(
"JetBrains-Research/jupyter-errors-dataset", split="test", streaming=True
)
row = next(iter(dataset))
notebook = nbformat.reads(row["content"], as_version=nbformat.NO_CONVERT)
```
# Citation
```
@misc{JupyterErrorsDataset,
title = {Dataset of Errors in Jupyter Notebooks},
author = {Konstantin Grotov and Sergey Titov and Yaroslav Zharov and Timofey Bryksin},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://huggingface.co/datasets/JetBrains-Research/jupyter-errors-dataset}},
}
```
提供机构:
JetBrains-Research
原始信息汇总
数据集概述
该数据集包含 10000 个 Jupyter 笔记本,每个笔记本至少包含一个错误。除了笔记本内容外,数据集还提供了笔记本存储的仓库信息,这些信息有助于在需要时恢复环境。
数据集信息
特征
- id: 数据类型为
int64 - repo_name: 数据类型为
string - repo_owner: 数据类型为
string - file_link: 数据类型为
string - line_link: 数据类型为
string - path: 数据类型为
string - content_sha: 数据类型为
string - content: 数据类型为
string
数据分割
- test: 包含
50个样本,总字节数为32708409 - train: 包含
10000个样本,总字节数为8081954107
数据大小
- 下载大小:
5914651135字节 - 数据集大小:
8114662516字节
配置
- default:
- test: 路径为
data/test-* - train: 路径为
data/train-*
- test: 路径为
标签
- jupyter notebook
大小分类
1K<n<10K



