JetBrains-Research/jupyter-errors-dataset

Name: JetBrains-Research/jupyter-errors-dataset
Creator: JetBrains-Research
Published: 2024-03-19 10:47:26
License: 暂无描述

Hugging Face2024-03-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/JetBrains-Research/jupyter-errors-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: id dtype: int64 - name: repo_name dtype: string - name: repo_owner dtype: string - name: file_link dtype: string - name: line_link dtype: string - name: path dtype: string - name: content_sha dtype: string - name: content dtype: string splits: - name: test num_bytes: 32708409 num_examples: 50 - name: train num_bytes: 8081954107 num_examples: 10000 download_size: 5914651135 dataset_size: 8114662516 configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* tags: - jupyter notebook size_categories: - 1K<n<10K --- # Dataset Summary The presented dataset contains `10000` Jupyter notebooks, each of which contains at least one error. In addition to the notebook content, the dataset also provides information about the repository where the notebook is stored. This information can help restore the environment if needed. # Getting Started This dataset is organized such that it can be naively loaded via the Hugging Face datasets library. We recommend using streaming due to the large size of the files. ```Python import nbformat from datasets import load_dataset dataset = load_dataset( "JetBrains-Research/jupyter-errors-dataset", split="test", streaming=True ) row = next(iter(dataset)) notebook = nbformat.reads(row["content"], as_version=nbformat.NO_CONVERT) ``` # Citation ``` @misc{JupyterErrorsDataset, title = {Dataset of Errors in Jupyter Notebooks}, author = {Konstantin Grotov and Sergey Titov and Yaroslav Zharov and Timofey Bryksin}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/datasets/JetBrains-Research/jupyter-errors-dataset}}, } ```

提供机构：

JetBrains-Research

原始信息汇总

数据集概述

该数据集包含 10000 个 Jupyter 笔记本，每个笔记本至少包含一个错误。除了笔记本内容外，数据集还提供了笔记本存储的仓库信息，这些信息有助于在需要时恢复环境。

数据集信息

特征

id: 数据类型为 int64
repo_name: 数据类型为 string
repo_owner: 数据类型为 string
file_link: 数据类型为 string
line_link: 数据类型为 string
path: 数据类型为 string
content_sha: 数据类型为 string
content: 数据类型为 string

数据分割

test: 包含 50 个样本，总字节数为 32708409
train: 包含 10000 个样本，总字节数为 8081954107

数据大小

下载大小: 5914651135 字节
数据集大小: 8114662516 字节

配置

default:
- test: 路径为 data/test-*
- train: 路径为 data/train-*

大小分类

1K<n<10K

5,000+

优质数据集

54 个

任务类型

进入经典数据集