five

JetBrains-Research/jupyter-errors-dataset

收藏
Hugging Face2024-03-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/JetBrains-Research/jupyter-errors-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: id dtype: int64 - name: repo_name dtype: string - name: repo_owner dtype: string - name: file_link dtype: string - name: line_link dtype: string - name: path dtype: string - name: content_sha dtype: string - name: content dtype: string splits: - name: test num_bytes: 32708409 num_examples: 50 - name: train num_bytes: 8081954107 num_examples: 10000 download_size: 5914651135 dataset_size: 8114662516 configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* tags: - jupyter notebook size_categories: - 1K<n<10K --- # Dataset Summary The presented dataset contains `10000` Jupyter notebooks, each of which contains at least one error. In addition to the notebook content, the dataset also provides information about the repository where the notebook is stored. This information can help restore the environment if needed. # Getting Started This dataset is organized such that it can be naively loaded via the Hugging Face datasets library. We recommend using streaming due to the large size of the files. ```Python import nbformat from datasets import load_dataset dataset = load_dataset( "JetBrains-Research/jupyter-errors-dataset", split="test", streaming=True ) row = next(iter(dataset)) notebook = nbformat.reads(row["content"], as_version=nbformat.NO_CONVERT) ``` # Citation ``` @misc{JupyterErrorsDataset, title = {Dataset of Errors in Jupyter Notebooks}, author = {Konstantin Grotov and Sergey Titov and Yaroslav Zharov and Timofey Bryksin}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/datasets/JetBrains-Research/jupyter-errors-dataset}}, } ```
提供机构:
JetBrains-Research
原始信息汇总

数据集概述

该数据集包含 10000 个 Jupyter 笔记本,每个笔记本至少包含一个错误。除了笔记本内容外,数据集还提供了笔记本存储的仓库信息,这些信息有助于在需要时恢复环境。

数据集信息

特征

  • id: 数据类型为 int64
  • repo_name: 数据类型为 string
  • repo_owner: 数据类型为 string
  • file_link: 数据类型为 string
  • line_link: 数据类型为 string
  • path: 数据类型为 string
  • content_sha: 数据类型为 string
  • content: 数据类型为 string

数据分割

  • test: 包含 50 个样本,总字节数为 32708409
  • train: 包含 10000 个样本,总字节数为 8081954107

数据大小

  • 下载大小: 5914651135 字节
  • 数据集大小: 8114662516 字节

配置

  • default:
    • test: 路径为 data/test-*
    • train: 路径为 data/train-*

标签

  • jupyter notebook

大小分类

  • 1K<n<10K
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作