codeparrot/github-jupyter-text-code-pairs

Name: codeparrot/github-jupyter-text-code-pairs
Creator: codeparrot
Published: 2022-10-25 09:30:34
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/github-jupyter-text-code-pairs

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个解析版本的[github-jupyter-parsed]数据集，包含Markdown和代码对。数据集经过去重处理，包含451662个示例。此外，还提到了一个类似的数据集CoNaLa，该数据集包含来自StackOverflow的文本和Python代码，并由注释者精心挑选了一些样本。

--- annotations_creators: [] language: - 代码（code） license: - 其他（other） multilinguality: - 单语言（monolingual） size_categories: - 未知（unknown） task_categories: - 文本生成（text-generation） task_ids: - 语言建模（language-modeling） pretty_name: github-jupyter-text-code-pairs --- 本数据集为[github-jupyter-parsed](https://huggingface.co/datasets/codeparrot/github-jupyter-parsed)的解析版本，包含标记语言（Markdown）与代码配对数据。我们在[preprocessing.py](https://huggingface.co/datasets/codeparrot/github-jupyter-parsed-v2/blob/main/preprocessing.py)中提供了预处理脚本。该数据集已完成去重，共包含451662条样本。针对文本与Python代码配对的同类数据集，还有源自StackOverflow的[CoNaLa](https://huggingface.co/datasets/neulab/conala)基准数据集，其部分样本由标注人员精心整理。

提供机构：

codeparrot

原始信息汇总