23ws-LLMcoder/LLMcoder-GitHub-Python-Mix-Direct

Name: 23ws-LLMcoder/LLMcoder-GitHub-Python-Mix-Direct
Creator: 23ws-LLMcoder
Published: 2023-11-15 14:56:40
License: 暂无描述

Hugging Face2023-11-15 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/23ws-LLMcoder/LLMcoder-GitHub-Python-Mix-Direct

下载链接

链接失效反馈

官方服务：

资源简介：

LLMcoder-GitHub-Python-Mix-Direct数据集是一个用于OpenAI的GPT-3.5-Turbo模型微调的数据集，旨在提供Python代码的自动完成建议。数据集包含100个目标对话，每个对话由系统提示、用户输入和助手响应组成。数据集的创建过程涉及从25个GitHub仓库中随机抽取Python文件，并进行处理和截断。数据集可能包含个人或敏感信息，并且可能偏向于特定风格的代码。

提供机构：

23ws-LLMcoder

原始信息汇总

数据集卡片 for LLMcoder-GitHub-Python-Mix-Direct

Python 目标自动完成建议，采用对话格式，适用于 OpenAI 的微调。

数据集详情

数据集描述

语言(NLP): [更多信息需补充]
许可证: [更多信息需补充]

数据集来源 [可选]

数据从以下公共 GitHub 仓库中抓取于 2023-11-15:

https://github.com/numpy/numpy
https://github.com/pandas-dev/pandas
https://github.com/matplotlib/matplotlib
https://github.com/scikit-learn/scikit-learn
https://github.com/python-pillow/Pillow
https://github.com/psaegert/pmtrendviz
https://github.com/psaegert/nli-nec
https://github.com/graphdeco-inria/gaussian-splatting
https://github.com/lllyasviel/ControlNet
https://github.com/maltfield/awesome-lemmy-instances
https://github.com/Aleph-Alpha/aleph-alpha-client
https://github.com/MaartenGr/BERTopic
https://github.com/MilesCranmer/PySR
https://github.com/AUTOMATIC1111/stable-diffusion-webui
https://github.com/microsoft/Codex-CLI
https://github.com/dropbox/hydra
https://github.com/HLearning/unet_keras
https://github.com/hmason/ml_class
https://github.com/django/django
https://github.com/encode/django-rest-framework
https://github.com/pallets/flask
https://github.com/postmanlabs/httpbin
https://github.com/jakevdp/PythonDataScienceHandbook
https://github.com/donnemartin/data-science-ipython-notebooks
https://github.com/tensorflow/tensorflow

用途

该数据集旨在用于通过 OpenAI 的微调 API 对 GPT-3.5-Turbo 进行微调。

数据集结构

train_completions.jsonl 包含 100 个目标对话的列表。每个对话结构如下：

json [ {"role": "system", "content": <系统提示>}, {"role": "user", "content": <代码的前半部分>}, {"role": "assistant", "content": <来自真实数据的小目标完成>} ]

数据集创建

筛选理由

该数据集旨在微调 GPT-3.5-Turbo，以提供更可靠格式的 Python 代码自动完成建议。

数据收集和处理

我们从与 Python 相关的 25 个 GitHub 仓库中抓取了数据，并按文件长度随机抽样了 4 个 Python 文件。文件在每个文件中的随机点分成两部分。接下来，如果输入超过 10k 个令牌，我们将输入文件截断为从开头开始的 250 到 10000 个令牌。输出被手动截断为合理短的代码完成。

源数据生产者是谁？

Paul Saegert

[更多信息需补充]

个人和敏感信息

该数据集可能包含个人或敏感信息。

偏差、风险和限制

该数据集仅包含来自最流行、趋势或个人项目的 Python 代码。它可能偏向于某种特定的代码风格。

5,000+

优质数据集

54 个

任务类型

进入经典数据集