five

luisroque/instruct-python-llama2-20k

收藏
Hugging Face2023-08-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/luisroque/instruct-python-llama2-20k
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 34661192.7 num_examples: 19000 - name: test num_bytes: 1824273.3 num_examples: 1000 download_size: 19060329 dataset_size: 36485466 license: cc-by-sa-3.0 task_categories: - text-generation language: - en pretty_name: Instruct Python 500k size_categories: - 10K<n<100K --- # Fine-tuning Instruct Llama2 Stack Overflow Python Q&A ## Transformed Dataset ### Objective The transformed dataset is designed for fine-tuning LLMs to improve Python coding assistance by focusing on high-quality content from Stack Overflow. It has around 20k instructions. ### Structure - **Question-Answer Pairing**: Questions and answers are paired using the `ParentId` linkage. - **Quality Focus**: Only top-rated answers for each question are retained. - **HTML Tag Removal**: All HTML tags in the content are removed. - **Combined Question Field**: Each question's title and body are merged. - **Filtering**: Entries with negative scores or those not containing Python code structures are excluded. Final columns: - `score_question` - `score_answer` - `question` - `answer` ### Llama2 Transformation The dataset has been transformed to match the Llama2 prompt structure, which is relevant for the model's fine-tuning. The format is the following: `<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]` Where: - `system_prompt` gives context or instructions to the model. - `user_message` is the user's query following the system prompt, expecting a particular response from the model. This structure ensures the training aligns with Llama2's expectations, optimizing the fine-tuning quality. ## Original Dataset The dataset contains questions and answers from Stack Overflow with the `python` tag, covering the period from August 2, 2008, to October 19, 2016. ## License All contributions are under the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/). Attribution is required. The original dataset was posted [here](https://www.kaggle.com/datasets/stackoverflow/pythonquestions). Keep in touch: [LinkedIn](https://www.linkedin.com/in/luisbrasroque/)
提供机构:
luisroque
原始信息汇总

数据集概述

基本信息

  • 数据集名称: Instruct Python 500k
  • 语言: 英语
  • 许可协议: CC-BY-SA 3.0
  • 任务类别: 文本生成
  • 数据集大小: 10K<n<100K

数据结构

  • 特征:
    • text: 数据类型为字符串
  • 分割:
    • train: 包含19000个样本,34661192.7字节
    • test: 包含1000个样本,1824273.3字节
  • 下载大小: 19060329字节
  • 数据集大小: 36485466字节

数据集转换

  • 目的: 用于微调大型语言模型(LLMs),以提高Python编程辅助的质量。
  • 结构:
    • 问题-答案配对: 使用ParentId链接问题和答案。
    • 质量聚焦: 仅保留每个问题的最高评分答案。
    • HTML标签移除: 移除内容中的所有HTML标签。
    • 合并问题字段: 每个问题的标题和正文合并。
    • 过滤: 排除评分负面或不包含Python代码结构的条目。
  • 最终列:
    • score_question
    • score_answer
    • question
    • answer

Llama2转换

  • 格式: <s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]
    • system_prompt: 提供模型上下文或指令。
    • user_message: 用户在系统提示后的查询,期望模型给出特定响应。
  • 目的: 确保训练与Llama2的预期一致,优化微调质量。

原始数据集

  • 内容: 包含2008年8月2日至2016年10月19日期间Stack Overflow上带有python标签的问题和答案。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作