luisroque/instruct-python-llama2-20k

Name: luisroque/instruct-python-llama2-20k
Creator: luisroque
Published: 2023-08-18 09:44:00
License: 暂无描述

Hugging Face2023-08-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/luisroque/instruct-python-llama2-20k

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 34661192.7 num_examples: 19000 - name: test num_bytes: 1824273.3 num_examples: 1000 download_size: 19060329 dataset_size: 36485466 license: cc-by-sa-3.0 task_categories: - text-generation language: - en pretty_name: Instruct Python 500k size_categories: - 10K<n<100K --- # Fine-tuning Instruct Llama2 Stack Overflow Python Q&A ## Transformed Dataset ### Objective The transformed dataset is designed for fine-tuning LLMs to improve Python coding assistance by focusing on high-quality content from Stack Overflow. It has around 20k instructions. ### Structure - **Question-Answer Pairing**: Questions and answers are paired using the `ParentId` linkage. - **Quality Focus**: Only top-rated answers for each question are retained. - **HTML Tag Removal**: All HTML tags in the content are removed. - **Combined Question Field**: Each question's title and body are merged. - **Filtering**: Entries with negative scores or those not containing Python code structures are excluded. Final columns: - `score_question` - `score_answer` - `question` - `answer` ### Llama2 Transformation The dataset has been transformed to match the Llama2 prompt structure, which is relevant for the model's fine-tuning. The format is the following: `<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]` Where: - `system_prompt` gives context or instructions to the model. - `user_message` is the user's query following the system prompt, expecting a particular response from the model. This structure ensures the training aligns with Llama2's expectations, optimizing the fine-tuning quality. ## Original Dataset The dataset contains questions and answers from Stack Overflow with the `python` tag, covering the period from August 2, 2008, to October 19, 2016. ## License All contributions are under the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/). Attribution is required. The original dataset was posted [here](https://www.kaggle.com/datasets/stackoverflow/pythonquestions). Keep in touch: [LinkedIn](https://www.linkedin.com/in/luisbrasroque/)

提供机构：

luisroque

原始信息汇总

数据集概述

基本信息

数据集名称: Instruct Python 500k
语言: 英语
许可协议: CC-BY-SA 3.0
任务类别: 文本生成
数据集大小: 10K<n<100K

数据结构

特征:
- text: 数据类型为字符串
分割:
- train: 包含19000个样本，34661192.7字节
- test: 包含1000个样本，1824273.3字节
下载大小: 19060329字节
数据集大小: 36485466字节

数据集转换

目的: 用于微调大型语言模型（LLMs），以提高Python编程辅助的质量。
结构:
- 问题-答案配对: 使用ParentId链接问题和答案。
- 质量聚焦: 仅保留每个问题的最高评分答案。
- HTML标签移除: 移除内容中的所有HTML标签。
- 合并问题字段: 每个问题的标题和正文合并。
- 过滤: 排除评分负面或不包含Python代码结构的条目。
最终列:
- score_question
- score_answer
- question
- answer

Llama2转换

格式: <s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]
- system_prompt: 提供模型上下文或指令。
- user_message: 用户在系统提示后的查询，期望模型给出特定响应。
目的: 确保训练与Llama2的预期一致，优化微调质量。

原始数据集

内容: 包含2008年8月2日至2016年10月19日期间Stack Overflow上带有python标签的问题和答案。

5,000+

优质数据集

54 个

任务类型

进入经典数据集