luisroque/instruct-python-llama2-20k
收藏Hugging Face2023-08-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/luisroque/instruct-python-llama2-20k
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 34661192.7
num_examples: 19000
- name: test
num_bytes: 1824273.3
num_examples: 1000
download_size: 19060329
dataset_size: 36485466
license: cc-by-sa-3.0
task_categories:
- text-generation
language:
- en
pretty_name: Instruct Python 500k
size_categories:
- 10K<n<100K
---
# Fine-tuning Instruct Llama2 Stack Overflow Python Q&A
## Transformed Dataset
### Objective
The transformed dataset is designed for fine-tuning LLMs to improve Python coding assistance by focusing on high-quality content from Stack Overflow. It has around 20k instructions.
### Structure
- **Question-Answer Pairing**: Questions and answers are paired using the `ParentId` linkage.
- **Quality Focus**: Only top-rated answers for each question are retained.
- **HTML Tag Removal**: All HTML tags in the content are removed.
- **Combined Question Field**: Each question's title and body are merged.
- **Filtering**: Entries with negative scores or those not containing Python code structures are excluded.
Final columns:
- `score_question`
- `score_answer`
- `question`
- `answer`
### Llama2 Transformation
The dataset has been transformed to match the Llama2 prompt structure, which is relevant for the model's fine-tuning. The format is the following:
`<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]`
Where:
- `system_prompt` gives context or instructions to the model.
- `user_message` is the user's query following the system prompt, expecting a particular response from the model.
This structure ensures the training aligns with Llama2's expectations, optimizing the fine-tuning quality.
## Original Dataset
The dataset contains questions and answers from Stack Overflow with the `python` tag, covering the period from August 2, 2008, to October 19, 2016.
## License
All contributions are under the [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/). Attribution is required. The original dataset was posted [here](https://www.kaggle.com/datasets/stackoverflow/pythonquestions).
Keep in touch: [LinkedIn](https://www.linkedin.com/in/luisbrasroque/)
提供机构:
luisroque
原始信息汇总
数据集概述
基本信息
- 数据集名称: Instruct Python 500k
- 语言: 英语
- 许可协议: CC-BY-SA 3.0
- 任务类别: 文本生成
- 数据集大小: 10K<n<100K
数据结构
- 特征:
text: 数据类型为字符串
- 分割:
train: 包含19000个样本,34661192.7字节test: 包含1000个样本,1824273.3字节
- 下载大小: 19060329字节
- 数据集大小: 36485466字节
数据集转换
- 目的: 用于微调大型语言模型(LLMs),以提高Python编程辅助的质量。
- 结构:
- 问题-答案配对: 使用
ParentId链接问题和答案。 - 质量聚焦: 仅保留每个问题的最高评分答案。
- HTML标签移除: 移除内容中的所有HTML标签。
- 合并问题字段: 每个问题的标题和正文合并。
- 过滤: 排除评分负面或不包含Python代码结构的条目。
- 问题-答案配对: 使用
- 最终列:
score_questionscore_answerquestionanswer
Llama2转换
- 格式:
<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]system_prompt: 提供模型上下文或指令。user_message: 用户在系统提示后的查询,期望模型给出特定响应。
- 目的: 确保训练与Llama2的预期一致,优化微调质量。
原始数据集
- 内容: 包含2008年8月2日至2016年10月19日期间Stack Overflow上带有
python标签的问题和答案。



