rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format

Name: rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format
Creator: rombodawg
Published: 2023-09-10 01:36:03
License: 暂无描述

Hugging Face2023-09-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other --- This is the LosslessMegaCodeTrainingV3_MINI dataset converted to guanaco format. Enjoy Original model card: This is a new version and experinmental version of the LosslessMegacodeTraining series. Its like the version 3 but only using the most refine parts of the dataset. The content of this dataset is roughly 80% coding instruction data and 20% non-coding instruction data. Amounting to 650,000 evol instruction-formatted lines of data. The outcome of having 20% non coding instruction data in the dataset is to preserve logic and reasoning skills within the model while training on coding. The lack of such skills has been observed to be a major issue with coding models such as Wizardcoder-15b and NewHope, but training models on this dataset alleviates that issue while also giving similar levels of coding knowledge. This dataset is a combination of the following datasets: - https://huggingface.co/datasets/rombodawg/Platypus_Evol - https://huggingface.co/datasets/rombodawg/Rombodawgs_commitpackft_Evolinstruct_Converted - https://huggingface.co/datasets/rombodawg/airoboros-2.1_general_purpose - https://huggingface.co/datasets/shahules786/megacode-best

提供机构：

rombodawg

原始信息汇总

LosslessMegaCodeTrainingV3_MINI 数据集

概述

版本: 实验性版本，类似于版本3，但仅使用数据集中最精细的部分。
内容组成: 约80%为编程指令数据，20%为非编程指令数据。
数据量: 总计650,000条指令格式化的数据行。

目的

训练目的: 在训练编程技能的同时，保留模型的逻辑和推理能力。
问题解决: 解决类似Wizardcoder-15b和NewHope等模型中缺乏逻辑和推理能力的问题。

数据来源

组合数据集:
- Platypus_Evol
- Rombodawgs_commitpackft_Evolinstruct_Converted
- airoboros-2.1_general_purpose
- megacode-best

5,000+

优质数据集

54 个

任务类型

进入经典数据集