five

rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format

收藏
Hugging Face2023-09-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rombodawg/LosslessMegaCodeTrainingV3_MINI_Guanaco_Format
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other --- This is the LosslessMegaCodeTrainingV3_MINI dataset converted to guanaco format. Enjoy Original model card: This is a new version and experinmental version of the LosslessMegacodeTraining series. Its like the version 3 but only using the most refine parts of the dataset. The content of this dataset is roughly 80% coding instruction data and 20% non-coding instruction data. Amounting to 650,000 evol instruction-formatted lines of data. The outcome of having 20% non coding instruction data in the dataset is to preserve logic and reasoning skills within the model while training on coding. The lack of such skills has been observed to be a major issue with coding models such as Wizardcoder-15b and NewHope, but training models on this dataset alleviates that issue while also giving similar levels of coding knowledge. This dataset is a combination of the following datasets: - https://huggingface.co/datasets/rombodawg/Platypus_Evol - https://huggingface.co/datasets/rombodawg/Rombodawgs_commitpackft_Evolinstruct_Converted - https://huggingface.co/datasets/rombodawg/airoboros-2.1_general_purpose - https://huggingface.co/datasets/shahules786/megacode-best
提供机构:
rombodawg
原始信息汇总

LosslessMegaCodeTrainingV3_MINI 数据集

概述

  • 版本: 实验性版本,类似于版本3,但仅使用数据集中最精细的部分。
  • 内容组成: 约80%为编程指令数据,20%为非编程指令数据。
  • 数据量: 总计650,000条指令格式化的数据行。

目的

  • 训练目的: 在训练编程技能的同时,保留模型的逻辑和推理能力。
  • 问题解决: 解决类似Wizardcoder-15b和NewHope等模型中缺乏逻辑和推理能力的问题。

数据来源

  • 组合数据集:
    • Platypus_Evol
    • Rombodawgs_commitpackft_Evolinstruct_Converted
    • airoboros-2.1_general_purpose
    • megacode-best
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作