shahules786/megacode-best
收藏Hugging Face2023-08-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shahules786/megacode-best
下载链接
链接失效反馈官方服务:
资源简介:
Megacode-best是megacode-2数据集的过滤和去重版本,通过GTE-base嵌入和余弦相似度去重技术,减少了数据集中的相似指令,以避免过拟合并提高泛化能力。数据集包含66k个样本,用于训练Open-assistant code llama 2模型。数据集的特性包括对话(conversation)和来源(source),其中对话包含用户(USER)和助手(ASSISTANT)的交互。数据集分为训练集,包含66,951个样本,总大小为376,370,658字节。
Megacode-best is a filtered and deduplicated version of the megacode-2 dataset. It utilizes GTE-base embeddings and cosine similarity-based deduplication techniques to reduce similar instructions within the dataset, so as to avoid overfitting and enhance generalization ability. The dataset contains 66,000 samples and is used for training the Open-Assistant CodeLlama 2 model. The dataset has two core attributes: conversation and source, where the conversation field includes interactions between USER and ASSISTANT. The dataset is split into a training set which contains 66,951 samples with a total size of 376,370,658 bytes.
提供机构:
shahules786
原始信息汇总
数据集概述
数据集信息
特征
- conversation:
- samples:
- ASSISTANT: 数据类型为字符串
- USER: 数据类型为字符串
- samples:
- source: 数据类型为字符串
数据分割
- train:
- num_bytes: 376370658 字节
- num_examples: 66951 个样本
数据大小
- download_size: 88693772 字节
- dataset_size: 376370658 字节



