Aratako/Magpie-Tanuki-Instruction-Selected-Evolved-26.5k
收藏Hugging Face2024-12-15 更新2024-12-21 收录
下载链接:
https://hf-mirror.com/datasets/Aratako/Magpie-Tanuki-Instruction-Selected-Evolved-26.5k
下载链接
链接失效反馈官方服务:
资源简介:
Magpie-Tanuki-Instruction-Selected-Evolved-26.5k是一个包含约26,500条日语合成指令的数据集。数据集的创建过程包括使用Magpie方法生成指令、获取指令的向量表示、进行聚类、从每个聚类中抽取指令,并应用Evol-Instruct方法进行进化。数据集的特征包括消息、指令、嵌入、ID、基础指令生成模型、进化历史、进化模型、进化代数和基础指令。数据集分为训练集,包含26,524个示例,总大小为205,992,952字节。数据集的许可证为Apache 2.0,但受Qwen许可证的影响。数据集的任务类别为文本生成,语言为日语,规模类别为10K<n<100K。
This dataset contains approximately 26,500 Japanese synthetic instruction data, which were generated by applying the Magpie method to create about 100,000 instructions, then using the cl-nagoya/ruri-large model to obtain vector representations of these instructions, clustering these vectors into 20,000 clusters using the Mini Batch K-Means algorithm, extracting up to 3 instructions from each cluster, and finally applying the Evol-Instruct method using the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int8 model to these instructions. The dataset features include messages (containing content and role), instruction, embedding, id, base_inst_gen_model, evol_history, evol_model, evol_generation, and base_instruction. The training portion of the dataset contains 26,524 samples, with a download size of 153,116,945 bytes and a dataset size of 205,992,952 bytes. The datasets license is other, the task category is text-generation, the language is Japanese, and the dataset size category is 10K<n<100K.
提供机构:
Aratako



