agentlans/rombodawg-Everything_Instruct
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/rombodawg-Everything_Instruct
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Everything-Instruct: Supervised Finetuning Dataset,是一个用于监督微调大型语言模型的数据集,包含超过7,000,000个指令-响应对。它结合了两个源数据集,适用于代码生成和调试、创意写作增强、提高指令跟随能力等多种任务,并支持包括英语、乌尔都语、中文、韩语、拉丁语、俄语、法语、西班牙语、德语、意大利语、日语、葡萄牙语、土耳其语、罗马尼亚语、芬兰语、捷克语、波斯语、阿拉伯语、乌克兰语、荷兰语、波兰语、塞尔维亚-克罗地亚语和加泰罗尼亚语在内的多种语言。数据集经过去重、敏感数据过滤、Unicode标准化和语言标注等处理步骤。各语言的数据行数分布不均,其中英语数据最多,乌尔都语次之。数据集存在一些局限性,如预处理导致的代码混乱、格式不规则、缺乏多轮对话等,且大多数指令和输入以英语为主。数据集遵循Apache 2.0许可证。
This dataset, named Everything-Instruct: Supervised Finetuning Dataset, is designed for supervised fine-tuning of large language models and contains over 7,000,000 instruction-response pairs. It combines two source datasets and is suitable for various tasks such as code generation and debugging, enhancing creative writing, and improving instruction-following capabilities. It supports multiple languages including English, Urdu, Chinese, Korean, Latin, Russian, French, Spanish, German, Italian, Japanese, Portuguese, Turkish, Romanian, Finnish, Czech, Persian, Arabic, Ukrainian, Dutch, Polish, Serbo-Croatian, and Catalan. The dataset has undergone processing steps including deduplication, sensitive data redaction, Unicode normalization, and language annotation. The distribution of rows across languages is uneven, with English having the most rows followed by Urdu. The dataset has some limitations, such as occasional mangled code due to preprocessing, irregular formatting, lack of multi-turn conversations, and a focus on English for most instructions and inputs. The dataset is licensed under Apache 2.0.
提供机构:
agentlans



