kshitijthakkar/loggenix_moe_mcs_v1
收藏Hugging Face2025-08-22 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/kshitijthakkar/loggenix_moe_mcs_v1
下载链接
链接失效反馈官方服务:
资源简介:
这个数据集是一个合并的集合,包含多个指令遵循和对话数据集,格式化为用于语言模型的监督微调(SFT)。数据集的每个示例包含任务描述或类别、用户输入/问题、预期响应、带角色的聊天消息列表、格式化文本、格式化文本中的令牌数量、消息长度、轮数、原始数据集标识符以及示例是否具有预存在的聊天格式的标志。数据集统计显示总示例数和令牌数、令牌计数分布和来源数据集的贡献。数据集包括来自不同来源的示例,如HelpingAI、HuggingFaceTB、MegaScience、SynthLabsAI、argilla、kshitijthakkar和microsoft。数据格式详细描述,并提供使用示例。创建数据集时使用的配置也进行了描述,包括数据集ID、字段映射、任务映射、处理设置、分词器详细信息、分析设置和输出设置。README还包括一个引用部分,供用户适当地引用原始源数据集。
This dataset is a merged collection of multiple instruction-following and conversational datasets, formatted for supervised fine-tuning (SFT) of language models. Each example in the dataset includes a task description or category, user input/question, expected response, a list of chat messages with roles (system/user/assistant), tokenizer-formatted text, the number of tokens in the formatted text, message length, number of turns, original dataset identifier, and a flag indicating if the example had pre-existing chat format. The dataset statistics show the total number of examples and tokens, token count distribution, and contributions from source datasets. The dataset includes examples from various sources such as HelpingAI, HuggingFaceTB, MegaScience, SynthLabsAI, argilla, kshitijthakkar, and microsoft. The data format is described in detail, and usage examples are provided. The configuration used to create the dataset is also described, including dataset IDs, field mapping, task mapping, processing settings, tokenizer details, analysis settings, and output settings. The README also includes a citation section for users to appropriately cite the original source datasets.
提供机构:
kshitijthakkar



