afkfatih/turkish-distilled-5K
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/afkfatih/turkish-distilled-5K
下载链接
链接失效反馈官方服务:
资源简介:
这是一个土耳其语的蒸馏SFT数据集,包含清理过的单轮对话数据,遵循`messages`模式。数据集主要用于文本生成任务,包含5957个清理过的样本,其中5660个用于训练,297个用于测试。每个样本包含id、messages、source、task、language、quality_score和flags字段。消息格式严格遵循user和assistant的对话模式。数据集经过严格清理,移除无效模式、重复对、短回答、可能的回声样本等。质量报告显示,删除的样本主要因重复对、可能的回声和短回答。任务分布包括问答、多选问答、指令和跨语言问答。标志分布显示部分样本存在非土耳其字符、代码或URL。平均用户消息长度为9.57词,平均助手消息长度为16.44词。
This is a Turkish Distilled SFT Dataset containing cleaned, single-turn conversational data following the `messages` schema. The dataset is primarily used for text-generation tasks and includes 5957 cleaned samples, with 5660 for training and 297 for testing. Each sample contains fields such as id, messages, source, task, language, quality_score, and flags. The message format strictly follows a user and assistant dialogue pattern. The dataset has undergone rigorous cleaning, removing invalid schemas, duplicate pairs, short responses, potential echo samples, etc. The quality report indicates that deleted samples were mainly due to duplicate pairs, potential echoes, and short responses. Task distribution includes QA, MCQA, instruction, and crosslingual QA. Flag distribution shows some samples with non-Turkish characters, code, or URLs. The average user message length is 9.57 words, and the average assistant message length is 16.44 words.
提供机构:
afkfatih



