agentlans/openchat_sharegpt4_dataset
收藏Hugging Face2025-12-16 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/agentlans/openchat_sharegpt4_dataset
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过处理的 sharegpt_clean.json 版本数据集,名为 openchat/openchat_sharegpt4_dataset。数据集支持多种语言,包括英语、中文、韩语、西班牙语、法语、日语、德语、葡萄牙语、俄语、意大利语、印尼语、越南语、荷兰语、波兰语、土耳其语、捷克语、希伯来语和乌克兰语。处理步骤包括纠正对话顺序、添加语言字段、去重、删除URL、电子邮件和电话号码等个人信息,以及数据洗牌。数据集适用于文本生成任务。README中还提供了一个表格,显示了每种语言的行数,其中英语行数最多(54288行),其他语言每种少于100行。
This dataset is a processed version of sharegpt_clean.json named openchat/openchat_sharegpt4_dataset. It is multilingual, covering languages such as English, Chinese, Korean, Spanish, French, Japanese, German, Portuguese, Russian, Italian, Indonesian, Vietnamese, Dutch, Polish, Turkish, Czech, Hebrew, and Ukrainian. The processing steps include correcting turn order, adding a language field, removing duplicates, redacting personal information like URLs, emails, and phone numbers, and shuffling the data. The dataset is categorized under text-generation tasks. The README also includes a table showing the number of rows per language, with English having the most rows (54288) and other languages having fewer than 100 rows each.
提供机构:
agentlans



