five

Sonnet3.5-SlimOrcaDedupCleaned-20k

收藏
魔搭社区2025-12-05 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned-20k
下载链接
链接失效反馈
官方服务:
资源简介:
Since the [original dataset](https://huggingface.co/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned) is fairly large I decided to create a filtered version containing the 20.000 most diverse assistant replies. Diversity was calculated through word usage, resulting in a filtered version with little repetition. The following criteria were applied: - No samples where the assistant reply contained Unicode, eliminating all entries related to translation. - No "explain like I'm five year old" entries, as I personally do not believe they add an awful lot to begin with. Enjoy!

由于原始数据集(original dataset,https://huggingface.co/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned)体量较大,笔者遂构建了一款经过筛选的子集,保留其中20000条多样性最高的助手回复。多样性通过词汇使用维度进行量化计算,最终得到的筛选版本几乎无重复内容。 本次筛选遵循以下规则: - 剔除所有助手回复包含Unicode字符的样本,此类条目均与翻译任务相关; - 剔除所有“像给五岁孩童解释”类的条目,笔者认为此类内容本身并无过多实用价值。 祝您使用愉快!
提供机构:
maas
创建时间:
2025-03-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作