Sonnet3.5-SlimOrcaDedupCleaned-20k
收藏魔搭社区2025-12-05 更新2025-03-29 收录
下载链接:
https://modelscope.cn/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned-20k
下载链接
链接失效反馈官方服务:
资源简介:
Since the [original dataset](https://huggingface.co/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned) is fairly large I decided to create a filtered version containing the 20.000 most diverse assistant replies. Diversity was calculated through word usage, resulting in a filtered version with little repetition.
The following criteria were applied:
- No samples where the assistant reply contained Unicode, eliminating all entries related to translation.
- No "explain like I'm five year old" entries, as I personally do not believe they add an awful lot to begin with.
Enjoy!
由于原始数据集(original dataset,https://huggingface.co/datasets/Gryphe/Sonnet3.5-SlimOrcaDedupCleaned)体量较大,笔者遂构建了一款经过筛选的子集,保留其中20000条多样性最高的助手回复。多样性通过词汇使用维度进行量化计算,最终得到的筛选版本几乎无重复内容。
本次筛选遵循以下规则:
- 剔除所有助手回复包含Unicode字符的样本,此类条目均与翻译任务相关;
- 剔除所有“像给五岁孩童解释”类的条目,笔者认为此类内容本身并无过多实用价值。
祝您使用愉快!
提供机构:
maas
创建时间:
2025-03-27



