five

smoltalk-semhashed

收藏
魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/mlabonne/smoltalk-semhashed
下载链接
链接失效反馈
官方服务:
资源简介:
# SmolTalk SemHashed ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/WEAtaqNwFfCifaDGOAKi4.png) This is a near-deduplicated version of [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) created with the [semhash](https://github.com/MinishLab/semhash/tree/main) library. Instead of MinHash deduplication, it uses embeddings generated with [minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), a distilled version of [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5), and a threshold of 0.95 (see the [vicinity](https://github.com/MinishLab/vicinity) library). ❤️ Kudos to [minishlab](https://huggingface.co/minishlab) for this super cool stuff!

# SmolTalk 语义哈希(SemHashed) ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61b8e2ba285851687028d395/WEAtaqNwFfCifaDGOAKi4.png) 本数据集是[smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)的近乎去重版本,由[语义哈希库(semhash)](https://github.com/MinishLab/semhash/tree/main)构建而成。 本数据集未采用MinHash去重方案,而是使用由[minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M)生成的嵌入向量;该模型是[BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)的蒸馏版本,且设置了0.95的相似度阈值(相关实现可参考[邻近度库(vicinity)](https://github.com/MinishLab/vicinity))。 ❤️ 感谢[minishlab](https://huggingface.co/minishlab)团队带来的这项出色工作!
提供机构:
maas
创建时间:
2025-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作