smoltalk-semhashed
收藏魔搭社区2025-12-05 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/mlabonne/smoltalk-semhashed
下载链接
链接失效反馈官方服务:
资源简介:
# SmolTalk SemHashed

This is a near-deduplicated version of [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) created with the [semhash](https://github.com/MinishLab/semhash/tree/main) library.
Instead of MinHash deduplication, it uses embeddings generated with [minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), a distilled version of [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5), and a threshold of 0.95 (see the [vicinity](https://github.com/MinishLab/vicinity) library).
❤️ Kudos to [minishlab](https://huggingface.co/minishlab) for this super cool stuff!
# SmolTalk 语义哈希(SemHashed)

本数据集是[smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)的近乎去重版本,由[语义哈希库(semhash)](https://github.com/MinishLab/semhash/tree/main)构建而成。
本数据集未采用MinHash去重方案,而是使用由[minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M)生成的嵌入向量;该模型是[BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)的蒸馏版本,且设置了0.95的相似度阈值(相关实现可参考[邻近度库(vicinity)](https://github.com/MinishLab/vicinity))。
❤️ 感谢[minishlab](https://huggingface.co/minishlab)团队带来的这项出色工作!
提供机构:
maas
创建时间:
2025-03-18



