five

MasonMac/WildChat-4.8M-EN-Semantic-Deduplicated

收藏
Hugging Face2025-08-23 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/MasonMac/WildChat-4.8M-EN-Semantic-Deduplicated
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个由去重后的英文提示组成的集合,提示长度小于2000个token。经过验证至少80%的提示由英文字母组成后移除非英文提示,使用HashSet和MinHash进行去重,并使用Qwen3-8B-Embedding生成嵌入进行聚类,只保留与质心最远的提示。数据集提供了四个不同的相似度阈值下的聚类结果:0.7、0.75、0.8和0.9。

This is a collection of deduplicated English prompts shorter than ~2000 tokens. Non-English prompts were removed after validating that at least 80% of each prompt was composed of English alphabet characters, deduplication was performed using HashSet and MinHash, and embeddings were generated using Qwen3-8B-Embedding for clustering, keeping only the prompt farthest from the centroid. The dataset is provided with clustering results at four different similarity thresholds: 0.7, 0.75, 0.8, and 0.9.
提供机构:
MasonMac
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作