MasonMac/WildChat-4.8M-EN-Semantic-Deduplicated
收藏Hugging Face2025-08-23 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/MasonMac/WildChat-4.8M-EN-Semantic-Deduplicated
下载链接
链接失效反馈官方服务:
资源简介:
这是一个由去重后的英文提示组成的集合,提示长度小于2000个token。经过验证至少80%的提示由英文字母组成后移除非英文提示,使用HashSet和MinHash进行去重,并使用Qwen3-8B-Embedding生成嵌入进行聚类,只保留与质心最远的提示。数据集提供了四个不同的相似度阈值下的聚类结果:0.7、0.75、0.8和0.9。
This is a collection of deduplicated English prompts shorter than ~2000 tokens. Non-English prompts were removed after validating that at least 80% of each prompt was composed of English alphabet characters, deduplication was performed using HashSet and MinHash, and embeddings were generated using Qwen3-8B-Embedding for clustering, keeping only the prompt farthest from the centroid. The dataset is provided with clustering results at four different similarity thresholds: 0.7, 0.75, 0.8, and 0.9.
提供机构:
MasonMac



