xlr8harder/wildchat-filtered-prompts

Name: xlr8harder/wildchat-filtered-prompts
Creator: xlr8harder
Published: 2025-12-17 20:49:09
License: 暂无描述

Hugging Face2025-12-17 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/xlr8harder/wildchat-filtered-prompts

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含从WildChat中提取的328,875个独特的用户提示，经过多阶段处理流程去除了重复、垃圾和低质量内容。原始数据集为allenai/WildChat，包含3,185,010个对话轮次，经过处理后减少了90%。处理流程包括提取第一用户消息、基本过滤（长度和语言）、模糊去重、N-gram垃圾检测和聚类垃圾检测。数据集格式为JSONL，每行一个提示。

This dataset contains 328,875 unique user prompts extracted from WildChat, processed through a multi-stage pipeline to remove duplicates, spam, and low-quality content. The original dataset is allenai/WildChat, which contains 3,185,010 conversation turns, reduced by 90% after processing. The pipeline includes extracting the first user message, basic filtering (length and language), fuzzy deduplication, N-gram spam detection, and cluster spam detection. The dataset format is JSONL with one prompt per line.

提供机构：

xlr8harder

5,000+

优质数据集

54 个

任务类型

进入经典数据集