nothingiisreal/c2-logs-cleaned
收藏Hugging Face2024-07-16 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/nothingiisreal/c2-logs-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
该数据集由用户maywell训练的分类器进行初步清理,去除了拒绝类数据,并进一步去重和清理。数据集经过多个过滤步骤,包括去除拒绝类数据、去重、去除非英语内容、去除空消息、代理提及、截断消息和匿名对话,并重新运行去除非英语内容的过滤步骤。每个过滤步骤后的数据大小分别为:Filter 0: 20GB, Filter 1: 4.68GB, Filter 2: 3.22GB, Filter 3: 1.38GB, Filter 4: 1.35GB。
The dataset was initially cleaned by a classifier trained by user maywell to remove refusals, and further deduplicated and cleaned. The dataset underwent several filtration steps, including removing refusals, deduplication, removing non-English content, removing empty messages, mentions of proxy, cut off messages, and anonymous conversations, and re-running the filter to remove non-English content. The data sizes after each filtration step are: Filter 0: 20GB, Filter 1: 4.68GB, Filter 2: 3.22GB, Filter 3: 1.38GB, Filter 4: 1.35GB.
提供机构:
nothingiisreal
原始信息汇总
数据集概述
数据处理步骤
- Filter 0: 移除拒绝(及类似内容)。
- Filter 1: 去重。
- Filter 2: 移除非英语内容。
- Filter 3: 移除空消息、提及代理、截断消息(仅限助手)和匿名对话。
- Filter 4: 重新运行Filter 2,因为某些CoT垃圾内容在第一次尝试时混淆了Filter 2。
数据规模
- Filter 0: 20GB
- Filter 1: 4.68GB
- Filter 2: 3.22GB
- Filter 3: 1.38GB
- Filter 4: 1.35GB



