Sao10K/c2-Logs-Filtered
收藏Hugging Face2024-06-20 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Sao10K/c2-Logs-Filtered
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来源于多个日志文件,经过多个过滤步骤处理,最终生成了多个过滤后的数据集。过滤步骤包括去除非英语内容、去除代理提及、转换为多轮对话格式、确保最小轮次、去除不完整响应、去除拒绝和提及OpenAI/Claude的内容、验证格式等。数据集未经过清洗或去重处理,用户需自行处理。
This dataset is derived from multiple log files and has undergone several filtration steps to generate multiple filtered datasets. The filtration steps include removing non-English content, removing mentions of proxies, converting to a multiturn format, ensuring a minimum number of turns, removing incomplete responses, removing refusals and mentions of OpenAI/Claude, and verifying the format. The dataset is not cleaned or deduplicated, and users are expected to handle it themselves.
提供机构:
Sao10K
原始信息汇总
数据集概述
语言
- 英语 (en)
配置
- Part 2 - Filtered
- 数据文件: Part2/c2-filter9.json
- Part 3 - Filtered
- 数据文件: Part3/c2-filter9.json
- Part 4 - Filtered
- 数据文件: Part4/c2-filter9.json
- Part 5 - Filtered
- 数据文件: Part5/c2-filter9.json
标签
- 不适合所有观众 (not-for-all-audiences)
过滤步骤
- Raw - 原始日志合并
- Filter 1 - 仅包含Opus
- Filter 2 - 非英语/NaN/代理提及
- Filter 3 - 仅限少于8K Token
- Filter 4 - 转换为多轮格式
- Filter 5 - 确保至少3轮
- Filter 6 - 移除和过滤不完整响应
- Filter 7 - 移除拒绝/OpenAI/Claude提及
- Filter 8 - 有效模式检查/字符串检查
- Filter 9 - 空响应检查
附加脚本
- Response - 将最终轮转换为响应
- ShareGPT - 转换为ShareGPT格式
- Verify - 验证正确格式
数据统计
| 数据 | Part 2 | Part 3 | Part 4 | Part 5 | 总计 |
|---|---|---|---|---|---|
| Raw | 164133 | 335527 | 221434 | 115578 | 836672 |
| Filter 1 | 136630 | 280614 | 183252 | 95440 | 513119 |
| Filter 2 | 28016 | 55306 | 34578 | 17952 | 135852 |
| Filter 3 | 17415 | 32388 | 21622 | 11046 | 82471 |
| Filter 4 | 17415 | 32388 | 21622 | 11046 | 82471 |
| Filter 5 | 15511 | 29199 | 19157 | 9736 | 70303 |
| Filter 6 | 13931 | 26186 | 17359 | 8967 | 66443 |
| Filter 7 | 13617 | 25242 | 16771 | 8692 | 64322 |
| Filter 8 | 13617 | 25241 | 16771 | 8692 | 64321 |
| Filter 9 | 13611 | 25205 | 16763 | 8689 | 64268 |



