mpasila/4chan-sharegpt-5k-sample-16k-context
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mpasila/4chan-sharegpt-5k-sample-16k-context
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- 4chan
size_categories:
- 1K<n<10K
---
Data is taken from [adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered](https://huggingface.co/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered) with these filters:
1. Each conversation has a minimum of 20 responses.
2. No empty values.
3. Max length is 16k tokens based on [unsloth/Ministral-3-8B-Base-2512](https://huggingface.co/unsloth/Ministral-3-8B-Base-2512).
```markdown
Token Count Statistics:
Total conversations: 3216
Total tokens: 2989177
Average tokens per conversation: 929.47
Median tokens per conversation: 814.0
Maximum tokens in a conversation: 3420
Minimum tokens in a conversation: 116
Token Distribution by Role:
System messages: 37099 tokens (1.24%)
Human messages: 1530310 tokens (51.20%)
Assistant messages: 1421768 tokens (47.56%)
Token Count Distribution:
0-512: 630 conversations (19.59%)
513-1024: 1507 conversations (46.86%)
1025-2048: 957 conversations (29.76%)
2049-4096: 122 conversations (3.79%)
4097-8192: 0 conversations (0.00%)
8193-16384: 0 conversations (0.00%)
16385+: 0 conversations (0.00%)
```
语言:
- 英语
标签:
- 4chan
规模类别:
- 1千 < 样本数 < 1万
---
本数据集取材自[adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered](https://huggingface.co/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered),并施加以下过滤规则:
1. 每条对话至少包含20条回复。
2. 无空值数据。
3. 基于[unsloth/Ministral-3-8B-Base-2512](https://huggingface.co/unsloth/Ministral-3-8B-Base-2512)的模型设定,最大Token长度为16384。
markdown
Token 计数统计:
对话总数量:3216
总Token数:2989177
单条对话平均Token数:929.47
单条对话Token数中位数:814.0
单条对话最大Token数:3420
单条对话最小Token数:116
按角色划分的Token分布:
系统消息:37099 Token(占比1.24%)
人类消息:1530310 Token(占比51.20%)
助手消息:1421768 Token(占比47.56%)
Token 数量区间分布:
0-512:630条对话(占比19.59%)
513-1024:1507条对话(占比46.86%)
1025-2048:957条对话(占比29.76%)
2049-4096:122条对话(占比3.79%)
4097-8192:0条对话(占比0.00%)
8193-16384:0条对话(占比0.00%)
16385+:0条对话(占比0.00%)
提供机构:
mpasila



