five

mpasila/4chan-sharegpt-5k-sample-16k-context

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mpasila/4chan-sharegpt-5k-sample-16k-context
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - 4chan size_categories: - 1K<n<10K --- Data is taken from [adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered](https://huggingface.co/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered) with these filters: 1. Each conversation has a minimum of 20 responses. 2. No empty values. 3. Max length is 16k tokens based on [unsloth/Ministral-3-8B-Base-2512](https://huggingface.co/unsloth/Ministral-3-8B-Base-2512). ```markdown Token Count Statistics: Total conversations: 3216 Total tokens: 2989177 Average tokens per conversation: 929.47 Median tokens per conversation: 814.0 Maximum tokens in a conversation: 3420 Minimum tokens in a conversation: 116 Token Distribution by Role: System messages: 37099 tokens (1.24%) Human messages: 1530310 tokens (51.20%) Assistant messages: 1421768 tokens (47.56%) Token Count Distribution: 0-512: 630 conversations (19.59%) 513-1024: 1507 conversations (46.86%) 1025-2048: 957 conversations (29.76%) 2049-4096: 122 conversations (3.79%) 4097-8192: 0 conversations (0.00%) 8193-16384: 0 conversations (0.00%) 16385+: 0 conversations (0.00%) ```

语言: - 英语 标签: - 4chan 规模类别: - 1千 < 样本数 < 1万 --- 本数据集取材自[adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered](https://huggingface.co/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered),并施加以下过滤规则: 1. 每条对话至少包含20条回复。 2. 无空值数据。 3. 基于[unsloth/Ministral-3-8B-Base-2512](https://huggingface.co/unsloth/Ministral-3-8B-Base-2512)的模型设定,最大Token长度为16384。 markdown Token 计数统计: 对话总数量:3216 总Token数:2989177 单条对话平均Token数:929.47 单条对话Token数中位数:814.0 单条对话最大Token数:3420 单条对话最小Token数:116 按角色划分的Token分布: 系统消息:37099 Token(占比1.24%) 人类消息:1530310 Token(占比51.20%) 助手消息:1421768 Token(占比47.56%) Token 数量区间分布: 0-512:630条对话(占比19.59%) 513-1024:1507条对话(占比46.86%) 1025-2048:957条对话(占比29.76%) 2049-4096:122条对话(占比3.79%) 4097-8192:0条对话(占比0.00%) 8193-16384:0条对话(占比0.00%) 16385+:0条对话(占比0.00%)
提供机构:
mpasila
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作