five

mpasila/BadVibesV1-16k-context

收藏
Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mpasila/BadVibesV1-16k-context
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个组合数据集,包含了来自四个不同数据源的条目,并经过过滤和处理以适应ShareGPT格式。具体包括:3216条来自adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered的条目,19962条来自Fizzarolli/fse-raw-dump的条目,11547条来自R-Arfin/Depression的条目,以及5060条来自ShiniChien/creepypasta的条目。数据集总共有39785个条目,且每个对话的上下文长度不超过16k。此外,数据集还提供了详细的token统计信息,包括总token数、平均token数、中位数、最大值和最小值,以及按角色和范围的token分布情况。

This dataset is a combination of these datasets (which have been filtered/processed for ShareGPT format and made sure they dont exceed 16k context length based on [unsloth/Ministral-3-8B-Base-2512](https://huggingface.co/unsloth/Ministral-3-8B-Base-2512)s tokenizer): - 3216 entries from [adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered](https://huggingface.co/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered) - 19962 entries from [Fizzarolli/fse-raw-dump](https://huggingface.co/datasets/Fizzarolli/fse-raw-dump) - 11547 entries from [R-Arfin/Depression](https://huggingface.co/datasets/R-Arfin/Depression) - 5060 entries from [ShiniChien/creepypasta](https://huggingface.co/datasets/ShiniChien/creepypasta). The data was also combined and shuffled. Total entries: 39785. Token Count Statistics: Total conversations: 39785, Total tokens: 114280013, Average tokens per conversation: 2872.44, Median tokens per conversation: 1842.0, Maximum tokens in a conversation: 16375, Minimum tokens in a conversation: 21. Token Distribution by Role: System messages: 655918 tokens (0.57%), Human messages: 2408497 tokens (2.11%), Assistant messages: 111215598 tokens (97.32%). Token Count Distribution: 0-512: 11628 conversations (29.23%), 513-1024: 3760 conversations (9.45%), 1025-2048: 5616 conversations (14.12%), 2049-4096: 8383 conversations (21.07%), 4097-8192: 7362 conversations (18.50%), 8193-16384: 3036 conversations (7.63%), 16385+: 0 conversations (0.00%).
提供机构:
mpasila
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作