adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered
收藏Hugging Face2024-10-11 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/adamo1139/4chan_archive_ShareGPT_fixed_newlines_unfiltered
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含218,801,883个token(使用Mistral tokenizer)的数据集,数据来源于4chan。原始数据缺少换行和聊天结构,因此通过脚本引入了聊天结构,寻找回复链。接着,使用Hermes Llama 3.1 8B模型处理了2.5M个样本,其中包含大约3000个固定的提示token和动态添加的218M个token,输出了大约218M个token来生成这个数据集。作者计划进一步过滤数据集,以去除低质量数据,保留高质量的对话。
This is a dataset with 218,801,883 tokens (Mistral tokenizer) of data scraped off 4chan. Since the original data was missing newlines and chat structure, chat structure was introduced via a script that looked for reply chains. Then, it was processed using a Hermes Llama 3.1 8B model, with 2.5M samples containing approximately 3000 fixed prompt tokens and dynamically added 218M tokens from the dataset, resulting in an output of around 218M tokens to produce this dataset. The author plans to further filter the dataset to remove low-quality data and retain high-quality conversations.
提供机构:
adamo1139



