Reddit_Dirty_Writing_Prompts_ShareGPT
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/SicariusSicariiStuff/Reddit_Dirty_Writing_Prompts_ShareGPT
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<b style="font-size: 40px;">Reddit_Dirty_Writing_Prompts_ShareGPT</b>
</div>
# Dataset Details
This is the Reddit_Dirty_Writing_Prompts dataset, which I further cleaned and sorted. The original dataset can be found at: https://huggingface.co/datasets/nothingiisreal/Reddit-Dirty-And-WritingPrompts
- Additional meticulously cleaning performed
- ShareGPT Format
- Each entry contains the number of tokens in both LLAMA1 and LLAMA3 tokenizers
- Each entry contains the number of characters
- Longest entry in tokens: TOKENS_LLAMA1: 7545 \ TOKENS_LLAMA3: 6397
- Shortest entry in tokens: TOKENS_LLAMA1: 98 \ TOKENS_LLAMA3: 83
- Total_TOKENS_LLAMA1: 12874614 (12M), Total_TOKENS_LLAMA3: 10892913 (10M)
I hope this helps as many people as possible, let's make AI with less slop, and make AI accessible for everyone 🤗
# Reddit低俗写作提示ShareGPT数据集(Reddit_Dirty_Writing_Prompts_ShareGPT)
## 数据集详情
本数据集为经进一步精细化清洗与整理后的Reddit低俗写作提示(Reddit_Dirty_Writing_Prompts)数据集。原始数据集可于以下地址获取:https://huggingface.co/datasets/nothingiisreal/Reddit-Dirty-And-WritingPrompts
- 已执行精细化清洗操作
- 采用ShareGPT格式
- 每条数据均包含LLAMA1与LLAMA3分词器下的Token(Token)数量
- 每条数据均包含字符数统计
- 分词最长条目:LLAMA1分词计数:7545,LLAMA3分词计数:6397
- 分词最短条目:LLAMA1分词计数:98,LLAMA3分词计数:83
- LLAMA1总分词数:12874614(约1200万),LLAMA3总分词数:10892913(约1000万)
衷心希望本数据集能惠及更多用户,让我们携手打造更优质的人工智能,推动人工智能的全民普惠 🤗
提供机构:
maas
创建时间:
2025-11-20



