iosscu/Refined-Anime-Text
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/iosscu/Refined-Anime-Text
下载链接
链接失效反馈官方服务:
资源简介:
这是一份包含超过一百万条、约4400万个 GPT-4/3.5 token的、全新合成的文本数据集的动漫主题子集。该数据集此前从未公开发布过。由于社区对动漫文化的浓厚兴趣,且考虑到通识数据集中此类题材的代表性不足,以及原始文本中网络俚语和无关内容的泛滥而导致的低质量、难以清理的问题,我们决定发布这份子集供进一步研究。这份数据集旨在用于研究大型语言模型中网络亚文化的数据治理,并探索具有挑战性的 LLM 持续预训练问题,例如特定主题的知识蒸馏以及对未见知识的持续学习。
This is a subset of our novel synthetic dataset of anime-themed text, containing over 1M entries, ~440M GPT-4/3.5 tokens. This dataset has never been publicly released before. We are releasing this subset due to the communitys interest in anime culture, which is underrepresented in general-purpose datasets, and the low quality of raw text due to the prevalence of internet slang and irrelevant content, making it difficult to clean. This dataset is intended for research on data governance of internet subcultures in large language models and to explore challenging LLM continual pre-training problems such as knowledge distillation on specific topics and continual learning of unseen knowledge.
提供机构:
iosscu



