CausalLM/Refined-Anime-Text

Name: CausalLM/Refined-Anime-Text
Creator: CausalLM
Published: 2025-02-14 18:30:24
License: 暂无描述

Hugging Face2025-02-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/CausalLM/Refined-Anime-Text

下载链接

链接失效反馈

官方服务：

资源简介：

这是一份包含超过一百万条、约4400万个 GPT-4/3.5 token的、全新合成的文本数据集的动漫主题子集。该数据集此前从未公开发布过。由于社区对动漫文化的浓厚兴趣，且考虑到通识数据集中此类题材的代表性不足，以及原始文本中网络俚语和无关内容的泛滥而导致的低质量、难以清理的问题，我们决定发布这份子集供进一步研究。这份数据集旨在用于研究大型语言模型中网络亚文化的数据治理，并探索具有挑战性的 LLM 持续预训练问题，例如特定主题的知识蒸馏以及对未见知识的持续学习。该数据是通过以下方式创建的：获取网络爬取的文本数据（此子集中不包含维基百科内容），将完整的网页文本通过支持长文本窗口的大型语言模型（GPT-4-32k/GPT-3.5-16K，根据难度动态切换），并合成一个精炼版本。数据集包含英文和中文文本。

This is a subset of our novel synthetic dataset of anime-themed text, containing over 1M entries, ~440M GPT-4/3.5 tokens. This dataset has never been publicly released before. We are releasing this subset due to the communitys interest in anime culture, which is underrepresented in general-purpose datasets, and the low quality of raw text due to the prevalence of internet slang and irrelevant content, making it difficult to clean. This dataset is intended for research on data governance of internet subcultures in large language models and to explore challenging LLM continual pre-training problems such as knowledge distillation on specific topics and continual learning of unseen knowledge. The data was created by taking web-scraped text data (wikipedia excluded in this subset), passing the full web page text through a large language model (GPT-4-32k/GPT-3.5-16K, switching dynamically based on the difficulty) that supports long context windows, and synthesizing a refined version. The dataset contains text in English and Chinese.

提供机构：

CausalLM

原始信息汇总

数据集概述

数据集状态

该数据集已不再在HF Mirror上提供。

获取方式

建议联系已下载该数据集的用户获取。

注意事项

如果您拥有该数据集的副本，请不要重新上传至HF Mirror。

5,000+

优质数据集

54 个

任务类型

进入经典数据集