Pile Uncopyrighted Dataset

Name: Pile Uncopyrighted Dataset
Creator: Hugging Face
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://huggingface.co/datasets/monology/pile-uncopyrighted

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从Pile非版权数据集中随机选取的一个子集，用于训练稀疏自编码器，以评估Matryoshka SAEs和TopK SAEs的性能。该数据集包含非版权文本，在语言表示和建模的背景下用于训练模型。其规模达到5000万个标记，所涉及的任务是语言建模。

This dataset is a randomly selected subset of the Pile non-copyright dataset, intended for training sparse autoencoders to evaluate the performance of Matryoshka SAEs and TopK SAEs. This dataset contains non-copyright text, which is utilized for model training in the context of language representation and modeling. It has a scale of 50 million tokens, and the associated task is language modeling.

提供机构：

Hugging Face

5,000+

优质数据集

54 个

任务类型

进入经典数据集