danwil/owt-ngrams
收藏Hugging Face2024-07-22 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/danwil/owt-ngrams
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了OpenWebText (OWT)数据集中最常见的1-6连续标记子序列(n-grams)。OWT复制版由布朗大学的Aaron Gokaslan和Vanya Cohen编译。数据集中列出了n-grams的数量及其在约90亿标记的数据集中的最小出现频率。该数据集用于展示gpt2-small稀疏自编码器如何更精确地记忆最常见的n-grams。
This dataset contains the most common 1-6 contiguous token subsequences (n-grams) in an open-source replication of the OpenWebText (OWT) dataset. The OWT replication was compiled by Aaron Gokaslan and Vanya Cohen of Brown University. The dataset details the number of each n-gram and the minimum number of times each sequence occurs in the ~9B-token dataset (its frequency). Notably, all individual tokens (1-grams) are included. The dataset was used to demonstrate that gpt2-small sparse autoencoders memorize the most commonly presented n-grams more precisely.
提供机构:
danwil



