danwil/owt-ngrams

Name: danwil/owt-ngrams
Creator: danwil
Published: 2024-07-22 07:11:32
License: 暂无描述

Hugging Face2024-07-22 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/danwil/owt-ngrams

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了OpenWebText (OWT)数据集中最常见的1-6连续标记子序列（n-grams）。OWT复制版由布朗大学的Aaron Gokaslan和Vanya Cohen编译。数据集中列出了n-grams的数量及其在约90亿标记的数据集中的最小出现频率。该数据集用于展示gpt2-small稀疏自编码器如何更精确地记忆最常见的n-grams。

This dataset contains the most common 1-6 contiguous token subsequences (n-grams) in an open-source replication of the OpenWebText (OWT) dataset. The OWT replication was compiled by Aaron Gokaslan and Vanya Cohen of Brown University. The dataset details the number of each n-gram and the minimum number of times each sequence occurs in the ~9B-token dataset (its frequency). Notably, all individual tokens (1-grams) are included. The dataset was used to demonstrate that gpt2-small sparse autoencoders memorize the most commonly presented n-grams more precisely.

提供机构：

danwil

5,000+

优质数据集

54 个

任务类型

进入经典数据集