WebText Dataset

paperswithcode.com2025-03-21 收录

下载链接：

https://paperswithcode.com/dataset/webtext

下载链接

链接失效反馈

资源简介：

WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny. WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.

WebText 是由 OpenAI 内部创建的一个数据集，通过抓取网页内容并着重于文档质量而构建。该数据集的构建者从 Reddit 上抓取了所有获得至少 3k 赞的同向链接。该抓取方法被用作一个启发式指标，以判断其他用户是否认为这些链接具有趣味性、教育意义或仅仅有趣。WebText 包含了 4500 万个链接中的文本子集，共计超过 800 万份文档，总计约 40 GB 的文本数据。由于 Wikipedia 是其他数据集的常见数据来源，因此所有来自 Wikipedia 的文档均已被从 WebText 中移除。

提供机构：

Papers with Code

AI搜集汇总

数据集介绍