WebText Dataset
收藏paperswithcode.com2025-03-21 收录
下载链接:
https://paperswithcode.com/dataset/webtext
下载链接
链接失效反馈资源简介:
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on
document quality. The authors scraped all outbound links from
Reddit which received at least 3
karma. The authors used the approach as a heuristic indicator for
whether other users found the link interesting, educational,
or just funny.
WebText contains the text subset of these 45 million links. It consists of over 8 million documents
for a total of 40 GB of text. All Wikipedia
documents were removed from WebText since it is a common data source
for other datasets.
WebText 是由 OpenAI 内部创建的一个数据集,通过抓取网页内容并着重于文档质量而构建。该数据集的构建者从 Reddit 上抓取了所有获得至少 3k 赞的同向链接。该抓取方法被用作一个启发式指标,以判断其他用户是否认为这些链接具有趣味性、教育意义或仅仅有趣。WebText 包含了 4500 万个链接中的文本子集,共计超过 800 万份文档,总计约 40 GB 的文本数据。由于 Wikipedia 是其他数据集的常见数据来源,因此所有来自 Wikipedia 的文档均已被从 WebText 中移除。
提供机构:
Papers with Code
AI搜集汇总
数据集介绍

以上内容由AI搜集并总结生成



