WebText

Papers with Code2024-05-15 收录

下载链接：

https://paperswithcode.com/dataset/webtext

下载链接

链接失效反馈

资源简介：

WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText 是 OpenAI 开发的内部语料库，其构建方式为抓取网页内容并着重筛选高质量文档。该数据集的构建者抓取了 Reddit 平台上获得至少 3 个 karma（Reddit社区用户积分）的全部出站链接，并将该积分阈值作为启发式判定指标，用以判断其他用户是否认为该链接内容有趣、具备教育价值或是单纯诙谐有趣。

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集