WebText
收藏Papers with Code2024-05-15 收录
下载链接:
https://paperswithcode.com/dataset/webtext
下载链接
链接失效反馈资源简介:
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
WebText 是 OpenAI 开发的内部语料库,其构建方式为抓取网页内容并着重筛选高质量文档。该数据集的构建者抓取了 Reddit 平台上获得至少 3 个 karma(Reddit社区用户积分)的全部出站链接,并将该积分阈值作为启发式判定指标,用以判断其他用户是否认为该链接内容有趣、具备教育价值或是单纯诙谐有趣。
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



