five

Query-Free OpenWebText - Part 1: Clean and Dirty Corpus

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/m2gdxppkvj
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is derived from the original OpenWebText corpus and is used to investigate the effect of intrinsic query language complexity by isolating pre-training exposure bias (RQ1, RQ2). This is Part 1 of the Query-Free OpenWebText dataset. The OpenWebText Base Filtered Corpus consists of two variants of equivalent size: OpenWebText-clean: The corpus where all documents containing explicit SQL, SPARQL, or Cypher syntax/keywords have been removed using rigorous filtering protocols. This is used to train the base **unbiased T5 (uT5)** model. OpenWebText-dirty: The original, unfiltered version of the OpenWebText corpus. This dataset retains natural occurrences of query language examples to represent typical pre-training distribution, allowing for the benchmarking of pre-training bias effects. Dataset Format: Arrow
创建时间:
2025-11-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作