Query-Free OpenWebText - Part 1: Clean and Dirty Corpus

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/m2gdxppkvj

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is derived from the original OpenWebText corpus and is used to investigate the effect of intrinsic query language complexity by isolating pre-training exposure bias (RQ1, RQ2). This is Part 1 of the Query-Free OpenWebText dataset. The OpenWebText Base Filtered Corpus consists of two variants of equivalent size: OpenWebText-clean: The corpus where all documents containing explicit SQL, SPARQL, or Cypher syntax/keywords have been removed using rigorous filtering protocols. This is used to train the base **unbiased T5 (uT5)** model. OpenWebText-dirty: The original, unfiltered version of the OpenWebText corpus. This dataset retains natural occurrences of query language examples to represent typical pre-training distribution, allowing for the benchmarking of pre-training bias effects. Dataset Format: Arrow

创建时间：

2025-11-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集