Query-Free OpenWebText - Part 1: Clean and Dirty Corpus
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/m2gdxppkvj
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is derived from the original OpenWebText corpus and is used to investigate the effect of intrinsic query language complexity by isolating pre-training exposure bias (RQ1, RQ2).
This is Part 1 of the Query-Free OpenWebText dataset.
The OpenWebText Base Filtered Corpus consists of two variants of equivalent size:
OpenWebText-clean: The corpus where all documents containing explicit SQL, SPARQL, or Cypher syntax/keywords have been removed using rigorous filtering protocols. This is used to train the base **unbiased T5 (uT5)** model.
OpenWebText-dirty: The original, unfiltered version of the OpenWebText corpus. This dataset retains natural occurrences of query language examples to represent typical pre-training distribution, allowing for the benchmarking of pre-training bias effects.
Dataset Format: Arrow
创建时间:
2025-11-17



