shaguftakhan2k17/openwebtext2
收藏Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/shaguftakhan2k17/openwebtext2
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过清理的OpenWebText2版本,移除了非英语、重复、受版权保护以及低质量(如过短、特殊字符过多等)的样本。数据集还针对多个基准测试(如GLUE、SIQA、PIQA等)进行了去污染处理,移除了4,096个文档。数据集总样本数为13,071,217,下载的parquet文件大小为34G。此外,还提供了一个经过模型过滤的版本,包含12,804,779个样本,该版本使用Qwen2.5-32B-Instruct生成语言质量标注,并通过RoBERT-large分类器进行过滤,移除了评分为1或2的文档。
A cleaned version of OpenWebText2 by removing non-English, duplicated, copyrighted, and low-quality (too short, too many special characters, etc) samples. This dataset has also been decontaminated with respect to various benchmarks (e.g., GLUE, SIQA, PIQA, etc.), and 4,096 documents were removed in this step. The dataset contains a total of 13,071,217 samples, with downloaded parquet files sized at 34G. Additionally, there is a model-filtered version including 12,804,779 samples, which uses Qwen2.5-32B-Instruct to generate language quality annotations and a RoBERT-large classifier to filter out documents with scores of 1 or 2.
提供机构:
shaguftakhan2k17



