WanJuan2.0 (万卷-CC)

Name: WanJuan2.0 (万卷-CC)
Creator: OpenDataLab
Published: 2026-05-10 03:30:44
License: 暂无描述

OpenDataLab2026-05-10 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/WanJuanCC

下载链接

链接失效反馈

官方服务：

更多采购需求

资源简介：

WanJuan2.0（万卷-CC）是从CommonCrawl获取的一个 1T Tokens 的高质量英文网络文本数据集。结果显示，与各类开源英文CC语料在 Perspective API 不同维度的评估上，WanJuan-CC都表现出更高的安全性。此外，通过在4个验证集上的困惑度（PPL）和6下游任务的准确率，也展示了WanJuan-CC的实用性。WanJuan-CC在各种验证集上的PPL表现出竞争力，特别是在要求更高语言流畅性的tiny-storys等集上。通过与同类型数据集进行1B模型训练对比，使用验证数据集的困惑度（perplexity）和下游任务的准确率作为评估指标，实验证明，WanJuan-CC显著提升了英文文本补全和通用英文能力任务的性能。

WanJuan2.0 (WanJuan-CC) is a high-quality English web text dataset with 1 trillion Tokens, sourced from CommonCrawl. Experimental results demonstrate that WanJuan-CC exhibits higher safety across various dimensions evaluated by Perspective API compared to various open-source English Common Crawl corpora. Additionally, its practicality is validated by the perplexity (PPL) results on four validation datasets and the accuracy metrics from six downstream tasks. WanJuan-CC shows competitive PPL performance across various validation datasets, especially on benchmarks like Tiny-Stories which demand higher linguistic fluency. Through comparative training of a 1B-parameter model with similar datasets, using PPL from validation datasets and downstream task accuracy as evaluation metrics, experiments prove that WanJuan-CC significantly improves the performance of English text completion and general English proficiency tasks.

提供机构：

OpenDataLab

创建时间：

2024-01-15

搜集汇总

数据集介绍

背景与挑战

背景概述

WanJuan2.0 (万卷-CC) 是一个高质量英文网络文本数据集，源自CommonCrawl，包含约100B Tokens的纯文本数据，主要用于文本预训练和自然语言处理任务。该数据集通过多步处理流程（如去重、安全过滤）确保了数据的安全性和流畅性，在安全评估和下游任务性能上优于同类开源语料。数据以Jsonlines格式存储，附带毒性、流畅性等评分字段，采用CC BY 4.0许可协议，适用于大模型训练和学术研究。

以上内容由遇见数据集搜集并总结生成

社区讨论

#经验分享

【我遇到的问题】 • 现象：该数据集的下载链接已失效【相关信息】 • 可考虑访问这个链接获取类似文件~https://www.selectdataset.com/dataset/3688356173feccbcf1f1e490ddc6bc72

5,000+

优质数据集

54 个

任务类型

进入经典数据集