Peihao/test-dateset

Name: Peihao/test-dateset
Creator: Peihao
Published: 2022-10-25 10:08:29
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Peihao/test-dateset

下载链接

链接失效反馈

官方服务：

资源简介：

C4数据集是一个基于Common Crawl网络爬取语料库的大规模、经过清理的版本。该数据集由AllenAI准备，主要用于预训练语言模型和词表示。数据集包含四个变体：`en`、`en.noblocklist`、`en.noclean`和`realnewslike`，每个变体的大小和格式有所不同。数据集的语言为英语，数据实例包含`url`、`text`和`timestamp`三个字段。数据集的创建过程包括从Common Crawl中提取自然语言文本，并进行去重和语言检测。数据集发布在ODC-BY许可下，并遵循Common Crawl的使用条款。

The C4 dataset is a large-scale, cleaned version of the Common Crawl web-crawled corpus. It was prepared by AllenAI and is primarily used for pre-training language models and word representations. The dataset comprises four variants: `en`, `en.noblocklist`, `en.noclean`, and `realnewslike`, each differing in size and format. The dataset is in English, and each data instance contains three fields: `url`, `text`, and `timestamp`. The dataset is constructed by extracting natural language text from Common Crawl, followed by deduplication and language detection. The dataset is released under the ODC-BY license and complies with the terms of use of Common Crawl.

提供机构：

Peihao

原始信息汇总

数据集概述

数据集名称

名称: C4
别名: Pretty Name: C4

数据集属性

语言: 英语 (en)
许可证: ODC-BY
多语言性: 多语言
大小: 100M<n<1B
源数据集: 原始
任务类别: 文本生成, 填充掩码
任务ID: 语言建模, 掩码语言建模
paperswithcode ID: c4

数据集描述

摘要: C4是一个庞大的、经过清洗的Common Crawl网络爬虫数据集版本。由AllenAI准备，包含四种变体：
- en: 305GB JSON格式
- en.noblocklist: 380GB JSON格式
- en.noclean: 2.3TB JSON格式
- realnewslike: 15GB JSON格式
支持的任务: 主要用于预训练语言模型和词表示。

数据集结构

数据实例: 包含url, text, timestamp字段。
数据字段:
- url: 源URL，字符串类型
- text: 文本内容，字符串类型
- timestamp: 时间戳，字符串类型
数据分割:
- en: 训练集364868892, 验证集364608
- en.noblocklist: 训练集393391519, 验证集393226
- realnewslike: 训练集13799838, 验证集13863

数据集创建

源数据: 约750GB英文文本，来自公共Common Crawl网络爬取。
数据收集与规范化: 使用c4.py进行数据构建，通过langdetect确保至少99%的概率为英文。

许可证信息

许可证: ODC-BY
使用条款: 使用此数据集需遵守Common Crawl的使用条款。

引用信息

@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集