下载链接：

https://modelscope.cn/datasets/allenai/c4

下载链接

链接失效反馈

官方服务：

资源简介：

# C4 ## Dataset Description - **Paper:** https://arxiv.org/abs/1910.10683 ### Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of [Google's C4 dataset](https://www.tensorflow.org/datasets/catalog/c4) We prepared five variants of the data: `en`, `en.noclean`, `en.noblocklist`, `realnewslike`, and `multilingual` (mC4). For reference, these are the sizes of the variants: - `en`: 305GB - `en.noclean`: 2.3TB - `en.noblocklist`: 380GB - `realnewslike`: 15GB - `multilingual` (mC4): 9.7TB (108 subsets, one per language) The `en.noblocklist` variant is exactly the same as the `en` variant, except we turned off the so-called "badwords filter", which removes all documents that contain words from the lists at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words. #### How do I download this? ##### Using 🤗 Datasets ```python from datasets import load_dataset # English only en = load_dataset("allenai/c4", "en") # Other variants in english en_noclean = load_dataset("allenai/c4", "en.noclean") en_noblocklist = load_dataset("allenai/c4", "en.noblocklist") realnewslike = load_dataset("allenai/c4", "realnewslike") # Multilingual (108 languages) multilingual = load_dataset("allenai/c4", "multilingual") # One specific language es = load_dataset("allenai/c4", "es") ``` Since this dataset is big, it is encouraged to load it in streaming mode using `streaming=True`, for example: ```python en = load_dataset("allenai/c4", "en", streaming=True) ``` You can also load and mix multiple languages: ```python from datasets import concatenate_datasets, interleave_datasets, load_dataset es = load_dataset("allenai/c4", "es", streaming=True) fr = load_dataset("allenai/c4", "fr", streaming=True) # Concatenate both datasets concatenated = concatenate_datasets([es, fr]) # Or interleave them (alternates between one and the other) interleaved = interleave_datasets([es, fr]) ``` ##### Using Dask ```python import dask.dataframe as dd df = dd.read_json("hf://datasets/allenai/c4/en/c4-train.*.json.gz") # English only en_df = dd.read_json("hf://datasets/allenai/c4/en/c4-*.json.gz") # Other variants in english en_noclean_df = dd.read_json("hf://datasets/allenai/c4/en/noclean/c4-*.json.gz") en_noblocklist_df = dd.read_json("hf://datasets/allenai/c4/en.noblocklist/c4-*.json.gz") realnewslike_df = dd.read_json("hf://datasets/allenai/c4/realnewslike/c4-*.json.gz") # Multilingual (108 languages) multilingual_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-*.json.gz") # One specific language es_train_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es.*.json.gz") es_valid_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es-validation.*.json.gz") ``` ##### Using Git ```bash git clone https://huggingface.co/datasets/allenai/c4 ``` This will download 13TB to your local drive. If you want to be more precise with what you are downloading, follow these commands instead: ```bash GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4 cd c4 git lfs pull --include "en/*" ``` The `git clone` command in this variant will download a bunch of stub files that Git LFS uses, so you can see all the filenames that exist that way. You can then convert the stubs into their real files with `git lfs pull --include "..."`. For example, if you wanted all the Dutch documents from the multilingual set, you would run ```bash git lfs pull --include "multilingual/c4-nl.*.json.gz" ``` ### Supported Tasks and Leaderboards C4 and mC4 are mainly intended to pretrain language models and word representations. ### Languages The `en`, `en.noclean`, `en.noblocklist` and `realnewslike` variants are in English. The other 108 languages are available and are reported in the table below. Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script. | language code | language name | |:----------------|:---------------------| | af | Afrikaans | | am | Amharic | | ar | Arabic | | az | Azerbaijani | | be | Belarusian | | bg | Bulgarian | | bg-Latn | Bulgarian (Latin) | | bn | Bangla | | ca | Catalan | | ceb | Cebuano | | co | Corsican | | cs | Czech | | cy | Welsh | | da | Danish | | de | German | | el | Greek | | el-Latn | Greek (Latin) | | en | English | | eo | Esperanto | | es | Spanish | | et | Estonian | | eu | Basque | | fa | Persian | | fi | Finnish | | fil | Filipino | | fr | French | | fy | Western Frisian | | ga | Irish | | gd | Scottish Gaelic | | gl | Galician | | gu | Gujarati | | ha | Hausa | | haw | Hawaiian | | hi | Hindi | | hi-Latn | Hindi (Latin script) | | hmn | Hmong, Mong | | ht | Haitian | | hu | Hungarian | | hy | Armenian | | id | Indonesian | | ig | Igbo | | is | Icelandic | | it | Italian | | iw | former Hebrew | | ja | Japanese | | ja-Latn | Japanese (Latin) | | jv | Javanese | | ka | Georgian | | kk | Kazakh | | km | Khmer | | kn | Kannada | | ko | Korean | | ku | Kurdish | | ky | Kyrgyz | | la | Latin | | lb | Luxembourgish | | lo | Lao | | lt | Lithuanian | | lv | Latvian | | mg | Malagasy | | mi | Maori | | mk | Macedonian | | ml | Malayalam | | mn | Mongolian | | mr | Marathi | | ms | Malay | | mt | Maltese | | my | Burmese | | ne | Nepali | | nl | Dutch | | no | Norwegian | | ny | Nyanja | | pa | Punjabi | | pl | Polish | | ps | Pashto | | pt | Portuguese | | ro | Romanian | | ru | Russian | | ru-Latn | Russian (Latin) | | sd | Sindhi | | si | Sinhala | | sk | Slovak | | sl | Slovenian | | sm | Samoan | | sn | Shona | | so | Somali | | sq | Albanian | | sr | Serbian | | st | Southern Sotho | | su | Sundanese | | sv | Swedish | | sw | Swahili | | ta | Tamil | | te | Telugu | | tg | Tajik | | th | Thai | | tr | Turkish | | uk | Ukrainian | | und | Unknown language | | ur | Urdu | | uz | Uzbek | | vi | Vietnamese | | xh | Xhosa | | yi | Yiddish | | yo | Yoruba | | zh | Chinese | | zh-Latn | Chinese (Latin) | | zu | Zulu | ## Dataset Structure ### Data Instances An example form the `en` config is: ``` { 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/', 'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z' } ``` ### Data Fields The data have several fields: - `url`: url of the source as a string - `text`: text content as a string - `timestamp`: timestamp as a string ### Data Splits Sizes for the variants in english: | name | train |validation| |----------------|--------:|---------:| | en |364868892| 364608| | en.noblocklist |393391519| 393226| | en.noclean | ?| ?| | realnewslike | 13799838| 13863| A train and validation split are also provided for the other languages, but lengths are still to be added. ### Source Data #### Initial Data Collection and Normalization The C4 and mC4 datasets are collections text sourced from the public Common Crawl web scrape. It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish) in addition to extensive deduplication. You can find the code that has been used to build this dataset in [c4.py](https://github.com/tensorflow/datasets/blob/5952d3d60d60e1727786fa7a9a23d24bb463d4d6/tensorflow_datasets/text/c4.py) by Tensorflow Datasets. C4 dataset was explicitly designed to be English only: any page that was not given a probability of at least 99% of being English by [langdetect](https://github.com/Mimino666/langdetect) was discarded. To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. ### Licensing Information We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset. ### Acknowledgements Big ups to the good folks at [Common Crawl](https://commoncrawl.org) whose data made this possible ([consider donating](http://commoncrawl.org/donate/)!), to Google for creating the code that curates and filters the data, and to Huggingface, who had no issue with hosting these 3TB of data for public download!

# C4 数据集 ## 数据集描述 - **论文链接：** https://arxiv.org/abs/1910.10683 ### 数据集概要本数据集是对公共爬虫（Common Crawl）的网页爬取语料库进行大规模清洗后得到的巨型语料集，其数据基础来源于 Common Crawl 数据集：<https://commoncrawl.org>。本数据集为[谷歌的C4数据集（Google's C4 dataset）](https://www.tensorflow.org/datasets/catalog/c4)的处理版本。我们准备了五种数据变体：`en`、`en.noclean`、`en.noblocklist`、`realnewslike`以及多语言变体（multilingual，即mC4）。各变体的大小如下： - `en`：305GB - `en.noclean`：2.3TB - `en.noblocklist`：380GB - `realnewslike`：15GB - 多语言变体（mC4）：9.7TB，包含108个语言子集，每个子集对应一种语言。其中`en.noblocklist`变体与`en`变体内容完全一致，仅关闭了所谓的“低俗词过滤器”——该过滤器会移除所有包含https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words 列表中词汇的文档。 #### 如何下载本数据集？ ##### 使用 🤗 数据集库（Datasets） python from datasets import load_dataset # 仅加载英语变体 en = load_dataset("allenai/c4", "en") # 加载其他英语变体 en_noclean = load_dataset("allenai/c4", "en.noclean") en_noblocklist = load_dataset("allenai/c4", "en.noblocklist") realnewslike = load_dataset("allenai/c4", "realnewslike") # 加载多语言变体（支持108种语言） multilingual = load_dataset("allenai/c4", "multilingual") # 加载单一指定语言 es = load_dataset("allenai/c4", "es") 由于本数据集体积较大，推荐使用流式加载模式，即设置`streaming=True`，示例如下： python en = load_dataset("allenai/c4", "en", streaming=True) 你也可以加载并混合多种语言的数据集： python from datasets import concatenate_datasets, interleave_datasets, load_dataset es = load_dataset("allenai/c4", "es", streaming=True) fr = load_dataset("allenai/c4", "fr", streaming=True) # 拼接两个数据集 concatenated = concatenate_datasets([es, fr]) # 或交错加载两个数据集（按顺序交替读取） interleaved = interleave_datasets([es, fr]) ##### 使用 Dask python import dask.dataframe as dd df = dd.read_json("hf://datasets/allenai/c4/en/c4-train.*.json.gz") # 仅加载英语变体 en_df = dd.read_json("hf://datasets/allenai/c4/en/c4-*.json.gz") # 加载其他英语变体 en_noclean_df = dd.read_json("hf://datasets/allenai/c4/en/noclean/c4-*.json.gz") en_noblocklist_df = dd.read_json("hf://datasets/allenai/c4/en.noblocklist/c4-*.json.gz") realnewslike_df = dd.read_json("hf://datasets/allenai/c4/realnewslike/c4-*.json.gz") # 加载多语言变体（108种语言） multilingual_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-*.json.gz") # 加载单一指定语言的数据集 es_train_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es.*.json.gz") es_valid_df = dd.read_json("hf://datasets/allenai/c4/multilingual/c4-es-validation.*.json.gz") ##### 使用 Git bash git clone https://huggingface.co/datasets/allenai/c4 该命令会将总计13TB的数据下载至本地磁盘。若希望精准控制下载内容，可以使用以下命令： bash GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4 cd c4 git lfs pull --include "en/*" 上述Git克隆命令会先下载一批Git大文件存储（Git Large File Storage，LFS）使用的桩文件，借此你可以查看所有可用的文件名。随后你可以通过`git lfs pull --include "..."`命令将桩文件转换为真实数据文件。例如，若需要下载多语言数据集中的荷兰语文档，可以执行： bash git lfs pull --include "multilingual/c4-nl.*.json.gz" ### 支持任务与评测基准 C4与mC4主要用于预训练大语言模型（Large Language Model，LLM）与词表征。 ### 语言说明 `en`、`en.noclean`、`en.noblocklist`以及`realnewslike`变体均为英语语料。其余108种语言的变体均已提供，详情见下表。请注意，以`-Latn`结尾的语言变体均为拉丁化版本，即使用拉丁字母进行书写。 | 语言代码 | 语言名称 | |:---------|:-----------------------| | af | 南非荷兰语 | | am | 阿姆哈拉语 | | ar | 阿拉伯语 | | az | 阿塞拜疆语 | | be | 白俄罗斯语 | | bg | 保加利亚语 | | bg-Latn | 保加利亚语（拉丁化） | | bn | 孟加拉语 | | ca | 加泰罗尼亚语 | | ceb | 宿务语 | | co | 科西嘉语 | | cs | 捷克语 | | cy | 威尔士语 | | da | 丹麦语 | | de | 德语 | | el | 希腊语 | | el-Latn | 希腊语（拉丁化） | | en | 英语 | | eo | 世界语 | | es | 西班牙语 | | et | 爱沙尼亚语 | | eu | 巴斯克语 | | fa | 波斯语 | | fi | 芬兰语 | | fil | 菲律宾语 | | fr | 法语 | | fy | 西弗里西亚语 | | ga | 爱尔兰语 | | gd | 苏格兰盖尔语 | | gl | 加利西亚语 | | gu | 古吉拉特语 | | ha | 豪萨语 | | haw | 夏威夷语 | | hi | 印地语 | | hi-Latn | 印地语（拉丁化） | | hmn | 苗语（孟语） | | ht | 海地克里奥尔语 | | hu | 匈牙利语 | | hy | 亚美尼亚语 | | id | 印度尼西亚语 | | ig | 伊博语 | | is | 冰岛语 | | it | 意大利语 | | iw | 希伯来语（旧称） | | ja | 日语 | | ja-Latn | 日语（拉丁化） | | jv | 爪哇语 | | ka | 格鲁吉亚语 | | kk | 哈萨克语 | | km | 高棉语 | | kn | 卡纳达语 | | ko | 韩语 | | ku | 库尔德语 | | ky | 吉尔吉斯语 | | la | 拉丁语 | | lb | 卢森堡语 | | lo | 老挝语 | | lt | 立陶宛语 | | lv | 拉脱维亚语 | | mg | 马尔加什语 | | mi | 毛利语 | | mk | 马其顿语 | | ml | 马拉雅拉姆语 | | mn | 蒙古语 | | mr | 马拉地语 | | ms | 马来语 | | mt | 马耳他语 | | my | 缅甸语 | | ne | 尼泊尔语 | | nl | 荷兰语 | | no | 挪威语 | | ny | 齐切瓦语 | | pa | 旁遮普语 | | pl | 波兰语 | | ps | 普什图语 | | pt | 葡萄牙语 | | ro | 罗马尼亚语 | | ru | 俄语 | | ru-Latn | 俄语（拉丁化） | | sd | 信德语 | | si | 僧伽罗语 | | sk | 斯洛伐克语 | | sl | 斯洛文尼亚语 | | sm | 萨摩亚语 | | sn | 修纳语 | | so | 索马里语 | | sq | 阿尔巴尼亚语 | | sr | 塞尔维亚语 | | st | 南索托语 | | su | 巽他语 | | sv | 瑞典语 | | sw | 斯瓦希里语 | | ta | 泰米尔语 | | te | 泰卢固语 | | tg | 塔吉克语 | | th | 泰语 | | tr | 土耳其语 | | uk | 乌克兰语 | | und | 未知语言 | | ur | 乌尔都语 | | uz | 乌兹别克语 | | vi | 越南语 | | xh | 科萨语 | | yi | 意第绪语 | | yo | 约鲁巴语 | | zh | 中文 | | zh-Latn | 中文（拉丁化） | | zu | 祖鲁语 | ## 数据集结构 ### 数据样例 `en`变体的一条样例如下： { 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/', 'text': 'Beginners BBQ Class Taking Place in Missoula! Do you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills. He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information. The cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z' } ### 数据字段本数据集包含以下字段： - `url`：字符串类型，代表数据来源的统一资源定位符（URL） - `text`：字符串类型，代表文本内容 - `timestamp`：字符串类型，代表时间戳 ### 数据划分英语变体的数据集划分大小如下： | 变体名称 | 训练集 | 验证集 | |:-----------------|---------:|-------:| | en | 364868892 | 364608 | | en.noblocklist | 393391519 | 393226 | | en.noclean | ? | ? | | realnewslike | 13799838 | 13863 | 其余语言变体同样提供训练集与验证集划分，但具体样本数量尚未补充。 ### 源数据 #### 初始数据收集与标准化 C4与mC4数据集的文本均来源于公开的Common Crawl网页爬取数据。除了广泛的去重操作外，数据集还通过启发式规则仅提取自然语言文本（排除模板文本与无意义乱码）。你可以在TensorFlow Datasets的[c4.py](https://github.com/tensorflow/datasets/blob/5952d3d60d60e1727786fa7a9a23d24bb463d4d6/tensorflow_datasets/text/c4.py)中找到构建本数据集的代码。 C4数据集最初被设计为仅支持英语：所有被[langdetect](https://github.com/Mimino666/langdetect)判定为英语概率低于99%的页面都会被移除。而mC4数据集的构建则使用了[CLD3](https://github.com/google/cld3)来识别超过100种语言。 ### 授权信息本数据集基于[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议发布。使用本数据集的同时，你也需要遵守[Common Crawl的使用条款](https://commoncrawl.org/terms-of-use/)，以尊重数据集中包含的内容。 ### 致谢感谢公共爬虫（Common Crawl）团队的优秀同仁，他们的数据为本项目奠定了基础（[欢迎捐赠](http://commoncrawl.org/donate/)！）；感谢谷歌团队开发了数据整理与过滤的代码；同时感谢Hugging Face，他们无私托管了总计3TB的公开下载数据！

应用场景：