singletongue/cc100-documents
收藏Hugging Face2025-11-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/singletongue/cc100-documents
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
- de
- en
- es
- fr
- it
- ja
- ko
- pt
- ru
- zh
language_bcp47:
- zh-Hans
- zh-Hant
multilinguality:
- multilingual
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
dataset_info:
- config_name: ar
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 30345714473
num_examples: 15039879
download_size: 15115802825
dataset_size: 30345714473
- config_name: de
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 72775978455
num_examples: 69023867
download_size: 45855033322
dataset_size: 72775978455
- config_name: en
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 327673321587
num_examples: 247588106
download_size: 206842211484
dataset_size: 327673321587
- config_name: es
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 58353909888
num_examples: 60542096
download_size: 36639924361
dataset_size: 58353909888
- config_name: fr
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 62167991656
num_examples: 62112712
download_size: 38687566994
dataset_size: 62167991656
- config_name: it
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 32862817488
num_examples: 24674591
download_size: 20840201122
dataset_size: 32862817488
- config_name: ja
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 75607021621
num_examples: 65613665
download_size: 43271734711
dataset_size: 75607021621
- config_name: ko
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 58828056524
num_examples: 35678358
download_size: 34859494100
dataset_size: 58828056524
- config_name: pt
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 53502662440
num_examples: 38999388
download_size: 33751151154
dataset_size: 53502662440
- config_name: ru
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 300870542621
num_examples: 123181529
download_size: 151288950598
dataset_size: 300870542621
- config_name: zh-Hans
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 51117335603
num_examples: 40002855
download_size: 36115556249
dataset_size: 51117335603
- config_name: zh-Hant
features:
- name: idx
dtype: int64
- name: start_ln
dtype: int64
- name: text
dtype: string
splits:
- name: train
num_bytes: 18091739883
num_examples: 12328227
download_size: 12884957758
dataset_size: 18091739883
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- config_name: de
data_files:
- split: train
path: de/train-*
- config_name: en
data_files:
- split: train
path: en/train-*
- config_name: es
data_files:
- split: train
path: es/train-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- config_name: it
data_files:
- split: train
path: it/train-*
- config_name: ja
data_files:
- split: train
path: ja/train-*
- config_name: ko
data_files:
- split: train
path: ko/train-*
- config_name: pt
data_files:
- split: train
path: pt/train-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- config_name: zh-Hans
data_files:
- split: train
path: zh-Hans/train-*
- config_name: zh-Hant
data_files:
- split: train
path: zh-Hant/train-*
---
# cc100-documents
This dataset is a restructured version of the [CC-100](https://data.statmt.org/cc-100/) ([statmt/cc100](https://huggingface.co/datasets/statmt/cc100)) dataset.
In the original dataset, each instance corresponds to a single paragraph (or a document boundary).
In this version, the data has been reformed so that each instance corresponds to a single, complete document.
This document-level structure makes it more convenient for processing using the `map()` and `filter()` methods in the Hugging Face Datasets library.
## Languages
The following languages are currently available:
|Language |Number of examples|
|:-------------------------------|-----------------:|
|`ar`: Arabic | 15,039,879|
|`de`: German | 69,023,867|
|`en`: English | 247,588,106|
|`es`: Spanish | 60,542,096|
|`fr`: French | 62,112,712|
|`it`: Italian | 24,674,591|
|`ja`: Japanese | 65,613,665|
|`ko`: Korean | 35,678,358|
|`pt`: Portuguese | 38,999,388|
|`ru`: Russian | 123,181,529|
|`zh-Hans`: Chinese (Simplified) | 40,002,855|
|`zh-Hant`: Chinese (Traditional)| 12,328,227|
## Dataset Structure
### Data Instances
Each instance in the dataset represents one document from the original CC-100 dataset, preserving the original document order. The original paragraphs are concatenated together, preserving the newline characters between them, to form a single document text.
Example from the `en` configuration:
```json
{
"idx": 0,
"start_ln": 1,
"text": "Belmont Estate is on the market for $63 million and boasts roughly 22,000 square feet of luxurious..."
}
{
"idx": 1,
"start_ln": 8,
"text": "Stay well hydrated—that means you should include about 48- 64 ounces of liquid (non-calorie) each day..."
}
```
### Data Fields
- **idx** (*int64*): The index of the instance, starting from `0`.
- **start_ln** (*int64*): The 1-based line number where the document begins in the original CC-100 text file.
- **text** (*string*): The complete text of the document.
## Dataset creation process
The dataset is created from the original CC-100 text files available at https://data.statmt.org/cc-100/.
The code used to create this dataset is available in the [GitHub repository](https://github.com/singletongue/cc100-documents).
## License
No intellectual property is claimed on the preparation of this corpus.
By using this dataset, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use) in respect of the content contained in the dataset.
Please refer to the following pages for license information on the original dataset:
- https://data.statmt.org/cc-100/
- https://huggingface.co/datasets/statmt/cc100
---
语言:
- ar
- de
- en
- es
- fr
- it
- ja
- ko
- pt
- ru
- zh
BCP47语言标签:
- zh-Hans
- zh-Hant
多语言属性:
- 多语言
任务类别:
- 文本生成
- 掩码填充
任务类型:
- 语言建模
- 掩码语言建模
数据集信息:
- 配置名称:ar
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:30345714473
示例数量:15039879
下载大小:15115802825
数据集大小:30345714473
- 配置名称:de
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:72775978455
示例数量:69023867
下载大小:45855033322
数据集大小:72775978455
- 配置名称:en
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:327673321587
示例数量:247588106
下载大小:206842211484
数据集大小:327673321587
- 配置名称:es
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:58353909888
示例数量:60542096
下载大小:36639924361
数据集大小:58353909888
- 配置名称:fr
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:62167991656
示例数量:62112712
下载大小:38687566994
数据集大小:62167991656
- 配置名称:it
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:32862817488
示例数量:24674591
下载大小:20840201122
数据集大小:32862817488
- 配置名称:ja
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:75607021621
示例数量:65613665
下载大小:43271734711
数据集大小:75607021621
- 配置名称:ko
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:58828056524
示例数量:35678358
下载大小:34859494100
数据集大小:58828056524
- 配置名称:pt
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:53502662440
示例数量:38999388
下载大小:33751151154
数据集大小:53502662440
- 配置名称:ru
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:300870542621
示例数量:123181529
下载大小:151288950598
数据集大小:300870542621
- 配置名称:zh-Hans
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:51117335603
示例数量:40002855
下载大小:36115556249
数据集大小:51117335603
- 配置名称:zh-Hant
特征:
- 名称:idx
数据类型:int64
- 名称:start_ln
数据类型:int64
- 名称:text
数据类型:string
拆分:
- 拆分名称:train
字节数:18091739883
示例数量:12328227
下载大小:12884957758
数据集大小:18091739883
配置项:
- 配置名称:ar
数据文件:
- 拆分:train
路径:ar/train-*
- 配置名称:de
数据文件:
- 拆分:train
路径:de/train-*
- 配置名称:en
数据文件:
- 拆分:train
路径:en/train-*
- 配置名称:es
数据文件:
- 拆分:train
路径:es/train-*
- 配置名称:fr
数据文件:
- 拆分:train
路径:fr/train-*
- 配置名称:it
数据文件:
- 拆分:train
路径:it/train-*
- 配置名称:ja
数据文件:
- 拆分:train
路径:ja/train-*
- 配置名称:ko
数据文件:
- 拆分:train
路径:ko/train-*
- 配置名称:pt
数据文件:
- 拆分:train
路径:pt/train-*
- 配置名称:ru
数据文件:
- 拆分:train
路径:ru/train-*
- 配置名称:zh-Hans
数据文件:
- 拆分:train
路径:zh-Hans/train-*
- 配置名称:zh-Hant
数据文件:
- 拆分:train
路径:zh-Hant/train-*
---
# cc100-documents
本数据集是 [CC-100](https://data.statmt.org/cc-100/)([statmt/cc100](https://huggingface.co/datasets/statmt/cc100))数据集的重构版本。
在原始数据集中,每个实例对应单个段落(或文档边界)。在本版本中,数据已被重构,使得每个实例对应单个完整文档。
这种文档级结构更便于使用 Hugging Face 数据集库中的 `map()` 和 `filter()` 方法进行处理。
## 语言支持
当前支持以下语言:
| 语言名称 | 示例数量 |
|:-------------------------------|-----------------:|
| `ar`: 阿拉伯语(Arabic) | 15,039,879|
| `de`: 德语(German) | 69,023,867|
| `en`: 英语(English) | 247,588,106|
| `es`: 西班牙语(Spanish) | 60,542,096|
| `fr`: 法语(French) | 62,112,712|
| `it`: 意大利语(Italian) | 24,674,591|
| `ja`: 日语(Japanese) | 65,613,665|
| `ko`: 韩语(Korean) | 35,678,358|
| `pt`: 葡萄牙语(Portuguese) | 38,999,388|
| `ru`: 俄语(Russian) | 123,181,529|
| `zh-Hans`: 简体中文(Chinese (Simplified)) | 40,002,855|
| `zh-Hant`: 繁体中文(Chinese (Traditional))| 12,328,227|
## 数据集结构
### 数据实例
数据集中的每个实例均源自原始CC-100数据集中的单篇文档,保留原始文档的顺序。原始段落会被拼接在一起,并保留其间的换行符,以形成单篇完整的文档文本。
以`en`配置为例:
json
{
"idx": 0,
"start_ln": 1,
"text": "Belmont Estate is on the market for $63 million and boasts roughly 22,000 square feet of luxurious..."
}
{
"idx": 1,
"start_ln": 8,
"text": "Stay well hydrated—that means you should include about 48- 64 ounces of liquid (non-calorie) each day..."
}
### 数据字段
- **idx** (*int64*):实例索引,起始值为`0`。
- **start_ln** (*int64*):该文档在原始CC-100文本文件中的起始行号(采用1基编号,即从1开始计数)。
- **text** (*string*):文档的完整文本内容。
## 数据集构建流程
本数据集基于https://data.statmt.org/cc-100/ 提供的原始CC-100文本文件构建。用于生成本数据集的代码可在 [GitHub仓库](https://github.com/singletongue/cc100-documents) 中获取。
## 授权协议
本语料库的整理工作未主张任何知识产权。使用本数据集即意味着您需遵守 [Common Crawl使用条款](https://commoncrawl.org/terms-of-use) 中关于数据集内容的相关规定。
有关原始数据集的授权信息,请参阅以下页面:
- https://data.statmt.org/cc-100/
- https://huggingface.co/datasets/statmt/cc100
提供机构:
singletongue



