indonesian-nlp/mc4-id
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/indonesian-nlp/mc4-id
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- id
license:
- odc-by
multilinguality:
- monolingual
size_categories:
tiny:
- 1M<n<10M
small:
- 10M<n<100M
medium:
- 10M<n<100M
large:
- 10M<n<100M
full:
- 100M<n<1B
source_datasets:
- extended
task_categories:
- text-generation
task_ids:
- language-modeling
paperswithcode_id: mc4
pretty_name: mC4-id
---
# Dataset Card for Clean(maybe) Indonesia mC4
## Dataset Description
- **Original Homepage:** [HF Hub](https://huggingface.co/datasets/allenai/c4)
- **Paper:** [ArXiv](https://arxiv.org/abs/1910.10683)
### Dataset Summary
A thoroughly cleaned version of the Indonesia split of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4). Based on the [Common Crawl dataset](https://commoncrawl.org). The original version was prepared by [AllenAI](https://allenai.org/), hosted at the address [https://huggingface.co/datasets/allenai/c4](https://huggingface.co/datasets/allenai/c4).
### Data Fields
The data contains the following fields:
- `url`: url of the source as a string
- `text`: text content as a string
- `timestamp`: timestamp of extraction as a string
### Data Splits
You can load any subset like this:
```python
from datasets import load_dataset
mc4_id_tiny = load_dataset("munggok/mc4-id", "tiny")
```
Since splits are quite large, you may want to traverse them using the streaming mode available starting from 🤗 Datasets v1.9.0:
```python
from datasets import load_dataset
mc4_id_full_stream = load_dataset("munggok/mc4-id", "full", split='train', streaming=True)
print(next(iter(mc4_id_full_stream))) # Prints the example presented above
```
## Dataset Creation
Refer to the original paper for more considerations regarding the choice of sources and the scraping process for creating `mC4`.
## Considerations for Using the Data
### Discussion of Biases
Despite the cleaning procedure aimed at removing vulgarity and profanity, it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.
## Additional Information
### Dataset Curators
Authors at AllenAI are the original curators for the `mc4` corpus.
### Licensing Information
AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
### Citation Information
If you use this dataset in your work, please cite us and the original mC4 authors as:
```
@inproceedings{xue-etal-2021-mt5,
title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer",
author = "Xue, Linting and
Constant, Noah and
Roberts, Adam and
Kale, Mihir and
Al-Rfou, Rami and
Siddhant, Aditya and
Barua, Aditya and
Raffel, Colin",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.41",
doi = "10.18653/v1/2021.naacl-main.41",
pages = "483--498",
}
```
### Contributions
Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.
提供机构:
indonesian-nlp
原始信息汇总
数据集概述
数据集名称
- 名称: mC4-id
- 别名: Clean(maybe) Indonesia mC4
数据集描述
- 概述: 该数据集是多语言巨型清洗版Common Crawl网络爬虫语料库(mC4)的印尼语分割的彻底清洗版本。
- 来源: 基于Common Crawl dataset。
- 原始版本准备者: AllenAI。
- 原始版本托管地址: https://huggingface.co/datasets/allenai/c4。
数据集内容
- 数据字段:
url: 源url,字符串类型。text: 文本内容,字符串类型。timestamp: 提取时间戳,字符串类型。
数据集大小
- 大小分类:
tiny: 1M<n<10Msmall: 10M<n<100Mmedium: 10M<n<100Mlarge: 10M<n<100Mfull: 100M<n<1B
数据集使用
-
任务类别: 文本生成
-
任务ID: 语言建模
-
使用示例: python from datasets import load_dataset
mc4_id_tiny = load_dataset("munggok/mc4-id", "tiny")
数据集创建
- 创建过程: 参考原始论文了解关于数据源选择和抓取过程的更多考虑。
数据集注意事项
- 偏见讨论: 尽管清洗过程旨在移除粗俗和亵渎内容,但必须考虑模型训练时不可避免地会反映互联网博客文章和评论中存在的偏见。
数据集管理
- 数据集维护者: AllenAI的作者。
- 许可证: ODC-BY。
- 引用信息: 使用此数据集时,请引用原始mC4作者。



