five

indonesian-nlp/mc4-id

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/indonesian-nlp/mc4-id
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - id license: - odc-by multilinguality: - monolingual size_categories: tiny: - 1M<n<10M small: - 10M<n<100M medium: - 10M<n<100M large: - 10M<n<100M full: - 100M<n<1B source_datasets: - extended task_categories: - text-generation task_ids: - language-modeling paperswithcode_id: mc4 pretty_name: mC4-id --- # Dataset Card for Clean(maybe) Indonesia mC4 ## Dataset Description - **Original Homepage:** [HF Hub](https://huggingface.co/datasets/allenai/c4) - **Paper:** [ArXiv](https://arxiv.org/abs/1910.10683) ### Dataset Summary A thoroughly cleaned version of the Indonesia split of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4). Based on the [Common Crawl dataset](https://commoncrawl.org). The original version was prepared by [AllenAI](https://allenai.org/), hosted at the address [https://huggingface.co/datasets/allenai/c4](https://huggingface.co/datasets/allenai/c4). ### Data Fields The data contains the following fields: - `url`: url of the source as a string - `text`: text content as a string - `timestamp`: timestamp of extraction as a string ### Data Splits You can load any subset like this: ```python from datasets import load_dataset mc4_id_tiny = load_dataset("munggok/mc4-id", "tiny") ``` Since splits are quite large, you may want to traverse them using the streaming mode available starting from 🤗 Datasets v1.9.0: ```python from datasets import load_dataset mc4_id_full_stream = load_dataset("munggok/mc4-id", "full", split='train', streaming=True) print(next(iter(mc4_id_full_stream))) # Prints the example presented above ``` ## Dataset Creation Refer to the original paper for more considerations regarding the choice of sources and the scraping process for creating `mC4`. ## Considerations for Using the Data ### Discussion of Biases Despite the cleaning procedure aimed at removing vulgarity and profanity, it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts. ## Additional Information ### Dataset Curators Authors at AllenAI are the original curators for the `mc4` corpus. ### Licensing Information AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset. ### Citation Information If you use this dataset in your work, please cite us and the original mC4 authors as: ``` @inproceedings{xue-etal-2021-mt5, title = "m{T}5: A Massively Multilingual Pre-trained Text-to-Text Transformer", author = "Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.41", doi = "10.18653/v1/2021.naacl-main.41", pages = "483--498", } ``` ### Contributions Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.
提供机构:
indonesian-nlp
原始信息汇总

数据集概述

数据集名称

  • 名称: mC4-id
  • 别名: Clean(maybe) Indonesia mC4

数据集描述

数据集内容

  • 数据字段:
    • url: 源url,字符串类型。
    • text: 文本内容,字符串类型。
    • timestamp: 提取时间戳,字符串类型。

数据集大小

  • 大小分类:
    • tiny: 1M<n<10M
    • small: 10M<n<100M
    • medium: 10M<n<100M
    • large: 10M<n<100M
    • full: 100M<n<1B

数据集使用

  • 任务类别: 文本生成

  • 任务ID: 语言建模

  • 使用示例: python from datasets import load_dataset

    mc4_id_tiny = load_dataset("munggok/mc4-id", "tiny")

数据集创建

  • 创建过程: 参考原始论文了解关于数据源选择和抓取过程的更多考虑。

数据集注意事项

  • 偏见讨论: 尽管清洗过程旨在移除粗俗和亵渎内容,但必须考虑模型训练时不可避免地会反映互联网博客文章和评论中存在的偏见。

数据集管理

  • 数据集维护者: AllenAI的作者。
  • 许可证: ODC-BY。
  • 引用信息: 使用此数据集时,请引用原始mC4作者。

贡献者

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作