five

thegoodfellas/mc4-pt-cleaned

收藏
Hugging Face2023-04-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/thegoodfellas/mc4-pt-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - fill-mask - text-generation language: - pt size_categories: - 10M<n<100M --- ## Description This is a clenned version of AllenAI mC4 PtBR section. The original dataset can be found here https://huggingface.co/datasets/allenai/c4 ## Clean procedure We applied the same clenning procedure as explained here: https://gitlab.com/yhavinga/c4nlpreproc.git The repository offers two strategies. The first one, found in the main.py file, uses pyspark to create a dataframe that can both clean the text and create a pseudo mix on the entire dataset. We found this strategy clever, but it is time/resource-consuming. To overcome this we jumped into the second approach consisting in leverage the singlefile.py script and parallel all together. We did the following: ``` GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4 cd c4 git lfs pull --include "multilingual/c4-pt.*.json.gz" ls c4-nl* | parallel --gnu --jobs 96 --progress python ~/c4nlpreproc/singlefile.py {} ``` Be advice you should install parallel first if you want to reproduce this dataset, or to create another in a different language. ## Dataset Structure We kept the same structure as the original, so it is like this: ``` { 'timestamp': '2020-02-22T22:24:31Z', 'url': 'https://url here', 'text': 'the content' } ``` ## Considerations for Using the Data We do not perform any procedure to remove bad words, vulgarity, or profanity. it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.
提供机构:
thegoodfellas
原始信息汇总

数据集概述

许可证

  • 类型: Apache-2.0

任务类别

  • fill-mask
  • text-generation

语言

  • 葡萄牙语 (pt)

大小

  • 范围: 10M<n<100M

数据集来源

  • 原始数据集: AllenAI mC4 PtBR section
  • 原始数据集链接: AllenAI/c4

清理过程

  • 清理方法: 采用与相同的清理程序
  • 主要策略: 使用pyspark在main.py中创建数据框,同时清理文本并创建数据集的伪混合
  • 替代策略: 利用singlefile.py脚本并行处理

数据集结构

  • 结构: 与原始数据集相同
  • 示例结构: json { timestamp: 2020-02-22T22:24:31Z, url: https://url here, text: the content }

使用考虑

  • 内容审查: 未执行移除不良词汇、粗俗或亵渎内容的程序
  • 数据偏见: 模型训练可能反映互联网博客文章和评论中的偏见,适合用于研究数据偏见及其影响限制

以上信息基于提供的数据集详情页面README文件内容提炼。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作