thegoodfellas/mc4-pt-cleaned

Name: thegoodfellas/mc4-pt-cleaned
Creator: thegoodfellas
Published: 2023-04-13 13:35:19
License: 暂无描述

Hugging Face2023-04-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/thegoodfellas/mc4-pt-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - fill-mask - text-generation language: - pt size_categories: - 10M<n<100M --- ## Description This is a clenned version of AllenAI mC4 PtBR section. The original dataset can be found here https://huggingface.co/datasets/allenai/c4 ## Clean procedure We applied the same clenning procedure as explained here: https://gitlab.com/yhavinga/c4nlpreproc.git The repository offers two strategies. The first one, found in the main.py file, uses pyspark to create a dataframe that can both clean the text and create a pseudo mix on the entire dataset. We found this strategy clever, but it is time/resource-consuming. To overcome this we jumped into the second approach consisting in leverage the singlefile.py script and parallel all together. We did the following: ``` GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4 cd c4 git lfs pull --include "multilingual/c4-pt.*.json.gz" ls c4-nl* | parallel --gnu --jobs 96 --progress python ~/c4nlpreproc/singlefile.py {} ``` Be advice you should install parallel first if you want to reproduce this dataset, or to create another in a different language. ## Dataset Structure We kept the same structure as the original, so it is like this: ``` { 'timestamp': '2020-02-22T22:24:31Z', 'url': 'https://url here', 'text': 'the content' } ``` ## Considerations for Using the Data We do not perform any procedure to remove bad words, vulgarity, or profanity. it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.

提供机构：

thegoodfellas

原始信息汇总

数据集概述

许可证

类型: Apache-2.0

任务类别

fill-mask
text-generation

语言

葡萄牙语 (pt)

大小

范围: 10M<n<100M

数据集来源

原始数据集: AllenAI mC4 PtBR section
原始数据集链接: AllenAI/c4

清理过程

清理方法: 采用与此相同的清理程序
主要策略: 使用pyspark在main.py中创建数据框，同时清理文本并创建数据集的伪混合
替代策略: 利用singlefile.py脚本并行处理

数据集结构

结构: 与原始数据集相同
示例结构: json { timestamp: 2020-02-22T22:24:31Z, url: https://url here, text: the content }

使用考虑

内容审查: 未执行移除不良词汇、粗俗或亵渎内容的程序
数据偏见: 模型训练可能反映互联网博客文章和评论中的偏见，适合用于研究数据偏见及其影响限制

以上信息基于提供的数据集详情页面README文件内容提炼。

5,000+

优质数据集

54 个

任务类型

进入经典数据集