thegoodfellas/mc4-pt-cleaned
收藏Hugging Face2023-04-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/thegoodfellas/mc4-pt-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- fill-mask
- text-generation
language:
- pt
size_categories:
- 10M<n<100M
---
## Description
This is a clenned version of AllenAI mC4 PtBR section. The original dataset can be found here https://huggingface.co/datasets/allenai/c4
## Clean procedure
We applied the same clenning procedure as explained here: https://gitlab.com/yhavinga/c4nlpreproc.git
The repository offers two strategies. The first one, found in the main.py file, uses pyspark to create a dataframe that can both clean the text and create a
pseudo mix on the entire dataset. We found this strategy clever, but it is time/resource-consuming.
To overcome this we jumped into the second approach consisting in leverage the singlefile.py script and parallel all together.
We did the following:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "multilingual/c4-pt.*.json.gz"
ls c4-nl* | parallel --gnu --jobs 96 --progress python ~/c4nlpreproc/singlefile.py {}
```
Be advice you should install parallel first if you want to reproduce this dataset, or to create another in a different language.
## Dataset Structure
We kept the same structure as the original, so it is like this:
```
{
'timestamp': '2020-02-22T22:24:31Z',
'url': 'https://url here',
'text': 'the content'
}
```
## Considerations for Using the Data
We do not perform any procedure to remove bad words, vulgarity, or profanity. it must be considered that model trained on this scraped corpus will inevitably reflect biases present in blog articles and comments on the Internet. This makes the corpus especially interesting in the context of studying data biases and how to limit their impacts.
提供机构:
thegoodfellas
原始信息汇总
数据集概述
许可证
- 类型: Apache-2.0
任务类别
- fill-mask
- text-generation
语言
- 葡萄牙语 (pt)
大小
- 范围: 10M<n<100M
数据集来源
- 原始数据集: AllenAI mC4 PtBR section
- 原始数据集链接: AllenAI/c4
清理过程
- 清理方法: 采用与此相同的清理程序
- 主要策略: 使用pyspark在
main.py中创建数据框,同时清理文本并创建数据集的伪混合 - 替代策略: 利用
singlefile.py脚本并行处理
数据集结构
- 结构: 与原始数据集相同
- 示例结构: json { timestamp: 2020-02-22T22:24:31Z, url: https://url here, text: the content }
使用考虑
- 内容审查: 未执行移除不良词汇、粗俗或亵渎内容的程序
- 数据偏见: 模型训练可能反映互联网博客文章和评论中的偏见,适合用于研究数据偏见及其影响限制
以上信息基于提供的数据集详情页面README文件内容提炼。



