hiroshi-matsuda-rit/filtered_mc4
收藏Hugging Face2023-08-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hiroshi-matsuda-rit/filtered_mc4
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: filtered-mc4
license:
- odc-by
multilinguality:
- multilingual
---
# Dataset Card for filtered-mc4
See original [mC4 dataset](https://huggingface.co/datasets/mc4) descriptions.
You can apply any regular expression to the mC4 dataset like this:
```python
from datasets import load_dataset
dataset = load_dataset('hiroshi-matsuda-rit/filtered_mc4', 'ja', split='train', reject_patterns=[r"(セフレ|出会い?系|(?<!ユニ)セックス|ソープガイド)", r"[^\s]\ [^\s]+\ [^\s]"], max_reject_pattern_occurence=3, streaming=True)
```
### Citation Information
```
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
```
提供机构:
hiroshi-matsuda-rit
原始信息汇总
数据集卡片 for filtered-mc4
基本信息
- 名称: filtered-mc4
- 许可证: odc-by
- 多语言性: 多语言
使用示例
python from datasets import load_dataset
dataset = load_dataset(hiroshi-matsuda-rit/filtered_mc4, ja, split=train, reject_patterns=[r"(セフレ|出会い?系|(?<!ユニ)セックス|ソープガイド)", r"[^s] [^s]+ [^s]"], max_reject_pattern_occurence=3, streaming=True)
引用信息
@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }



