five

hiroshi-matsuda-rit/filtered_mc4

收藏
Hugging Face2023-08-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hiroshi-matsuda-rit/filtered_mc4
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: filtered-mc4 license: - odc-by multilinguality: - multilingual --- # Dataset Card for filtered-mc4 See original [mC4 dataset](https://huggingface.co/datasets/mc4) descriptions. You can apply any regular expression to the mC4 dataset like this: ```python from datasets import load_dataset dataset = load_dataset('hiroshi-matsuda-rit/filtered_mc4', 'ja', split='train', reject_patterns=[r"(セフレ|出会い?系|(?<!ユニ)セックス|ソープガイド)", r"[^\s]\ [^\s]+\ [^\s]"], max_reject_pattern_occurence=3, streaming=True) ``` ### Citation Information ``` @article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, } ```
提供机构:
hiroshi-matsuda-rit
原始信息汇总

数据集卡片 for filtered-mc4

基本信息

  • 名称: filtered-mc4
  • 许可证: odc-by
  • 多语言性: 多语言

使用示例

python from datasets import load_dataset

dataset = load_dataset(hiroshi-matsuda-rit/filtered_mc4, ja, split=train, reject_patterns=[r"(セフレ|出会い?系|(?<!ユニ)セックス|ソープガイド)", r"[^s] [^s]+ [^s]"], max_reject_pattern_occurence=3, streaming=True)

引用信息

@article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作