mC4
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/google-research/multilingual-t5
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个大规模的多语种数据集,其中700GB的子集被用于训练一个变压器模型,以生成西班牙语诗歌。这个子集是该大型多语种语料库的一部分,其规模达到了700GB,所承担的任务是进行无监督的西班牙语诗歌生成。
This is a large-scale multilingual corpus. A 700GB subset of this corpus was used to train a Transformer model for Spanish poetry generation. This subset constitutes a portion of this large multilingual corpus, which has a total size of 700GB, and the task carried out herein is unsupervised Spanish poetry generation.
提供机构:
mC4



