Spanish Corpus

Name: Spanish Corpus
Creator: Authors of the paper
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/josecannete/spanish-corpora

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个为训练西班牙语BERT模型而收集的大型语料库，它包含了来自维基百科以及OPUS项目多个来源的文本，这些来源包括联合国和政府期刊、TED演讲、字幕以及新闻故事。这个语料库是对Cardellino在2016年编译版本的更新，其规模与原始BERT训练语料库相当，大约包含30亿个单词。该数据集的任务是预训练一个基于BERT的语言模型。

This is a large corpus collected for training a Spanish BERT model, which contains texts from multiple sources including Wikipedia, the OPUS Project, United Nations documents, government journals, TED Talks, subtitles, and news stories. This corpus is an updated version of the 2016 compilation by Cardellino, with a scale comparable to the original BERT training corpus, holding approximately 3 billion words. The task of this dataset is to pre-train a BERT-based language model.

提供机构：

Authors of the paper

5,000+

优质数据集

54 个

任务类型

进入经典数据集