spanish_billion_words

Name: spanish_billion_words
Creator: huggingface.co
License: 暂无描述

huggingface.co2025-01-15 收录

下载链接：

https://huggingface.co/datasets/crscardellino/spanish_billion_words

下载链接

链接失效反馈

官方服务：

资源简介：

An unannotated Spanish corpus of nearly 1.5 billion words, compiled from different resources from the web. This resources include the spanish portions of SenSem, the Ancora Corpus, some OPUS Project Corpora and the Europarl, the Tibidabo Treebank, the IULA Spanish LSP Treebank, and dumps from the Spanish Wikipedia, Wikisource and Wikibooks. This corpus is a compilation of 100 text files. Each line of these files represents one of the 50 million sentences from the corpus.

一项未经标注的西班牙语语料库，总计约15亿单词，汇集自网络上的多种资源。这些资源包括SenSem的西班牙语部分、Ancora语料库、OPUS项目语料库中的部分内容以及Europarl、Tibidabo语料库、IULA西班牙语语言学树库，以及来自西班牙维基百科、维基源和维基教科书的导出数据。该语料库由100个文本文件组成，其中每个文件中的一行代表语料库中的5000万个句子之一。

提供机构：

huggingface.co

5,000+

优质数据集

54 个

任务类型

进入经典数据集