CEREAL I, el Corpus del Español REAL

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/11387863

下载链接

链接失效反馈

官方服务：

资源简介：

Content: CEREAL v2 (visit the project website) is a document-level corpus of documents in Spanish extracted from Colossal OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl. The process to build the corpus and its characteristics can be found in: Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024. In order to reproduce the results of the paper, please, use v1 of the corpus. The corpus used to train the classifier and the sentence-level version of CEREAL is available athttps://zenodo.org/records/11390829 Files Description: See the README.txt file

创建时间：

2025-02-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集