CEREAL I, el Corpus del Español REAL
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11387863
下载链接
链接失效反馈官方服务:
资源简介:
Content:
CEREAL v2 (visit the project website) is a document-level corpus of documents in Spanish extracted from Colossal OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.
The process to build the corpus and its characteristics can be found in:
Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.
In order to reproduce the results of the paper, please, use v1 of the corpus. The corpus used to train the classifier and the sentence-level version of CEREAL is available athttps://zenodo.org/records/11390829
Files Description:
See the README.txt file
创建时间:
2025-02-07



