five

CEREAL I, el Corpus del Español REAL

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11387863
下载链接
链接失效反馈
官方服务:
资源简介:
Content: CEREAL v2 (visit the project website) is a document-level corpus of documents in Spanish extracted from  Colossal OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.   The process to build the corpus and its characteristics can be found in: Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024. In order to reproduce the results of the paper, please, use v1 of the corpus. The corpus used to train the classifier and the sentence-level version of CEREAL is available athttps://zenodo.org/records/11390829 Files Description: See the README.txt file
创建时间:
2025-02-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作