Catalan Textual Corpus

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/4519348

下载链接

链接失效反馈

官方服务：

资源简介：

The Catalan Textual Corpus is a 1760-million-token web corpus of Catalan built from several sources: existing corpus such as DOGC, CaWac (non-dedup version), Oscar (unshuffled version), Open Subtitles, Catalan Wikipedia; and three brand new crawlings: the Catalan General Crawling, obtained by crawling the 500 most popular .cat and .ad domains; the Catalan Government Crawling, obtained by crawling the .gencat domain and subdomains, belonging to the Catalan Government; and the ACN corpus with 220k news items from March 2015 until October 2020, crawled from the Catalan News Agency. It consists of 1.758.388.896 tokens, 73.172.152 sentences and 12.556.365 documents. Documents are separated by single new lines. These boundaries have been preserved as long as the license allowed it. We license the actual packaging of these data under a Attribution-ShareAlike 4.0 International License. Copyright (c) 2021 Text Mining Unit at BSC If you use this resource in your work, please cite our latest paper: @misc{armengolestape2021multilingual, title={Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan}, author={Jordi Armengol{-}Estap{\'{e}} and Casimiro Pio Carrino and Carlos Rodriguez-Penagos and Ona de Gibert Bonet and Carme Armentano{-}Oller and Aitor Gonzalez{-}Agirre and Maite Melero and Marta Villegas}, year={2021}, eprint={2107.07903}, archivePrefix={arXiv}, primaryClass={cs.CL} }

创建时间：

2021-07-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集