five

Catalan General Crawling

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/4636227
下载链接
链接失效反馈
官方服务:
资源简介:
If you use this resource in your work, please cite our latest paper: @inproceedings{armengol-estape-etal-2021-multilingual,     title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",     author = "Armengol-Estap{\'e}, Jordi  and       Carrino, Casimiro Pio  and       Rodriguez-Penagos, Carlos  and       de Gibert Bonet, Ona  and       Armentano-Oller, Carme  and       Gonzalez-Agirre, Aitor  and       Melero, Maite  and       Villegas, Marta",     booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",     month = aug,     year = "2021",     address = "Online",     publisher = "Association for Computational Linguistics",     url = "https://aclanthology.org/2021.findings-acl.437",     doi = "10.18653/v1/2021.findings-acl.437",     pages = "4933--4946", } The Catalan General Crawling Corpus is a 435-million-token web corpus of Catalan built from the web. It has been obtained by crawling the 500 most popular .cat and .ad domains during July 2020. It consists of 434.817.705 tokens, 19.451.691 sentences and 1.016.114 documents. Documents are separated by single new lines. It is a subcorpus of the Catalan Textual Corpus. We license the actual packaging of this data under a Attribution 4.0 International License. Copyright (c) 2021 Text Mining Unit at BSC
创建时间:
2022-10-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作