CoWeSe (Corpus Web Salud Español)

Name: CoWeSe (Corpus Web Salud Español)
Creator: 巴塞罗那超级计算中心文本挖掘单元
Published: 2021-09-16 15:22:28
License: 暂无描述

arXiv2021-09-16 更新2024-06-21 收录

下载链接：

https://doi.org/10.5281/zenodo.4561970

下载链接

链接失效反馈

官方服务：

资源简介：

CoWeSe（西班牙语健康网络语料库）是由巴塞罗那超级计算中心文本挖掘单元创建的，迄今为止最大的西班牙语生物医学语料库，包含约7.5亿个令牌，总大小为4.5GB。该数据集通过2020年对3000个西班牙语网站的大规模爬虫收集而成，内容涵盖医学、科学、医疗期刊等多个领域。创建过程中，采用了定制的数据清洗流程，确保数据质量。CoWeSe主要用于西班牙语生物医学自然语言处理领域，支持特定领域的语言模型训练和词嵌入生成，旨在解决非英语生物医学数据资源稀缺的问题。

CoWeSe (Spanish Health Web Corpus) was developed by the Text Mining Unit of the Barcelona Supercomputing Center, and stands as the largest Spanish biomedical corpus to date. It contains approximately 750 million tokens with a total size of 4.5 GB. This dataset was collected via large-scale crawling of 3,000 Spanish websites in 2020, covering multiple domains including medicine, science, and medical journals. During its creation, a customized data cleaning pipeline was adopted to ensure data quality. CoWeSe is primarily used in the field of Spanish biomedical natural language processing, supporting domain-specific language model training and word embedding generation, aiming to address the scarcity of non-English biomedical data resources.

提供机构：

巴塞罗那超级计算中心文本挖掘单元

创建时间：

2021-09-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集