SINAI/ALIA-biomedical
收藏Hugging Face2026-04-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SINAI/ALIA-biomedical
下载链接
链接失效反馈官方服务:
资源简介:
ALIA生物医学语料库是一个开放访问且透明的战略数据基础设施,它编译并组织了来自西班牙生物医学领域的大量官方文档和科学文本。该语料库创建自可靠且可验证的来源,确保数据的可追溯性和材料的合法再利用。根据其使用许可,该语料库保证数据的免费和开放获取。其目的是为研究人员、医疗保健专业人员和计算语言学家提供一个同质化、结构化且可访问的文档基础,这些人员对西班牙语医学、药理学和临床文本的分析和利用感兴趣。该语料库采用综合方法设计,涵盖临床指南、医学注册、科学出版物和官方健康公报。它覆盖了药理学、流行病学、公共卫生和专业化医学研究等关键领域。这种多样性使得能够全面覆盖西班牙语世界中规范和记录医学与科学活动的文档生态系统。该语料库的规模约为1000万个实例和超过55亿个词元,使其成为开发专门针对医学的大型语言模型(LLMs)、应用于临床语言的自然语言处理(NLP)工具以及西班牙语医学信息学研究的空前资源。其经过处理的性质便于在文本挖掘、语义建模、信息检索以及构建专门用于医疗保健和生命科学的人工智能系统中进行高级利用。
The ALIA Biomedical Corpus is an open-access and transparent strategic data infrastructure that compiles and organizes an extensive collection of official documents and scientific texts from the Spanish biomedical domain. It has been created from reliable and verifiable sources, ensuring data traceability and the lawful reuse of materials. The corpus guarantees free and open availability of the data, in accordance with its usage license. Its purpose is to provide a homogeneous, structured, and accessible documentary base for researchers, healthcare professionals, and computational linguists interested in the analysis and exploitation of medical, pharmacological, and clinical texts in Spanish. This corpus has been designed with an integrative approach that encompasses clinical guidelines, medical registries, scientific publications, and official health bulletins. It covers key areas such as pharmacology, epidemiology, public health, and specialized medical research. This diversity allows for comprehensive coverage of the documentary ecosystem that regulates and records medical and scientific activity in the Spanish-speaking world. The scope of the corpus, with approximately 10 million instances and over 5.5 billion tokens, makes it an unprecedented source for the development of Large Language Models (LLMs) specialized in medicine, Natural Language Processing (NLP) tools applied to clinical language, and research in medical informatics in Spanish. Its processed nature facilitates advanced utilization in text mining, semantic modeling, information retrieval, and the construction of artificial intelligence systems specialized in healthcare and life sciences.
提供机构:
SINAI



