医疗大模型预训练数据集

Name: 医疗大模型预训练数据集
Creator: 北方健康医疗大数据科技有限公司
Published: 2024-03-07 00:00:00
License: 暂无描述

山东省数据知识产权存证登记平台2024-03-07 更新2024-05-08 收录

下载链接：

https://sddip.com/djgg/publicDetails/f58743aa77f04831a389d2ce2ee73bbb

下载链接

链接失效反馈

官方服务：

资源简介：

本次预训练数据集是由我司构建的一个医疗文本数据集，用于训练大语言模型。该预训练数据集的目标是为了训练一个在医疗领域有较好理解能力的语言模型，以提高医疗诊断准确度、提升患者护理水平和提高医疗效率等方面的表现。通过在大规模医疗文本数据上进行预训练，该模型可以更好地理解医疗领域的特定文本，并为医疗相关的问题提供有用的解答和指导。预训练数据集旨在为医疗领域的语言模型提供一个具有结构化、有序化、标准化和标识化的训练基础，以提升模型在医疗场景下的理解能力和应用性。本预训练数据集规模达百亿token级别。

This pre-training dataset is a medical text corpus constructed by our company for training Large Language Models (LLMs). The objective of this dataset is to train a language model with robust medical domain comprehension, so as to enhance the model's performance in improving medical diagnostic accuracy, elevating patient care standards, and boosting healthcare efficiency, among other relevant metrics. By conducting pre-training on large-scale medical text data, the model can better grasp domain-specific medical texts and provide practical answers and guidance for medical-related queries. This pre-training dataset is intended to offer a structured, organized, standardized and annotated training foundation for medical domain language models, thereby enhancing the model's comprehension capabilities and practical applicability in medical scenarios. The scale of this pre-training dataset reaches 10 billion tokens.

提供机构：

北方健康医疗大数据科技有限公司

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成