nuvocare/MSD_manual_topics_user_base
收藏Hugging Face2024-04-24 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/nuvocare/MSD_manual_topics_user_base
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: User
dtype: string
- name: Category
dtype: string
- name: Language
dtype:
class_label:
names:
'0': english
'1': french
'2': german
'3': spanish
- name: Topic1
dtype: string
- name: Topic2
dtype: string
- name: Topic3
dtype: string
- name: Text
dtype: string
splits:
- name: train
num_bytes: 129319477.93694109
num_examples: 81269
- name: test
num_bytes: 43107023.063058905
num_examples: 27090
download_size: 104135445
dataset_size: 172426501.0
license: apache-2.0
task_categories:
- text-classification
- question-answering
- text-generation
language:
- en
- de
- fr
- es
tags:
- medical
- healthcare
size_categories:
- 10K<n<100K
---
# MSD_manual_topics_user_base
This dataset has been built with the website https://www.msdmanuals.com/ provided by Merck & Co for the greater audience.
The MSD manual is an essential source of knowledge for many topics related to symptoms, diseases, health and other related topics. The manual makes an extra effort to make it available both for professionals and patients by having two distinct version.
The content, while being labelled the same, differs by the type of user in order to facilitate understanding for patients or give clear details for professional. The manual is available in different languages.
This dataset focuses on spanish, german, english and french content about health topics and symptoms. The content is tagged by 2 to 3 medical topics and flagged by user's type and languages.
It consists of roughly 21M words representing 45M tokens with a BERT tokenizer.
In total, the dataset consists of 21k texts. The splits (train : 75%; test : 25%) are made to equally balance languages.
Use cases:
- Create adaptive agent for medical explanation
- Benchmarck models' abilities to explain
- Create medical agents
- Fine-tune models
An instruct-based version is available here : https://huggingface.co/datasets/nuvocare/MSD_manual_topics_user_instruct
This dataset is built using the website : https://www.msdmanuals.com/ provided by Merck & Co.
All credits of the contents are for the MSD organization.
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
nuvocare
原始信息汇总
数据集概述
数据集名称
MSD_manual_topics_user_base
数据集特征
- User: 字符串类型
- Category: 字符串类型
- Language: 分类标签,包括英语、法语、德语和西班牙语
- Topic1: 字符串类型
- Topic2: 字符串类型
- Topic3: 字符串类型
- Text: 字符串类型
数据集分割
- train: 包含81269个样本,大小为129319477.93694109字节
- test: 包含27090个样本,大小为43107023.063058905字节
数据集大小
- 下载大小: 104135445字节
- 数据集大小: 172426501.0字节
许可
apache-2.0
任务类别
- 文本分类
- 问答
- 文本生成
语言
- 英语
- 德语
- 法语
- 西班牙语
标签
- 医疗
- 健康
大小类别
- 10K<n<100K



