nuvocare/MSD_manual_topics_user_base

Name: nuvocare/MSD_manual_topics_user_base
Creator: nuvocare
Published: 2024-04-24 11:41:04
License: 暂无描述

Hugging Face2024-04-24 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/nuvocare/MSD_manual_topics_user_base

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: User dtype: string - name: Category dtype: string - name: Language dtype: class_label: names: '0': english '1': french '2': german '3': spanish - name: Topic1 dtype: string - name: Topic2 dtype: string - name: Topic3 dtype: string - name: Text dtype: string splits: - name: train num_bytes: 129319477.93694109 num_examples: 81269 - name: test num_bytes: 43107023.063058905 num_examples: 27090 download_size: 104135445 dataset_size: 172426501.0 license: apache-2.0 task_categories: - text-classification - question-answering - text-generation language: - en - de - fr - es tags: - medical - healthcare size_categories: - 10K<n<100K --- # MSD_manual_topics_user_base This dataset has been built with the website https://www.msdmanuals.com/ provided by Merck & Co for the greater audience. The MSD manual is an essential source of knowledge for many topics related to symptoms, diseases, health and other related topics. The manual makes an extra effort to make it available both for professionals and patients by having two distinct version. The content, while being labelled the same, differs by the type of user in order to facilitate understanding for patients or give clear details for professional. The manual is available in different languages. This dataset focuses on spanish, german, english and french content about health topics and symptoms. The content is tagged by 2 to 3 medical topics and flagged by user's type and languages. It consists of roughly 21M words representing 45M tokens with a BERT tokenizer. In total, the dataset consists of 21k texts. The splits (train : 75%; test : 25%) are made to equally balance languages. Use cases: - Create adaptive agent for medical explanation - Benchmarck models' abilities to explain - Create medical agents - Fine-tune models An instruct-based version is available here : https://huggingface.co/datasets/nuvocare/MSD_manual_topics_user_instruct This dataset is built using the website : https://www.msdmanuals.com/ provided by Merck & Co. All credits of the contents are for the MSD organization. [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

nuvocare

原始信息汇总

数据集概述

数据集名称

MSD_manual_topics_user_base

数据集特征

User: 字符串类型
Category: 字符串类型
Language: 分类标签，包括英语、法语、德语和西班牙语
Topic1: 字符串类型
Topic2: 字符串类型
Topic3: 字符串类型
Text: 字符串类型

数据集分割

train: 包含81269个样本，大小为129319477.93694109字节
test: 包含27090个样本，大小为43107023.063058905字节

数据集大小

下载大小: 104135445字节
数据集大小: 172426501.0字节

许可

apache-2.0

任务类别

文本分类
问答
文本生成

语言

英语
德语
法语
西班牙语

大小类别

10K<n<100K

5,000+

优质数据集

54 个

任务类型

进入经典数据集