Name: 22yuan/MMedC
Creator: 22yuan
Published: 2026-01-26 07:19:06
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/22yuan/MMedC

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 language: - en - zh - ja - fr - ru - es - ar - de tags: - medical size_categories: - 10B<n<100B --- # MMedC [💻Github Repo](https://github.com/MAGIC-AI4Med/MMedLM) [🖨️arXiv Paper](https://arxiv.org/abs/2402.13963) The official pre-training dataset for "Towards Building Multilingual Language Model for Medicine". ## News - We add Arabic and German corpus to MMedC. ## Introduction This repo contains MMedC, a multilingual medical corpus with 25.5 billion tokens. | Language | Family | Filtering Content | Textbooks | Websites | Small-scale Dataset | TotAmt | |-----------|---------------|-------------------|-----------|----------|---------------------|--------| | English | Indo-European | 6.56 | 4.00 | 0.00 | 0.00 | 10.56 | | Spanish | Indo-European | 3.98 | 0.31 | 0.05 | 0.02 | 4.35 | | French | Indo-European | 1.90 | 0.02 | 0.00 | 0.17 | 2.10 | | Russian | Indo-European | 1.29 | 0.40 | 0.00 | 0.00 | 1.69 | | Chinese | Sino-Tibetan | 3.34 | 1.21 | 0.00 | 0.19 | 4.74 | | Japaneses | Sino-Tibetan | 1.93 | 0.00 | 0.10 | 0.01 | 2.05 | | Arabic | Afro-Asiatic | 0.64 | 0.00 | 0.00 | 0.00 | 0.64 | | German | Indo-European | 1.54 | 0.00 | 0.00 | 0.00 | 1.54 | - English Textbooks is not included in this repo due to copyright issues. For this part of 4B English corpus, please refer to [PMC-LLaMA](https://github.com/chaoyi-wu/PMC-LLaMA) You can download the MMedC.zip file to access all the data. The data are saved in txt format, and the zip file contains four folders corresponding to four types of data sources: filtering content, medical websites, medical textbooks, and small-scale datasets. Please refer to our paper for details. You can use the following method to obtain the paths to all txt files in the directory. Afterward, you can read these txt files and customize subsequent operations. ```python import os txt_root = "PATH/TO/MMEDC" txt_paths = [] for root, dirs, files in os.walk(txt_root): if 'cultural_filtered_data_used' not in root: for file in files: if file.endswith('.txt'): txt_paths.append(os.path.join(root, file)) ``` Our [GitHub](https://github.com/MAGIC-AI4Med/MMedLM) provides a data collection pipeline as well as our data preprocessing code. ## News [2024.2.21] Our pre-print paper is released ArXiv. Dive into our findings [here](https://arxiv.org/abs/2402.13963). [2024.2.20] We release [MMedLM](https://huggingface.co/Henrychur/MMedLM) and [MMedLM 2](https://huggingface.co/Henrychur/MMedLM2). With an auto-regressive continues training on MMedC, these models achieves superior performance compared to all other open-source models, even rivaling GPT-4 on MMedBench. [2023.2.20] We release [MMedC](https://huggingface.co/datasets/Henrychur/MMedC), a multilingual medical corpus containing 25.5B tokens. [2023.2.20] We release [MMedBench](https://huggingface.co/datasets/Henrychur/MMedBench), a new multilingual medical multi-choice question-answering benchmark with rationale. Check out the leaderboard [here](https://henrychur.github.io/MultilingualMedQA/). ## Evaluation on MMedBench The further pretrained MMedLM 2 showcast it's great performance in medical domain across different language. | Method | Size | Year | MMedC | MMedBench | English | Chinese | Japanese | French | Russian | Spanish | Avg. | |------------------|------|---------|-----------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| | GPT-3.5 | - | 2022.12 | ✗ | ✗ | 56.88 | 52.29 | 34.63 | 32.48 | 66.36 | 66.06 | 51.47 | | GPT-4 | - | 2023.3 | ✗ | ✗ | 78.00 | 75.07 | 72.91 | 56.59 | 83.62 | 85.67 | 74.27 | | Gemini-1.0 pro | - | 2024.1 | ✗ | ✗ | 53.73 | 60.19 | 44.22 | 29.90 | 73.44 | 69.69 | 55.20 | | BLOOMZ | 7B | 2023.5 | ✗ | trainset | 43.28 | 58.06 | 32.66 | 26.37 | 62.89 | 47.34 | 45.10 | | InternLM | 7B | 2023.7 | ✗ | trainset | 44.07 | 64.62 | 37.19 | 24.92 | 58.20 | 44.97 | 45.67 | | Llama\ 2 | 7B | 2023.7 | ✗ | trainset | 43.36 | 50.29 | 25.13 | 20.90 | 66.80 | 47.10 | 42.26 | | MedAlpaca | 7B | 2023.3 | ✗ | trainset | 46.74 | 44.80 | 29.64 | 21.06 | 59.38 | 45.00 | 41.11 | | ChatDoctor | 7B | 2023.4 | ✗ | trainset | 43.52 | 43.26 | 25.63 | 18.81 | 62.50 | 43.44 | 39.53 | | PMC-LLaMA | 7B | 2023.4 | ✗ | trainset | 47.53 | 42.44 | 24.12 | 20.74 | 62.11 | 43.29 | 40.04 | | Mistral | 7B | 2023.10 | ✗ | trainset | 61.74 | 71.10 | 44.72 | 48.71 | 74.22 | 63.86 | 60.73 | | InternLM\ 2 | 7B | 2024.2 | ✗ | trainset | 57.27 | 77.55 | 47.74 | 41.00 | 68.36 | 59.59 | 58.59 | | MMedLM~(Ours) | 7B | - | ✗ | trainset | 49.88 | 70.49 | 46.23 | 36.66 | 72.27 | 54.52 | 55.01 | | MMedLM\ 2~(Ours) | 7B | - | ✗ | trainset | 61.74 | 80.01 | 61.81 | 52.09 | 80.47 | 67.65 | 67.30 | - GPT and Gemini is evluated under zero-shot setting through API - Open-source models first undergo training on the trainset of MMedBench before evaluate. ## Contact If you have any question, please feel free to contact qiupengcheng@pjlab.org.cn. ## Citation ``` @misc{qiu2024building, title={Towards Building Multilingual Language Model for Medicine}, author={Pengcheng Qiu and Chaoyi Wu and Xiaoman Zhang and Weixiong Lin and Haicheng Wang and Ya Zhang and Yanfeng Wang and Weidi Xie}, year={2024}, eprint={2402.13963}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

应用场景：