five

22yuan/MMedC

收藏
Hugging Face2026-01-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/22yuan/MMedC
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 language: - en - zh - ja - fr - ru - es - ar - de tags: - medical size_categories: - 10B<n<100B --- # MMedC [💻Github Repo](https://github.com/MAGIC-AI4Med/MMedLM) [🖨️arXiv Paper](https://arxiv.org/abs/2402.13963) The official pre-training dataset for "Towards Building Multilingual Language Model for Medicine". ## News - We add Arabic and German corpus to MMedC. ## Introduction This repo contains MMedC, a multilingual medical corpus with 25.5 billion tokens. | Language | Family | Filtering Content | Textbooks | Websites | Small-scale Dataset | TotAmt | |-----------|---------------|-------------------|-----------|----------|---------------------|--------| | English | Indo-European | 6.56 | 4.00 | 0.00 | 0.00 | 10.56 | | Spanish | Indo-European | 3.98 | 0.31 | 0.05 | 0.02 | 4.35 | | French | Indo-European | 1.90 | 0.02 | 0.00 | 0.17 | 2.10 | | Russian | Indo-European | 1.29 | 0.40 | 0.00 | 0.00 | 1.69 | | Chinese | Sino-Tibetan | 3.34 | 1.21 | 0.00 | 0.19 | 4.74 | | Japaneses | Sino-Tibetan | 1.93 | 0.00 | 0.10 | 0.01 | 2.05 | | Arabic | Afro-Asiatic | 0.64 | 0.00 | 0.00 | 0.00 | 0.64 | | German | Indo-European | 1.54 | 0.00 | 0.00 | 0.00 | 1.54 | - English Textbooks is not included in this repo due to copyright issues. For this part of 4B English corpus, please refer to [PMC-LLaMA](https://github.com/chaoyi-wu/PMC-LLaMA) You can download the MMedC.zip file to access all the data. The data are saved in txt format, and the zip file contains four folders corresponding to four types of data sources: filtering content, medical websites, medical textbooks, and small-scale datasets. Please refer to our paper for details. You can use the following method to obtain the paths to all txt files in the directory. Afterward, you can read these txt files and customize subsequent operations. ```python import os txt_root = "PATH/TO/MMEDC" txt_paths = [] for root, dirs, files in os.walk(txt_root): if 'cultural_filtered_data_used' not in root: for file in files: if file.endswith('.txt'): txt_paths.append(os.path.join(root, file)) ``` Our [GitHub](https://github.com/MAGIC-AI4Med/MMedLM) provides a data collection pipeline as well as our data preprocessing code. ## News [2024.2.21] Our pre-print paper is released ArXiv. Dive into our findings [here](https://arxiv.org/abs/2402.13963). [2024.2.20] We release [MMedLM](https://huggingface.co/Henrychur/MMedLM) and [MMedLM 2](https://huggingface.co/Henrychur/MMedLM2). With an auto-regressive continues training on MMedC, these models achieves superior performance compared to all other open-source models, even rivaling GPT-4 on MMedBench. [2023.2.20] We release [MMedC](https://huggingface.co/datasets/Henrychur/MMedC), a multilingual medical corpus containing 25.5B tokens. [2023.2.20] We release [MMedBench](https://huggingface.co/datasets/Henrychur/MMedBench), a new multilingual medical multi-choice question-answering benchmark with rationale. Check out the leaderboard [here](https://henrychur.github.io/MultilingualMedQA/). ## Evaluation on MMedBench The further pretrained MMedLM 2 showcast it's great performance in medical domain across different language. | Method | Size | Year | MMedC | MMedBench | English | Chinese | Japanese | French | Russian | Spanish | Avg. | |------------------|------|---------|-----------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| | GPT-3.5 | - | 2022.12 | &#10007; | &#10007; | 56.88 | 52.29 | 34.63 | 32.48 | 66.36 | 66.06 | 51.47 | | GPT-4 | - | 2023.3 | &#10007; | &#10007; | 78.00 | 75.07 | 72.91 | 56.59 | 83.62 | 85.67 | 74.27 | | Gemini-1.0 pro | - | 2024.1 | &#10007; | &#10007; | 53.73 | 60.19 | 44.22 | 29.90 | 73.44 | 69.69 | 55.20 | | BLOOMZ | 7B | 2023.5 | &#10007; | trainset | 43.28 | 58.06 | 32.66 | 26.37 | 62.89 | 47.34 | 45.10 | | InternLM | 7B | 2023.7 | &#10007; | trainset | 44.07 | 64.62 | 37.19 | 24.92 | 58.20 | 44.97 | 45.67 | | Llama\ 2 | 7B | 2023.7 | &#10007; | trainset | 43.36 | 50.29 | 25.13 | 20.90 | 66.80 | 47.10 | 42.26 | | MedAlpaca | 7B | 2023.3 | &#10007; | trainset | 46.74 | 44.80 | 29.64 | 21.06 | 59.38 | 45.00 | 41.11 | | ChatDoctor | 7B | 2023.4 | &#10007; | trainset | 43.52 | 43.26 | 25.63 | 18.81 | 62.50 | 43.44 | 39.53 | | PMC-LLaMA | 7B | 2023.4 | &#10007; | trainset | 47.53 | 42.44 | 24.12 | 20.74 | 62.11 | 43.29 | 40.04 | | Mistral | 7B | 2023.10 | &#10007; | trainset | 61.74 | 71.10 | 44.72 | 48.71 | 74.22 | 63.86 | 60.73 | | InternLM\ 2 | 7B | 2024.2 | &#10007; | trainset | 57.27 | 77.55 | 47.74 | 41.00 | 68.36 | 59.59 | 58.59 | | MMedLM~(Ours) | 7B | - | &#10007; | trainset | 49.88 | 70.49 | 46.23 | 36.66 | 72.27 | 54.52 | 55.01 | | MMedLM\ 2~(Ours) | 7B | - | &#10007; | trainset | 61.74 | 80.01 | 61.81 | 52.09 | 80.47 | 67.65 | 67.30 | - GPT and Gemini is evluated under zero-shot setting through API - Open-source models first undergo training on the trainset of MMedBench before evaluate. ## Contact If you have any question, please feel free to contact qiupengcheng@pjlab.org.cn. ## Citation ``` @misc{qiu2024building, title={Towards Building Multilingual Language Model for Medicine}, author={Pengcheng Qiu and Chaoyi Wu and Xiaoman Zhang and Weixiong Lin and Haicheng Wang and Ya Zhang and Yanfeng Wang and Weidi Xie}, year={2024}, eprint={2402.13963}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
22yuan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作