five

Public Biomedical Dataset

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/facebook/wmt19-en-de
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集源自大约1.6万篇来自PubMed的德语文摘,并扩展了翻译成英文的PubMed文摘以及MIMIC-III的临床笔记。为了确保翻译质量,我们由一名医生对翻译质量进行了评估,以选择最佳模型。经过翻译后,文档数量达到了约4500万篇。该数据集的任务是对语言模型进行预训练。

This dataset is built upon approximately 16,000 German abstracts retrieved from PubMed, and is further expanded with English-translated PubMed abstracts and clinical notes from the MIMIC-III database. To ensure translation quality, a physician was invited to evaluate the translation performance for selecting the optimal model. After completing the translation process, the total number of documents in this dataset reaches approximately 45 million. This dataset is designed for pre-training language models.
提供机构:
PubMed, MIMIC-III
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作