Medex
收藏arXiv2025-08-15 更新2025-11-27 收录
下载链接:
https://hf-mirror.com/datasets/medexanon/Medex
下载链接
链接失效反馈官方服务:
资源简介:
Medex 是一个大规模的数据集,包含从公开或可授权的文献中提取的医学相关实体(如小分子、蛋白质、疾病、基因等)及其相关信息。该数据集由超过两百万个独特的实体和超过两亿个独特的段落组成,其中包含32.3百万对自然语言事实和相应的实体表示(如SMILES或refseq IDs)。Medex 数据集的创建利用了大型语言模型(LLMs)和跨模态语言建模的最新进展,通过发现相关段落中的治疗实体并总结信息为简洁的事实,将学术文献中的非结构化数据转化为标记的治疗相关实体对。Medex 数据集在构建具有强先验知识的模型方面非常有效,可用于约束分子优化算法,从而提出更安全且效果良好的分子。
Medex is a large-scale dataset that contains medically relevant entities (e.g., small molecules, proteins, diseases, genes, etc.) and their associated information extracted from publicly or licensable literature. This dataset consists of over 2 million unique entities and over 200 million unique passages, and contains 32.3 million pairs of natural language facts and their corresponding entity representations (e.g., SMILES or refseq IDs). The creation of the Medex dataset leverages the latest advancements in Large Language Models (LLMs) and cross-modal language modeling, transforming unstructured data from academic literature into labeled therapy-related entity pairs by identifying therapeutic entities in relevant passages and summarizing the information into concise facts. The Medex dataset is highly effective for building models with strong prior knowledge, and can be used to constrain molecular optimization algorithms to propose safer and more efficacious molecules.
提供机构:
宾夕法尼亚大学
创建时间:
2025-08-15



