five

Diverse-Expression Program

收藏
科学数据银行2025-02-25 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=fad6e0981bfb4336945a56d4f60b35cc
下载链接
链接失效反馈
官方服务:
资源简介:
The task of molecule generation guided by specific text descriptions has been proposed to generate molecules that match given text inputs. Mainstream methods typically use SMILES to represent molecules and rely on diffusion models or autoregressive structures for modeling. However, the one-to-many mapping diversity when using SMILES to represent molecules causes existing methods to require complex model architectures and larger training datasets to improve performance, which affects the efficiency of model training and generation. In this paper, we propose a Text-Guided Diverse-Expression Diffusion (TGDD) Model for Molecule Generation. TGDD combines both SMILES and SELFIES into a novel Diverse-Expression molecular representation, enabling precise molecule mapping based on natural language. By leveraging this Diverse-Expression representation, TGDD simplifies the segmented diffusion generation process, achieving faster training and reduced memory consumption, while also exhibiting stronger alignment with natural language. TGDD outperforms both TGM-LDM and the autoregressive model MolT5-Base on most evaluation metrics.

针对特定文本描述引导的分子生成任务已被提出,旨在生成与给定文本输入匹配的分子。当前主流方法通常以SMILES表示分子,并依赖扩散模型或自回归结构开展建模。然而,使用SMILES表示分子时存在的一对多映射多样性问题,导致现有方法需要采用复杂的模型架构与更大规模的训练数据集才能提升性能,这会降低模型训练与生成的效率。本文提出一种面向分子生成的文本引导多样化表达扩散模型(Text-Guided Diverse-Expression Diffusion, TGDD)。TGDD将SMILES与SELFIES结合为一种全新的多样化表达分子表示方式,可实现基于自然语言的精准分子映射。通过利用该多样化表达表示,TGDD简化了分段式扩散生成流程,实现了更快的训练速度与更低的内存占用,同时还展现出更强的自然语言对齐能力。在多数评估指标上,TGDD均优于TGM-LDM与自回归模型MolT5-Base。
提供机构:
Hanyu Jiang; Xiangjie Kong; Wenchao Weng; Università degli Studi di Enna Kore
创建时间:
2025-02-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作