CGsmiles: A Versatile Line Notation for Molecular Representations across Multiple Resolutions

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/CGsmiles_A_Versatile_Line_Notation_for_Molecular_Representations_across_Multiple_Resolutions/28652625

下载链接

链接失效反馈

官方服务：

资源简介：

Coarse-grained (CG) models simplify molecular representations by grouping multiple atoms into effective particles, enabling faster simulations and reducing the chemical compound space compared to atomistic methods. Additionally, models with chemical specificity, such as Martini, may extrapolate to cases where experimental data is scarce, making CG methods highly promising for high-throughput (HT) screenings and chemical space exploration. Yet no rigorous data formats exist for the crucial aspect of describing how the atoms are grouped (i.e., the mapping). As CG models advance toward true HT capabilities, the lack of mappings and indexing capabilities for the growing number of CG molecules poses a significant barrier. To address this, we introduce CGsmiles, a versatile line notation inspired by the popular Simplified Molecular Input Line Entry System (SMILES) and BigSMILES. CGsmiles encodes the molecular graph and particle (atom) properties independent of their resolution and incorporates a framework that allows seamless conversion between coarse- and fine-grained resolutions. By specifying fragments that describe how each particle is represented at the next finer resolution (e.g., CG particles to atoms), CGsmiles can represent multiple resolutions and their hierarchical relationships in a single string. In this paper, we present the CGSmiles syntax and analyze a benchmark set of 407 molecules from the Martini force field. We highlight key features missing in existing notations that are essential for accurately describing CG models. To demonstrate the utility of CGsmiles beyond simulations, we construct two simple machine-learning models for predicting partition coefficients, both trained on CGsmiles-indexed data and leveraging information from both CG and atomistic resolutions. Finally, we briefly discuss the applicability of CGsmiles to polymers, which particularly benefit from the multiresolution nature of the notation.

粗粒度（Coarse-grained, CG）模型通过将多个原子归为有效粒子来简化分子表征：相较于原子级模拟方法，其可实现更快的模拟运算并压缩化学空间范围。此外，马蒂尼（Martini）这类具备化学特异性的模型可外推至实验数据稀缺的场景，使得CG方法在高通量（HT）筛选与化学空间探索领域极具应用前景。然而，针对描述原子归组方式（即映射关系）这一核心环节，目前尚无统一规范的数据格式。随着CG模型逐步迈向真正的高通量应用能力，针对日益增长的CG分子缺乏统一映射与索引机制，已成为一项关键阻碍。为解决这一问题，本文提出CGsmiles：一种受主流简化分子线性输入规范（SMILES）与BigSMILES启发的通用线性符号表示法。CGsmiles可编码分子图与粒子（原子）属性，且不依赖于分辨率层级；同时内置一套框架，支持粗粒度与细粒度分辨率间的无缝转换。通过指定描述各粒子在更细分辨率层级下的表征方式的片段（例如将CG粒子映射为原子），CGsmiles可在单个字符串中同时表征多种分辨率及其层级关联。本文首先阐述CGsmiles的语法规则，并分析一套源自马蒂尼力场的407个分子基准数据集，同时指出当前主流符号表示法中缺失的、用于精准描述CG模型的关键特性。为验证CGsmiles在模拟之外的应用价值，本文构建了两个用于预测分配系数的简易机器学习模型：二者均基于CGsmiles索引化的数据进行训练，并同时利用了CG与原子级分辨率的信息。最后，本文简要探讨了CGsmiles在聚合物领域的适用场景——聚合物可充分利用该符号表示法的多分辨率特性。

创建时间：

2025-04-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集