five

MolGrapher-Synthetic-300K

收藏
魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/MolGrapher-Synthetic-300K
下载链接
链接失效反馈
官方服务:
资源简介:
# MolGrapher-Synthetic-300K MolGrapher-Synthetic-300K is the synthetic dataset introduced in [MolGrapher: Graph-based Visual Recognition of Chemical Structures](https://github.com/DS4SD/MolGrapher). Our dataset is created using molecule SMILES retrieved from the database PubChem. Training images are then generated from SMILES using the molecule drawing library RDKit. The synthetic training set is augmented at multiple levels: - Molecule level: Molecules are randomly transformed by: (1) displaying explicit hydrogens, (2) reducing of the size of bonds connected to explicit hydrogens, (3) displaying explicit methyls, (4) displaying explicit carbons, (5) selecting a molecular conformation, (6) removing implicit hydrogens of atom labels, (7) rotating triple bonds, (8) displaying explicit carbons connected to triple bonds, adding artificial superatom groups with (9) single or (10) multiple attachment points, (11) displaying wedge bonds using solid or dashed bonds, and (12) displaying single bonds as wavy bonds. - Rendering level: The rendering parameters used in RDKit are randomly set: (1) the bond width, (2) the font, (3) the font size, (4) the atom label padding, (5) the molecule rotation, which does not rotate atom labels, (6) the display of atom indices and (7) their font size, (8) the hand-drawing style, (9) the charges positions, (10) the display of encircled charges and (11) their size, and (12) the display of aromatic cycles using circles. Together with the training images, we generate the graph ground-truth, i.e the graph connectivity, the atoms and bonds labels, and their positions. RDKit allows to embed to the generated image some metadata, which stores the mapping between atom indices and their positions in the image. At the same time, we store a MolFile containing the graph connectivity information and the class of each atom and bond. By combining both of them, the graph ground-truth can be re-created. See [dataset_explorer.ipynb](https://huggingface.co/datasets/ds4sd/MolGrapher-Synthetic-300K/blob/main/dataset_explorer.ipynb) for examples on reading samples.

# MolGrapher-Synthetic-300K MolGrapher-Synthetic-300K 是《MolGrapher:基于图的化学结构视觉识别》中提出的合成数据集,相关开源仓库可访问 https://github.com/DS4SD/MolGrapher。 本数据集依托PubChem数据库获取的分子SMILES(简化分子线性输入规范,Simplified Molecular Input Line Entry System)构建,随后借助分子绘图库RDKit将SMILES转换为训练图像。该合成训练集从两个维度开展数据增强: - 分子维度:通过以下12种方式对分子进行随机变换:(1) 显示显性氢原子;(2) 缩小与显性氢原子相连的化学键键宽;(3) 显示显性甲基;(4) 显示显性碳原子;(5) 选取分子构象;(6) 移除原子标签中的隐式氢原子;(7) 旋转三键;(8) 显示与三键相连的显性碳原子;(9) 添加带有单个连接位点的人工超原子基团;(10) 添加带有多个连接位点的人工超原子基团;(11) 使用实体键或虚线键表示楔形键;(12) 将单键以波浪键形式展示。 - 渲染维度:RDKit的渲染参数采用随机设置,具体包含:(1) 化学键宽度;(2) 字体类型;(3) 字体大小;(4) 原子标签内边距;(5) 分子旋转角度(原子标签不随分子旋转);(6) 原子索引显示开关;(7) 原子索引字体大小;(8) 手绘风格;(9) 电荷标注位置;(10) 带圈电荷的显示;(11) 带圈电荷的尺寸;(12) 用圆环表示芳香环。 在生成训练图像的同时,我们还会同步生成图结构基准真值,涵盖图的连接关系、原子与化学键的标签及其在图像中的位置。RDKit支持在生成的图像中嵌入元数据,用于存储原子索引与图像中原子位置的映射关系;同时我们还会存储包含图连接信息以及各原子、化学键类别的MolFile(分子结构文件)。结合上述元数据与MolFile,即可完整复现图结构基准真值。 可参考 https://huggingface.co/datasets/ds4sd/MolGrapher-Synthetic-300K/blob/main/dataset_explorer.ipynb 查看样本读取示例。
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作