five

HDBSCAN clustering and decision tree classification results for atoms and atom types in the extended version of the MATTS2021 data bank

收藏
DataCite Commons2026-02-19 更新2026-05-04 收录
下载链接:
https://repod.icm.edu.pl/citation?persistentId=doi:10.18150/6BUP53
下载链接
链接失效反馈
官方服务:
资源简介:
The repository contains datasets and figures related to analysis of atoms and atom types in the new, developer version of the MATTS data bank, with the implementation of machine learning methods (HDBSCAN clustering and decision-tree classification). The data bank file is available on github: https://github.com/discamb-project/MATTS/tree/MATTS2021extThe data used in the analysis includes multipole model parameters calculated for atoms and atom types belonging to one of the topological subgroups, evaluated for a chosen local coordinate system (LCS) types and different possible orientations within these LCS types (Rybicka et al., 2026a). The topological subgroups are defined based on (a) the number of first neighbors and planarity, which together define a topological group (4n, 3n, 3p, 2p, 1p, 1x), and (b) the chemical element (H, C, N, O, F, P, S, Cl, Br). Thus, the 15 topological subgroups are: 4n-C, 4n-N, 4n-P, 4n-S, 3n-N, 3n-S, 3p-C, 3p-N, 2p-N, 2p-O, 2p-S, 1p-F, 1p-Cl, 1p-Br, and one subgroup combining 1x-H and 1p-H. Files specific to each topological group or subgroup include the group or subgroup name in the filename. The term “1p-halogens” refers to the combined set of 1p-F, 1p-Cl, and 1p-Br. The “1x-H and 1p-H” subgroup was excluded from clustering, but was included in the classification. Compared to the published MATTS2021 data bank (Jha et al., 2022; Rybicka et al., 2022), the new version of the data bank is built with an extended set of model molecules and changed refinement procedure. The refinement of the multipole model for model molecules was performed without applying symmetry constraints on refined parameters, and using the def2-TZVP/B3LYP level of theory and Su-Coppens-Macchi radial functions, instead of 6-31G**/B3LYP level of theory and Clementi-Roetti radial functions. The reasoning behind the adjustment of the basis set and radial functions is explained in a publication by Ignat’ev & Dominiak (2024).For each subgroup, atom type definitions from the data bank were used, with minor adjustments of the description of the neighbors to assess the correctness of the LCS rotation process. Multipole model parameters for every considered LCS orientation for each atom were generated using the bankMaker utility from the DiSCaMB library (Chodkiewicz et al., 2018), local scripts, and a dataset of refined multipolar models. All symmetries in the atom type definitions were set to “no”, preventing the enforcement of any symmetry higher than 1 and enabling all multipolar functions to be populated. Removing symmetry constraints allows all Plm to be populated. The procedure is the same as described in Rybicka et al. 2026a, 2026b. The resulting sets of multipole model parameters were combined into subgroup- and LCS type- specific files to be used in clustering and classification.HDBSCAN clustering was applied to identify groups of atoms with similar multipole parameter distributions, and the results were evaluated by checking cluster assignments, statistics, atom type co-occurrence within clusters, and cluster validation metrics. Results from this analysis are available as *.ods files in the /Clustering-analysis directory. Decision tree classification was used to evaluate the separability of atom types based on selected multipole features, with interpretation including per-class performance metrics, confusion matrices, and evaluation of predictions of the classification. Different confusion matrices and textual representation of the decision trees obtained in this analysis are available as *.png files in the /Confusion-matrices directory and *.txt files in the /Decision-trees directory.Each symmetry, together with its associated LCS orientation, leads to a specific set of multipolar functions that should vanish, i.e. their populations (Plm values) should be equal to zero (Kurki-Suonio, 1977). Electron density pseudosymmetry was assigned to each obtained cluster and to instances of a given atom type within a cluster. The pseudosymmetry was assigned based only on multipole parameters, using symmetry selection rules (Kurki-Suonio, 1977), and comparing the mean and standard deviation (ssd) of Plm with a given zero-value threshold specific for each topological subgroup to establish which Plm can be considered effectively zero. The procedure of finding the zero-value thresholds is described in Rybicka et. al., 2026a. The READ_ME.odt file provides structured representation of the contents of each folder and detailed description of deposited files.References:Chodkiewicz, M. L., Migacz, S., Rudnicki, W., Makal, A., Kalinowski, J. A., Moriarty, N. W., Grosse-Kunstleve, R. W., Afonine, P. V., Adams, P. D.; Dominiak, P. M. (2018). J. Appl. Crystallogr. 51, 193–199.Ignat’ev, V., & Dominiak, P. M. (2024). Journal of Applied Crystallography, 57(Pt 6), 1884–1895. https://doi.org/10.1107/S1600576724009841Jha, K. K., Gruza, B., Sypko, A., Kumar, P., Chodkiewicz, M. L.; Dominiak, P. M. (2022). J. Chem. Inf. Model. 62, 3752–3765.Kurki-Suonio, K. (1977). Isr. J. Chem. 16, 115–123.Rybicka, P. M., Kulik, M., Chodkiewicz, M. L.; Dominiak, P. M. (2022). J. Chem. Inf. Model. 62, 3766–3783.Rybicka, P. M., Kulik, M., Ignat’ev, V., & Dominiak, P. M. (2026a). ChemRxiv. January. https://doi.org/10.26434/chemrxiv.10001488Rybicka, P.; Kulik, M.; Ignat’ev, V.; Dominiak, P., (2026b), "Datasets and scripts for pseudosymmetry and local coordinate system analysis of atoms and atom types in MATTS2021", https://doi.org/10.18150/1MEFPJ, RepOD, V1
提供机构:
RepOD
创建时间:
2026-01-27
二维码
社区交流群
二维码
科研交流群
商业服务