five

molecular solubility datasets

收藏
DataCite Commons2023-05-30 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/documents/molecular-solubility-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
We collated small molecule solubility data from an array of databases and literature. Some of these sources merely provided molecular names, lacking SMILES notation. We sourced the molecular SMILES and molecular weights from PubChem, DrugBank, https://www.wikiarabic.org/, and https://www.sigmaaldrich.com/US/en. In terms of data selection, Canonical SMILES was preferred over Isomeric SMILES in instances where both were available in different forms. As for data retention, we considered only experimental results gathered at temperatures between 20-25 degrees Celsius, and we standardized the data units to mol/L. The data cleaning process entailed several steps. We eliminated sections containing non-numerical data and other errors, as well as drug molecules without identifiable SMILES. We eradicated any duplicate data; Data representing the same molecule, either with identical SMILES or different SMILES formats, were removed. Additionally, data entries with solubility differences greater than 0.03 were eliminated. When the solubility difference was equal to or less than 0.03, the mean value was computed to provide the final solubility result. Ultimately, we excised molecules containing only two atoms, molecules with a molecular weight exceeding 500, and those with logS below -8.

本研究从多类数据库及学术文献中整理了小分子溶解度数据。其中部分数据源仅提供了分子名称,未附带简化分子线性输入规范(SMILES)字符串。我们从PubChem、DrugBank、WikiArabic网站(https://www.wikiarabic.org/)以及Sigma-Aldrich官网(https://www.sigmaaldrich.com/US/en)获取了对应的分子SMILES字符串与分子量数据。在数据筛选环节,当同一分子同时存在标准SMILES与异构SMILES两种格式时,优先选用标准SMILES。数据留存方面,我们仅保留采集于20至25摄氏度区间内的实验结果,并将所有数据的单位统一标准化为mol/L。数据清洗环节包含多个步骤:我们剔除了包含非数值型数据及其他错误的条目,同时移除了无法识别SMILES字符串的药物分子相关数据;清除所有重复数据,无论对应分子的SMILES字符串格式是否一致,只要代表同一分子的重复数据均予以删除。此外,溶解度差值大于0.03的数据条目将被直接剔除;若溶解度差值小于或等于0.03,则通过计算平均值得到最终溶解度结果。最终,我们进一步移除了仅含2个原子的分子、分子量超过500的分子以及logS值低于-8的分子。
提供机构:
IEEE DataPort
创建时间:
2023-05-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作