PhDat
收藏DataCite Commons2025-10-24 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/PhDat/29071202
下载链接
链接失效反馈官方服务:
资源简介:
<b>PhDat. A dataset containing the phase behaviour for liquids. Currently (October 2025) limited to data for surfactant/water binary mixtures.</b><br>The data is provided as a JSON file, and once loaded, data for any of the given surfactants can be retrieved using the record index or the SMILES string.<br>The data set is created from a grid of sample points extracted from a phase diagram image using the range of composition and temperature of the phase diagram and specifying the grid resolution. PhDat currently uses a common resolution of 1<i> ◦</i>C and 1 wt %. We assign each sample point a probability of being in a particular phase state (i) according to P(i) = e<i>−d(i)/2</i>. Here d(i) is the minimum distance of a sample point to the phase i.<br>If a sample point is in the same phase as the phase being sampled for the distance to that phase is zero and as such P(i) = 1. If a sample point lies on a phase boundary it will be equally likely to be in adjacent phases. Finally, for each sample point, all probabilities below the threshold of 10E<i>−3</i> were set to zero for simplicity and the resulting probabilities were normalised to one. This results in a final output matrix, and therefore JSON file, where each particular temperature and composition point is associated with a vector of probability of being in each particular phase state found in the phase diagram.<br><b>Phase States</b><br>A total of 118 unique phase states (both one- and two-phase regions) have been identified that define the particular phase in a region of the phase diagram. In single-phase regions these are entries such as Isotropic (L<i>1</i>), Hexagonal (H<i>1</i>) and Lamellar (L<i>a</i>). For two-phase regions such phase states could be W+L<i>1</i> to describe a combined region where water and L<i>1</i> phase coexist in a phase separated state.<br>Single-phase region label descriptionsW - Water or sub-micellar solutionE - IceL1, L2 - Isotropic micellar solutions (normal and reversed)H1, H2 - Hexagonal phases (normal and reversed)I1, I2 - Cubic micellar phasesV1, V2, V2i, V2p - Bicontinuous cubic phasesLa, Lb, La1, La2 - Lamellar phases, Lα and Lβ (liquid and gel)X, X1, X2 - Solid surfactant phases (differing by hydration state)L3 - Sponge phaseN1 - Nematic liquid of rod- or worm-like micellesM1 - 2D Monoclinic phasePb - Hydrated bilayer-based rippled phaseS, Sa, Sb - non-crystalline solidT1 - Tetragonal phaseU - Unmeasured, unknown and/or unclear region or phase stateTwo-phase region label descriptions are comprised from a combination of the single-phase labels. Two-phase coexistence regions are indicated by a ‘+’ between the corresponding phases, with the phase labels ordered alphabetically for convenience (e.g. W+X1 rather than X1+W). Not all combinations are encountered since the two-phase regions must occur between single phase regions, and in any given phase diagram the phase sequence is strictly ordered.<br><b>Data set structure</b><br>The JSON file is structured as a list of records, indexed by a data record entry number. Each record contains data from one unique source, organized as a dictionary comprising:the SMILES string,the state of the diagram (either complete or incomplete if some areas are unknown),the name of the chemical compound,the source (e.g. the citation reference to the paper) and its figure location in the source (e.g. the figure number or page number),the purity of the chemical (if given),the measurement methodology (if given),the type (non-ionic, cationic, anionic, zwitterionic or mixed)the solvent (water in all cases)the labels of the original source and the assigned label in this datasetthe keys for the data (header names)the values (phase state probabilities) as a list for all data keys<br>Here the composition is always given as wt % (weight percent) of molecule such that 0 wt % is pure solvent (water in the current PhDat release) and 100 wt % is the pure molecule of interest (surfactants in the current PhDat release). Hence reading each column entry of the list of the set of data keys provides complete information on each discretized point of the diagram, e.g. its composition, temperature and the probability value (as a fraction) for each phase state. Note that this format allows for the same compound to have multiple records if there is more than one source for the phase diagram and one should not assume the SMILES strings are unique.<br><br><b>Further details</b><br>PhDat has been developed by the STFC Hartree Centre and is made available under the CC-BY 4.0 license. Digitisation of the phase diagrams has been achieved via the use of CurveClaw. CurveClaw is a bespoke program used for the semi-automated extraction of phase diagram data into digital (numerical) form. CurveClaw is available under BSD 2-clause license from Github.A bespoke 'DataExplorer' has been created for PhDat it acts as a demonstrator of how one may extract data from the database or interrogate the contents can be found at Github.Further details of the process of data collection can be found in the associated publication contained within this project.<br>The authors of PhDat are happy to receive feedback on the data set and additional data which we will work into the main data set once evaluating the supplied data. Any feedback or additonal data can be sent to felix.rummel@stfc.ac.uk and richard.anderson@stfc.ac.uk.
**PhDat数据集:面向液相行为的数据集。截至2025年10月,当前仅收录表面活性剂-水二元混合物的相关数据。**
本数据集以JSON文件形式存储,加载后可通过记录索引或简化分子线性输入规范(SMILES)字符串检索任意指定表面活性剂的相关数据。
本数据集通过对相图图像进行采样构建网格样本点生成:基于相图的组成与温度范围,并指定网格分辨率。当前PhDat采用的通用分辨率为1摄氏度与1质量百分比(wt%)。我们根据公式$P(i) = e^{-d(i)/2}$为每个样本点分配其处于特定相态$(i)$的概率,其中$d(i)$为该样本点到相态$i$的最小距离。
若样本点与目标相态处于同一相中,则其到该相态的距离为0,此时$P(i)=1$。若样本点位于相边界上,则其处于相邻两相的概率相等。为简化计算,我们将所有低于$10^{-3}$阈值的概率置零,并将剩余概率归一化至总和为1。最终得到的输出矩阵(即最终的JSON文件)中,每个温度与组成点均对应一个概率向量,用以表征其处于相图中各特定相态的可能性。
**相态分类**
本数据集共识别出118种独特的相态(涵盖单相区与两相区),用以定义相图中各区域的具体相态。单相区的典型相态包括各向同性相(Isotropic,记为$L_1$)、六方相(Hexagonal,记为$H_1$)以及层状相(Lamellar,记为$L_alpha$)。两相区的相态例如$W+L_1$,用于描述水相与$L_1$相发生相分离并共存的复合区域。
**单相区标签说明**
$W$ - 水相或亚胶束溶液
$E$ - 冰相
$L_1, L_2$ - 各向同性胶束溶液(正相与反相)
$H_1, H_2$ - 六方相(正相与反相)
$I_1, I_2$ - 立方胶束相
$V_1, V_2, V_{2i}, V_{2p}$ - 双连续立方相
$L_alpha, L_eta, L_{alpha1}, L_{alpha2}$ - 层状相(对应$L_alpha$与$L_eta$,即液晶态与凝胶态)
$X, X_1, X_2$ - 固态表面活性剂相(因水合状态不同而区分)
$L_3$ - 海绵相
$N_1$ - 棒状或蠕虫状胶束的向列相液晶
$M_1$ - 二维单斜相
$P_b$ - 水合双层膜波纹相
$S, S_a, S_b$ - 非晶态固体
$T_1$ - 四方相
$U$ - 未测量、未知或模糊的区域/相态
**两相区标签说明**
两相区的标签由单相区标签组合而成,通过在对应相态之间添加符号`+`来表示两相共存区域。为便于使用,相态标签按字母顺序排列(例如采用$W+X_1$而非$X_1+W$)。并非所有组合均存在,因为两相区必然位于两个单相区之间,且特定相图中的相态序列具有严格的顺序性。
**数据集结构**
JSON文件以记录列表的形式组织,每条记录由数据记录编号索引。每条记录对应唯一的数据源,以字典形式存储以下信息:
- 简化分子线性输入规范(SMILES)字符串
- 相图状态(完整或存在未知区域的不完整状态)
- 化合物名称
- 数据源(例如论文的引用文献)及其在源文献中的图表位置(如图号或页码)
- 化合物纯度(若提供)
- 测量方法(若提供)
- 表面活性剂类型(非离子型、阳离子型、阴离子型、两性离子型或混合型)
- 溶剂(本数据集统一为水)
- 原始源标签与本数据集分配的标签
- 数据键名(即表头名称)
- 所有数据键对应的相态概率值列表
本数据集中的组成始终以溶质的质量百分比(wt%)表示:0 wt%对应纯溶剂(即当前PhDat版本中的纯水),100 wt%对应纯目标溶质(即当前PhDat版本中的纯表面活性剂)。因此,读取数据键列表中的每一列即可获取相图中每个离散采样点的完整信息,包括其组成、温度以及各相态的概率值(以小数形式表示)。需注意:若同一化合物的相图存在多个数据源,则该化合物可对应多条记录,因此不应默认SMILES字符串具有唯一性。
**补充说明**
PhDat由STFC哈特利中心开发,采用CC-BY 4.0协议开源。相图的数字化工作通过CurveClaw工具完成:CurveClaw是一款定制化程序,用于将相图数据半自动提取为数字化(数值化)格式。CurveClaw采用BSD 2条款许可证,可从GitHub平台获取。
我们为PhDat开发了一款定制化的`DataExplorer`工具,用于演示如何从数据库中提取数据或查询数据集内容,该工具可从GitHub平台获取。
有关数据采集流程的更多细节,请参见本项目附带的相关学术出版物。
PhDat的开发团队欢迎用户对本数据集提出反馈意见,以及提供额外的数据;我们将在评估所提交的数据后,将其整合至主数据集中。相关反馈或额外数据可发送至邮箱:felix.rummel@stfc.ac.uk与richard.anderson@stfc.ac.uk。
提供机构:
figshare
创建时间:
2025-06-03



