Process Understandability for DFG Notation

Name: Process Understandability for DFG Notation
Creator: Mendeley Data
Published: 2025-05-01 05:24:22
License: 暂无描述

DataCite Commons2025-05-01 更新2025-05-17 收录

下载链接：

https://data.mendeley.com/datasets/2247f6kygy

下载链接

链接失效反馈

官方服务：

资源简介：

This data set is composed of 3000 JPEG files and an MS Excel file which contains attributes of each JPEG file. The data set was prepared to measure process understandability and structuredness for diagrams in Directly-follows Graph (DFG) notation. For creating a DFG, main input is a transition matrix which is built using event logs. Disregarding the repeating number of loops between the nodes in a control-flow diagram and including self-loops, there can be at most n^2 distinct process diagrams with n nodes. For a 10-step process, number of all possible process diagrams is 100 under these conditions. From a practical point of view, since node number less than 3 is not considered as a process most of the times and creating process diagrams more than 43 causes system performance problems, the range for nodes in the data set was specified to be between 3 and 43. So, the complete set size is 27429. Among all these possible process diagrams, 3000 of them were randomly created with Python. Visual data set sample size corresponds to 10.94% of the possible diagrams in the selected range and can represent all possible process diagrams in the defined universe. For simplification purposes, thickness is the same for all edges, activity names are standardized (Activity1, Activity2 etc.), artificial start / end nodes are not included and diagram directions are from left to right. Starting from 3 nodes, process diagram generation algorithm basically determined the number of arcs and created the transition matrix randomly. Then a process diagram was created with this information and number of nodes, arcs and self-loops were saved in the MS Excel spreadsheet. With this information, following columns were obtained: - Number of Nodes (Size) - Number of Arcs - Total Number of Elements (including all nodes and arcs) - Number of Self Loops (Arcs starting and ending at the same node) - % of All Possible Behaviors (Density) - Arcs per Node (CNC: Coefficient of Network Connectivity) - Arcs per Node Excluding Self Loops (CNCX: Coefficient of Network Connectivity Excluding Self Loops) - Logarithm of Arcs per Node (LogCNC) - Logarithm of Arcs per Node Excluding Self Loops (LogCNCX) In addition to the input variables above, an initial evaluation was made by a process expert whether a given diagram is structured or not which is given in Classification column. In the evaluation phase, together with the adjectives structured / unstructured, simple / complex, spaghetti-like, easy / hard to understand and metrics used in the literature, 3 other guiding criteria were included: - Is it possible to follow the flow and read it as a process diagram? - What would be the effort to make the process diagram more structured? - Would this process diagram be an acceptable output for a customer? Classification columns is the output variable in 0-1 scale where 0 is structured and 1 is unstructured.

本数据集由3000个JPEG文件与1个MS Excel文件组成，后者存储了每个JPEG文件的相关属性。本数据集旨在测算直接跟随图（Directly-follows Graph，DFG）格式流程图表的流程可理解性与结构化程度。创建直接跟随图的核心输入为基于事件日志构建的转移矩阵。若忽略控制流图表中节点间循环的重复次数且计入自循环，则包含n个节点的流程图表最多可生成n²种不同变体。以10步流程为例，在上述规则下所有可能的流程图表总数为100种。从实际应用角度而言，由于节点数少于3的流程通常不被视作完整流程，而节点数超过43的流程图表生成会引发系统性能问题，因此本数据集的节点数范围被限定为3至43。此时完整的流程图表集合规模为27429种。在所有符合条件的潜在流程图表中，我们通过Python随机生成了3000种作为样本。本次可视化数据集的样本量占选定范围内所有可能图表的10.94%，可代表所定义全域内的全部潜在流程图表。为简化分析，所有边的粗细均保持一致，活动名称均采用标准化格式（如Activity1、Activity2等），未包含人工起始/终止节点，且图表流向统一为从左至右。本流程图表生成算法从3个节点起步，首先确定弧的数量并随机生成转移矩阵，随后基于该信息生成流程图表，并将节点数、弧数与自循环数量存入MS Excel电子表格。由此得到以下列字段： - 节点数（规模） - 弧数 - 总元素数（包含所有节点与弧） - 自循环数（起点与终点为同一节点的弧） - 所有可能行为占比（密度） - 每节点弧数（CNC：网络连接系数） - 剔除自循环后的每节点弧数（CNCX：剔除自循环后的网络连接系数） - 每节点弧数的对数值（LogCNC） - 剔除自循环后的每节点弧数的对数值（LogCNCX）除上述输入变量外，流程专家还会对给定图表是否具备结构化特征进行初步评估，该结果存储于Classification列中。在评估阶段，除采用结构化/非结构化、简单/复杂、意大利面式、易于/难于理解等描述性词汇以及现有文献中的度量标准外，还新增了3项指导准则： 1. 是否能够遵循流程流向并将其作为流程图表进行解读？ 2. 将该流程图表调整为更结构化形式所需的工作量如何？ 3. 该流程图表是否可作为令客户满意的交付成果？ Classification列为0-1取值范围的输出变量，其中0代表结构化，1代表非结构化。

提供机构：

Mendeley Data

创建时间：

2024-11-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集