five

Synthetic Event Streams

收藏
ieee-dataport.org2025-01-21 收录
下载链接:
https://ieee-dataport.org/open-access/synthetic-event-streams
下载链接
链接失效反馈
官方服务:
资源简介:
OverviewAs stated in the Process Mining Manifesto, it is still difficult to compare different Process Mining tools and techniques. In the same line, an explorative comparison starts with the datasets employed, which should represent well the different behaviour that data might assume. Despite the availability of event logs, the majority of them was not created for online scenarios. Therefore, one of the challenges in Process Mining research is to provide reliable benchmark datasets consisting of representative online settings. We contribute to this aim by proposing a benchmark dataset composed of 942 event streams with concept drift. The event streams explore different characteristics of an online scenario, such as drift types, perspectives, sizes and noise percentage.DescriptionThis package contains 942 synthetic event streams that simulate concept drift in business processes. Each stream has only one drift. Different stream sizes, types and perspective of drift, and noise percentual are applied. Each event in the stream contains four main attributes: case identification, event name, event start time, event completion time.All event streams share a few common characteristics: (i) the arrival rate of cases is fixed to 20 minutes, i.e. after every 20 minutes an event from a new case arrives in the stream; (ii) the time distribution between events of the same case follows a normal distribution. For baseline behavior, the mean time was set to 30 minutes and the standard variation to 3 minutes. While for drifted behavior the mean and standard variation were 5 and 0.5 minutes, respectively; (iii) for time drifts, the model used in a single event stream is the same, i.e. the drift happens only in the time perspective; this way, we avoid introducing other factors; (iv) all drifts were created with 100, 500 and 1000 cases; (v) noise was introduced in the event stream for all the trace drifts. We chose to introduce noise in the form of anomalous cases. The anomalies consisted of removing either the first or the last half of the trace. Then, different percentages were applied (5%, 10%, 15% and 20%) in relation to the total stream size. Note that standard cases were swapped for anomalous ones, this way preserving the event stream size. We explored four different types of drifts to compose the dataset of event streams:Sudden drift: the first half of the stream is composed of the baseline model, and the second half is composed of the drifted model. The same idea applies for trace and time drifts (for time drifts the change is only in the time distribution and not the actual model). Recurring drift: for streams sizes of 100 traces, cases follow the division 33-33-34. The initial and the last concepts are the baseline, and the inner one is the drifted behavior, i.e. the baseline behavior starts the stream, fades after 33 traces and reappears for the last 34 traces, indicating a recurring characteristic; the same applies for time drifts. For 500 and 1000 traces, the division is 167-167-166 and 330-330-340, respectively.Gradual drift: one concept slowly takes place over another. This way, 20% of the stream was dedicated to the transition between concepts.Incremental drift: for the trace perspective, an intermediate model between the baseline and the drift model is required since the process change is incremental. This way, only complex change patterns were used because it was possible to create intermediate models from them whereas, for simple change patterns, the same is not possible since the simple change is already the final form of drift. This way, 20% of the stream log was dedicated for the intermediate behavior, so the division was 40-20-40 (baseline-intermediate model-incremental drift). The same applies for the other sizes following the proportion. For incremental time drifts all change patterns were used since the incremental drift was applied to the time perspective, disregarding of the model. This way, the transition state (20% of the stream log) was subdivided into four parts where standard time distribution decreases 5 minutes between them, following the incremental change of time.

概述:如《流程挖掘宣言》所述,不同流程挖掘工具与技术之间的比较仍具挑战性。在此背景下,一项探索性的比较往往始于所采用的dataset,这些dataset需能充分体现数据可能呈现的多样性行为。尽管事件日志资源丰富,但其中大部分并非为在线场景所设计。因此,流程挖掘研究中的一个挑战便是提供包含具有代表性在线设置的可靠基准dataset。为此,我们贡献了一个包含942个具有概念漂移的事件流的基准dataset。这些事件流探讨了在线场景的不同特征,如漂移类型、视角、规模及噪声百分比。描述:本包包含942个模拟商业流程中概念漂移的合成事件流。每个流仅包含一个漂移。不同的流规模、漂移类型、视角以及噪声百分比均被应用。事件流中的每个事件包含四个主要属性:案例识别、事件名称、事件开始时间、事件完成时间。所有事件流均具有以下共同特征:(i)案例到达速率固定为每20分钟一次,即每20分钟流中会到达一个来自新案例的事件;(ii)同一案例的事件间时间分布遵循正态分布。对于基准行为,平均时间设定为30分钟,标准差为3分钟;而对于漂移行为,平均时间和标准差分别为5分钟和0.5分钟;(iii)对于时间漂移,单个事件流中使用的模型是相同的,即漂移仅在时间视角上发生;这样,我们避免了引入其他因素;(iv)所有漂移均创建于100、500和1000个案例上;(v)在所有轨迹漂移的事件流中引入了噪声。我们选择以异常案例的形式引入噪声。异常包括移除轨迹的第一半或最后半部分。然后,根据总流规模应用了不同的百分比(5%、10%、15%和20%)。请注意,标准案例与异常案例进行了互换,以此保留事件流规模。我们探讨了四种不同类型的漂移以构成事件流dataset:突然漂移:流的第一个半部分由基准模型组成,第二个半部分由漂移模型组成。对于轨迹和时间的漂移(对于时间漂移,变化仅在于时间分布,而非实际模型);周期性漂移:对于100个轨迹的流规模,案例的分布为33-33-34。初始和最后的概念是基准,中间的是漂移行为,即基准行为开始流,在33个轨迹后消失,并在最后的34个轨迹中再次出现,显示出周期性特征;对于500和1000个轨迹,分布分别为167-167-166和330-330-340;渐进漂移:一个概念逐渐取代另一个。因此,20%的流被用于概念之间的过渡;增量漂移:对于轨迹视角,需要一个位于基准和漂移模型之间的中间模型,因为过程变化是递增的。因此,仅使用了复杂的变化模式,因为可以从它们中创建中间模型,而对于简单的变化模式,这是不可能的,因为简单的变化已经是最终的漂移形式。因此,20%的流日志被用于中间行为,所以分布为40-20-40(基准-中间模型-增量漂移)。对于其他规模,也遵循相同的比例。对于增量时间漂移,所有变化模式都被使用,因为增量漂移应用于时间视角,不考虑模型。因此,过渡状态(20%的流日志)被细分为四部分,其中标准时间分布在这四部分之间递减5分钟,遵循时间的增量变化。
提供机构:
IEEE Dataport
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作