Claims Management Log Dataset with Digital Documents
收藏doi.org2025-03-22 收录
下载链接:
http://doi.org/10.17632/kdcspz6xtn.1
下载链接
链接失效反馈官方服务:
资源简介:
This is an event log dataset from a real-world claims management process of a mid-sized German insurance company. It is used in the article "Utilizing the Omnipresent: Incorporating Digital Documents into Predictive Process Monitoring Using Deep Neural Networks".
This event log is special in that it associates individual events with external context information in the form of digital documents with multiple pages. These digital documents were either received or produced during each event. The event log ("process_log.csv") is provided as a CSV file and contains the following attributes:
* timestamp: timestamp of the event
* instance_id: unique identifier of the process instance
* state: event type as integer
* state_name: event type
* type: damage type (instance outcome) as integer
* type_name: damage type (instance outcome)
* file_name: file name of the associated digital document
* n_pages: number of pages in the associated document
* time_since_last_event: elapsed seconds since the last event occurred
* log_time_since_last_event: natural logarithm of elapsed seconds since the last event occurred
The original digital documents are stored in the PDF format and contain up to ten pages. For reasons of data privacy, they can't be published directly. However, to enable research to utilize this data, the digital documents are published as feature vectors extracted by established pretrained neural networks (feature extractors). While these feature vectors cannot be used to reconstruct the source document, they contain meaningful information that can be used in applications such as Predictive Process Monitoring (PPM).
Feature vectors are extracted using four models:
* VGG-16 pretrained on the ImageNet dataset
* VGG-16 pretrained on the RVL-CDIP dataset
* BERT pretrained on German texts
* LayoutXLM pretrained on multilingual document data
They are stored as zipped numpy arrays (e.g., "features/vgg_rvl.zip"). The file name serves as the unique key to link the digital documents to their corresponding events in the log. Details and references are provided in the associated article.
Finally, we also publish the exact data splits ("folds_and_splits.csv") that were used for model evaluation in the associated article.
本数据集源自一家中型德国保险公司真实世界索赔管理流程的事件日志。该数据集应用于文章《利用无处不在:利用深度神经网络将数字文档纳入预测过程监控》的研究中。该事件日志的独特之处在于,它将单个事件与外部上下文信息相关联,这些信息以多页数字文档的形式呈现。这些数字文档在每次事件发生时均被接收或生成。事件日志("process_log.csv")以CSV文件形式提供,包含以下属性:
* 时间戳:事件的时间戳
* 实例ID:流程实例的唯一标识符
* 状态:事件类型(以整数表示)
* 状态名称:事件类型
* 类型:损坏类型(实例结果)(以整数表示)
* 类型名称:损坏类型(实例结果)
* 文件名:相关联的数字文档的文件名
* 页数:相关文档的页数
* 自上次事件以来时间:自上次事件发生以来的经过秒数
* 自上次事件以来对数时间:自上次事件发生以来的经过秒数的自然对数
原始数字文档以PDF格式存储,最多包含十页。出于数据隐私保护的原因,这些文档无法直接发布。然而,为了使研究能够利用这些数据,数字文档以由预训练神经网络(特征提取器)提取的特征向量形式发布。虽然这些特征向量不能用于重建原始文档,但它们包含可用于预测过程监控(PPM)等应用中的有意义信息。
特征向量使用以下四种模型提取:
* 在ImageNet数据集上预训练的VGG-16
* 在RVL-CDIP数据集上预训练的VGG-16
* 在德语文本上预训练的BERT
* 在多语言文档数据上预训练的LayoutXLM
它们存储为压缩的numpy数组(例如,“features/vgg_rvl.zip”)。文件名作为唯一键,用于将数字文档与其日志中的对应事件链接。详细信息及参考文献可在相关文章中找到。
最后,我们还发布了用于模型评估的精确数据拆分("folds_and_splits.csv")。
提供机构:
Mendeley Data



