five

A digital forensics corpus representing the view of academics and practitioners 1999-2021

收藏
DataCite Commons2024-05-28 更新2024-07-13 收录
下载链接:
https://discovery.dundee.ac.uk/en/datasets/a-digital-forensics-corpus-representing-the-view-of-academics-and
下载链接
链接失效反馈
官方服务:
资源简介:
A significant challenge in digital forensics is the lack of a framework for common language and knowledge. This creates barriers to communicating, collaborating and knowledge sharing amongst stakeholders. Methods for creating a comprehensive set of common terms on a topic includes Natural Language Processing (NLP) and Generative Artificial Intelligence (GenAI) algorithms. The efficiency of these algorithms depends on the coverage, quality and quantity of the training corpus. As far as we know, there is no such corpus that is readily available for training these algorithms. This is a digital forensics practice and research corpus, validated by practitioners working in this domain. The corpus is ready for training new generations of NLP and GenAI algorithms. The associated paper also presents a systematic method of sharing a training corpus, where the data structure, such as folder and file names, make it convenient to programmatically interact with the data.

数字取证领域面临的一项重大挑战,在于缺乏统一的通用语言与知识框架。这给各利益相关方之间的沟通、协作与知识共享设置了障碍。用于构建某一主题下完整通用术语集的方法,涵盖自然语言处理(Natural Language Processing,NLP)与生成式人工智能(Generative Artificial Intelligence,GenAI)算法。此类算法的效能,取决于训练语料库的覆盖范围、数据质量与样本规模。据我们所知,目前尚无适用于此类算法训练的公开可用语料库。 本数据集为经该领域从业者验证的数字取证实践与研究语料库。该语料库可直接用于训练新一代自然语言处理与生成式人工智能算法。配套论文还提出了一种共享训练语料库的系统化方法,其数据结构(如文件夹与文件名的命名规范)便于通过编程方式与数据进行交互。
提供机构:
University of Dundee
创建时间:
2024-05-24
二维码
社区交流群
二维码
科研交流群
商业服务