NapierOne Mixed File Dataset
收藏registry.opendata.aws2025-03-25 收录
下载链接:
https://registry.opendata.aws/napierone/
下载链接
链接失效反馈官方服务:
资源简介:
NapierOne is a modern cybersecurity mixed file data set, primarily aimed at, but not limited to, ransomware detection and forensic analysis. The dataset contains over 500,000 distinct files, representing 44 distinct popular file types. It was designed to address the known deficiency in research reproducibility and improve consistency by facilitating research replication and repeatability. The data set was inspired by the Govdocs1 data set and it is intended that ‘NapierOne’ be used as a complement to this original data set. An investigation was performed with the goal of determining the common files types currently in use. No specific research was found that explicitly provided this information, so an alternative consensus approach was employed. This involved combining the findings from multiple sources of file type usage into an overall ranked list. After which 5,000 real-world example files were gathered, and a specific data subset was created, for each of the common file types identified. In some circumstances, multiple data subsets were created for a specific file type, each subset representing a specific characteristic for that file type. For example, there are multiple data subsets for the ZIP file type with each subset containing examples of a specific compression method. Ransomware execution tends to produce files that have high entropy, so examples of file types that naturally have this attribute are also present. The resulting entire data set comprises of more than 90 separate data subsets divided between 44 distinct file types, resulting in over 500,000 unique files in total. Currently, the data set contains examples of the following file types APK, BIN, BMP, CSS, CSV, DOC, DOCX, DWG, ELF, EPS,EPUB, EXE, GIF, GZIP, HTML, ICS, JS, JPG, JSON, MKV, MP3, MP4, ODS, OXPS, PDF, PNG, PPT, PPTX, PS1, RAR, SVG, TAR, TIF, TXT, WEBP, XLS, XLSX, XML, ZIP, ZLIB, 7Zip
NapierOne 是一套现代化的网络安全混合文件数据集,其主要目标在于,但不仅限于,勒索软件检测和取证分析。该数据集包含超过 50 万个独特的文件,代表 44 种不同的流行文件类型。其设计宗旨在于弥补研究可重复性的已知不足,并通过促进研究的复制和可重复性来提升一致性。本数据集受到 Govdocs1 数据集的启发,并旨在将 'NapierOne' 作为该原始数据集的补充。为了确定当前普遍使用的文件类型,进行了一项调查。未发现具体的研究明确提供此类信息,因此采用了替代的共识方法。该方法涉及将多个来源的文件类型使用情况研究结果综合成一份总体排名列表。随后,收集了 5,000 个现实世界的示例文件,并为识别出的每种常见文件类型创建了一个特定的数据子集。在某些情况下,为特定文件类型创建了多个数据子集,每个子集代表该文件类型的特定特征。例如,对于 ZIP 文件类型,存在多个数据子集,每个子集包含特定压缩方法的示例。勒索软件的执行往往会产生高熵的文件,因此也包含了具有此属性的自然文件类型的示例。由此形成的整个数据集由超过 90 个独立的数据子集组成,分布在 44 种不同的文件类型中,总计超过 50 万个独特文件。目前,该数据集包含以下文件类型的示例:APK、BIN、BMP、CSS、CSV、DOC、DOCX、DWG、ELF、EPS、EPUB、EXE、GIF、GZIP、HTML、ICS、JS、JPG、JSON、MKV、MP3、MP4、ODS、OXPS、PDF、PNG、PPT、PPTX、PS1、RAR、SVG、TAR、TIF、TXT、WEBP、XLS、XLSX、XML、ZIP、ZLIB、7Zip。
提供机构:
School of Computing at Edinburgh Napier University



