Venafi/Machine-Identity-Spectra

Name: Venafi/Machine-Identity-Spectra
Creator: Venafi
Published: 2023-09-17 17:21:41
License: 暂无描述

Hugging Face2023-09-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Venafi/Machine-Identity-Spectra

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - feature-extraction language: - en tags: - certificates - machine identity - security size_categories: - 10M<n<100M pretty_name: Machine Identity Spectra Dataset configs: - config_name: sample_data data_files: Data/CertificateFeatures-sample.parquet --- # Machine Identity Spectra Dataset <img src="https://huggingface.co/datasets/Venafi/Machine-Identity-Spectra/resolve/main/VExperimentalSpectra.svg" alt="Spectra Dataset" width="250"> ## Summary Venafi is excited to release of the Machine Identity Spectra large dataset. This collection of data contains extracted features from 19m+ certificates discovered over HTTPS (port 443) on the public internet between July 20 and July 26, 2023. The features are a combination of X.509 certificate features, RFC5280 compliance checks, and other attributes intended to be used for clustering, features analysis, and a base for supervised learning tasks (labels not included). Some rows may contain nan values as well and as such could require additional pre-processing for certain tasks. This project is part of Venafi Athena. Venafi is committed to enabling the data science community to increase the adoption of machine learning techniques to identify machine identity threats and solutions. Phillip Maraveyias at Venafi is the lead researcher for this dataset. ## Data Structure The extracted features are contained in the Data folder as certificateFeatures.csv.gz. The unarchived data size is approximately 10GB and contains 98 extracted features for approximately 19m certificates. A description of the features and expected data types is contained in the base folder as features.csv. The Data folder also contains a 500k row sample of the data in parquet format. This is displayed in the Data Viewer for easy visual inspection of the dataset. ## Clustering and PCA Example To demonstrate a potential use of the data, clustering and Principal Component Analysis (PCA) were conducted on the binary data features in the dataset. 10 clusters were generated and PCA conducted with the top 3 components preserved. KMeans clustering was performed to generate a total of 10 clusters. In this case we are primarily interested in visualizing the data and understanding better how it may be used, so the choice of 10 clusters is mostly for illustrative purposes. The top three PCA components accounted for approximately 61%, 10%, and 6% of the total explained variance (for a total of 77% of the overall data variance). Plots of the first 2 components in 2D space and top 3 components in 3D space grouped into the 10 clusters are shown below. ### Clusters in 2 Dimensions ![](ClusterAnalysis/clusters2d.png) ### Clusters in 3 Dimensions ![](ClusterAnalysis/clusters3d.png) ## Contact Please contact athena-community@venafi.com if you have any questions about this dataset. ## References and Acknowledgement The following papers provided inspiration for this project: - Li, J.; Zhang, Z.; Guo, C. Machine Learning-Based Malicious X.509 Certificates’ Detection. Appl. Sci. 2021, 11, 2164. https://doi.org/ 10.3390/app11052164 - Liu, J.; Luktarhan, N.; Chang, Y.; Yu, W. Malcertificate: Research and Implementation of a Malicious Certificate Detection Algorithm Based on GCN. Appl. Sci. 2022,12,4440. https://doi.org/ 10.3390/app12094440

提供机构：

Venafi

原始信息汇总

Machine Identity Spectra Dataset

概述

Venafi 发布了 Machine Identity Spectra 大型数据集。该数据集包含从 2023 年 7 月 20 日至 7 月 26 日期间在公共互联网上通过 HTTPS（端口 443）发现的 1900 多万个证书中提取的特征。这些特征包括 X.509 证书特征、RFC5280 合规性检查以及其他属性，旨在用于聚类、特征分析和监督学习任务的基础（不包含标签）。部分行可能包含 nan 值，因此可能需要额外的预处理。

数据结构

提取的特征包含在 Data 文件夹中的 certificateFeatures.csv.gz 文件中。未压缩的数据大小约为 10GB，包含约 1900 万个证书的 98 个提取特征。特征描述和预期数据类型包含在 base 文件夹中的 features.csv 文件中。Data 文件夹还包含一个 50 万行的样本数据，以 parquet 格式存储，可在数据查看器中轻松进行可视化检查。

聚类和 PCA 示例

为了展示数据集的潜在用途，对数据集中的二进制数据特征进行了聚类和主成分分析（PCA）。生成了 10 个聚类，并进行了 PCA，保留了前 3 个主成分。KMeans 聚类用于生成总共 10 个聚类，主要目的是可视化数据并更好地理解其用途。前三个 PCA 成分分别占总解释方差的 61%、10% 和 6%（总共占总体数据方差的 77%）。二维空间中的前两个成分和三维空间中的前三个成分的聚类图如下所示。

二维聚类

三维聚类

联系

如有任何关于此数据集的问题，请联系 athena-community@venafi.com。

5,000+

优质数据集

54 个

任务类型

进入经典数据集