Venafi/Machine-Identity-Spectra
收藏Hugging Face2023-09-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Venafi/Machine-Identity-Spectra
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- feature-extraction
language:
- en
tags:
- certificates
- machine identity
- security
size_categories:
- 10M<n<100M
pretty_name: Machine Identity Spectra Dataset
configs:
- config_name: sample_data
data_files: Data/CertificateFeatures-sample.parquet
---
# Machine Identity Spectra Dataset
<img src="https://huggingface.co/datasets/Venafi/Machine-Identity-Spectra/resolve/main/VExperimentalSpectra.svg" alt="Spectra Dataset" width="250">
## Summary
Venafi is excited to release of the Machine Identity Spectra large dataset.
This collection of data contains extracted features from 19m+ certificates discovered over HTTPS (port 443) on the
public internet between July 20 and July 26, 2023.
The features are a combination of X.509 certificate features, RFC5280 compliance checks,
and other attributes intended to be used for clustering, features analysis, and a base for supervised learning tasks (labels not included).
Some rows may contain nan values as well and as such could require additional pre-processing for certain tasks.
This project is part of Venafi Athena. Venafi is committed to enabling the data science community to increase the adoption of machine learning techniques
to identify machine identity threats and solutions.
Phillip Maraveyias at Venafi is the lead researcher for this dataset.
## Data Structure
The extracted features are contained in the Data folder as certificateFeatures.csv.gz. The unarchived data size is
approximately 10GB and contains 98 extracted features for approximately 19m certificates. A description of the features
and expected data types is contained in the base folder as features.csv.
The Data folder also contains a 500k row sample of the data in parquet format. This is displayed in the Data Viewer
for easy visual inspection of the dataset.
## Clustering and PCA Example
To demonstrate a potential use of the data, clustering and Principal Component Analysis (PCA) were
conducted on the binary data features in the dataset. 10 clusters were generated and PCA conducted with the top 3 components preserved.
KMeans clustering was performed to generate a total of 10 clusters. In this case we are primarily
interested in visualizing the data and understanding better how it may be used, so the choice of 10 clusters is mostly
for illustrative purposes.
The top three PCA components accounted for approximately 61%, 10%, and 6% of the total explained variance
(for a total of 77% of the overall data variance). Plots of the first 2 components in 2D space and top 3 components in
3D space grouped into the 10 clusters are shown below.
### Clusters in 2 Dimensions

### Clusters in 3 Dimensions

## Contact
Please contact athena-community@venafi.com if you have any questions about this dataset.
## References and Acknowledgement
The following papers provided inspiration for this project:
- Li, J.; Zhang, Z.; Guo, C. Machine Learning-Based Malicious X.509 Certificates’ Detection. Appl. Sci. 2021, 11, 2164. https://doi.org/ 10.3390/app11052164
- Liu, J.; Luktarhan, N.; Chang, Y.; Yu, W. Malcertificate: Research and Implementation of a Malicious Certificate Detection Algorithm Based on GCN. Appl. Sci. 2022,12,4440. https://doi.org/ 10.3390/app12094440
提供机构:
Venafi
原始信息汇总
Machine Identity Spectra Dataset
概述
Venafi 发布了 Machine Identity Spectra 大型数据集。该数据集包含从 2023 年 7 月 20 日至 7 月 26 日期间在公共互联网上通过 HTTPS(端口 443)发现的 1900 多万个证书中提取的特征。这些特征包括 X.509 证书特征、RFC5280 合规性检查以及其他属性,旨在用于聚类、特征分析和监督学习任务的基础(不包含标签)。部分行可能包含 nan 值,因此可能需要额外的预处理。
数据结构
提取的特征包含在 Data 文件夹中的 certificateFeatures.csv.gz 文件中。未压缩的数据大小约为 10GB,包含约 1900 万个证书的 98 个提取特征。特征描述和预期数据类型包含在 base 文件夹中的 features.csv 文件中。Data 文件夹还包含一个 50 万行的样本数据,以 parquet 格式存储,可在数据查看器中轻松进行可视化检查。
聚类和 PCA 示例
为了展示数据集的潜在用途,对数据集中的二进制数据特征进行了聚类和主成分分析(PCA)。生成了 10 个聚类,并进行了 PCA,保留了前 3 个主成分。KMeans 聚类用于生成总共 10 个聚类,主要目的是可视化数据并更好地理解其用途。前三个 PCA 成分分别占总解释方差的 61%、10% 和 6%(总共占总体数据方差的 77%)。二维空间中的前两个成分和三维空间中的前三个成分的聚类图如下所示。
二维聚类

三维聚类

联系
如有任何关于此数据集的问题,请联系 athena-community@venafi.com。



