hendriow/mh100k
收藏Hugging Face2025-11-27 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/hendriow/mh100k
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- tabular-classification
- other
tags:
- security
- android
- malware
- cybersecurity
- int8
size_categories:
- 100K<n<1M
license: cc-by-4.0
language:
- en
pretty_name: MH-100K Android Malware Dataset
---
# MH-100K: An innovative Android Malware Dataset for advanced research
## Dataset Summary
**MH-100K** is a large-scale dataset for Android malware detection research. It contains **101,975** Android applications (APKs) collected between **2010 and 2022**, providing a diverse set of samples to study malware evolution over more than a decade.
The dataset features high-dimensional tabular data representing the static analysis of these applications. It includes permissions, API calls, and intents, along with extensive metadata and detection labels derived from VirusTotal.
## Dataset Structure
The repository contains the dataset in a consolidated, efficient format:
- **`mh100.parquet`**: The main dataset file containing the feature matrix and metadata for all 101,975 instances. Stored in `int8` format for efficiency.
- **`mh100-labels.csv`**: Contains the label information (Malware vs Benign) and VirusTotal metadata.
- **`feature_names.csv`**: A mapping file that lists the names of the features corresponding to the columns in the feature matrix.
## How to Use
You can load this dataset directly using the Hugging Face `datasets` library.
### Quick Load
```python
from huggingface_hub import hf_hub_download
import pandas as pd
# 1. Download the specific file to your cache
file_path = hf_hub_download(
repo_id="hendriow/mh100k",
filename="mh100.parquet",
repo_type="dataset"
)
# 2. Read it directly into a dataframe
df = pd.read_parquet(file_path)
df.info()
```
### Loading with Feature Names
Since the dataset is high-dimensional (>10k features), the columns in the parquet file might be indexed. You can map them back to their real names (e.g., `android.permission.INTERNET`) using the `feature_names.csv` file.
```python
from huggingface_hub import hf_hub_download
import pandas as pd
# 1. Download the labels file to your local cache
csv_path = hf_hub_download(
repo_id="hendriow/mh100k",
filename="mh100-labels.csv",
repo_type="dataset"
)
# 2. Read into a DataFrame
labels_df = pd.read_csv(csv_path)
labels_df.head()
```
## Dataset Description
The **MH-100K** dataset is a comprehensive collection of Android malware information, comprising 101,975 samples.
- **Data Type:** Tabular (Int8)
- **Time Period:** 2010 - 2022
- **Source:** Samples randomly selected from AndroZoo.
### Features and Attributes
- SHA256 hash (APK's signature)
- File name
- Package name
- Android's official compilation API
- 166 permissions
- 24,417 API calls
- 250 intents
### About VirusTotal API
The [VirusTotal API](https://developers.virustotal.com/reference/overview) is a crucial tool in this dataset's creation, known for its prowess in detecting suspicious files and URLs. Each API request yields a JSON, aiding in categorizing the APK based on its perceived threat.
## Citation
If you use this dataset in your research, please cite the original authors:
> @article{bragancca2023android,
title={Android malware detection with MH-100K: An innovative dataset for advanced research},
author={Bragan{\c{c}}a, Hendrio and Rocha, Vanderson and Barcellos, Lucas and Souto, Eduardo and Kreutz, Diego and Feitosa, Eduardo},
journal={Data in Brief},
volume={51},
pages={109750},
year={2023},
publisher={Elsevier}
}
> @inproceedings{bragancca2023capturing,
title={Capturing the behavior of android malware with mh-100k: A novel and multidimensional dataset},
author={Bragan{\c{c}}a, Hendrio and Rocha, Vanderson and Barcellos, Lucas Vilanova and Souto, Eduardo and Kreutz, Diego and Feitosa, Eduardo},
booktitle={Simp{\'o}sio Brasileiro de Seguran{\c{c}}a da Informa{\c{c}}{\~a}o e de Sistemas Computacionais (SBSeg)},
pages={510--515},
year={2023},
organization={SBC}
}
提供机构:
hendriow



