leachl/obfuscated-exebench
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/leachl/obfuscated-exebench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- translation
language:
- en
tags:
- assembly
- deobfuscation
- llvm-ir
- aarch64
- binary-analysis
- reverse-engineering
- code
- obfuscation
- tigress
- exebench
pretty_name: Obfuscated ExeBench (AArch64)
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test_flatten
path: data/test_flatten-*
- split: test_encode_arithmetic
path: data/test_encode_arithmetic-*
- split: test_combined
path: data/test_combined-*
dataset_info:
features:
- name: fname
dtype: string
- name: func_def
dtype: string
- name: technique
dtype: string
- name: clean_asm
dtype: string
- name: obfuscated_asm
dtype: string
- name: clean_ir
dtype: string
- name: obfuscated_c
dtype: string
- name: tigress_seed
dtype: int64
- name: exebench_split
dtype: string
- name: c_deps
dtype: string
- name: func_c_signature
dtype: string
- name: cpp_wrapper
dtype: string
- name: dummy_funcs
dtype: string
- name: io_pairs
dtype: string
splits:
- name: train
num_bytes: 10559367144
num_examples: 980000
- name: test_flatten
num_bytes: 130049242
num_examples: 3909
- name: test_encode_arithmetic
num_bytes: 126010317
num_examples: 3908
- name: test_combined
num_bytes: 131099302
num_examples: 3909
download_size: 1192158603
dataset_size: 10946526005
---
# Obfuscated ExeBench (AArch64)
A large-scale dataset of **obfuscated AArch64 assembly** functions paired with their **clean LLVM IR**, original **C source code**, and **clean assembly**. Designed for training neural models that can **deobfuscate** and **lift** obfuscated binary code.
## Dataset Summary
| Property | Value |
|---|---|
| **Training samples** | ~980,000 (from ExeBench `train_synth_compilable`) |
| **Test samples** | 11,726 (3,909 Flatten / 3,908 EncodeArithmetic / 3,909 Combined) |
| **Test source** | ExeBench `test_synth` (unseen functions) |
| **Architecture** | AArch64 (ARM64) |
| **Obfuscator** | [Tigress 4.0.11](https://tigress.wtf/) |
| **Compiler** | `aarch64-linux-gnu-gcc 15.2.0` (`-S -O0 -std=c11 -w`) |
| **Techniques** | Control-Flow Flattening, Arithmetic Encoding, Combined |
| **Format** | Parquet with Snappy compression |
## Splits
| Split | Rows | Source | Techniques |
|---|---|---|---|
| `train` | ~980,000 | ExeBench `train_synth_compilable` | All three (balanced ⅓ each) |
| `test_flatten` | 3,909 | ExeBench `test_synth` | Flatten only |
| `test_encode_arithmetic` | 3,908 | ExeBench `test_synth` | EncodeArithmetic only |
| `test_combined` | 3,909 | ExeBench `test_synth` | Flatten+EncodeArithmetic |
## Columns
| Column | Type | Description |
|---|---|---|
| `fname` | `string` | Function name |
| `func_def` | `string` | Original C source code of the function |
| `technique` | `string` | Obfuscation technique applied |
| `clean_asm` | `string` | Clean AArch64 assembly from ExeBench (`angha_gcc_arm_O0`) |
| `obfuscated_asm` | `string` | Obfuscated AArch64 assembly (after Tigress → GCC) |
| `clean_ir` | `string` | Clean LLVM IR from ExeBench (`angha_clang_ir_O0`) |
| `obfuscated_c` | `string` | Tigress-obfuscated C source (target function only, runtime stripped) |
| `tigress_seed` | `int32` | Random seed used for Tigress (for reproducibility) |
| `exebench_split` | `string` | Source ExeBench split name |
## Usage
```python
from datasets import load_dataset
# Training data (all techniques)
train = load_dataset("leachl/obfuscated-exebench", split="train", streaming=True)
# Test sets (one per technique, from unseen test_synth functions)
test_flat = load_dataset("leachl/obfuscated-exebench", split="test_flatten")
test_ea = load_dataset("leachl/obfuscated-exebench", split="test_encode_arithmetic")
test_comb = load_dataset("leachl/obfuscated-exebench", split="test_combined")
```
## Obfuscation Techniques
Each function is independently obfuscated with one of three Tigress transformations:
| Technique | Tigress Flag | Description |
|---|---|---|
| `Flatten` | `--Transform=Flatten` | Control-flow flattening — replaces structured control flow with a switch-in-a-loop dispatcher |
| `EncodeArithmetic` | `--Transform=EncodeArithmetic` | Replaces simple arithmetic/boolean expressions with equivalent but complex MBA expressions |
| `Flatten+EncodeArithmetic` | Both transforms | Combined: flattening + arithmetic encoding applied sequentially |
## Tigress Runtime
A representative Tigress runtime (~480 KB, ~7400 lines of C) is stored in `tigress_runtime.c`.
The `obfuscated_c` column contains **only** the target function body (runtime stripped).
## License
MIT — same as the underlying ExeBench dataset.
许可证:MIT
任务类别:机器翻译
语言:英语
标签:汇编、代码混淆、LLVM中间表示(LLVM IR)、AArch64、二进制分析、逆向工程、代码、混淆、Tigress、ExeBench
友好名称:混淆式ExeBench(AArch64架构)
规模类别:100K<n<1M
配置项:
- 配置名称:default
数据文件:
- 拆分:train,路径:data/train-*
- 拆分:test_flatten,路径:data/test_flatten-*
- 拆分:test_encode_arithmetic,路径:data/test_encode_arithmetic-*
- 拆分:test_combined,路径:data/test_combined-*
数据集信息:
特征字段:
- 字段名:fname,数据类型:字符串
- 字段名:func_def,数据类型:字符串
- 字段名:technique,数据类型:字符串
- 字段名:clean_asm,数据类型:字符串
- 字段名:obfuscated_asm,数据类型:字符串
- 字段名:clean_ir,数据类型:字符串
- 字段名:obfuscated_c,数据类型:字符串
- 字段名:tigress_seed,数据类型:int64
- 字段名:exebench_split,数据类型:字符串
- 字段名:c_deps,数据类型:字符串
- 字段名:func_c_signature,数据类型:字符串
- 字段名:cpp_wrapper,数据类型:字符串
- 字段名:dummy_funcs,数据类型:字符串
- 字段名:io_pairs,数据类型:字符串
数据集拆分统计:
- 拆分名称:train,总字节数:10559367144,样本数量:980000
- 拆分名称:test_flatten,总字节数:130049242,样本数量:3909
- 拆分名称:test_encode_arithmetic,总字节数:126010317,样本数量:3908
- 拆分名称:test_combined,总字节数:131099302,样本数量:3909
下载总大小:1192158603
数据集总存储大小:10946526005
# 混淆式ExeBench(AArch64架构)
本数据集为大规模**混淆式AArch64汇编函数**数据集,配套包含其对应的**纯净LLVM中间表示(LLVM IR)**、原始**C源代码**与**纯净汇编代码**,专为训练能够对混淆二进制代码进行反混淆与提升的神经网络模型而设计。
## 数据集概览
| 属性 | 取值 |
|---|---|
| **训练样本数** | 约980,000(源自ExeBench的`train_synth_compilable`子集) |
| **测试样本数** | 11,726(其中扁平化模式3,909条、算术编码模式3,908条、混合模式3,909条) |
| **测试集来源** | ExeBench的`test_synth`子集(未见过的函数) |
| **指令集架构** | AArch64(即ARM64) |
| **混淆工具** | [Tigress 4.0.11](https://tigress.wtf/) |
| **编译器** | `aarch64-linux-gnu-gcc 15.2.0`(编译参数:`-S -O0 -std=c11 -w`) |
| **混淆技术** | 控制流扁平化、算术编码、混合模式 |
| **存储格式** | 采用Snappy压缩的Parquet格式 |
## 数据集拆分
| 拆分名称 | 样本数量 | 数据源 | 应用混淆技术 |
|---|---|---|---|
| `train` | 约980,000 | ExeBench `train_synth_compilable` | 全部三种技术(均衡分布,每种占1/3) |
| `test_flatten` | 3,909 | ExeBench `test_synth` | 仅控制流扁平化 |
| `test_encode_arithmetic` | 3,908 | ExeBench `test_synth` | 仅算术编码 |
| `test_combined` | 3,909 | ExeBench `test_synth` | 控制流扁平化+算术编码 |
## 数据列说明
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `fname` | `string` | 函数名称 |
| `func_def` | `string` | 函数的原始C源代码 |
| `technique` | `string` | 所应用的混淆技术 |
| `clean_asm` | `string` | 源自ExeBench的纯净AArch64汇编代码(`angha_gcc_arm_O0`) |
| `obfuscated_asm` | `string` | 经Tigress混淆后再由GCC编译得到的混淆式AArch64汇编代码 |
| `clean_ir` | `string` | 源自ExeBench的纯净LLVM中间表示(LLVM IR,`angha_clang_ir_O0`) |
| `obfuscated_c` | `string` | 仅针对目标函数、剥离了运行时依赖的Tigress混淆式C源代码 |
| `tigress_seed` | `int32` | Tigress混淆时使用的随机种子,用于结果复现 |
| `exebench_split` | `string` | 该样本所属的ExeBench原始拆分名称 |
| `c_deps` | `string` | C语言依赖项 |
| `func_c_signature` | `string` | 函数的C语言签名 |
| `cpp_wrapper` | `string` | C++包装代码 |
| `dummy_funcs` | `string` | 虚拟函数 |
| `io_pairs` | `string` | 输入输出样本对 |
## 使用方法
python
from datasets import load_dataset
# 加载训练集(包含全部三种混淆技术)
train = load_dataset("leachl/obfuscated-exebench", split="train", streaming=True)
# 加载各技术对应的测试集(均来自未见过的test_synth函数)
test_flat = load_dataset("leachl/obfuscated-exebench", split="test_flatten")
test_ea = load_dataset("leachl/obfuscated-exebench", split="test_encode_arithmetic")
test_comb = load_dataset("leachl/obfuscated-exebench", split="test_combined")
## 混淆技术说明
每个函数将独立应用以下三种Tigress转换之一:
| 混淆技术 | Tigress命令行标志 | 技术说明 |
|---|---|---|
| `Flatten` | `--Transform=Flatten` | 控制流扁平化:将结构化控制流替换为循环内的switch分发器 |
| `EncodeArithmetic` | `--Transform=EncodeArithmetic` | 将简单算术/布尔表达式替换为等价但复杂度更高的表达式 |
| `Flatten+EncodeArithmetic` | 同时启用两种转换 | 混合模式:依次应用控制流扁平化与算术编码混淆 |
## Tigress运行时
一个典型的Tigress运行时文件(约480 KB,含约7400行C代码)存储于`tigress_runtime.c`中。`obfuscated_c`列仅包含目标函数的代码主体,已剥离运行时依赖。
## 许可证
MIT许可证,与底层ExeBench数据集保持一致。
提供机构:
leachl



