five

leachl/obfuscated-exebench

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/leachl/obfuscated-exebench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - translation language: - en tags: - assembly - deobfuscation - llvm-ir - aarch64 - binary-analysis - reverse-engineering - code - obfuscation - tigress - exebench pretty_name: Obfuscated ExeBench (AArch64) size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: data/train-* - split: test_flatten path: data/test_flatten-* - split: test_encode_arithmetic path: data/test_encode_arithmetic-* - split: test_combined path: data/test_combined-* dataset_info: features: - name: fname dtype: string - name: func_def dtype: string - name: technique dtype: string - name: clean_asm dtype: string - name: obfuscated_asm dtype: string - name: clean_ir dtype: string - name: obfuscated_c dtype: string - name: tigress_seed dtype: int64 - name: exebench_split dtype: string - name: c_deps dtype: string - name: func_c_signature dtype: string - name: cpp_wrapper dtype: string - name: dummy_funcs dtype: string - name: io_pairs dtype: string splits: - name: train num_bytes: 10559367144 num_examples: 980000 - name: test_flatten num_bytes: 130049242 num_examples: 3909 - name: test_encode_arithmetic num_bytes: 126010317 num_examples: 3908 - name: test_combined num_bytes: 131099302 num_examples: 3909 download_size: 1192158603 dataset_size: 10946526005 --- # Obfuscated ExeBench (AArch64) A large-scale dataset of **obfuscated AArch64 assembly** functions paired with their **clean LLVM IR**, original **C source code**, and **clean assembly**. Designed for training neural models that can **deobfuscate** and **lift** obfuscated binary code. ## Dataset Summary | Property | Value | |---|---| | **Training samples** | ~980,000 (from ExeBench `train_synth_compilable`) | | **Test samples** | 11,726 (3,909 Flatten / 3,908 EncodeArithmetic / 3,909 Combined) | | **Test source** | ExeBench `test_synth` (unseen functions) | | **Architecture** | AArch64 (ARM64) | | **Obfuscator** | [Tigress 4.0.11](https://tigress.wtf/) | | **Compiler** | `aarch64-linux-gnu-gcc 15.2.0` (`-S -O0 -std=c11 -w`) | | **Techniques** | Control-Flow Flattening, Arithmetic Encoding, Combined | | **Format** | Parquet with Snappy compression | ## Splits | Split | Rows | Source | Techniques | |---|---|---|---| | `train` | ~980,000 | ExeBench `train_synth_compilable` | All three (balanced ⅓ each) | | `test_flatten` | 3,909 | ExeBench `test_synth` | Flatten only | | `test_encode_arithmetic` | 3,908 | ExeBench `test_synth` | EncodeArithmetic only | | `test_combined` | 3,909 | ExeBench `test_synth` | Flatten+EncodeArithmetic | ## Columns | Column | Type | Description | |---|---|---| | `fname` | `string` | Function name | | `func_def` | `string` | Original C source code of the function | | `technique` | `string` | Obfuscation technique applied | | `clean_asm` | `string` | Clean AArch64 assembly from ExeBench (`angha_gcc_arm_O0`) | | `obfuscated_asm` | `string` | Obfuscated AArch64 assembly (after Tigress → GCC) | | `clean_ir` | `string` | Clean LLVM IR from ExeBench (`angha_clang_ir_O0`) | | `obfuscated_c` | `string` | Tigress-obfuscated C source (target function only, runtime stripped) | | `tigress_seed` | `int32` | Random seed used for Tigress (for reproducibility) | | `exebench_split` | `string` | Source ExeBench split name | ## Usage ```python from datasets import load_dataset # Training data (all techniques) train = load_dataset("leachl/obfuscated-exebench", split="train", streaming=True) # Test sets (one per technique, from unseen test_synth functions) test_flat = load_dataset("leachl/obfuscated-exebench", split="test_flatten") test_ea = load_dataset("leachl/obfuscated-exebench", split="test_encode_arithmetic") test_comb = load_dataset("leachl/obfuscated-exebench", split="test_combined") ``` ## Obfuscation Techniques Each function is independently obfuscated with one of three Tigress transformations: | Technique | Tigress Flag | Description | |---|---|---| | `Flatten` | `--Transform=Flatten` | Control-flow flattening — replaces structured control flow with a switch-in-a-loop dispatcher | | `EncodeArithmetic` | `--Transform=EncodeArithmetic` | Replaces simple arithmetic/boolean expressions with equivalent but complex MBA expressions | | `Flatten+EncodeArithmetic` | Both transforms | Combined: flattening + arithmetic encoding applied sequentially | ## Tigress Runtime A representative Tigress runtime (~480 KB, ~7400 lines of C) is stored in `tigress_runtime.c`. The `obfuscated_c` column contains **only** the target function body (runtime stripped). ## License MIT — same as the underlying ExeBench dataset.

许可证:MIT 任务类别:机器翻译 语言:英语 标签:汇编、代码混淆、LLVM中间表示(LLVM IR)、AArch64、二进制分析、逆向工程、代码、混淆、Tigress、ExeBench 友好名称:混淆式ExeBench(AArch64架构) 规模类别:100K<n<1M 配置项: - 配置名称:default 数据文件: - 拆分:train,路径:data/train-* - 拆分:test_flatten,路径:data/test_flatten-* - 拆分:test_encode_arithmetic,路径:data/test_encode_arithmetic-* - 拆分:test_combined,路径:data/test_combined-* 数据集信息: 特征字段: - 字段名:fname,数据类型:字符串 - 字段名:func_def,数据类型:字符串 - 字段名:technique,数据类型:字符串 - 字段名:clean_asm,数据类型:字符串 - 字段名:obfuscated_asm,数据类型:字符串 - 字段名:clean_ir,数据类型:字符串 - 字段名:obfuscated_c,数据类型:字符串 - 字段名:tigress_seed,数据类型:int64 - 字段名:exebench_split,数据类型:字符串 - 字段名:c_deps,数据类型:字符串 - 字段名:func_c_signature,数据类型:字符串 - 字段名:cpp_wrapper,数据类型:字符串 - 字段名:dummy_funcs,数据类型:字符串 - 字段名:io_pairs,数据类型:字符串 数据集拆分统计: - 拆分名称:train,总字节数:10559367144,样本数量:980000 - 拆分名称:test_flatten,总字节数:130049242,样本数量:3909 - 拆分名称:test_encode_arithmetic,总字节数:126010317,样本数量:3908 - 拆分名称:test_combined,总字节数:131099302,样本数量:3909 下载总大小:1192158603 数据集总存储大小:10946526005 # 混淆式ExeBench(AArch64架构) 本数据集为大规模**混淆式AArch64汇编函数**数据集,配套包含其对应的**纯净LLVM中间表示(LLVM IR)**、原始**C源代码**与**纯净汇编代码**,专为训练能够对混淆二进制代码进行反混淆与提升的神经网络模型而设计。 ## 数据集概览 | 属性 | 取值 | |---|---| | **训练样本数** | 约980,000(源自ExeBench的`train_synth_compilable`子集) | | **测试样本数** | 11,726(其中扁平化模式3,909条、算术编码模式3,908条、混合模式3,909条) | | **测试集来源** | ExeBench的`test_synth`子集(未见过的函数) | | **指令集架构** | AArch64(即ARM64) | | **混淆工具** | [Tigress 4.0.11](https://tigress.wtf/) | | **编译器** | `aarch64-linux-gnu-gcc 15.2.0`(编译参数:`-S -O0 -std=c11 -w`) | | **混淆技术** | 控制流扁平化、算术编码、混合模式 | | **存储格式** | 采用Snappy压缩的Parquet格式 | ## 数据集拆分 | 拆分名称 | 样本数量 | 数据源 | 应用混淆技术 | |---|---|---|---| | `train` | 约980,000 | ExeBench `train_synth_compilable` | 全部三种技术(均衡分布,每种占1/3) | | `test_flatten` | 3,909 | ExeBench `test_synth` | 仅控制流扁平化 | | `test_encode_arithmetic` | 3,908 | ExeBench `test_synth` | 仅算术编码 | | `test_combined` | 3,909 | ExeBench `test_synth` | 控制流扁平化+算术编码 | ## 数据列说明 | 列名 | 数据类型 | 描述 | |---|---|---| | `fname` | `string` | 函数名称 | | `func_def` | `string` | 函数的原始C源代码 | | `technique` | `string` | 所应用的混淆技术 | | `clean_asm` | `string` | 源自ExeBench的纯净AArch64汇编代码(`angha_gcc_arm_O0`) | | `obfuscated_asm` | `string` | 经Tigress混淆后再由GCC编译得到的混淆式AArch64汇编代码 | | `clean_ir` | `string` | 源自ExeBench的纯净LLVM中间表示(LLVM IR,`angha_clang_ir_O0`) | | `obfuscated_c` | `string` | 仅针对目标函数、剥离了运行时依赖的Tigress混淆式C源代码 | | `tigress_seed` | `int32` | Tigress混淆时使用的随机种子,用于结果复现 | | `exebench_split` | `string` | 该样本所属的ExeBench原始拆分名称 | | `c_deps` | `string` | C语言依赖项 | | `func_c_signature` | `string` | 函数的C语言签名 | | `cpp_wrapper` | `string` | C++包装代码 | | `dummy_funcs` | `string` | 虚拟函数 | | `io_pairs` | `string` | 输入输出样本对 | ## 使用方法 python from datasets import load_dataset # 加载训练集(包含全部三种混淆技术) train = load_dataset("leachl/obfuscated-exebench", split="train", streaming=True) # 加载各技术对应的测试集(均来自未见过的test_synth函数) test_flat = load_dataset("leachl/obfuscated-exebench", split="test_flatten") test_ea = load_dataset("leachl/obfuscated-exebench", split="test_encode_arithmetic") test_comb = load_dataset("leachl/obfuscated-exebench", split="test_combined") ## 混淆技术说明 每个函数将独立应用以下三种Tigress转换之一: | 混淆技术 | Tigress命令行标志 | 技术说明 | |---|---|---| | `Flatten` | `--Transform=Flatten` | 控制流扁平化:将结构化控制流替换为循环内的switch分发器 | | `EncodeArithmetic` | `--Transform=EncodeArithmetic` | 将简单算术/布尔表达式替换为等价但复杂度更高的表达式 | | `Flatten+EncodeArithmetic` | 同时启用两种转换 | 混合模式:依次应用控制流扁平化与算术编码混淆 | ## Tigress运行时 一个典型的Tigress运行时文件(约480 KB,含约7400行C代码)存储于`tigress_runtime.c`中。`obfuscated_c`列仅包含目标函数的代码主体,已剥离运行时依赖。 ## 许可证 MIT许可证,与底层ExeBench数据集保持一致。
提供机构:
leachl
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作