ISAdetect dataset (ISAdetect binary file and object code dataset)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/ISAdetect_dataset
下载链接
链接失效反馈官方服务:
资源简介:
该存储库包含两个数据集:一个包含原始二进制文件和从中提取的代码段(“完整数据集”),另一个只有代码段(“仅代码段”)。通过分割标记为可执行的二进制部分来提取代码部分。二进制文件是从 Debian 存储库中抓取的。还有两个 CSV 文件可用,一个包含完整的二进制文件,一个包含代码部分,其中包括从每个架构的大约 3000 个二进制文件中提取的 293 个特征。这些特征可用于训练分类器。该数据集包含以下 23 种架构的数千个二进制文件:alpha、amd64、arm64、armel、armhf、hppa、i386、ia64、m68k、mips、mips64el、mipsel、powerpc、powerpcspe、powerpc64、powerpc64el、riscv、s390、s390x , sh4, sparc, sparc64 和 x32。有 98 500 个二进制文件,大约 27 GB(未压缩)的二进制文件和大约 15 GB(未压缩)的二进制文件中的代码部分。两个数据集都将二进制文件保存在架构命名的目录中。文件夹内的文件被命名为原始二进制文件的 MD5 哈希,以“.code”结尾的哈希文件仅包含原始二进制文件的所有代码段的串联。每个架构文件夹还包含一个以架构命名的 JSON 文件,例如amd64 包含 amd64.json。 JSON 文件的结构如下(以 JSON Schema-like 表示法描述)这项工作基于 John Clemens,2015 年“使用机器学习对目标代码进行自动分类”和 De Nicolao,Pietro 等人的工作, 2018 年,“ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation” 该数据集作为以下论文的一部分发布:Sami Kairajärvi、Andrei Costin 和 Timo Hämäläinen。 2020. ISAdetect:可用于可执行二进制文件和目标代码的 ISA(CPU 架构和字节序)自动检测。第十届 ACM 数据和应用程序安全与隐私会议 (CODASPY’20),2020 年 3 月 16 日至 18 日,美国洛杉矶新奥尔良。 ACM,纽约,纽约,美国,5 页。 https://doi.org/10.1145/3374664.3375742 Kairajärvi、Sami、Andrei Costin 和 Timo Hämäläinen。 “实现对任意二进制文件和目标代码序列的 CPU 架构和字节序的可用自动检测。” arXiv 预印本 arXiv:1908.05459 (2019)。凯拉耶尔维,萨米人。 “使用二进制文件内容自动识别架构和字节序。” (2019)。与此数据集相关的代码可在 https://github.com/kairis/isadetect 更改日志:版本 6 - 29.3.2020 添加 Weka 模型版本 5 - 17.1.2020 清理数据集版本 4 - 13.1.2020 初始版本
This repository contains two datasets: one containing the original binaries and the code segments extracted from them ("full dataset"), and another containing only the code segments ("code-only dataset"). The code segments are extracted by splitting the executable-marked sections of the binaries. The binaries were scraped from Debian repositories.
Two CSV files are also available: one for the full binaries and one for the code segments, which include 293 features extracted from approximately 3,000 binaries per architecture. These features can be used for training classifiers.
This dataset contains thousands of binaries across the following 23 architectures: alpha, amd64, arm64, armel, armhf, hppa, i386, ia64, m68k, mips, mips64el, mipsel, powerpc, powerpcspe, powerpc64, powerpc64el, riscv, s390, s390x, sh4, sparc, sparc64, and x32. There are 98,500 binaries in total, with approximately 27 GB (uncompressed) of binaries and approximately 15 GB (uncompressed) of code segments extracted from the binaries.
Both datasets store binaries in directories named after their respective architectures. Files within the folders are named using the MD5 hash of the original binary; hash files ending with ".code" contain only the concatenation of all code segments from the original binary. Each architecture folder also contains a JSON file named after the architecture, e.g., amd64.json for the amd64 architecture. The structure of the JSON files is described below using JSON Schema-like notation.
This work is based on the studies by John Clemens (2015) "Automatic Classification of Object Code Using Machine Learning" and De Nicolao, Pietro et al. (2018) "ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation".
This dataset was published as part of the following papers:
1. Sami Kairajärvi, Andrei Costin, and Timo Hämäläinen. 2020. ISAdetect: Usable Automatic Detection of ISA (CPU Architecture and Endianness) for Executable Binaries and Object Code. In *Proceedings of the 10th ACM Conference on Data and Application Security and Privacy (CODASPY '20)*, March 16–18, 2020, New Orleans, LA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3374664.3375742
2. Kairajärvi, Sami, Andrei Costin, and Timo Hämäläinen. "Toward Usable Automatic Detection of CPU Architecture and Endianness for Arbitrary Binaries and Object Code Sequences." arXiv preprint arXiv:1908.05459 (2019).
3. Kairajärvi, Sami. "Automatic Architecture and Endianness Identification Using Binary File Contents." (2019).
The code associated with this dataset is available at https://github.com/kairis/isadetect
Changelog:
- Version 6 - 29.3.2020: Added Weka models
- Version 5 - 17.1.2020: Cleaned up the dataset
- Version 4 - 13.1.2020: Initial release
提供机构:
OpenDataLab
创建时间:
2022-08-11
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含从Debian存储库抓取的23种CPU架构的约98,500个二进制文件,分为完整数据集和仅代码段两个子集,并提取了293个特征用于训练分类器,以支持ISA(CPU架构和字节序)的自动检测。
以上内容由遇见数据集搜集并总结生成



