five

ISAdetect binary file and object code dataset

收藏
Mendeley Data2024-03-27 更新2024-06-28 收录
下载链接:
https://etsin.fairdata.fi/dataset/9f6203f5-2360-426f-b9df-052f3f936ed2
下载链接
链接失效反馈
资源简介:
This repository holds two datasets: one with both the original binaries and the code sections extracted from them (“full dataset”), and one with only the code sections (“only code sections”). The code sections were extracted by carving out sections of the binary that were marked as executable. The binaries were scraped from Debian repositories. There are also two CSV files available, one with full binaries and one with only code sections, which include the 293 features extracted from about 3000 binaries per architecture. These features can be used to train classifiers. The dataset consists of thousands of binaries for the following 23 architectures: alpha, amd64, arm64, armel, armhf, hppa, i386, ia64, m68k, mips, mips64el, mipsel, powerpc, powerpcspe, powerpc64, powerpc64el, riscv, s390, s390x, sh4, sparc, sparc64 and x32. There are 98 500 binary files, about 27 gigabytes (uncompressed) of binary files and about 15 gigabytes (uncompressed) of only code sections from those binary files. Both datasets hold the binaries in directories named by the architecture. The files inside the folders are named as MD5 hashes of the original binary files, and a hash file ending with “.code” contains only the concatenation of all code sections of the original binary file. Each architecture folder also holds a JSON file named after the architecture, e.g. amd64 holds amd64.json. The structure of the JSON file is as follows (described in a JSON Schema-like notation): "architecture": { "type": "string", "description": "Name of the architecture" }, "code_sections": { "type": "array", "items": { "type": "string" }, "description": "Names of the code sections that were used" }, "endianness": { "type": "string", "enum": [ "big", "little" ], "description": "Endianness of the binary file" }, "filehash": { "type": "string", "description": "MD5 hash of the original binary file" }, "fileinfo": { "type": "string", "description": "Output of running Linux 'file' command on the original binary file" }, "filename": { "type": "string", "description": "Path where the binary file was located in the Debian package" }, "filesize": { "type": "integer", "decription": "File size of the original binary file in bytes" }, "wordsize": { "type": "ingeter", "description": "Wordsize of the binary file" }, "deb_package": { "type": "string", "description": "Name of the debian package where the binary file was extracted from" }, "only_code": { "type": "string", "description": "Name of the file which holds only the executable code sections of the original binary file. Should be the md5 sum with .code extension" }, "only_code_size": { "type": "integer", "description": "File size of the 'only code' file in bytes" } This work is based on work by John Clemens, 2015, “Automatic classification of object code using machine learning” and De Nicolao, Pietro et al., 2018, “ELISA: ELiciting ISA of Raw Binaries for Fine-Grained Code and Data Separation” This dataset is released as part of the following papers: Sami Kairajärvi, Andrei Costin, and Timo Hämäläinen. 2020. ISAdetect: Usable automated detection of ISA (CPU architecture and endianness) for executable binary files and object code. In Tenth ACM Conference on Data and Application Security and Privacy (CODASPY’20), March 16–18, 2020, New Orleans, LA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3374664.3375742 Kairajärvi, Sami, Andrei Costin, and Timo Hämäläinen. "Towards usable automated detection of CPU architecture and endianness for arbitrary binary files and object code sequences." arXiv preprint arXiv:1908.05459 (2019). Kairajärvi, Sami. "Automatic identification of architecture and endianness using binary file contents." (2019). The code associated with this dataset can be found at https://github.com/kairis/isadetect
作者:
Sami Kairajärvi
开放时间:
2023-10-10
创建时间:
2023-10-10