five

theelderemo/linux-asm-pairs

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/theelderemo/linux-asm-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: gpl-2.0 task_categories: - text-generation language: - en - asm tags: - assembly - linux-kernel - compiler - bfc-bic - objdump - decompilation - bug-fix - text2text-generation pretty_name: Linux Kernel BFC/BIC Assembly→Explanation Pairs size_categories: - 100K<n<1M --- # Linux Kernel Assembly → Explanation Dataset A dataset of disassembled Linux kernel functions paired with structured natural-language explanations, built from the [Linux Commits Dataset (Zenodo 10654193)](https://zenodo.org/records/10654193). Each row corresponds to a single compiled function extracted from a `.c` or `.h` file touched by a **Bug-Fix Commit (BFC)** or **Bug-Introducing Commit (BIC)** in the Linux kernel git history. Source files are compiled with `gcc -O2 -g -fno-inline -fno-omit-frame-pointer` and disassembled with `objdump -d -S` (AT&T syntax, source-interleaved). Intended for training or evaluating models on **assembly code understanding**, specifically, generating natural language explanations from disassembled Linux kernel functions. --- ## Schema | Column | Type | Description | |---|---|---| | `bfc_hash` | `string` | Hash of the bug-fix commit in the pair | | `bic_hash` | `string` | Hash of the bug-introducing commit in the pair | | `commit_hash` | `string` | The specific commit this function was extracted from | | `commit_type` | `string` | `"bfc"` (bug-fix commit) or `"bic"` (bug-introducing commit) | | `filename` | `string` | Source file path within the kernel tree (e.g. `kernel/sched/core.c`) | | `func_name` | `string` | Name of the disassembled function | | `asm` | `string` | `objdump -d -S` output for one function (AT&T syntax, source lines interleaved) | | `explanation` | `string` | Structured description: commit type, function name, author, date, commit message, and role | | `commit_message` | `string` | First 512 characters of the commit message | --- ## Usage ```python from datasets import load_dataset ds = load_dataset("theelderemo/linux-asm-pairs", split="train") print(ds["asm"]) print(ds["explanation"]) ``` Filter to only bug-fix commits: ```python bfc_only = ds.filter(lambda x: x["commit_type"] == "bfc") ``` --- ## Data Pipeline 1. BFC/BIC pairs sourced from `bfc_bic.csv` in [Zenodo record 10654193](https://zenodo.org/records/10654193) 2. Full diffs fetched from a local bare clone of the Linux kernel git repo (snapshot: 2023-11-12) 3. Changed `.c` / `.h` files compiled with `gcc -O2 -g -fno-inline -fno-omit-frame-pointer` 4. Functions extracted via `objdump -d -S --no-show-raw-insn` 5. Explanations constructed from Perceval-parsed commit metadata --- ## License Source assembly is derived from the Linux kernel, which is licensed under **GPL-2.0**. This dataset inherits that license. See [kernel.org/doc/html/latest/process/license-rules.html](https://www.kernel.org/doc/html/latest/process/license-rules.html) for details. --- ## Citation If you use this dataset, please also cite the upstream source: ```bibtex @dataset{linux_commits_2023, title = {Linux Commits Dataset}, year = {2023}, url = {https://zenodo.org/records/10654193} } @dataset{christopher_dickinson_2026, author = { Christopher Dickinson }, title = { linux-asm-pairs (Revision 9ea25c3) }, year = 2026, url = { https://huggingface.co/datasets/theelderemo/linux-asm-pairs }, doi = { 10.57967/hf/8453 }, publisher = { Hugging Face } } ```

license: gpl-2.0 task_categories: - 文本生成 language: - 英语(en)、汇编语言(asm) tags: - 汇编 - Linux内核 - 编译器 - BFC-BIC - objdump - 反编译 - 缺陷修复 - 文本到文本生成 pretty_name: Linux内核BFC/BIC 汇编→解释配对数据集 size_categories: - 100K<n<1M # Linux内核汇编→解释数据集 本数据集收录经反汇编的Linux内核函数与结构化自然语言解释的配对样本,其构建依托于[Linux提交数据集(Zenodo 10654193)](https://zenodo.org/records/10654193)。 数据集中的每一行对应一个从Linux内核Git历史中,由**修复提交(Bug-Fix Commit, BFC)**或**引入缺陷提交(Bug-Introducing Commit, BIC)**所修改的`.c`或`.h`文件中提取的编译后函数。源文件采用`gcc -O2 -g -fno-inline -fno-omit-frame-pointer`命令编译,并通过`objdump -d -S`(AT&T语法,穿插源代码行)进行反汇编。 本数据集旨在针对**汇编代码理解**任务训练或评估模型,具体而言,即从经反汇编的Linux内核函数生成自然语言解释。 --- ## 数据集架构 | 列名 | 数据类型 | 列描述 | |---|---|---| | `bfc_hash` | 字符串 | 该配对中修复提交的哈希值 | | `bic_hash` | 字符串 | 该配对中引入缺陷提交的哈希值 | | `commit_hash` | 字符串 | 提取该函数所在的具体提交哈希值 | | `commit_type` | 字符串 | 取值为`"bfc"`(修复提交)或`"bic"`(引入缺陷提交) | | `filename` | 字符串 | 内核源码树中的源文件路径(例如`kernel/sched/core.c`) | | `func_name` | 字符串 | 反汇编函数的名称 | | `asm` | 字符串 | 单个函数的`objdump -d -S`反汇编输出(AT&T语法,穿插源代码行) | | `explanation` | 字符串 | 结构化描述内容,涵盖提交类型、函数名称、作者、提交日期、提交信息及相关角色 | | `commit_message` | 字符串 | 提交信息的前512个字符 | --- ## 使用方法 python from datasets import load_dataset ds = load_dataset("theelderemo/linux-asm-pairs", split="train") print(ds["asm"]) print(ds["explanation"]) 仅筛选修复提交的样本: python bfc_only = ds.filter(lambda x: x["commit_type"] == "bfc") --- ## 数据处理流程 1. BFC/BIC配对样本源自[Zenodo记录10654193](https://zenodo.org/records/10654193)中的`bfc_bic.csv`文件 2. 完整差异信息取自本地裸克隆的Linux内核Git仓库(快照时间:2023年11月12日) 3. 对修改过的`.c`/`.h`文件采用`gcc -O2 -g -fno-inline -fno-omit-frame-pointer`命令编译 4. 通过`objdump -d -S --no-show-raw-insn`命令提取函数汇编代码 5. 解释文本基于Perceval解析的提交元数据构建 --- ## 许可证 本数据集的汇编源码源自Linux内核,而Linux内核采用**GPL-2.0**许可证,本数据集继承该许可证。详细信息请参阅[kernel.org/doc/html/latest/process/license-rules.html](https://www.kernel.org/doc/html/latest/process/license-rules.html)。 --- ## 引用声明 若使用本数据集,请同时引用上游数据源: bibtex @dataset{linux_commits_2023, title = {Linux Commits Dataset}, year = {2023}, url = {https://zenodo.org/records/10654193} } @dataset{christopher_dickinson_2026, author = { Christopher Dickinson }, title = { linux-asm-pairs (Revision 9ea25c3) }, year = 2026, url = { https://huggingface.co/datasets/theelderemo/linux-asm-pairs }, doi = { 10.57967/hf/8453 }, publisher = { Hugging Face } }
提供机构:
theelderemo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作