theelderemo/linux-asm-pairs
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/theelderemo/linux-asm-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: gpl-2.0
task_categories:
- text-generation
language:
- en
- asm
tags:
- assembly
- linux-kernel
- compiler
- bfc-bic
- objdump
- decompilation
- bug-fix
- text2text-generation
pretty_name: Linux Kernel BFC/BIC Assembly→Explanation Pairs
size_categories:
- 100K<n<1M
---
# Linux Kernel Assembly → Explanation Dataset
A dataset of disassembled Linux kernel functions paired with structured natural-language explanations, built from the [Linux Commits Dataset (Zenodo 10654193)](https://zenodo.org/records/10654193).
Each row corresponds to a single compiled function extracted from a `.c` or `.h` file touched by a **Bug-Fix Commit (BFC)** or **Bug-Introducing Commit (BIC)** in the Linux kernel git history. Source files are compiled with `gcc -O2 -g -fno-inline -fno-omit-frame-pointer` and disassembled with `objdump -d -S` (AT&T syntax, source-interleaved).
Intended for training or evaluating models on **assembly code understanding**, specifically, generating natural language explanations from disassembled Linux kernel functions.
---
## Schema
| Column | Type | Description |
|---|---|---|
| `bfc_hash` | `string` | Hash of the bug-fix commit in the pair |
| `bic_hash` | `string` | Hash of the bug-introducing commit in the pair |
| `commit_hash` | `string` | The specific commit this function was extracted from |
| `commit_type` | `string` | `"bfc"` (bug-fix commit) or `"bic"` (bug-introducing commit) |
| `filename` | `string` | Source file path within the kernel tree (e.g. `kernel/sched/core.c`) |
| `func_name` | `string` | Name of the disassembled function |
| `asm` | `string` | `objdump -d -S` output for one function (AT&T syntax, source lines interleaved) |
| `explanation` | `string` | Structured description: commit type, function name, author, date, commit message, and role |
| `commit_message` | `string` | First 512 characters of the commit message |
---
## Usage
```python
from datasets import load_dataset
ds = load_dataset("theelderemo/linux-asm-pairs", split="train")
print(ds["asm"])
print(ds["explanation"])
```
Filter to only bug-fix commits:
```python
bfc_only = ds.filter(lambda x: x["commit_type"] == "bfc")
```
---
## Data Pipeline
1. BFC/BIC pairs sourced from `bfc_bic.csv` in [Zenodo record 10654193](https://zenodo.org/records/10654193)
2. Full diffs fetched from a local bare clone of the Linux kernel git repo (snapshot: 2023-11-12)
3. Changed `.c` / `.h` files compiled with `gcc -O2 -g -fno-inline -fno-omit-frame-pointer`
4. Functions extracted via `objdump -d -S --no-show-raw-insn`
5. Explanations constructed from Perceval-parsed commit metadata
---
## License
Source assembly is derived from the Linux kernel, which is licensed under **GPL-2.0**. This dataset inherits that license. See [kernel.org/doc/html/latest/process/license-rules.html](https://www.kernel.org/doc/html/latest/process/license-rules.html) for details.
---
## Citation
If you use this dataset, please also cite the upstream source:
```bibtex
@dataset{linux_commits_2023,
title = {Linux Commits Dataset},
year = {2023},
url = {https://zenodo.org/records/10654193}
}
@dataset{christopher_dickinson_2026,
author = { Christopher Dickinson },
title = { linux-asm-pairs (Revision 9ea25c3) },
year = 2026,
url = { https://huggingface.co/datasets/theelderemo/linux-asm-pairs },
doi = { 10.57967/hf/8453 },
publisher = { Hugging Face }
}
```
license: gpl-2.0
task_categories:
- 文本生成
language:
- 英语(en)、汇编语言(asm)
tags:
- 汇编
- Linux内核
- 编译器
- BFC-BIC
- objdump
- 反编译
- 缺陷修复
- 文本到文本生成
pretty_name: Linux内核BFC/BIC 汇编→解释配对数据集
size_categories:
- 100K<n<1M
# Linux内核汇编→解释数据集
本数据集收录经反汇编的Linux内核函数与结构化自然语言解释的配对样本,其构建依托于[Linux提交数据集(Zenodo 10654193)](https://zenodo.org/records/10654193)。
数据集中的每一行对应一个从Linux内核Git历史中,由**修复提交(Bug-Fix Commit, BFC)**或**引入缺陷提交(Bug-Introducing Commit, BIC)**所修改的`.c`或`.h`文件中提取的编译后函数。源文件采用`gcc -O2 -g -fno-inline -fno-omit-frame-pointer`命令编译,并通过`objdump -d -S`(AT&T语法,穿插源代码行)进行反汇编。
本数据集旨在针对**汇编代码理解**任务训练或评估模型,具体而言,即从经反汇编的Linux内核函数生成自然语言解释。
---
## 数据集架构
| 列名 | 数据类型 | 列描述 |
|---|---|---|
| `bfc_hash` | 字符串 | 该配对中修复提交的哈希值 |
| `bic_hash` | 字符串 | 该配对中引入缺陷提交的哈希值 |
| `commit_hash` | 字符串 | 提取该函数所在的具体提交哈希值 |
| `commit_type` | 字符串 | 取值为`"bfc"`(修复提交)或`"bic"`(引入缺陷提交) |
| `filename` | 字符串 | 内核源码树中的源文件路径(例如`kernel/sched/core.c`) |
| `func_name` | 字符串 | 反汇编函数的名称 |
| `asm` | 字符串 | 单个函数的`objdump -d -S`反汇编输出(AT&T语法,穿插源代码行) |
| `explanation` | 字符串 | 结构化描述内容,涵盖提交类型、函数名称、作者、提交日期、提交信息及相关角色 |
| `commit_message` | 字符串 | 提交信息的前512个字符 |
---
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("theelderemo/linux-asm-pairs", split="train")
print(ds["asm"])
print(ds["explanation"])
仅筛选修复提交的样本:
python
bfc_only = ds.filter(lambda x: x["commit_type"] == "bfc")
---
## 数据处理流程
1. BFC/BIC配对样本源自[Zenodo记录10654193](https://zenodo.org/records/10654193)中的`bfc_bic.csv`文件
2. 完整差异信息取自本地裸克隆的Linux内核Git仓库(快照时间:2023年11月12日)
3. 对修改过的`.c`/`.h`文件采用`gcc -O2 -g -fno-inline -fno-omit-frame-pointer`命令编译
4. 通过`objdump -d -S --no-show-raw-insn`命令提取函数汇编代码
5. 解释文本基于Perceval解析的提交元数据构建
---
## 许可证
本数据集的汇编源码源自Linux内核,而Linux内核采用**GPL-2.0**许可证,本数据集继承该许可证。详细信息请参阅[kernel.org/doc/html/latest/process/license-rules.html](https://www.kernel.org/doc/html/latest/process/license-rules.html)。
---
## 引用声明
若使用本数据集,请同时引用上游数据源:
bibtex
@dataset{linux_commits_2023,
title = {Linux Commits Dataset},
year = {2023},
url = {https://zenodo.org/records/10654193}
}
@dataset{christopher_dickinson_2026,
author = { Christopher Dickinson },
title = { linux-asm-pairs (Revision 9ea25c3) },
year = 2026,
url = { https://huggingface.co/datasets/theelderemo/linux-asm-pairs },
doi = { 10.57967/hf/8453 },
publisher = { Hugging Face }
}
提供机构:
theelderemo



