five

NickIBrody/x86-asm-instructions-23k

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/NickIBrody/x86-asm-instructions-23k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - en tags: - code - assembly - x86 - x86-64 - instruction-tuning - code-generation size_categories: - 10K<n<100K pretty_name: x86 ASM Instructions - 23K --- # x86 ASM Instructions - 23K Instruction-tuning dataset of x86 and x86_64 assembly snippets paired with short natural-language instructions or comments. ## Dataset Summary | Field | Value | |---|---| | Total rows | 23,104 | | Split(s) | train | | Format | JSONL (`instruction`, `output`, `system`) | | Instruction average length | 47.0 characters | | Instruction max length | 413 characters | | Output average length | 514.25 characters | | Output max length | 15,805 characters | | System prompt values | 1 | ## Format Each example contains: ```json { "instruction": "Recursive factorial. cdecl calling convention.", "output": "factorial_rec_cdecl:\n enter 0, 0\n ...", "system": "As an x86/x86_64 assembly expert, you are capable of writing optimized low-level code from specifications." } ``` ## Recommended Use - supervised fine-tuning for x86 or x86_64 assembly generation - low-level code completion experiments - instruction-to-assembly research with additional downstream filtering ## Important Limitations - the dataset currently exposes only a `train` split - instructions vary in quality and appear to include extracted comments, partial descriptions, and terse labels rather than consistently written prompts - the dataset appears to mix multiple domains including general assembly routines, DOS-era utilities, boot-related code, and potentially malware-adjacent samples - examples are standalone snippets and often rely on missing surrounding context such as memory layout, calling convention, macro definitions, or target platform - the `system` field is constant across rows and may be better injected by the training template instead of consumed as sample data ## Safety Note This dataset appears to include low-level code related to file infection, boot manipulation, destructive disk access, or other malware-adjacent behavior in at least some samples. Downstream users should review and filter the corpus before production use, safety-sensitive training, or redistribution. ## Provenance And Licensing This release does not currently document per-example provenance fields such as repository, source file, commit, or source license. Because of that, this dataset should not be treated as a clean permissively licensed corpus without further legal and provenance review. ## Missing Metadata The current release still lacks: - repository / file / commit provenance - per-example source license metadata - domain labels such as `bootloader`, `dos`, `utility`, `malware`, or `reversing` - validation and test splits - contamination analysis against downstream benchmarks ## Suggested Next Steps To make this dataset more useful, the next revision should add: 1. deterministic `validation` and `test` splits 2. provenance fields for each sample 3. filtering or explicit labeling for malware-adjacent content 4. instruction-quality filtering for short or fragmentary prompts
提供机构:
NickIBrody
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作