shivank21/icml_obf_split

Name: shivank21/icml_obf_split
Creator: shivank21
Published: 2026-04-25 11:25:22
License: 暂无描述

Hugging Face2026-04-25 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/shivank21/icml_obf_split

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: int64 - name: task_id dtype: string - name: source dtype: string - name: difficulty dtype: string - name: title dtype: string - name: description dtype: string - name: tags list: string - name: rating dtype: int64 - name: examples list: - name: cpu_sys_us dtype: int64 - name: cpu_user_us dtype: int64 - name: input dtype: string - name: measure_error dtype: string - name: op_count dtype: int64 - name: output dtype: string - name: status dtype: string - name: tc_difficulty dtype: string - name: wall_ns dtype: int64 - name: tests list: - name: cpu_sys_us dtype: int64 - name: cpu_user_us dtype: int64 - name: input dtype: string - name: measure_error dtype: string - name: op_count dtype: int64 - name: output dtype: string - name: status dtype: string - name: tc_difficulty dtype: string - name: wall_ns dtype: int64 - name: synthetic_tests list: - name: cpu_sys_us dtype: int64 - name: cpu_user_us dtype: int64 - name: input dtype: string - name: measure_error dtype: string - name: op_count dtype: int64 - name: output dtype: string - name: status dtype: string - name: tc_difficulty dtype: string - name: wall_ns dtype: int64 - name: method dtype: string - name: logic_type dtype: string - name: transform_status dtype: string - name: retries dtype: int64 - name: pair_verified dtype: bool - name: conversion_quality dtype: string - name: paradigm_reason dtype: string - name: original_passed dtype: bool - name: original_num_passed dtype: int64 - name: original_total dtype: int64 - name: original_failures list: - name: actual dtype: string - name: case_index dtype: int64 - name: error dtype: string - name: expected dtype: string - name: status dtype: string - name: converted_passed dtype: bool - name: converted_num_passed dtype: int64 - name: converted_total dtype: int64 - name: converted_failures list: - name: actual dtype: string - name: case_index dtype: int64 - name: error dtype: string - name: expected dtype: string - name: status dtype: string - name: iterative_solution dtype: string - name: recursive_solution dtype: string - name: iterative_solution_obfuscated dtype: string - name: recursive_solution_obfuscated dtype: string - name: rename_map dtype: string - name: iterative_solution_fullobf dtype: string - name: recursive_solution_fullobf dtype: string - name: fullobf_token_map dtype: string - name: fullobf_status dtype: string - name: fullobf_iter_passed dtype: bool - name: fullobf_rec_passed dtype: bool - name: fullobf_iter_num_passed dtype: int64 - name: fullobf_rec_num_passed dtype: int64 - name: fullobf_iter_total dtype: int64 - name: fullobf_rec_total dtype: int64 splits: - name: train num_bytes: 28531889 num_examples: 2000 - name: test num_bytes: 14265944 num_examples: 1000 download_size: 16139283 dataset_size: 42797833 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* ---

提供机构：

shivank21

搜集汇总

数据集介绍

构建方式

本数据集名为icml_obf_split，专为研究代码混淆与程序理解而构建。数据来源于在线判题系统，收录了3000道编程题目，其中训练集2000例、测试集1000例。每条数据包含原始题目信息、多组测试用例及对应的迭代与递归解法，并通过系统化的混淆流程生成两种混淆版本：一种仅重命名变量与函数（partial obfuscation），另一种在重命名基础上进一步转换控制流与数据结构（full obfuscation）。构建时记录了每道题的转换状态、通过率及失败详情，确保正解与混淆解在测试集上保持等价性，从而提供可控的代码复杂度演化数据。

使用方法

使用该数据集时，研究者可首先通过HuggingFace Datasets库加载，代码示例如下：`from datasets import load_dataset; dataset = load_dataset('icml_obf_split')`。加载后，每条样本可通过键访问题目描述、测试用例、各类解法及混淆信息。对于代码生成任务，可将`description`与`iterative_solution_obfuscated`作为输入-输出对进行训练；对于代码修复研究，可利用`original_failures`中的错误信息与`iterative_solution`构建修复目标。此外，`fullobf_token_map`字段可用于探索逆向工程，而`synthetic_tests`则支持对抗鲁棒性评估。推荐将训练集用于模型微调，测试集用于性能验证。

背景与挑战

背景概述

该数据集名为icml_obf_split，诞生于对代码智能与程序语言理解领域深层次探索的背景下。近年来，随着深度学习在代码生成、程序修复等任务上的广泛应用，模型对代码逻辑的理解能力成为核心瓶颈。正是在这一研究浪潮中，由相关研究人员构建的这一数据集，旨在探究代码混淆对算法逻辑推理能力的影响。其核心研究问题聚焦于：经过复杂混淆变换后的代码，模型是否仍能准确理解其迭代与递归逻辑。该数据集通过系统性地对代码进行变量重命名、流程重排等混淆操作，生成成对的原始与混淆代码样本，为评估和提升模型在代码层面的鲁棒性提供了宝贵资源。自发布以来，它已对代码智能领域的基准测试与模型评估产生了重要影响，推动了更加健壮的代码理解模型的发展。

当前挑战

该数据集所面临的挑战具有双重性。在领域问题层面，它直面当前代码智能模型对代码逻辑理解能力的脆弱性。现有模型通常在标准代码上表现优异，但在面对变量名替换、控制流扁平化等混淆技术时，其推理准确率往往大幅下降，暴露出对语义不变性理解的缺陷。构建过程中的挑战同样不容小觑：如何在不改变代码语义的前提下，生成足够多样化且真实的混淆样本，需要精心设计变换策略与验证机制。此外，确保原始代码与混淆代码在逻辑上严格等价，并构建全面的测试用例来验证变换的正确性，也是构建过程中的核心难点。这些挑战共同决定了该数据集在推动代码理解鲁棒性研究中的关键地位。

常用场景

经典使用场景

在程序分析与代码智能领域，icml_obf_split 数据集为评估和提升代码混淆与反混淆技术提供了标准化的基准。该数据集包含了大量经过精心设计的编程题及其对应的迭代与递归解决方案，并通过对变量名、结构进行变换生成了不同程度的混淆版本。研究者可借此来训练和测试模型在代码混淆下的理解能力，例如判断代码逻辑等价性、恢复原始语义，或者评估模型对混淆代码的鲁棒性。数据集划分了训练集与测试集，并提供了详细的执行状态、通过率等元信息，极大地方便了模型性能的横向对比与可重复研究。

解决学术问题

学术界长期受困于如何量化代码混淆对程序理解与自动化分析的影响，以及如何开发能够抵御此类干扰的智能模型。icml_obf_split 数据集的诞生填补了这一空白，它系统化地提供了原始代码、混淆代码及对应的执行结果，使得研究者能够严格探究混淆策略对代码逻辑保持性的破坏程度。该数据集推动了代码表示学习、程序修复、代码克隆检测等领域的发展，其引入的 fullobf 完全混淆方案更成为检验模型深层语义理解能力的试金石，为构建更鲁棒的代码理解系统奠定了数据基础。

实际应用

在实际应用中，icml_obf_split 数据集最直接地服务于软件安全领域。例如，安全分析工具可借助在该数据集上训练出的模型，有效识别经过混淆的恶意代码，即便攻击者通过重命名变量、插入无用分支等手法隐藏真实意图，模型仍能精准检测其逻辑结构。此外，在代码剽窃检测、代码版权保护以及自动化代码评审系统中，该数据集帮助提升工具对混淆后代码的鉴别能力，保障了软件生态的健康与可靠。

数据集最近研究