five

SeifElden2342532/Code-Optimization

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SeifElden2342532/Code-Optimization
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description This dataset contains 3,000 pairs of Python code snippets designed to demonstrate common performance bottlenecks and their optimized counterparts. Each entry includes complexity analysis (Time and Space) and a technical explanation for the optimization. - **Total Samples:** 3,000 - **Language:** Python 3.x - **Focus:** Performance engineering, Refactoring, and Algorithmic Efficiency. ### Dataset Summary The dataset is structured to help train or fine-tune models on code refactoring and performance-aware programming. It covers a range of categories from simple built-in function replacements to advanced memory management techniques. ## Dataset Structure ### Data Instances A typical instance includes the original "slow" code and the refactored "fast" code: | Field | Type | Description | | :--- | :--- | :--- | | `id` | int64 | Unique identifier for the pair. | | `category` | string | The high-level domain (e.g., `lists`, `heap`, `concurrency`). | | `subcategory` | string | The specific optimization pattern (e.g., `nlargest_variant`). | | `original_code` | string | Python code with suboptimal performance. | | `optimized_code` | string | Python code after applying optimizations. | | `original_time_complexity` | string | Big O notation for the original snippet. | | `optimized_time_complexity` | string | Big O notation for the optimized snippet. | | `optimization_explanation` | string | Technical justification for the improvement. | ### Data Fields * **`category`**: 18 distinct categories including `lists`, `strings`, `searching`, `dicts`, `math`, and `io`. * **`complexity`**: Detailed mapping of performance gains, often showing transitions like $O(n^2) \rightarrow O(n)$. ## Dataset Creation ### Diversity and Limitations * **Template-Based Generation:** The dataset exhibits high semantic redundancy. While every snippet is unique, many entries are variations of the same logic with different constants (e.g., 400 entries for `nlargest_variant`). * **Category Imbalance:** * **Heavily Represented:** `lists` (22.6%), `heap` (13.7%), `strings` (12.1%). * **Underrepresented:** `concurrency` (0.3%), `graphs` (0.2%), `dp` (0.2%). * **Complexity:** Most optimizations focus on leveraging Python’s C-level built-ins (like `map`, `zip`, `itertools`) rather than high-level architectural changes. ## Use Cases 1. **Code Refactoring Models:** Training LLMs to recognize and fix suboptimal Python patterns. 2. **Performance Benchmarking:** Testing the ability of static analysis tools to suggest performance improvements. 3. **Educational Tools:** Building "Linting" style assistants that explain *why* certain code is slow.
提供机构:
SeifElden2342532
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作