five

KernelBook

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/GPUMODE/KernelBook
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview `dataset_permissive{.json/.parquet}` is a curated collection of pairs of pytorch programs and equivalent triton code (generated by torch inductor) which can be used to train models to translate pytorch code to triton code. The triton code was generated using **PyTorch 2.5.0** so for best results during evaluation / running the triton code we recommend using that version of pytorch. ## Dataset Creation The dataset was created through the following process: 1. **Repository Collection**: PyTorch repositories were collected from GitHub using repositories (and associated hashes) from the [Stack v1](https://huggingface.co/datasets/bigcode/the-stack). 2. **PyTorch Module Extraction**: We extracted the pytorch code from the repositories, and seperated them into individual `torch.nn` modules with appropriate dependencies. 3. **Creating Unit Tests**: We created unit tests for each module to ensure that the code was working as expected. Code in which could not create unit tests for was removed. 4. **Extracting Triton Code**: We used torch.compile in order to produce triton code from the pytorch code. 5. **Transorming Triton Code**: We transformed the triton code into one which resembled the format seen in [KernelBench](https://github.com/ScalingIntelligence/KernelBench). 5. **Metadata Enrichment**: Each repository entry was enriched with metadata such as license information, star count, and commit SHA. The scripts to do this yourself can be found [here](https://github.com/pytorch-labs/popcorn-kernels/tree/main/github_pytorch_index) ## Data Structure Each entry in the dataset contains the following fields: | Field | Description | |-------|-------------| | `repo_name` | The name of the repository in the format `username/repository` | | `licenses` | List of licenses associated with the repository | | `stars` | Number of GitHub stars the repository has | | `sha` | The commit SHA hash used for version reference | | `repo_link` | Direct link to the repository at the specific commit (GitHub URL) | | *Additional fields* | The dataset may contain other repository-specific information | ## File Formats The dataset is available in two formats: 1. **JSON**: `dataset_permissive.json` - A human-readable format that can be easily parsed by most programming languages. 2. **Parquet**: `dataset_permissive.parquet` - A columnar storage format optimized for analytics and big data processing. ## Usage Examples ### Loading the Dataset in Python #### Using JSON: ```python import json # Load the JSON version with open('dataset_permissive.json', 'r') as f: repos = json.load(f) # Example: Print the first 5 repository names for repo in repos[:5]: print(repo['repo_name']) ``` #### Using Parquet: ```python import pandas as pd # Load the Parquet version df = pd.read_parquet('dataset_permissive.parquet') # Example: Get repositories with more than 1000 stars popular_repos = df[df['stars'] > 1000] print(f"Number of popular repositories: {len(popular_repos)}") ``` ## License Information The `dataset_permissive` contains only repositories with permissive licenses, including but not limited to: - MIT License - Apache License 2.0 - BSD Licenses (various) - Mozilla Public License - Unlicense - zlib License The dataset itself is provided for research and development purposes. Users should still verify the license of individual repositories before using their code in production or commercial settings. ## Citation ``` @software{kernelbook2025, title={KernelBook}, author={Paliskara, Sahan and Saroufim, Mark}, year={2025}, month={5}, url={https://huggingface.co/datasets/GPUMODE/KernelBook}, } ```

## 概述 `dataset_permissive{.json/.parquet}` 是经过精选的PyTorch程序与等效Triton代码(由Torch Inductor生成)配对集合,可用于训练将PyTorch代码转换为Triton代码的模型。 本数据集的Triton代码由**PyTorch 2.5.0**生成,因此若要在评估或运行Triton代码时获得最佳效果,建议使用该版本的PyTorch。 ## 数据集构建流程 本数据集通过以下步骤生成: 1. **仓库采集**:从GitHub采集PyTorch相关仓库,所用仓库及对应哈希值源自[Stack v1](https://huggingface.co/datasets/bigcode/the-stack)数据集。 2. **PyTorch模块提取**:从上述仓库中提取PyTorch代码,并将其拆分为带有合理依赖关系的独立`torch.nn`模块。 3. **单元测试构建**:为每个模块编写单元测试以验证代码运行符合预期,无法完成单元测试的代码将被移除。 4. **Triton代码提取**:通过`torch.compile`从PyTorch代码生成Triton代码。 5. **Triton代码格式转换**:将生成的Triton代码调整为与[KernelBench](https://github.com/ScalingIntelligence/KernelBench)一致的格式。 5. **元数据增强**:为每个仓库条目补充元数据,包括许可证信息、GitHub星标数以及提交SHA哈希值。 如需自行复现该数据集的生成流程,可参考[此处](https://github.com/pytorch-labs/popcorn-kernels/tree/main/github_pytorch_index)提供的脚本。 ## 数据结构 数据集的每个条目包含以下字段: | 字段名 | 字段说明 | |-------|---------| | `repo_name` | 仓库名称,格式为`用户名/仓库名` | | `licenses` | 该仓库关联的许可证列表 | | `stars` | 该仓库的GitHub星标数量 | | `sha` | 用于版本追溯的提交SHA哈希值 | | `repo_link` | 对应特定提交的仓库直接链接(GitHub URL) | | *附加字段* | 数据集可能包含其他与仓库相关的专属信息 | ## 存储格式 本数据集提供两种存储格式: 1. **JSON格式**:`dataset_permissive.json`——具备良好可读性,可被绝大多数编程语言轻松解析。 2. **Parquet格式**:`dataset_permissive.parquet`——面向分析与大数据处理优化的列式存储格式。 ## 使用示例 ### Python加载数据集 #### 使用JSON格式加载: python import json # 加载JSON格式数据集 with open('dataset_permissive.json', 'r') as f: repos = json.load(f) # 示例:打印前5个仓库名称 for repo in repos[:5]: print(repo['repo_name']) #### 使用Parquet格式加载: python import pandas as pd # 加载Parquet格式数据集 df = pd.read_parquet('dataset_permissive.parquet') # 示例:筛选星标数超过1000的仓库 popular_repos = df[df['stars'] > 1000] print(f"热门仓库数量:{len(popular_repos)}") ## 许可证说明 `dataset_permissive`数据集仅收录采用宽松许可证的仓库,包括但不限于: - MIT许可证 - Apache许可证2.0版 - 各类BSD许可证 - Mozilla公共许可证 - Unlicense(无版权声明) - zlib许可证 本数据集仅用于研发场景,用户在将仓库代码用于生产或商业环境前,仍需自行验证对应仓库的许可证合规性。 ## 引用格式 @software{kernelbook2025, title={KernelBook}, author={Paliskara, Sahan and Saroufim, Mark}, year={2025}, month={5}, url={https://huggingface.co/datasets/GPUMODE/KernelBook}, }
提供机构:
maas
创建时间:
2025-07-10
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
KernelBook是一个包含PyTorch程序与等效Triton代码配对的数据集,用于训练代码转换模型。数据集提供JSON和Parquet两种格式,仅包含宽松许可证的代码仓库,适合研究和开发使用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作