five

categorized_triton_data_permissive

收藏
魔搭社区2026-01-02 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/GPUMODE/categorized_triton_data_permissive
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description This dataset contains code snippets from Triton-based projects across GitHub, specifically filtered to include only repositories with permissive licenses (MIT, Apache, BSD, etc.). Each entry in the dataset includes: - Triton code snippet - Repository information - File path - Commit hash - Direct GitHub URL to the source code - License information - Categorization of the code functionality ## Dataset Creation The dataset was created by: 1. Collecting Triton code snippets from public GitHub repositories 2. Categorizing the code snippets based on functionality (Using claude) 3. Filtering to keep only snippets from repositories with permissive licenses using a custom `should_keep_license` function ## License Information This dataset is released under the MIT License. However, each code snippet in the dataset comes from a repository with its own specific license (all permissive). The license type for each snippet is included in the dataset. Permissive licenses included in this dataset: - MIT - BSD - APACHE - CC0 ## Format and Usage The dataset is provided in two formats: - JSON format (`permissive_triton_dataset.json`) - Parquet format (`permissive_triton_dataset.parquet`) ### Sample Data Structure ```json { "uuid": "...", "file_name": "example_triton_file.py", "repo_name": "username/repo", "file_path": "path/to/file.py", "commit_hash": "abcdef123456", "starcount": 42, "input": "@triton.jit\ndef example_kernel(...):\n ...", "category": { "Functionality": ["Category1", "Category2"] }, "licenses": ["MIT"], "github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py" } ``` ### Field Descriptions | Field | Description | |-------|-------------| | `uuid` | Unique identifier for the entry in the dataset | | `file_name` | Name of the source code file | | `repo_name` | GitHub repository name in format "username/repo" | | `file_path` | Path to the file within the repository | | `commit_hash` | Git commit hash for the specific version of the file | | `starcount` | Number of stars the repository had at the time of data collection | | `input` | The actual Triton code snippet | | `category` | Categorization of the code functionality (labeled using Claude) | | `licenses` | List of permissive license types applicable to this code | | `github_url` | Direct URL to view the file on GitHub at the specific commit | #### Category Types We consider categories in the following domains: Functionality, Data Type, Performance Objective, Parallelization Strategy, and Memory Access Pattern. We optinally add labels to each of these domains per entry to try and describe the data (using claude). ### Loading the Dataset ```python # Using JSON import json with open('permissive_triton_dataset.json', 'r') as f: dataset = json.load(f) # Using Parquet import pandas as pd df = pd.read_parquet('permissive_triton_dataset.parquet') ```

# 数据集概述 本数据集收录了GitHub上所有基于Triton的项目中的代码片段,并经过筛选,仅保留采用宽松许可证(MIT、Apache、BSD等)的仓库中的内容。数据集中的每条条目均包含以下信息: - Triton代码片段 - 仓库信息 - 文件路径 - 提交哈希值 - 源代码的GitHub直接访问链接 - 许可证信息 - 代码功能分类 ## 数据集构建流程 本数据集通过以下步骤构建: 1. 从公开GitHub仓库中收集Triton代码片段 2. 基于功能对代码片段进行分类(采用克劳德(Claude)完成标注) 3. 使用自定义`should_keep_license`函数进行过滤,仅保留来自宽松许可证仓库的代码片段 ## 许可证说明 本数据集采用MIT许可证发布。不过,数据集中的每条代码片段均来自各自拥有专属许可证的仓库(所有仓库均采用宽松许可证),每条代码的许可证类型已包含在数据集中。 本数据集包含的宽松许可证类型如下: - MIT - BSD - APACHE - CC0 ## 格式与使用方式 本数据集提供两种存储格式: - JSON格式(文件名为`permissive_triton_dataset.json`) - Parquet格式(文件名为`permissive_triton_dataset.parquet`) ### 示例数据结构 json { "uuid": "...", "file_name": "example_triton_file.py", "repo_name": "username/repo", "file_path": "path/to/file.py", "commit_hash": "abcdef123456", "starcount": 42, "input": "@triton.jit def example_kernel(...): ...", "category": { "Functionality": ["Category1", "Category2"] }, "licenses": ["MIT"], "github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py" } ### 字段说明 | 字段名 | 含义说明 | |-------|----------| | `uuid` | 数据集中条目的唯一标识符 | | `file_name` | 源代码文件的文件名 | | `repo_name` | 格式为"username/repo"的GitHub仓库名称 | | `file_path` | 文件在对应仓库内的路径 | | `commit_hash` | 对应文件版本的Git提交哈希值 | | `starcount` | 数据收集时该仓库获得的星标数量 | | `input` | 实际的Triton代码片段 | | `category` | 代码功能的分类标签(由Claude标注) | | `licenses` | 适用于该代码的宽松许可证类型列表 | | `github_url` | 可在GitHub上查看该特定提交版本文件的直接链接 | #### 分类维度类型 我们将分类维度划分为以下五类:功能、数据类型、性能目标、并行化策略以及内存访问模式。我们会为每条数据的上述各维度添加可选标签,以全面描述该数据(由Claude完成标注)。 ## 数据集加载方法 python # 使用JSON格式加载 import json with open('permissive_triton_dataset.json', 'r') as f: dataset = json.load(f) # 使用Parquet格式加载 import pandas as pd df = pd.read_parquet('permissive_triton_dataset.parquet')
提供机构:
maas
创建时间:
2025-07-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作