categorized_triton_data_permissive

Name: categorized_triton_data_permissive
Creator: maas
Published: 2026-01-02 16:40:22
License: 暂无描述

魔搭社区2026-01-02 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/GPUMODE/categorized_triton_data_permissive

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description This dataset contains code snippets from Triton-based projects across GitHub, specifically filtered to include only repositories with permissive licenses (MIT, Apache, BSD, etc.). Each entry in the dataset includes: - Triton code snippet - Repository information - File path - Commit hash - Direct GitHub URL to the source code - License information - Categorization of the code functionality ## Dataset Creation The dataset was created by: 1. Collecting Triton code snippets from public GitHub repositories 2. Categorizing the code snippets based on functionality (Using claude) 3. Filtering to keep only snippets from repositories with permissive licenses using a custom `should_keep_license` function ## License Information This dataset is released under the MIT License. However, each code snippet in the dataset comes from a repository with its own specific license (all permissive). The license type for each snippet is included in the dataset. Permissive licenses included in this dataset: - MIT - BSD - APACHE - CC0 ## Format and Usage The dataset is provided in two formats: - JSON format (`permissive_triton_dataset.json`) - Parquet format (`permissive_triton_dataset.parquet`) ### Sample Data Structure ```json { "uuid": "...", "file_name": "example_triton_file.py", "repo_name": "username/repo", "file_path": "path/to/file.py", "commit_hash": "abcdef123456", "starcount": 42, "input": "@triton.jit\ndef example_kernel(...):\n ...", "category": { "Functionality": ["Category1", "Category2"] }, "licenses": ["MIT"], "github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py" } ``` ### Field Descriptions | Field | Description | |-------|-------------| | `uuid` | Unique identifier for the entry in the dataset | | `file_name` | Name of the source code file | | `repo_name` | GitHub repository name in format "username/repo" | | `file_path` | Path to the file within the repository | | `commit_hash` | Git commit hash for the specific version of the file | | `starcount` | Number of stars the repository had at the time of data collection | | `input` | The actual Triton code snippet | | `category` | Categorization of the code functionality (labeled using Claude) | | `licenses` | List of permissive license types applicable to this code | | `github_url` | Direct URL to view the file on GitHub at the specific commit | #### Category Types We consider categories in the following domains: Functionality, Data Type, Performance Objective, Parallelization Strategy, and Memory Access Pattern. We optinally add labels to each of these domains per entry to try and describe the data (using claude). ### Loading the Dataset ```python # Using JSON import json with open('permissive_triton_dataset.json', 'r') as f: dataset = json.load(f) # Using Parquet import pandas as pd df = pd.read_parquet('permissive_triton_dataset.parquet') ```

# 数据集概述本数据集收录了GitHub上所有基于Triton的项目中的代码片段，并经过筛选，仅保留采用宽松许可证（MIT、Apache、BSD等）的仓库中的内容。数据集中的每条条目均包含以下信息： - Triton代码片段 - 仓库信息 - 文件路径 - 提交哈希值 - 源代码的GitHub直接访问链接 - 许可证信息 - 代码功能分类 ## 数据集构建流程本数据集通过以下步骤构建： 1. 从公开GitHub仓库中收集Triton代码片段 2. 基于功能对代码片段进行分类（采用克劳德（Claude）完成标注） 3. 使用自定义`should_keep_license`函数进行过滤，仅保留来自宽松许可证仓库的代码片段 ## 许可证说明本数据集采用MIT许可证发布。不过，数据集中的每条代码片段均来自各自拥有专属许可证的仓库（所有仓库均采用宽松许可证），每条代码的许可证类型已包含在数据集中。本数据集包含的宽松许可证类型如下： - MIT - BSD - APACHE - CC0 ## 格式与使用方式本数据集提供两种存储格式： - JSON格式（文件名为`permissive_triton_dataset.json`） - Parquet格式（文件名为`permissive_triton_dataset.parquet`） ### 示例数据结构 json { "uuid": "...", "file_name": "example_triton_file.py", "repo_name": "username/repo", "file_path": "path/to/file.py", "commit_hash": "abcdef123456", "starcount": 42, "input": "@triton.jit def example_kernel(...): ...", "category": { "Functionality": ["Category1", "Category2"] }, "licenses": ["MIT"], "github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py" } ### 字段说明 | 字段名 | 含义说明 | |-------|----------| | `uuid` | 数据集中条目的唯一标识符 | | `file_name` | 源代码文件的文件名 | | `repo_name` | 格式为"username/repo"的GitHub仓库名称 | | `file_path` | 文件在对应仓库内的路径 | | `commit_hash` | 对应文件版本的Git提交哈希值 | | `starcount` | 数据收集时该仓库获得的星标数量 | | `input` | 实际的Triton代码片段 | | `category` | 代码功能的分类标签（由Claude标注） | | `licenses` | 适用于该代码的宽松许可证类型列表 | | `github_url` | 可在GitHub上查看该特定提交版本文件的直接链接 | #### 分类维度类型我们将分类维度划分为以下五类：功能、数据类型、性能目标、并行化策略以及内存访问模式。我们会为每条数据的上述各维度添加可选标签，以全面描述该数据（由Claude完成标注）。 ## 数据集加载方法 python # 使用JSON格式加载 import json with open('permissive_triton_dataset.json', 'r') as f: dataset = json.load(f) # 使用Parquet格式加载 import pandas as pd df = pd.read_parquet('permissive_triton_dataset.parquet')

提供机构：

maas

创建时间：

2025-07-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集