categorized_triton_data_permissive
收藏魔搭社区2026-01-02 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/GPUMODE/categorized_triton_data_permissive
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset contains code snippets from Triton-based projects across GitHub, specifically filtered to include only repositories with permissive licenses (MIT, Apache, BSD, etc.). Each entry in the dataset includes:
- Triton code snippet
- Repository information
- File path
- Commit hash
- Direct GitHub URL to the source code
- License information
- Categorization of the code functionality
## Dataset Creation
The dataset was created by:
1. Collecting Triton code snippets from public GitHub repositories
2. Categorizing the code snippets based on functionality (Using claude)
3. Filtering to keep only snippets from repositories with permissive licenses using a custom `should_keep_license` function
## License Information
This dataset is released under the MIT License. However, each code snippet in the dataset comes from a repository with its own specific license (all permissive). The license type for each snippet is included in the dataset.
Permissive licenses included in this dataset:
- MIT
- BSD
- APACHE
- CC0
## Format and Usage
The dataset is provided in two formats:
- JSON format (`permissive_triton_dataset.json`)
- Parquet format (`permissive_triton_dataset.parquet`)
### Sample Data Structure
```json
{
"uuid": "...",
"file_name": "example_triton_file.py",
"repo_name": "username/repo",
"file_path": "path/to/file.py",
"commit_hash": "abcdef123456",
"starcount": 42,
"input": "@triton.jit\ndef example_kernel(...):\n ...",
"category": {
"Functionality": ["Category1", "Category2"]
},
"licenses": ["MIT"],
"github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py"
}
```
### Field Descriptions
| Field | Description |
|-------|-------------|
| `uuid` | Unique identifier for the entry in the dataset |
| `file_name` | Name of the source code file |
| `repo_name` | GitHub repository name in format "username/repo" |
| `file_path` | Path to the file within the repository |
| `commit_hash` | Git commit hash for the specific version of the file |
| `starcount` | Number of stars the repository had at the time of data collection |
| `input` | The actual Triton code snippet |
| `category` | Categorization of the code functionality (labeled using Claude) |
| `licenses` | List of permissive license types applicable to this code |
| `github_url` | Direct URL to view the file on GitHub at the specific commit |
#### Category Types
We consider categories in the following domains: Functionality, Data Type, Performance Objective, Parallelization Strategy, and Memory Access Pattern.
We optinally add labels to each of these domains per entry to try and describe the data (using claude).
### Loading the Dataset
```python
# Using JSON
import json
with open('permissive_triton_dataset.json', 'r') as f:
dataset = json.load(f)
# Using Parquet
import pandas as pd
df = pd.read_parquet('permissive_triton_dataset.parquet')
```
# 数据集概述
本数据集收录了GitHub上所有基于Triton的项目中的代码片段,并经过筛选,仅保留采用宽松许可证(MIT、Apache、BSD等)的仓库中的内容。数据集中的每条条目均包含以下信息:
- Triton代码片段
- 仓库信息
- 文件路径
- 提交哈希值
- 源代码的GitHub直接访问链接
- 许可证信息
- 代码功能分类
## 数据集构建流程
本数据集通过以下步骤构建:
1. 从公开GitHub仓库中收集Triton代码片段
2. 基于功能对代码片段进行分类(采用克劳德(Claude)完成标注)
3. 使用自定义`should_keep_license`函数进行过滤,仅保留来自宽松许可证仓库的代码片段
## 许可证说明
本数据集采用MIT许可证发布。不过,数据集中的每条代码片段均来自各自拥有专属许可证的仓库(所有仓库均采用宽松许可证),每条代码的许可证类型已包含在数据集中。
本数据集包含的宽松许可证类型如下:
- MIT
- BSD
- APACHE
- CC0
## 格式与使用方式
本数据集提供两种存储格式:
- JSON格式(文件名为`permissive_triton_dataset.json`)
- Parquet格式(文件名为`permissive_triton_dataset.parquet`)
### 示例数据结构
json
{
"uuid": "...",
"file_name": "example_triton_file.py",
"repo_name": "username/repo",
"file_path": "path/to/file.py",
"commit_hash": "abcdef123456",
"starcount": 42,
"input": "@triton.jit
def example_kernel(...):
...",
"category": {
"Functionality": ["Category1", "Category2"]
},
"licenses": ["MIT"],
"github_url": "https://github.com/username/repo/blob/abcdef123456/path/to/file.py"
}
### 字段说明
| 字段名 | 含义说明 |
|-------|----------|
| `uuid` | 数据集中条目的唯一标识符 |
| `file_name` | 源代码文件的文件名 |
| `repo_name` | 格式为"username/repo"的GitHub仓库名称 |
| `file_path` | 文件在对应仓库内的路径 |
| `commit_hash` | 对应文件版本的Git提交哈希值 |
| `starcount` | 数据收集时该仓库获得的星标数量 |
| `input` | 实际的Triton代码片段 |
| `category` | 代码功能的分类标签(由Claude标注) |
| `licenses` | 适用于该代码的宽松许可证类型列表 |
| `github_url` | 可在GitHub上查看该特定提交版本文件的直接链接 |
#### 分类维度类型
我们将分类维度划分为以下五类:功能、数据类型、性能目标、并行化策略以及内存访问模式。我们会为每条数据的上述各维度添加可选标签,以全面描述该数据(由Claude完成标注)。
## 数据集加载方法
python
# 使用JSON格式加载
import json
with open('permissive_triton_dataset.json', 'r') as f:
dataset = json.load(f)
# 使用Parquet格式加载
import pandas as pd
df = pd.read_parquet('permissive_triton_dataset.parquet')
提供机构:
maas
创建时间:
2025-07-10



