BytedTsinghua-SIA/CUDA-Agent-Ops-6K
收藏Hugging Face2026-02-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BytedTsinghua-SIA/CUDA-Agent-Ops-6K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
pretty_name: CUDA-Agent-Ops-6K
size_categories:
- 1K<n<10K
task_categories:
- text-generation
language:
- en
---
# CUDA-Agent-Ops-6K
CUDA-Agent-Ops-6K is a curated training dataset for CUDA kernel generation and optimization.
It is released as part of the CUDA-Agent project:
- Project Page: https://CUDA-Agent.github.io/
- Github Repo: https://github.com/BytedTsinghua-SIA/CUDA-Agent
## Dataset Summary
CUDA-Agent-Ops-6K contains **6,000 synthesized operator-level training tasks** designed for large-scale agentic RL training. It is intended to provide diverse and executable CUDA-oriented training tasks and reduce contamination risk against KernelBench evaluation.
## Why this dataset
High-quality CUDA training data is scarce. Manual expert annotation of high-performance kernels is expensive and hard to scale.
CUDA-Agent-Ops-6K addresses this bottleneck with a scalable synthesis-and-filtering pipeline that produces training tasks with controlled difficulty and better reliability.
## Data Construction Pipeline
The dataset is built with three stages:
1. Seed problem crawling
- Mine reference operators from `torch` and `transformers`
- Represent each task as runnable PyTorch operator logic
2. LLM-based combinatorial synthesis
- Compose multiple operators into fused tasks (up to 5 sampled operators)
- Increase task diversity and optimization complexity beyond single-op patterns
3. Execution-based filtering and decontamination
- Keep tasks executable in both eager mode and `torch.compile`
- Remove stochastic operators for reproducibility
- Remove degenerate outputs (e.g., constant/indistinguishable outputs)
- Keep runtime in a controlled range (1ms-100ms in eager mode)
- Remove tasks highly similar to KernelBench test cases
## Data Format
I pulled the dataset repository and inspected the current `data.parquet` file directly.
The current release contains **6000 rows** with **3 string columns**:
- `ops`: operator/module descriptor string.
For most `torch#N` rows, this is a JSON-like list string of operators.
Example: `["nn.BatchNorm3d", "torch.diag", "torch.max", "nn.Parameter"]`
For `transformers` rows, this can be a single module identifier string (e.g., `MPNetLMHead_2`).
- `data_source`: source tag string.
Observed patterns: `torch#N` (where `N` matches the number of ops in `ops`) and `transformers`
- `code`: runnable Python/PyTorch task code for the synthesized operator problem
No null values were found in these three columns in the current file.
This means each training sample can be viewed as:
- an operator/module descriptor (`ops`)
- its provenance/source marker (`data_source`)
- the executable task implementation (`code`)
## Citation
If you use this dataset, please cite both the CUDA-Agent project and the dataset release.
```bibtex
@misc{cuda_agent_2026,
title = {CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation},
author = {Dai, Weinan and Wu, Hanlin and Yu, Qiying and Gao, Huan-ang and Li, Jiahao and Jiang, Chengquan and Lou, Weiqiang and Song, Yufan and Yu, Hongli and Chen, Jiaze and Ma, Wei-Ying and Zhang, Ya-Qin and Liu, Jingjing and Wang, Mingxuan and Liu, Xin and Zhou, Hao},
year = {2026},
howpublished = {Project page and technical report}
}
@misc{cuda_agent_ops_6k_2026,
title = {CUDA-Agent-Ops-6K: Training Dataset for CUDA-Agent},
author = {{BytedTsinghua-SIA}},
year = {2026},
howpublished = {Hugging Face dataset},
url = {https://huggingface.co/datasets/BytedTsinghua-SIA/CUDA-Agent-Ops-6K}
}
```
提供机构:
BytedTsinghua-SIA



