BytedTsinghua-SIA/CUDA-Agent-Ops-6K

Name: BytedTsinghua-SIA/CUDA-Agent-Ops-6K
Creator: BytedTsinghua-SIA
Published: 2026-02-27 19:56:56
License: 暂无描述

Hugging Face2026-02-27 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/BytedTsinghua-SIA/CUDA-Agent-Ops-6K

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 pretty_name: CUDA-Agent-Ops-6K size_categories: - 1K<n<10K task_categories: - text-generation language: - en --- # CUDA-Agent-Ops-6K CUDA-Agent-Ops-6K is a curated training dataset for CUDA kernel generation and optimization. It is released as part of the CUDA-Agent project: - Project Page: https://CUDA-Agent.github.io/ - Github Repo: https://github.com/BytedTsinghua-SIA/CUDA-Agent ## Dataset Summary CUDA-Agent-Ops-6K contains **6,000 synthesized operator-level training tasks** designed for large-scale agentic RL training. It is intended to provide diverse and executable CUDA-oriented training tasks and reduce contamination risk against KernelBench evaluation. ## Why this dataset High-quality CUDA training data is scarce. Manual expert annotation of high-performance kernels is expensive and hard to scale. CUDA-Agent-Ops-6K addresses this bottleneck with a scalable synthesis-and-filtering pipeline that produces training tasks with controlled difficulty and better reliability. ## Data Construction Pipeline The dataset is built with three stages: 1. Seed problem crawling - Mine reference operators from `torch` and `transformers` - Represent each task as runnable PyTorch operator logic 2. LLM-based combinatorial synthesis - Compose multiple operators into fused tasks (up to 5 sampled operators) - Increase task diversity and optimization complexity beyond single-op patterns 3. Execution-based filtering and decontamination - Keep tasks executable in both eager mode and `torch.compile` - Remove stochastic operators for reproducibility - Remove degenerate outputs (e.g., constant/indistinguishable outputs) - Keep runtime in a controlled range (1ms-100ms in eager mode) - Remove tasks highly similar to KernelBench test cases ## Data Format I pulled the dataset repository and inspected the current `data.parquet` file directly. The current release contains **6000 rows** with **3 string columns**: - `ops`: operator/module descriptor string. For most `torch#N` rows, this is a JSON-like list string of operators. Example: `["nn.BatchNorm3d", "torch.diag", "torch.max", "nn.Parameter"]` For `transformers` rows, this can be a single module identifier string (e.g., `MPNetLMHead_2`). - `data_source`: source tag string. Observed patterns: `torch#N` (where `N` matches the number of ops in `ops`) and `transformers` - `code`: runnable Python/PyTorch task code for the synthesized operator problem No null values were found in these three columns in the current file. This means each training sample can be viewed as: - an operator/module descriptor (`ops`) - its provenance/source marker (`data_source`) - the executable task implementation (`code`) ## Citation If you use this dataset, please cite both the CUDA-Agent project and the dataset release. ```bibtex @misc{cuda_agent_2026, title = {CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation}, author = {Dai, Weinan and Wu, Hanlin and Yu, Qiying and Gao, Huan-ang and Li, Jiahao and Jiang, Chengquan and Lou, Weiqiang and Song, Yufan and Yu, Hongli and Chen, Jiaze and Ma, Wei-Ying and Zhang, Ya-Qin and Liu, Jingjing and Wang, Mingxuan and Liu, Xin and Zhou, Hao}, year = {2026}, howpublished = {Project page and technical report} } @misc{cuda_agent_ops_6k_2026, title = {CUDA-Agent-Ops-6K: Training Dataset for CUDA-Agent}, author = {{BytedTsinghua-SIA}}, year = {2026}, howpublished = {Hugging Face dataset}, url = {https://huggingface.co/datasets/BytedTsinghua-SIA/CUDA-Agent-Ops-6K} } ```

提供机构：

BytedTsinghua-SIA

5,000+

优质数据集

54 个

任务类型

进入经典数据集