cudaLLM-data

Name: cudaLLM-data
Creator: maas
Published: 2025-12-04 16:44:35
License: 暂无描述

魔搭社区2025-12-04 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/ByteDance-Seed/cudaLLM-data

下载链接

链接失效反馈

官方服务：

资源简介：

## CudaLLM Dataset A high-quality dataset of PyTorch operator test cases, designed to benchmark and evaluate the capabilities of LLMs in generating optimized CUDA kernels. This dataset provides pairs of problems (standard PyTorch nn.Module implementations) and solutions (performance-optimized versions using custom CUDA kernels). It's a valuable resource for research in AI for HPC, code generation, and compiler optimization. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B. * SFT Dataset: `sft_cuda_llm_r1.parquet` * RL Dataset: `rl_cuda_llm_0424.parquet` ### ✨ Key Features - Diverse Operator Coverage: Includes a wide range of operators from torch.ops.aten, torch.nn, and torch.nn.functional. - Rigorous Validation: Every test case undergoes a multi-stage validation process, including Abstract Syntax Tree (AST) analysis for structural correctness and dynamic execution in a CUDA environment to ensure numerical stability. - Realistic Combinations: Operator sequences are sampled not just randomly, but also based on statistical analysis of real-world usage in the HuggingFace Transformers library. ### 🛠️ Dataset Generation Workflow The dataset was created through a systematic, four-stage pipeline to ensure its quality and relevance. 1. Operator Sampling: We use a stratified sampling strategy to select operators. This includes all basic PyTorch ops, common dual-operator pairs from Transformers, and more complex random combinations. 2. LLM-based Code Generation: We use a LLM guided by sophisticated prompt engineering, to generate initial PyTorch code (problem.py) for each operator sequence. 3. Static Checking: An AST-based checker verifies that the generated code's computation flow precisely matches the sampled operator sequence. 4. Dynamic Validation: The code is executed with random tensors on a CUDA device to filter out any samples that produce numerical errors (NaN/Inf), ensuring robustness. Finally, all valid samples are standardized and deduplicated.

## CudaLLM 数据集这是一套高质量的PyTorch算子测试用例数据集，旨在对大语言模型（LLM）生成优化型CUDA内核的能力进行基准测试与评估。该数据集包含问题与解决方案配对样本：问题为标准PyTorch nn.Module实现，解决方案则为使用自定义CUDA内核的性能优化版本。本数据集是高性能计算（High Performance Computing, HPC）领域人工智能、代码生成以及编译器优化相关研究的宝贵资源。数据集由DeepSeek R1、DeepSeel Coder-7B以及Qwen2-32B生成。 * 监督微调（Supervised Fine-Tuning, SFT）数据集：`sft_cuda_llm_r1.parquet` * 强化学习（Reinforcement Learning, RL）数据集：`rl_cuda_llm_0424.parquet` ### ✨ 核心特性 - 丰富的算子覆盖范围：涵盖torch.ops.aten、torch.nn以及torch.nn.functional中的大量算子。 - 严格的验证流程：所有测试用例均经过多阶段验证流程，包括通过抽象语法树（Abstract Syntax Tree, AST）分析确保结构正确性，以及在CUDA环境中进行动态执行以保障数值稳定性。 - 贴合实际的组合方式：算子序列不仅通过随机方式采样，还基于HuggingFace Transformers库中真实使用场景的统计分析进行选取。 ### 🛠️ 数据集生成流程本数据集通过系统化的四阶段流程构建，以保障其质量与相关性。 1. 算子采样：采用分层采样策略筛选算子，涵盖所有基础PyTorch算子、Transformers中的常见双算子组合，以及更复杂的随机组合。 2. 基于大语言模型的代码生成：通过精心设计的提示工程引导大语言模型，为每个算子序列生成初始PyTorch代码（problem.py）。 3. 静态检查：基于抽象语法树的检查器将验证生成代码的计算流程与采样得到的算子序列完全匹配。 4. 动态验证：在CUDA设备上使用随机张量执行代码，过滤掉产生数值错误（NaN/Inf）的样本，保障数据集的鲁棒性。最终，所有有效样本均经过标准化处理与去重。

提供机构：

maas

创建时间：

2025-08-06

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集