hybrid-diff-ar/stack-v2-sparse-classes-75kplus
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/hybrid-diff-ar/stack-v2-sparse-classes-75kplus
下载链接
链接失效反馈官方服务:
资源简介:
这是一个名为Stack v2 Sparse Python Classes 75kplus的数据集,包含75829个样本,用于扩散和自回归混合代码生成实验。数据集来源于bigcode/the-stack-v2-dedup的Python子集,通过AST级别的类过滤器提取。每个样本是一个Python类,包含自然语言提示、类/方法签名、方法体等字段。数据集分为训练集(74829个样本)、验证集(500个样本)和测试集(500个样本)。数据集还应用了多种过滤器,如方法数量(2到6个)、每个方法必须有非空文档字符串、方法体必须有3到30行非空代码等。
This is a dataset named Stack v2 Sparse Python Classes 75kplus, containing 75,829 samples for Diffusion + Autoregressive hybrid code generation experiments. The data is extracted from the Python subset of bigcode/the-stack-v2-dedup using AST-level class filters. Each sample is a Python class with fields such as natural-language prompts, class/method signatures, and method bodies. The dataset is split into training (74,829 samples), validation (500 samples), and test (500 samples) sets. Various filters are applied, including 2 to 6 methods per class, non-empty docstrings for each method, and 3 to 30 non-empty lines per method body.
提供机构:
hybrid-diff-ar



