db-d2/primevul-codebert-embeddings

Name: db-d2/primevul-codebert-embeddings
Creator: db-d2
Published: 2026-04-08 02:53:16
License: 暂无描述

Hugging Face2026-04-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/db-d2/primevul-codebert-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification tags: - code - vulnerability-detection - embeddings - codebert - positive-unlabeled-learning language: - code size_categories: - 100K<n<1M --- # PrimeVul Embeddings for PU Learning Pre-extracted [CLS] token embeddings from two code models for all functions in the PrimeVul v0.1 vulnerability detection dataset, plus the raw PrimeVul v0.1 JSONL source files. ## CodeBERT Embeddings (root .npz files) Each .npz file contains frozen CodeBERT embeddings (768-dimensional vectors) for C/C++ functions, along with their labels and CWE type annotations. These were extracted once using a frozen CodeBERT model and are used for downstream PU (positive-unlabeled) learning experiments without requiring GPU access. | File | Functions | Vulnerable | Shape | |------|-----------|-----------|-------| | train.npz | 175,797 | 4,862 (2.77%) | (175797, 768) | | valid.npz | 23,948 | 593 (2.48%) | (23948, 768) | | test.npz | 24,788 | 549 (2.21%) | (24788, 768) | | test_paired.npz | 870 | 435 (50%) | (870, 768) | Arrays in each .npz: - embeddings: (N, 768) float32 -- CodeBERT [CLS] token vectors - labels: (N,) int32 -- 0 = benign, 1 = vulnerable - cwe_types: (N,) U20 string -- CWE category (e.g., "CWE-119") or "unknown" - idxs: (N,) int64 -- original PrimeVul record index for traceability ### How to load ```python import numpy as np data = np.load("train.npz") X = data["embeddings"] # (175797, 768) y = data["labels"] # (175797,) cwes = data["cwe_types"] # (175797,) ``` No special flags needed. All arrays use standard numpy dtypes (float32, int32, U20, int64). ## VulBERTa Embeddings (vulberta/ folder) Same format as CodeBERT but extracted from claudios/VulBERTa-mlm, a RoBERTa model pretrained on C/C++ vulnerability code. Same functions, same labels, same idxs -- only the embedding vectors differ. | File | Functions | Shape | |------|-----------|-------| | vulberta/train.npz | 175,797 | (175797, 768) | | vulberta/valid.npz | 23,948 | (23948, 768) | | vulberta/test.npz | 24,788 | (24788, 768) | | vulberta/test_paired.npz | 870 | (870, 768) | VulBERTa embeddings have higher L2 magnitude (~27 vs ~21 for CodeBERT) but the same 768 dimensions. Load the same way: np.load("vulberta/train.npz"). ## Raw PrimeVul v0.1 data (raw/ folder) The raw/ folder contains the original PrimeVul v0.1 JSONL files from the PrimeVul project. Each line is a JSON object with fields including func (source code), target (0/1 label), cwe (list of CWE strings), cve (CVE identifier), and project metadata. | File | Records | |------|---------| | raw/primevul_train.jsonl | 175,797 | | raw/primevul_valid.jsonl | 23,948 | | raw/primevul_test.jsonl | 24,788 | | raw/primevul_train_paired.jsonl | 9,724 | | raw/primevul_valid_paired.jsonl | 870 | | raw/primevul_test_paired.jsonl | 870 | ## Extraction details ### CodeBERT - Model: microsoft/codebert-base (RoBERTa architecture, 125M parameters) - Extraction: frozen model, [CLS] token from final layer - Tokenization: max_length=512, truncation=True, padding=max_length - Source data: PrimeVul v0.1 (chronological train/valid/test splits) - Extracted on: Google Colab, A100 GPU, ~23 minutes for all splits ### VulBERTa - Model: claudios/VulBERTa-mlm (RoBERTa architecture, 125M parameters, pretrained on C/C++ vulnerability code) - Extraction: frozen model, [CLS] token from final layer - Tokenization: max_length=512, truncation=True, padding=max_length - Source data: PrimeVul v0.1 (same functions as CodeBERT) - Extracted on: Google Colab, A100 GPU, ~23 minutes for all splits ## Citation If you use this data, please cite the PrimeVul dataset: ```bibtex @article{ding2024primevul, title={Vulnerability Detection with Code Language Models: How Far Are We?}, author={Ding, Yangruibo and Fu, Yanjun and Ibrahim, Omniyyah and Sitawarin, Chawin and Chen, Xinyun and Alomair, Basel and Wagner, David and Ray, Baishakhi and Chen, Yizheng}, journal={arXiv preprint arXiv:2403.18624}, year={2024} } ``` ## License MIT (same as PrimeVul)

提供机构：

db-d2

5,000+

优质数据集

54 个

任务类型

进入经典数据集