mohameddhameem/CDNA-CoDET-M4

Name: mohameddhameem/CDNA-CoDET-M4
Creator: mohameddhameem
Published: 2026-04-04 12:37:54
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/mohameddhameem/CDNA-CoDET-M4

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - code license: mit pretty_name: 'CDNA-CoDET-M4: Code Authorship Attribution via Code Property Graphs (Enhanced)' task_categories: - text-classification size_categories: - 1M<n<10M multilinguality: - multilingual annotations_creators: - no-annotation source_datasets: - DaniilOr/CoDET-M4 --- # CDNA-CoDET-M4: Code Authorship Attribution via Code Property Graphs (Enhanced) ## Dataset Summary **Note**: This is an enhanced version of the original [CoDET-M4 dataset](https://huggingface.co/datasets/DaniilOr/CoDET-M4) by DaniilOr, extended with Code Property Graph (CPG) representations by the CodeDNA team at Singapore Management University. We built this dataset to tackle LLM code authorship attribution—figuring out exactly which AI model wrote a specific piece of code. While most approaches just analyze the raw source code text as tokens, we found that modeling the actual structure of the code captures deeper, more reliable stylistic fingerprints. To do this, we converted the code snippets into **Code Property Graphs (CPGs)**, which combine Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), and Program Dependence Graphs (PDGs). ![Code to Graph Extraction](docs/images/page_7_image_3.png) The dataset contains ~77K Python and Java samples spanning 6 models (GPT-4o, CodeLlama, Llama 3.1, Nxcode, CodeQwen 1.5) and a human-written baseline. It is provided in two ready-to-use configurations: `hetero` (complete graph structures) and `scalar` (22 pre-computed structural metrics). ## Dataset Details ### Languages & Task - **Programming Languages**: Python, Java - **Task**: Multi-class code authorship attribution (6-way classification) - **Classes**: GPT-4o, CodeLlama, Llama 3.1, Nxcode, CodeQwen 1.5, Human - **Unit of Analysis**: Individual functions (not full repositories) ### Configurations CDNA-CoDET-M4 is published as a single Hugging Face dataset with two configurations: #### **hetero** – Heterogeneous Code Property Graphs Complete graph-structured representations combining AST, CFG, and PDG information extracted via Joern static analysis. - **Graph Representation**: Heterogeneous multigraph (multiple edge types: control flow, data flow, syntax) - **Node Features**: Syntactic type, source code token sequences, semantic role - **Edge Features**: Edge type labels (e.g., `controlFlow`, `dataFlow`, `call`) - **Examples**: 33,342 samples - **Data Size**: ~11 GB (dataset), ~23 GB (uncompressed) #### **scalar** – Aggregated Structural Features Preprocessed summary statistics derived from CPGs, suitable for traditional ML baselines and interpretability studies. - **Feature Set**: 22 hand-crafted structural measures (graph density, cyclomatic complexity, fan-in/fan-out, etc.) - **Format**: Tabular (numeric vectors alongside code text and metadata) - **Examples**: 77,332 samples - **Data Size**: ~26 GB (dataset), ~54 GB (uncompressed) ### Data Splits Both configurations follow the same split structure: | Split | Count | Purpose | |-------|-------|---------| | train | ~20,000 per language | Model training | | validation | ~2,000 per language | Hyperparameter tuning | | test | ~17,700 per language | Final evaluation | **Note**: Exact counts are language-dependent due to filtering during CPG extraction (see Dataset Creation below). ### Data Format & Schema **Common Fields (both configurations)**: | Field | Type | Description | |-------|------|-------------| | `idx` | int64 | Unique sample identifier | | `hash` | string | Content hash (enables deduplication tracking) | | `target` | string | Source LLM or "human" | | `model` | string | Full model identifier (e.g., "GPT-4o-turbo", "CodeLlama-13B") | | `language` | string | "python" or "java" | | `split` | string | "train", "validation", or "test" | | `source` | string | Generation context or data source | | `code` | string | Raw source code (UTF-8) | | `graphml` | string | GraphML-serialized CPG (hetero) or empty (scalar) | **Additional Scalar Features** (scalar configuration): | Field | Type | Description | |-------|------|-------------| | `cyclomatic_complexity` | float | McCabe complexity metric | | `lines_of_code` | int | Physical LOC | | `fan_in` | int | Distinct data sources | | `fan_out` | int | Distinct data targets | | `graph_density` | float | Edge-to-possible-edge ratio | | *(+17 additional measures)* | — | See schema metadata | ### Data Format - **File Format**: Apache Arrow (columnar, native to Hugging Face Datasets) - **Splits**: 22 Arrow shards (hetero), 54 Arrow shards (scalar) for distributed loading - **Compression**: LZ4 (optional, configurable at load time) ## Dataset Creation ### Code Generation & Collection 1. **LLM Prompting**: Six state-of-the-art code-generation models prompted independently to generate single-function implementations for diverse domains: - GPT-4o (OpenAI, multimodal, instruction-tuned) - CodeLlama (Meta, code-specific pretraining) - Llama 3.1 (Meta, general-purpose with code capability) - Nxcode (NxCode-34B, instruction-tuned variant) - CodeQwen 1.5 (Alibaba, multilingual, code-centric) - Human reference samples (collected from open-source repositories) 2. **Sampling Strategy**: Prompts span 5–10 distinct algorithmic domains (sorting, graph traversal, string manipulation, etc.) to ensure diversity and prevent simple memorization. 3. **Filtering**: Samples that fail parsing or produce syntax errors were excluded. C++ samples were deferred due to Joern frontend limitations. ### Code Property Graph Extraction 1. **Tool**: Joern static analysis framework (v2.x) 2. **Extraction Steps**: - **Parsing**: Language-specific AST construction - **Control Flow**: CFG edges (conditional branches, loops) - **Data Flow**: PDG edges (variable dependencies, definitions, uses) - **Heterogeneous Integration**: Multi-edge-type graph merging 3. **Normalization**: - Node feature standardization (TF-IDF or learned embeddings for text tokens) - Edge type stratification (categorical encoding) - Graph size capping (functions >2000 nodes excluded to avoid memory overflow) ### Preprocessing & Feature Engineering 1. **Scalar Feature Computation** (for scalar configuration): - McCabe cyclomatic complexity - Halstead volume metrics - Fan-in/fan-out (data dependency analysis) - Graph density, diameter, average degree - Lexical complexity measures 2. **Train/Val/Test Split**: Stratified random split (80/10/10) per language per model, ensuring no data leakage. 3. **Serialization**: - **Hetero**: GraphML format (XML-based, lossless graph representation) - **Scalar**: Arrow native numeric columns ### Data Quality & Provenance - **No Manual Annotation**: Ground truth is deterministic (LLM source is known) - **Deduplication**: Duplicate code snippets removed via content hashing - **Bias Mitigation**: Stratified sampling per model to avoid class imbalance - **Reproducibility**: Random seeds fixed; prompts versioned ## Intended Use & Limitations ### Primary Use Cases 1. **Research on LLM code fingerprinting**: Train and benchmark attribution models across diverse architectures 2. **Comparative analysis**: Study which structural features best discriminate between LLMs 3. **Structural understanding**: Analyze how different LLMs produce distinct CPG patterns 4. **Baseline establishment**: Provide reference results for future work in code authorship ### Documented Limitations 1. **C++ Support**: Not currently included due to Joern static analyzer limitations. Future versions may cover additional languages. 2. **Single-Function Scope**: Dataset contains isolated functions, not full repositories or multi-file projects. Authorship patterns in large codebases may differ significantly. 3. **Synthetic Data Origin**: All code is LLM-generated or human open-source, not adversarially crafted or obfuscated. Performance on naturally-written industrial code remains uncertain. 4. **Domain Shift**: Test distributions are held-out LLM samples from the same problem domains as training. Cross-domain or cross-language generalization is not directly assessed. 5. **Training Data Overlap**: Some LLMs (e.g., CodeQwen, Nxcode) may share instruction-tuning corpora, leading to high correlation and classification confusion. This is documented but not filtered. 6. **Granularity Sensitivity**: Reported in prior work (CoDET-M4 baseline) showing ~8.6× accuracy variance depending on whether classification is per-function vs. per-class. This dataset is function-level only. 7. **Evaluation Methodology**: Baseline comparisons are macro-averaged F1 scores. Class-imbalanced datasets or rare-model scenarios may exhibit different behavior. ### Out of Scope - **Production Forensics**: This dataset is not validated for deployment in real-world source-code investigation or legal evidence contexts. Additional domain adaptation and validation are essential. - **Adversarial Robustness**: Not tested against code obfuscation, style transfer, or intentional model-spoofing attacks. - **Real-World Human Code**: Authorship models trained on this dataset should not be assumed to work on arbitrary production code without retraining or domain adaptation. ## Personal Information & Ethical Considerations ### Responsible Use Guidelines - Citation of this dataset should include discussion of its constraints (synthetic origin, single-function scope) ## Licensing This dataset is released under the **MIT License**. The MIT License permits free use, modification, and distribution with minimal restrictions. For the full license text, see [LICENSE](LICENSE) or https://opensource.org/licenses/MIT. **Attribution Requirement**: While not legally required by MIT, we request that users cite the dataset and acknowledge the CodeDNA team (see Citation section below). ## Citation If you use CoDET-M4 in published research, please cite: ```bibtex @dataset{codedna_codetm4, title = {{CDNA-CoDET-M4}: Code Authorship Attribution via Code Property Graphs}, authors = {Gusta, Avisenna and Yinqi, Gu and Sia, Sim Kim and Mohamed, Dhameem and Shenghua, Ye}, year = {2025}, school = {Singapore Management University, School of Computing and Information Systems}, url = {https://huggingface.co/datasets/mohameddhameem/CDNA-CoDET-M4} } ``` **Suggested Bibtex (alternative format)**: ```bibtex @inproceedings{codedna2025structural, title = {Structural Fingerprints of Large Language Models: Code Authorship Attribution via Code Property Graph Neural Networks}, authors = {Gusta, Avisenna and Yinqi, Gu and Sia, Sim Kim and Mohamed, Dhameem and Shenghua, Ye}, year = {2025}, school = {Singapore Management University}, note = {Dataset: CoDET-M4}, url = {https://huggingface.co/datasets/mohameddhameem/CDNA-CoDET-M4} } ``` **Inline Reference**: > The CoDET-M4 benchmark (Gusta et al., 2025) provides 77K Python and Java samples from 6 LLMs, structured as Code Property Graphs for research on automated code authorship attribution. ## References 1. **Original CoDET-M4 Dataset**: [DaniilOr/CoDET-M4](https://huggingface.co/datasets/DaniilOr/CoDET-M4) — Source dataset (500K+ samples) 2. **Joern Static Analysis**: https://joern.io/ 3. **Code Property Graphs**: Yamaguchi, F., et al. (2014). "Modeling and Discovering Vulnerabilities with Code Property Graphs." *IEEE S&P*. 4. **Model Cards**: Mitchell, T., et al. (2018). "Model Cards for Model Reporting." *arXiv:1810.03993*. 5. **Dataset Cards**: https://huggingface.co/docs/datasets/dataset_card ## Quick Start ### Installation ```bash pip install datasets ``` ### Loading the Dataset ```python from datasets import load_dataset # Load heterogeneous Code Property Graphs dataset_hetero = load_dataset("mohameddhameem/CDNA-CoDET-M4", "hetero") print(dataset_hetero) # Dataset({ # features: ['idx', 'hash', 'target', 'model', 'language', 'split', 'source', 'code', 'graphml'], # num_rows: 33342 # }) # Load scalar structural features dataset_scalar = load_dataset("mohameddhameem/CDNA-CoDET-M4", "scalar") print(dataset_scalar) # Dataset({ # features: ['idx', 'hash', 'target', 'model', 'language', 'split', 'source', 'code', ...], # num_rows: 77332 # }) # Access training split train_hetero = dataset_hetero['train'] print(train_hetero[0]) # { # 'code': 'def quicksort(arr):\n ...', # 'target': 'gpt-4o', # 'language': 'python', # 'graphml': '<graphml>...</graphml>' # } ``` ### Filtering & Exploration ```python # Filter by language python_samples = dataset_hetero['train'].filter(lambda x: x['language'] == 'python') # Filter by model gpt_samples = dataset_hetero['train'].filter(lambda x: x['target'] == 'gpt-4o') # Count samples per model from collections import Counter model_counts = Counter(dataset_hetero['train']['target']) print(model_counts) ``` ## Acknowledgments This dataset was created as part of the CodeDNA research project at Singapore Management University's School of Computing and Information Systems. **Contributors**: Avisenna Gusta, Gu Yinqi, Sim Kim Sia, Mohamed Dhameem, Ye Shenghua We acknowledge the CoDET-M4 benchmark maintainers and the Joern project for enabling graph-based code analysis. --- *Last Updated: April 2025*

提供机构：

mohameddhameem

5,000+

优质数据集

54 个

任务类型

进入经典数据集