five

felixwangg/AIDev-CWE-Classification

收藏
Hugging Face2025-12-01 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/felixwangg/AIDev-CWE-Classification
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: human_pr_cwes data_files: - split: train path: "human_pr_cwes.parquet" - config_name: agentic_pr_cwes data_files: - split: train path: "agentic_pr_cwes.parquet" - config_name: cwe_data data_files: - split: train path: "replication_package/cwe_children_1000.parquet" - config_name: english_count data_files: - split: human_pr_is_eng path: "replication_package/english_count/human_pr_is_eng_2025-11-15-22:55:58.parquet" - split: all_pr_is_eng path: "replication_package/english_count/all_pr_is_eng_2025-11-15-22:40:02.parquet" - config_name: cwe_count_by_agent data_files: - split: Claude_Code path: "Claude_Code_cwe_counts.parquet" - split: Copilot path: "Copilot_cwe_counts.parquet" - split: Cursor path: "Cursor_cwe_counts.parquet" - split: Devin path: Devin_cwe_counts.parquet - split: Codex path: OpenAI_Codex_cwe_counts.parquet --- # human_pr_cwes.parquet human_pull_request: 6618 rows <br> Number of skipped records because its `body` column is null: 995 <br> Number of skipped non-English rows: 377 rows <br> Number of skipped records because it exceeds the token limit: 56 <br> Number of skipped records because it has no code_diff information: 2 <br> Percentage of skipped records for the whole dataset: <span style="color:red">21.61%</span> Number of non-skipped records in total: 5188<br> Number of skipped records in total: 1430 <br> keyword_extraction_result: 617 rows<br> Extracted percentage: 617 / (6618 - 995 - 377) = 11.76% Number of total security patches: 169 <br> Percentage of security patches: <span style="color:red">3.258%</span> 3 has a response of {true, []}, indicating that it is a security patch, but its CWE type is outside of our considered scope. One record is sometimes related to one or more CWEs. We remove the ancestors from the CWE list and only count the deepest CWE. <br> We normalize the count by 1/k. If one record has k CWEs after removing ancestors, we count each CWE as 1/k. <br> Un-normalized ancestor-cleaned CWEs: 236. <br> 1 security patch is related to 236 / 169 = ~1.396 CWE. <br> Since CWE is a Directed Acyclic Graph (DAG), there's not a single canonical path from the top-layer to the leaf nodes. All parent relationships are semantic groupings that are equally valid classifications. A leaf CWE can be reached by multiple valid topological paths, and it belongs to all parent nodes. So it's our design choice not to normalize each node with a factor of 1/number_of_its_parents. By our design, each relationship from the top-layer node to the leaf node will be counted once, and the overall percentage may exceed 100% due to overlapping. 1000 - Research Concepts - CWE-664 Improper Control of a Resource Through its Lifetime: 78 - 33.05084745762712% - CWE-707 Improper Neutralization: 60 - 25.423728813559322% - CWE-284 Improper Access Control: 51 - 21.610169491525422% - CWE-710 Improper Adherence to Coding Standards: 26 - 11.016949152542374% - CWE-693 Protection Mechanism Failure: 19 - 8.05084745762712% - CWE-691 Insufficient Control Flow Management: 7 - 2.9661016949152543% - CWE-697 Incorrect Comparison: 4 - 1.694915254237288% - CWE-703 Improper Check or Handling of Exceptional Conditions: 4 - 1.694915254237288% - CWE-682 Incorrect Calculation: 4 - 1.694915254237288% - CWE-435 Improper Interaction Between Multiple Correctly-Behaving Entities: 3 - 1.271186440677966% Occurrences (top 10): - CWE-20: 42 / 236 = 17.80% - CWE-400: 17 / 236 = 7.20% - CWE-1104: 10 / 236 = 4.24% - CWE-1395: 10 / 236 = 4.24% - CWE-22: 9 / 236 = 3.81% - CWE-284: 9 / 236 = 3.81% - CWE-287: 7 / 236 = 2.97% - CWE-269: 7 / 236 = 2.97% - CWE-306: 5 / 236 = 2.12% - CWE-79: 5 / 236 = 2.12% # agentic_pr_cwes.parquet all_pull_request: 932791 rows<br> Number of skipped records because its `body` column is null: 8773 <br> Number of skipped non-English rows: 44179 rows <row> Number of skipped records because it exceeds the token limit: 4524 <br> Number of skipped records because it has no code_diff information: 28153 <br> Percentage of skipped records for the whole dataset: <span style="color:red">9.18%</span> Number of non-skipped records in total: 847162<br> Number of skipped records in total: 85629 <br> keyword_extraction_result: 91694 rows<br> Extracted percentage: 91684 / (932791 - 8773 - 44179) = 10.42% Number of total security patches: 10001 <br> Percentage of security patches: <span style="color:red">1.181%</span> 7 has a response of {true, []}, indicating that it is a security patch, but its CWE type is outside of our considered scope. One record is sometimes related to one or more CWEs. We remove the ancestors from the CWE list and only count the deepest CWE. <br> We normalize the count by 1/k. If one record has k CWEs after removing ancestors, we count each CWE as 1/k. <br> Un-normalized ancestor-cleaned CWEs: 14158. <br> 1 security patch is related to 14158 / 10001 = ~ 1.416 CWE. <br> Since CWE is a Directed Acyclic Graph (DAG), there's not a single canonical path from the top-layer to the leaf nodes. All parent relationships are semantic groupings that are equally valid classifications. A leaf CWE can be reached by multiple valid topological paths, and it belongs to all parent nodes. So it's our design choice not to normalize each node with a factor of 1/number_of_its_parents. By our design, each relationship from the top-layer node to the leaf node will be counted once, and the overall percentage may exceed 100% due to overlapping. 1000 - Research Concepts - CWE-284 Improper Access Control: 5158 - 36.43169939256957% - CWE-707 Improper Neutralization: 4754 - 33.57818900974714% - CWE-664 Improper Control of a Resource Through its Lifetime: 3314 - 23.40726091255827% - CWE-693 Protection Mechanism Failure: 1780 - 12.57239723124735% - CWE-710 Improper Adherence to Coding Standards: 787 - 5.558694730894194% - CWE-703 Improper Check or Handling of Exceptional Conditions: 325 - 2.295521966379432% - CWE-691 Insufficient Control Flow Management: 170 - 1.2007345670292415% - CWE-682 Incorrect Calculation: 64 - 0.4520412487639497% - CWE-697 Incorrect Comparison: 51 - 0.36022037010877245% - CWE-435 Improper Interaction Between Multiple Correctly-Behaving Entities: 7 - 0.049442011583557% Occurrences (top 10): - CWE-20: 3494 / 14158 = 24.68% - CWE-306: 1123 / 14158 = 7.93% - CWE-862: 848 / 14158 = 5.88% - CWE-79: 617 / 14158 = 4.36% - CWE-287: 608 / 14158 = 4.29% - CWE-285: 446 / 14158 = 3.15% - CWE-352: 394 / 14158 = 2.78% - CWE-798: 340 / 14158 = 2.40% - CWE-200: 331 / 14158 = 2.34% - CWE-22: 330 / 14158 = 2.33% --- **Security Patch Ratio By Agents** - Claude Code: 119 / 5137 = 2.32% - Copilot: 740 / 50447 = 1.47% - Devin: 398 / 29744 = 1.34% - Cursor: 416 / 32941 = 1.26% - Codex: 8328 / 814522 = 1.02% **CWE Distribution By Agents** - Codex: 193 types of CWEs - Copilot: 126 types of CWEs - Cursor: 88 types of CWEs - Devin: 87 types of CWEs - Claude Code: 63 types of CWEs --- # cwe_children_1000.parquet Part of the replication package. It contains all CWE information that we crawled starting from https://cwe.mitre.org/data/definitions/1000.html
提供机构:
felixwangg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作