├── .gitignore
├── taxonomy
├── taxonomy.png
├── taxonomy.xlsx
└── README.md
├── analyses
├── taxonomy.xlsx
├── datasets_labeling_summary.csv
└── study_taxonomy_analysis.ipynb
├── LICENSE
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
--------------------------------------------------------------------------------
/taxonomy/taxonomy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/taxonomy/taxonomy.png
--------------------------------------------------------------------------------
/analyses/taxonomy.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/analyses/taxonomy.xlsx
--------------------------------------------------------------------------------
/taxonomy/taxonomy.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/taxonomy/taxonomy.xlsx
--------------------------------------------------------------------------------
/taxonomy/README.md:
--------------------------------------------------------------------------------
1 | # LLM4SVD TAXONOMY 🗂️
2 |
3 | We categorize existing LLM4SVD approaches according to detection task, input representation, system architecture, and technique. The presented taxonomy allows for meaningful comparison and benchmarking of studies.
4 |
5 |
6 |
7 |
8 | 
--------------------------------------------------------------------------------
/analyses/datasets_labeling_summary.csv:
--------------------------------------------------------------------------------
1 | Dataset,Labeling,Type
2 | SARD,Synthetic,Mixed
3 | Juliet C/C++,Synthetic,Synthetic
4 | Juliet Java,Synthetic,Synthetic
5 | VulDeePecker,Security Vendor,Mixed
6 | Draper VDISC,Tool,Mixed
7 | Devign,Developer,Real (Balanced)
8 | Big-Vul,Security Vendor,Real (Imbalanced)
9 | D2A,Tool,Real (Imbalanced)
10 | ReVeal,Developer,Real (Imbalanced)
11 | CVEfixes,Security Vendor,Real (Imbalanced)
12 | CrossVul,Security Vendor,Real (Balanced)
13 | SecurityEval,Synthetic,Mixed
14 | SVEN,Developer,Real (Balanced)
15 | DiverseVul,Developer,Real (Imbalanced)
16 | FormAI,Tool,Synthetic
17 | ReposVul,Tool,Real (Imbalanced)
18 | PrimeVul,Security Vendor,Real (Imbalanced)
19 | MegaVul,Security Vendor,Real (Imbalanced)
20 | CleanVul,Developer,Real (Balanced)
21 | PairVul,Security Vendor,Real (Balanced)
22 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 hs-esslingen-it-security
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Awesome-LLM4SVD 🌟-🧠👩💻🔍
2 |
3 | This repository contains the artifacts from the systematic literature review (SLR) on LLM-based software vulnerability detection ("A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models").
4 | The SLR analyzes 263 studies published between January 2020 and November 2025 and provides a structured taxonomy of detection approaches, input representations, system architectures, techniques, and dataset usage.
5 |
6 |
7 | ## Table of Contents
8 |
9 | To support open science and reproducibility, we publicly release:
10 | - 📝 [Surveyed Papers](#papers): A curated list of surveyed papers. This list will be continuously updated to track the latest papers.
11 | - 🗂️ [Taxonomy](https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD/tree/main/taxonomy): Taxonomy of LLM-based vulnerability detection studies along with the categorization of each surveyed paper.
12 | - 📝 [Selected Datasets](#datasets): A list of the most commonly used datasets in the surveyed studies with their download sources.
13 |
14 |
15 |
16 |
17 |
18 | For details, see our [preprint here](https://arxiv.org/abs/2507.22659):
19 |
20 | 📚 S. Kaniewski, F. Schmidt, M. Enzweiler, M. Menth, und T. Heer. 2025. *A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models*. arXiv:2507.22659.
21 | ```bibtex
22 | @preprint{kaniewskiLLM4SVD2025,
23 | title={{A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models}},
24 | author={Kaniewski, Sabrina and Schmidt, Fabian and Enzweiler, Markus and Menth, Michael and Heer, Tobias},
25 | year={2025},
26 | eprint={2507.22659},
27 | archivePrefix={arXiv},
28 | primaryClass={cs.SE},
29 | url={https://arxiv.org/abs/2507.22659},
30 | }
31 | ```
32 |
33 |
34 |
35 |
36 | - 🤝 [Contribute to this repository](#contribution)
37 | - ⚖️ [License](#license)
38 |
39 |
40 |
41 |
42 | ----------------
43 | ----------------
44 |
45 | ## Papers
46 |
47 | > **Note:** Entries marked with ✨ indicate the latest papers that are not discussed in the preprint of the SLR. The latest preprint version covers all studies up to November 2025.
48 |
49 |
50 | ### 2025
51 | - (11/2025) Leveraging Self-Paced Learning for Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2511.09212)] [[Code](https://figshare.com/s/bef3211194fc18fe375e)]
52 | - (11/2025) Specification-Guided Vulnerability Detection with Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2511.04014)] [[Code](https://github.com/zhuhaopku/VulInstruct-temp)]
53 | - (11/2025) Compressing Large Language Models for SQL Injection Detection: A Case Study on Deep Seek-Coder and Meta-llama-3-70b-instruct. **`FRUCT 2025`** [[Paper](https://ieeexplore.ieee.org/document/11239157)]
54 | - (11/2025) VulTrLM: LLM-assisted Vulnerability Detection via AST Decomposition and Comment Enhancement. **`EMSE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10664-025-10738-7)]
55 | - (11/2025) Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open and Industry Data. **`PROFES 2025`** [[Paper](https://arxiv.org/abs/2509.09313)] [[Code](https://github.com/CybersecurityLab-unibz/cross_domain_evaluation)]
56 | - (11/2025) Learning-based Models for Vulnerability Detection: An Extensive Study. **`EMSE 2025`** [[Paper](https://arxiv.org/abs/2408.07526)] [[Code](https://figshare.com/s/bde8e41890e8179fbe5f?file=41894784)]
57 | - (11/2025) A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence- and Collaboration-based Decision Making. **`EMNLP 2025`** [[Paper](https://aclanthology.org/2025.emnlp-main.1071/)]
58 | - (10/2025) Leveraging Intra-and Inter-References in Vulnerability Detection using Multi-Agent Collaboration Based on LLMs. **`Cluster Computing 2025`** [[Paper](https://link.springer.com/article/10.1007/s10586-025-05721-2)]
59 | - (10/2025) iCodeReviewer: Improving Secure Code Review with Mixture of Prompts. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.12186)]
60 | - (10/2025) Bridging Semantics \& Structure for Software Vulnerability Detection using Hybrid Network Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.10321)] [[Code](https://zenodo.org/records/17259519)]
61 | - (10/2025) FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk. **`ESORICS 2025`** [[Paper](https://arxiv.org/abs/2506.19453)] [[Code](https://github.com/sajalhalder/FuncVul)]
62 | - (10/2025) On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.27675)]
63 | - (10/2025) Towards Explainable Vulnerability Detection With Large Language Models. **`TSE 2025`** [[Paper](https://arxiv.org/abs/2406.09701)]
64 | - (10/2025) MulVuln: Enhancing Pre-trained LMs with Shared and Language-Specific Knowledge for Multilingual Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.04397)]
65 | - (10/2025) Llama-Based Source Code Vulnerability Detection: Prompt Engineering vs Fine Tuning. **`ESORICS 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-032-07884-1_15)] [[Code](https://github.com/DynaSoumhaneOuchebara/Llama-based-vulnerability-detection)]
66 | - (10/2025) Real-VulLLM: An LLM Based Assessment Framework in the Wild. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.04056)]
67 | - (10/2025) Distilling Lightweight Language Models for C/C++ Vulnerabilities. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.06645)] [[Code](https://github.com/yangxiaoxuan123/ FineSec_detect)]
68 | - (10/2025) A Zero-Shot Framework for Cross-Project Vulnerability Detection in Source Code. **`EMSE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10664-025-10749-4)] [[Code](https://github.com/Radowan98/ZSVulD)]
69 | - (10/2025) Sparse-MoE: Syntax-Aware Multi-view Mixture of Experts for Long-Sequence Software Vulnerability Detection. **`ADMA 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-981-95-3456-2_24)]
70 | - (09/2025) DeepVulHunter: Enhancing the Code Vulnerability Detection Capability of LLMs through Multi-Round Analysis. **`JIIS 2025`** [[Paper](https://link.springer.com/article/10.1007/s10844-025-00982-0)]
71 | - (09/2025) Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.12039)]
72 | - (09/2025) GPTVD: vulnerability detection and analysis method based on LLM’s chain of thoughts. **`ASE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10515-025-00550-4)] [[Code](https://github.com/chenyn273/GPTVD)]
73 | - (09/2025) An Advanced Detection Framework for Embedded System Vulnerabilities. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11153853)]
74 | - (09/2025) Utilizing Large Programming Language Models on Software Vulnerability Detection. **`ASYU 2025`** [[Paper](https://ieeexplore.ieee.org/document/11208282)]
75 | - (09/2025) MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.00317)] [[Code](https://github.com/youpengl/MAVUL)]
76 | - (09/2025) Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2509.12629)] [[Code](https://github.com/sssszh/ELVul4LLM)]
77 | - (09/2025) VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2509.11523)]
78 | - (09/2025) PIONEER: Improving the Robustness of Student Models when Compressing Pre-Trained Models of Code. **`ASE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10515-025-00560-2)] [[Code](https://github.com/illsui1on/PIONEER)]
79 | - (08/2025) VulPr: A Prompt Learning-based Method for Vulnerability Detection. **`EIT 2025`** [[Paper](https://ieeexplore.ieee.org/document/11231886)]
80 | - (08/2025) MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning. **`IRI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11153184)]
81 | - (08/2025) Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/pdf/2508.04448)] [[Code](https://github.com/Damian0401/ProjectAnalyzer)]
82 | - (08/2025) Enhancing Fine-Grained Vulnerability Detection With Reinforcement Learning. **`TSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11145224)] [[Code](https://github.com/YuanJiangGit/RLFD)]
83 | - (08/2025) CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.11599)]
84 | - (08/2025) Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.21817)] [[Code](https://github.com/yikun-li/TitanVul-BenchVul)]
85 | - (08/2025) LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.16419)] [[Code](https://github.com/NoujoudNader/LLM-Bugs-Detection)]
86 | - (08/2025) Multimodal Fusion for Vulnerability Detection: Integrating Sequence and Graph-Based Analysis with LLM Augmentation. **`MAPR 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11133833)]
87 | - (08/2025) SAFE: A Novel Approach For Software Vulnerability Detection from Enhancing The Capability of Large Language Models. **`ASIACCS 2025`** [[Paper](https://arxiv.org/abs/2409.00882)]
88 | - (08/2025) Software Vulnerability Detection using Large Language Models. **`SecureComm 2025`** [[Paper](https://arxiv.org/abs/2410.00249)]
89 | - (08/2025) Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.16625)]
90 | - (08/2025) Think Broad, Act Narrow: CWE Identification with Multi-Agent Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.01451)] [[Code](https://zenodo.org/records/15871507)]
91 | - (08/2025) Improving Software Security Through a LLM-Based Vulnerability Detection Model. **`DEXA 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-032-02049-9_9)]
92 | - (07/2025) An Automatic Classification Model for Long Code Vulnerabilities Based on the Teacher-Student Framework. **`QRS 2025`** [[Paper](https://ieeexplore.ieee.org/document/11216609)]
93 | - (07/2025) LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. **`USENIX Security 2025`** [[Paper](https://arxiv.org/abs/2507.16585)] [[Code](https://github.com/qcri/llmxcpg)] [[Code](https://zenodo.org/records/15614095)]
94 | - (07/2025) CLeVeR: Multi-modal Contrastive Learning for Vulnerability Code Representation. **`ACL 2025`** [[Paper](https://aclanthology.org/2025.findings-acl.414/)] [[Code](https://github.com/yoimiya-nlp/CLeVeR)]
95 | - (07/2025) Revisiting Pre-trained Language Models for Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.16887)]
96 | - (07/2025) Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.03051)]
97 | - (07/2025) HgtJIT: Just-in-Time Vulnerability Detection Based on Heterogeneous Graph Transformer. **`TDSC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11072308)]
98 | - (07/2025) AI-Powered Vulnerability Detection in Code Using BERT-Based LLM with Transparency Measures. **`ITC-Egypt 2025`** [[Paper](https://ieeexplore.ieee.org/document/11186618)]
99 | - (07/2025) Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories. **`Unknown 2025`** [[Paper](https://arxiv.org/abs/2503.03586)]
100 | - (06/2025) VulnTeam: A Team Collaboration Framework for LLM-based Vulnerability Detection. **`IJCNN 2025`** [[Paper](https://ieeexplore.ieee.org/document/11229292)]
101 | - (06/2025) One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE). **`PACMSE 2025`** [[Paper](https://arxiv.org/abs/2501.16454)]
102 | - (06/2025) Improving Vulnerability Type Prediction and Line-Level Detection via Adversarial Training-based Data Augmentation and Multi-Task Learning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.23534)] [[Code](https://github.com/Karelye/EDAT-MLT)]
103 | - (06/2025) Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2406.11147)] [[Code](https://github.com/knowledgerag4llmvuld/knowledgerag4llmvuld)]
104 | - (06/2025) Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.10104)]
105 | - (06/2025) Evaluating LLaMA 3.2 for Software Vulnerability Detection. **`EICC 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-94855-8_3)]
106 | - (06/2025) How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2408.10495)] [[Code](https://github.com/jianian0318/LLMSecureCode)]
107 | - (06/2025) Detecting Code Vulnerabilities using LLMs. **`DSN 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11068842)] [[Code](https://github.com/a24167566/LLMs-Code-Vulnerability-Detection)]
108 | - (06/2025) LPASS: Linear Probes as Stepping Stones for Vulnerability Detection using Compressed LLMs. **`JISA 2025`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212625001620)]
109 | - (06/2025) Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.20444)]
110 | - (06/2025) CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2411.17274)] [[Code](https://github.com/yikun-li/CleanVul)]
111 | - (06/2025) Large Language Models for Multilingual Vulnerability Detection: How Far Are We?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.07503)] [[Code](https://github.com/SpanShu96/Large-Language-Model-for-Multilingual-Vulnerability-Detection/tree/main)]
112 | - (06/2025) Large Language Models for In-File Vulnerability Localization Can Be ""Lost in the End"". **`PACMSE 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3715758)] [[Code](https://zenodo.org/records/14840519)]
113 | - (06/2025) LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2401.16185)] [[Code](https://anonymous.4open.science/r/LLM4Vuln/README.md)]
114 | - (06/2025) ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2408.16028)] [[Code](https://anonymous.4open.science/r/anvil)]
115 | - (06/2025) Line-level Semantic Structure Learning for Code Vulnerability Detection. **`Internetware 2025`** [[Paper](https://arxiv.org/abs/2407.18877)] [[Code](https://figshare.com/articles/dataset/CSLS_model_code_and_data/26391658)]
116 | - (06/2025) SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair. **`ISMM 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3735950.3735954)] [[Code](https://github.com/HuantWang/SecureMind)]
117 | - (06/2025) VuL-MCBERT: A Vulnerability Detection Method Based on Self-Supervised Contrastive Learning. **`CAIBDA 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11183103)]
118 | - (06/2025) Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.07390)] [[Code](https://github.com/Xin-Cheng-Wen/PO4Vul)]
119 | - (06/2025) Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs. **`PACMSE 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3728875)]
120 | - (06/2025) An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2401.16310)] [[Code](https://zenodo.org/records/15572151)]
121 | - (05/2025) SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.19828)] [[Code](https://github.com/basimbd/SecVulEval)]
122 | - (05/2025) AutoAdapt: On the Application of AutoML for Parameter-Efficient Fine-Tuning of Pre-Trained Code Models. **`TOSEM 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3734867)] [[Code](https://github.com/serval-uni-lu/AutoAdapt)]
123 | - (05/2025) Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11028308)]
124 | - (05/2025) LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code. **`ICSC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11140501)]
125 | - (05/2025) A Comparative Study of Machine Learning and Large Language Models for SQL and NoSQL Injection Vulnerability Detection. **`SIST 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11139190)]
126 | - (05/2025) Are Sparse Autoencoders Useful for Java Function Bug Detection?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.10375)]
127 | - (05/2025) ♪ With a Little Help from My (LLM) Friends: Enhancing Static Analysis with LLMs to Detect Software Vulnerabilities. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11028575)]
128 | - (05/2025) GraphCodeBERT-Augmented Graph Attention Networks for Code Vulnerability Detection. **`CAI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11050748)]
129 | - (05/2025) Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.15088)]
130 | - (05/2025) Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.10961)] [[Code](https://figshare.com/s/1514bc9a7aa64b46d94e)]
131 | - (05/2025) Adversarial Training for Robustness Enhancement in LLM-Based Code Vulnerability Detection. **`CISCE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11065803)]
132 | - (05/2025) Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.17460)]
133 | - (05/2025) An Automated Code Review Framework Based on BERT and Qianwen Large Model. **`CCAI 2025`** [[Paper](https://ieeexplore.ieee.org/document/11189422)]
134 | - (04/2025) A Software Vulnerability Detection Model Combined with Graph Simplification. **`AIBDF 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3718491.3718525)]
135 | - (04/2025) Human-Understandable Explanation for Software Vulnerability Prediction. **`JSS 2025`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0164121225001232)] [[Code](https://github.com/quy-ng/human-xai-software-vulnerability-prediction)]
136 | - (04/2025) Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.16584)] [[Code](https://huggingface.co/floxihunter/codegen-mono-CWEdetect)] [[Code](https://huggingface.co/datasets/floxihunter/synthetic_python_cwe)]
137 | - (04/2025) Vulnerability Detection with Code Language Models: How Far are We?. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11029911)] [[Code](https://github.com/DLVulDet/PrimeVul)]
138 | - (04/2025) Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.13474)] [[Code](https://anonymous.4open.science/r/CORRECT/README.md)]
139 | - (04/2025) IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. **`ICLR 2025`** [[Paper](https://arxiv.org/abs/2405.17238)] [[Code](https://github.com/iris-sast/iris)]
140 | - (04/2025) Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.13676)]
141 | - (04/2025) An Ensemble Transformer Approach with Cross-Attention for Automated Code Security Vulnerability Detection and Documentation. **`ISDFS 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11012039)]
142 | - (04/2025) Metamorphic-Based Many-Objective Distillation of LLMs for Code-Related Tasks. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/document/11029766)] [[Code](https://zenodo.org/records/14857610)]
143 | - (04/2025) XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection. **`The Journal of Supercomputing 2025`** [[Paper](https://link.springer.com/article/10.1007/s11227-025-07198-7)]
144 | - (04/2025) Leveraging Multi-Task Learning to Improve the Detection of SATD and Vulnerability. **`ICPC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11025930)] [[Code](https://github.com/moritzmock/multitask-vulberability-detection)]
145 | - (04/2025) Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection \& Repair in the IDE. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11029760)] [[Code](https://figshare.com/articles/dataset/Closing_the_Gap_A_User_Study_on_the_Real-world_Usefulness_of_AI-powered_Vulnerability_Detection_Repair_in_the_IDE/26367139?file=52478936)]
146 | - (04/2025) R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.04699)] [[Code](https://github.com/martin-wey/R2Vul)]
147 | - (04/2025) Context-Enhanced Vulnerability Detection Based on Large Language Models. **`TOSEM 2025`** [[Paper](https://arxiv.org/abs/2504.16877)] [[Code](https://github.com/DoeSEResearch/PacVD)]
148 | - (04/2025) SSRFSeek: An LLM-based Static Analysis Framework for Detecting SSRF Vulnerabilities in PHP Applications. **`AINIT 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11035424)]
149 | - (03/2025) CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection. **`TASE 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-98208-8_15)] [[Code](https://github.com/CASTLE-Benchmark)]
150 | - (03/2025) SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs?. **`TSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10910240)]
151 | - (03/2025) Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. **`ICST 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10988968)] [[Code](https://github.com/seal-research/secvul-llm-study/)]
152 | - (03/2025) Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis. **`ADIoT 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-85593-1_9)]
153 | - (03/2025) Steering Large Language Models for Vulnerability Detection. **`ICASSP 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10887736)]
154 | - (03/2025) HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.10793)]
155 | - (03/2025) You Only Train Once: A Flexible Training Framework for Code Vulnerability Detection Driven by Vul-Vector. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.10988)]
156 | - (03/2025) Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.01449)] [[Code](https://github.com/soarsmu/SVD-Bench)]
157 | - (03/2025) Reasoning with LLMs for Zero-Shot Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.17885)] [[Code](https://github.com/Erroristotle/VulnSage)]
158 | - (02/2025) EFVD: A Framework of Source Code Vulnerability Detection via Fusion of Enhanced Graph Representation Learning and Pre-trained Transformer-Based Model. **`CNSSE 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3732365.3732421)]
159 | - (02/2025) Fine-Tuning Transformer LLMs for Detecting SQL Injection and XSS Vulnerabilities. **`ICAIIC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10920868)]
160 | - (02/2025) Finetuning Large Language Models for Vulnerability Detection. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10908394)] [[Code](https://github.com/rmusab/vul-llm-finetune)]
161 | - (02/2025) Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10879492)]
162 | - (02/2025) Manual Prompt Engineering is Not Dead: A Case Study on Large Language Models for Code Vulnerability Detection with DSPy. **`CDMA 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10908746)]
163 | - (02/2025) AIDetectVul: Software Vulnerability Detection Method Based on Feature Fusion of Pre-trained Models. **`ICCECE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10985370)]
164 | - (01/2025) DMVL4AVD: A Deep Multi-View Learning Model for Automated Vulnerability Detection. **`Neural Comput. Appl. 2025`** [[Paper](https://link.springer.com/article/10.1007/s00521-024-10892-x)] [[Code](https://drive.google.com/file/d/1-qWqmRuBi8kRAAE2yiG6JNiY8vLYxXlz/view)]
165 | - (01/2025) Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.14841)]
166 | - (01/2025) CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2501.04510)]
167 | - (01/2025) Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.18260)] [[Code](https://github.com/SakiRinn/LLM4CVD)] [[Code](https://huggingface.co/datasets/xuefen/VulResource)]
168 | - (01/2025) To Err is Machine: Vulnerability Detection Challenges LLM Reasoning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2403.17218)] [[Code](https://figshare.com/articles/dataset/Data_Package_for_LLM_Vulnerability_Detection_Study/27368025)]
169 | - (01/2025) Streamlining Security Vulnerability Triage with Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2501.18908)] [[Code](https://zenodo.org/records/14776104)]
170 | - (01/2025) Sink Vulnerability Type Prediction Using Small Language Model (SLM). **`IC3ECSBHI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10991300)]
171 | - (01/2025) A Vulnerability Detection Framework Based on Graph Decomposition Fusion and Augmented Abstract Syntax Tree. **`BDICN 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3727353.3727471)]
172 |
173 | ### 2024
174 | - (12/2024) Vulnerability Detection in Popular Programming Languages with Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.15905)] [[Code](https://github.com/syafiq/llm_vd)]
175 | - (12/2024) On the Compression of Language Models for Code: An Empirical Study on CodeBERT. **`SANER 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10992473)] [[Code](https://zenodo.org/records/14357478)]
176 | - (12/2024) LLM-Based Approach for Buffer Overflow Detection in Source Code. **`CIT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11021816)]
177 | - (12/2024) A Source Code Vulnerability Detection Method Based on Positive-Unlabeled Learning. **`RICAI 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10911761)]
178 | - (12/2024) Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows. **`ICMLA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10903489)]
179 | - (12/2024) EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code. **`BigData 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10825609)]
180 | - (12/2024) Software Vulnerability Detection Using LLM: Does Additional Information Help?. **`ACSAC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10917361)] [[Code](https://github.com/research7485/vulnerability_detection)]
181 | - (12/2024) Enhancing Source Code Vulnerability Detection Using Flattened Code Graph Structures. **`ICFTIC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10913325)]
182 | - (12/2024) SQL Injection Vulnerability Detection Based on Pissa-Tuned Llama 3 Large Language Model. **`ICFTIC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10912886)]
183 | - (12/2024) A Method of SQL Injection Attack Detection Based on Large Language Models. **`CNTEIE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10987904)]
184 | - (12/2024) MVD: A Multi-Lingual Software Vulnerability Detection Framework. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.06166)] [[Code](https://figshare.com/s/10ec70108294a225f391)]
185 | - (12/2024) Python Source Code Vulnerability Detection Based on CodeBERT Language Model. **`ACAI 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10899694)]
186 | - (11/2024) RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?. **`EMNLP 2024`** [[Paper](https://arxiv.org/abs/2410.07573)]
187 | - (11/2024) StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model. **`TSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10746847)] [[Code](https://github.com/YuanJiangGit/StagedVulBERT)]
188 | - (11/2024) Applying Contrastive Learning to Code Vulnerability Type Classification. **`EMNLP 2024`** [[Paper](https://aclanthology.org/2024.emnlp-main.666/)]
189 | - (11/2024) Boosting Cybersecurity Vulnerability Scanning based on LLM-supported Static Application Security Testing. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.15735)]
190 | - (11/2024) Enhancing Vulnerability Detection Efficiency: An Exploration of Light-weight LLMs with Hybrid Code Features. **`JISA 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212624002278)] [[Code](https://github.com/JNL-28/Enhancing-Vulnerability-Detection-Efficiency)]
191 | - (11/2024) Research on the LLM-Driven Vulnerability Detection System Using LProtector. **`ICDSCA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10859408)]
192 | - (11/2024) Enhanced LLM-Based Framework for Predicting Null Pointer Dereference in Source Code. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.00216)]
193 | - (10/2024) Vulnerability Prediction using Pre-trained Models: An Empirical Evaluation. **`MASCOTS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10786510)] [[Code](https://sites.google.com/view/vpllm/)]
194 | - (10/2024) Fine-Tuning Pre-trained Model with Optimizable Prompt Learning for Code Vulnerability Detection. **`ISSRE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10771498)] [[Code](https://github.com/Exclusisve-V/PromptVulnerabilityDetection)]
195 | - (10/2024) Improving Long-Tail Vulnerability Detection Through Data Augmentation Based on Large Language Models. **`ICSME 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10795073)] [[Code](https://github.com/LuckyDengXiao/LERT)]
196 | - (10/2024) Exploring AI for Vulnerability Detection and Repair. **`CARS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10778769)]
197 | - (10/2024) DetectBERT: Code Vulnerability Detection. **`GCCIT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10862235)]
198 | - (10/2024) VULREM: Fine-Tuned BERT-Based Source-Code Potential Vulnerability Scanning System to Mitigate Attacks in Web Applications. **`Applied Sciences 2024`** [[Paper](https://www.mdpi.com/2076-3417/14/21/9697)]
199 | - (10/2024) A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality. **`TPS-ISA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10835695)] [[Code](https://figshare.com/articles/dataset/Reproduction_package_for_paper_A_Qualitative_Study_on_Using_ChatGPT_for_Software_Security_Perception_vs_Practicality_/24452365?file=48008890)]
200 | - (10/2024) Vul-LMGNNs: Fusing Language Models and Online-distilled Graph Neural Networks for Code Vulnerability Detection. **`Information Fusion 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S1566253524005268)] [[Code](https://github.com/Vul-LMGNN/vul-LMGGNN)]
201 | - (10/2024) SecureQwen: Leveraging LLMs for Vulnerability Detection in Python Codebases. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824004565)]
202 | - (10/2024) VulnerAI: GPT Based Web Application Vulnerability Detection. **`ICAMAC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10828788)]
203 | - (10/2024) DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection. **`JSS 2024`** [[Paper](nan)] [[Code](https://github.com/Yang-Yanjing/DLAP)]
204 | - (10/2024) Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability. **`TSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10706805)] [[Code](https://github.com/vinci-grape/VulEmpirical)]
205 | - (10/2024) Detecting Source Code Vulnerabilities Using Fine-Tuned Pre-Trained LLMs. **`ICSP 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10846595)]
206 | - (10/2024) A Source Code Vulnerability Detection Method Based on Adaptive Graph Neural Networks. **`ASE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10765114)]
207 | - (09/2024) Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection. **`ESORICS 2024`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-70879-4_14)]
208 | - (09/2024) Navigating (In)Security of AI-Generated Code. **`CSR 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10679468)]
209 | - (09/2024) Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code. **`ISSTA 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3650212.3652127)] [[Code](https://anonymous.4open.science/r/EXPO/README.md)]
210 | - (09/2024) Can a Llama Be a Watchdog? Exploring Llama 3 and Code Llama for Static Application Security Testing. **`CSR 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10679444)]
211 | - (09/2024) May the Source Be with You: On ChatGPT, Cybersecurity, and Secure Coding. **`Information 2024`** [[Paper](https://www.mdpi.com/2078-2489/15/9/572)]
212 | - (09/2024) Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.00571)]
213 | - (09/2024) Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.10490)]
214 | - (09/2024) SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection. **`ISSTA 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3650212.3652124)] [[Code](https://github.com/Xin-Cheng-Wen/Comment4Vul)]
215 | - (09/2024) Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.01001)] [[Code](https://figshare.com/s/5da14b0776750c6fa787)]
216 | - (09/2024) VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.10756)]
217 | - (08/2024) VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2406.07595)] [[Code](https://github.com/Sweetaroo/VulDetectBench)]
218 | - (08/2024) Defect-Scanner: A Comparative Empirical Study on Language Model and Deep Learning Approach for Software Vulnerability Detection. **`IJIS 2024`** [[Paper](https://link.springer.com/article/10.1007/s10207-024-00901-4)]
219 | - (08/2024) From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.02329)]
220 | - (08/2024) Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.06428)]
221 | - (08/2024) Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. **`ACL 2024`** [[Paper](https://arxiv.org/abs/2406.03718)] [[Code](https://github.com/CGCL-codes/VulLLM)]
222 | - (08/2024) Unintentional Security Flaws in Code: Automated Defense via Root Cause Analysis. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.00199)] [[Code](https://anonymous.4open.science/r/Threat_Detection_Modeling-BB7B/README.md)]
223 | - (08/2024) Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection. **`USENIX Security 2024`** [[Paper](https://www.usenix.org/conference/usenixsecurity24/presentation/risse)] [[Code](https://github.com/niklasrisse/USENIX_2024)] [[Code](https://github.com/niklasrisse/VPP)]
224 | - (08/2024) VulSim: Leveraging Similarity of {Multi-Dimensional. **`USENIX Security 2024`** [[Paper](https://www.usenix.org/conference/usenixsecurity24/presentation/shimmi)] [[Code](https://github.com/SamihaShimmi/VulSim)]
225 | - (07/2024) Enhancing Software Code Vulnerability Detection Using GPT-4o and Claude-3.5 Sonnet: A Study on Prompt Engineering Techniques. **`Electronics 2024`** [[Paper](https://www.mdpi.com/2079-9292/13/13/2657)]
226 | - (07/2024) MultiVD: A Transformer-based Multitask Approach for Software Vulnerability Detection. **`SECRYPT 2024`** [[Paper](https://www.scitepress.org/Papers/2024/127194/127194.pdf)]
227 | - (07/2024) DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection. **`Internetware 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3671016.3671388)] [[Code](https://github.com/GCVulnerability/DFEPT)]
228 | - (07/2024) Vulnerability Classification on Source Code Using Text Mining and Deep Learning Techniques. **`QRS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10727022)] [[Code](https://sites.google.com/view/vulnerabilityclassification/)]
229 | - (07/2024) Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection. **`SSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10664399)]
230 | - (07/2024) Effectiveness of ChatGPT for Static Analysis: How Far Are We?. **`AIware 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3664646.3664777)] [[Code](https://zenodo.org/records/10828316)]
231 | - (07/2024) Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.00197)]
232 | - (07/2024) M2CVD: Enhancing Vulnerability Understanding through Multi-Model Collaboration for Code Vulnerability Detection. **`TOSEM 2024`** [[Paper](https://arxiv.org/abs/2406.05940)] [[Code](https://github.com/HotFrom/M2CVD)]
233 | - (07/2024) SCL-CVD: Supervised Contrastive Learning for Code Vulnerability Detection via GraphCodeBERT. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824002992)]
234 | - (07/2024) Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2407.16235)]
235 | - (06/2024) Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT. **`EASE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3661167.3661281)] [[Code](https://github.com/lhmtriet/LLM4Vul)]
236 | - (06/2024) Greening Large Language Models of Code. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3639475.3640097)] [[Code](https://github.com/soarsmu/Avatar)]
237 | - (06/2024) Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2406.05892)] [[Code](https://zenodo.org/records/11403208)]
238 | - (06/2024) Evaluating the Impact of Conventional Code Analysis Against Large Language Models in API Vulnerability Detection. **`EICC 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3655693.3655701)]
239 | - (06/2024) SVulDetector: Vulnerability Detection based on Similarity using Tree-based Attention and Weighted Graph Embedding Mechanisms. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824002335)] [[Code](https://figshare.com/s/426156a96a83da1d38d0)]
240 | - (05/2024) DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection. **`IEEE Access 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10517582)]
241 | - (05/2024) LLM-CloudSec: Large Language Model Empowered Automatic and Deep Vulnerability Analysis for Intelligent Clouds. **`INFOCOM 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10620804)] [[Code](https://github.com/DPCa0/LLM-CloudSec)]
242 | - (05/2024) LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. **`SP 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10646663/)] [[Code](https://github.com/ai4cloudops/SecLLMHolmes)]
243 | - (05/2024) VulD-CodeBERT: CodeBERT-Based Vulnerability Detection Model for C/C++ Code. **`CISCE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10653337)]
244 | - (05/2024) Large Language Model for Vulnerability Detection: Emerging Results and Future Directions. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3639476.3639762)] [[Code](https://github.com/soarsmu/ChatGPT-VulDetection)]
245 | - (04/2024) VulnGPT: Enhancing Source Code Vulnerability Detection Using AutoGPT and Adaptive Supervision Strategies. **`DCOSS-IoT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10621527)]
246 | - (04/2024) BiT5: A Bidirectional NLP Approach for Advanced Vulnerability Detection in Codebase. **`Procedia Computer Science 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S1877050924006306)]
247 | - (04/2024) Software Vulnerability and Functionality Assessment using Large Language Models. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3643787.3648036)]
248 | - (04/2024) Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks. **`ICSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10548173)] [[Code](https://zenodo.org/records/10140638)]
249 | - (04/2024) Towards Causal Deep Learning for Vulnerability Detection. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3597503.3639170)] [[Code](https://figshare.com/s/0ffda320dcb96c249ef2?file=41801019)]
250 | - (04/2024) ProRLearn: Boosting Prompt Tuning-based Vulnerability Detection by Reinforcement Learning. **`ASE 2024`** [[Paper](https://link.springer.com/article/10.1007/s10515-024-00438-9)] [[Code](https://github.com/ProRLearn/ProRLearn001)]
251 | - (04/2024) VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2404.15596)]
252 | - (03/2024) Python Source Code Vulnerability Detection with Named Entity Recognition. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824001032)] [[Code](https://github.com/mmeberg/PyVulDet-NER)]
253 | - (03/2024) GRACE: Empowering LLM-based Software Vulnerability Detection with Graph Structure and In-Context Learning. **`JSS 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0164121224000748)] [[Code](https://github.com/P-E-Vul/GRACE)]
254 | - (03/2024) Learning Defect Prediction from Unrealistic Data. **`SANER 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10589866)] [[Code](https://zenodo.org/records/10514652)]
255 | - (03/2024) Making Vulnerability Prediction more Practical: Prediction, Categorization, and Localization. **`IST 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0950584924000636)] [[Code](https://github.com/liucyy/VulPCL)]
256 | - (02/2024) A Preliminary Study on Using Large Language Models in Software Pentesting. **`NDSS 2024`** [[Paper](https://arxiv.org/abs/2401.17459)]
257 | - (02/2024) TRACED: Execution-aware Pre-training for Source Code. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3597503.3608140)] [[Code](https://github.com/ARiSE-Lab/TRACED_ICSE_24)]
258 | - (02/2024) LLbezpeky: Leveraging Large Language Models for Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2401.01269)]
259 | - (02/2024) Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2402.17230)]
260 | - (01/2024) Your Instructions Are Not Always Helpful: Assessing the Efficacy of Instruction Fine-tuning for Software Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2401.07466)]
261 |
262 | ### 2023
263 | - (12/2023) Joint Geometrical and Statistical Domain Adaptation for Cross-domain Code Vulnerability Detection. **`EMNLP 2023`** [[Paper](https://aclanthology.org/2023.emnlp-main.788/)]
264 | - (12/2023) ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?. **`APSEC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10479409)] [[Code](https://github.com/awsm-research/ChatGPT4Vul)]
265 | - (12/2023) Code Defect Detection Method Based on BERT and Ensemble. **`ICCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10507306)]
266 | - (12/2023) Assessing the Effectiveness of Vulnerability Detection via Prompt Tuning: An Empirical Study. **`APSEC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10479384)] [[Code](https://github.com/P-E-Vul/prompt-empircial-vulnerability)]
267 | - (12/2023) Enhancing Code Security Through Open-source Large Language Models: A Comparative Study. **`FPS 2023`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-57537-2_15)]
268 | - (12/2023) Optimizing Pre-trained Language Models for Efficient Vulnerability Detection in Code Snippets. **`ICCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10507456)]
269 | - (12/2023) Exploring the Limits of ChatGPT in Software Security Applications. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2312.05275)]
270 | - (11/2023) How To Get Better Embeddings with Code Pre-trained Models? An Empirical Study. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2311.08066)]
271 | - (11/2023) AIBugHunter: A Practical Tool for Predicting, Classifying and Repairing Software Vulnerabilities. **`EMSE 2023`** [[Paper](https://link.springer.com/article/10.1007/s10664-023-10346-3)] [[Code](https://github.com/awsm-research/AIBugHunter)]
272 | - (11/2023) The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification. **`ESEC/FSE 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3611643.3616304)] [[Code](https://zenodo.org/records/10499843)]
273 | - (11/2023) Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation. **`ESEC/FSE 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3611643.3616358)] [[Code](https://github.com/jacknichao/SVulD)]
274 | - (11/2023) Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2311.04109)] [[Code](https://figshare.com/s/4a16a528d6874aad51a0)]
275 | - (11/2023) Software Vulnerabilities Detection Based on a Pre-trained Language Model. **`TrustCom 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10538979)]
276 | - (10/2023) DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. **`RAID 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3607199.3607242)] [[Code](https://github.com/wagner-group/diversevul)]
277 | - (10/2023) PTLVD:Program Slicing and Transformer-based Line-level Vulnerability Detection System. **`SCAM 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10356694)] [[Code](https://github.com/chenshixu/PTLVD)]
278 | - (10/2023) Software Vulnerability Detection using Large Language Models. **`ISSRE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10301302)]
279 | - (10/2023) Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2310.16263)]
280 | - (09/2023) Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge. **`ASE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10298584)] [[Code](https://github.com/jacknichao/MVulD)]
281 | - (09/2023) DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2309.15324)] [[Code](https://github.com/WJ-8/DefectHunter)]
282 | - (09/2023) When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection. **`ASE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10298363)] [[Code](https://github.com/PILOT-VD-2023/PILOT)]
283 | - (08/2023) Using ChatGPT as a Static Application Security Testing Tool. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2308.14434)] [[Code](https://github.com/abakhshandeh/ChatGPTasSAST)]
284 | - (08/2023) VulExplainer: A Transformer-Based Hierarchical Distillation for Explaining Vulnerability Types. **`TSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10220166)] [[Code](https://github.com/awsm-research/VulExplainer)]
285 | - (08/2023) Software Vulnerability Detection with GPT and In-Context Learning. **`DSC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10381286)]
286 | - (08/2023) Can Large Language Models Find And Fix Vulnerable Software?. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2308.10345)]
287 | - (07/2023) Leveraging Deep Learning Models for Cross-function Null Pointer Risks Detection. **`AITest 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10229470)]
288 | - (07/2023) An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph. **`EuroS&P 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10190505)] [[Code](https://github.com/pial08/SemVulDet)]
289 | - (07/2023) VulDetect: A novel technique for detecting software vulnerabilities using Language Models. **`CSR 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10224924)]
290 | - (07/2023) An Enhanced Vulnerability Detection in Software Using a Heterogeneous Encoding Ensemble. **`ISCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10217978)]
291 | - (06/2023) New Tricks to Old Codes: Can AI Chatbots Replace Static Code Analysis Tools?. **`EICC 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3590777.3590780)] [[Code](https://github.com/New-Tricks-to-Old-Codes/Replace-Static-Analysis-Tools)]
292 | - (06/2023) Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. **`TSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10153647)] [[Code](https://zenodo.org/records/7123322)]
293 | - (05/2023) An Empirical Study of Deep Learning Models for Vulnerability Detection. **`ICSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10172583)] [[Code](https://figshare.com/articles/dataset/An_Empirical_Study_of_Deep_Learning_Models_for_Vulnerability_Detection/20791240?file=39183863)]
294 | - (05/2023) Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2306.01754)]
295 | - (05/2023) Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models. **`ICSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10172346)] [[Code](https://github.com/ReliableCoding/REPEAT)]
296 | - (05/2023) Detecting Vulnerabilities in IoT Software: New Hybrid Model and Comprehensive Data Analysis. **`JISA 2023`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212623000510)]
297 | - (05/2023) VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software Vulnerabilities Using Language Models. **`JEEIT 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10185860)]
298 | - (04/2023) Evaluation of ChatGPT Model for Vulnerability Detection. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2304.07232)]
299 |
300 | ### 2022
301 | - (12/2022) BBVD: A BERT-based Method for Vulnerability Detection. **`IJACSA 2022`** [[Paper](https://www.proquest.com/docview/2770373789?pq-origsite=gscholar&fromopenview=true&sourcetype=Scholarly%20Journals)]
302 | - (12/2022) Exploring Transformers for Multi-Label Classification of Java Vulnerabilities. **`QRS 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10062434)] [[Code](https://github.com/TQRG/VDET-for-Java)]
303 | - (12/2022) Transformer-Based Language Models for Software Vulnerability Detection. **`ACSAC 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3564625.3567985)] [[Code](https://bitbucket.csiro.au/users/jan087/repos/acsac-2022-submission/browse)]
304 | - (12/2022) PATVD: Vulnerability Detection Based on Pre-training Techniques and Adversarial Training. **`SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10189687/)]
305 | - (11/2022) Multi-view Pre-trained Model for Code Vulnerability Identification. **`WASA 2022`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-19211-1_11)]
306 | - (11/2022) Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection. **`Mathematics 2022`** [[Paper](https://www.mdpi.com/2227-7390/10/23/4482)]
307 | - (11/2022) BERT-Based Vulnerability Type Identification with Effective Program Representation. **`WASA 2022`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-19208-1_23#citeas)]
308 | - (10/2022) VulDeBERT: A Vulnerability Detection System Using BERT. **`ISSRE 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9985089)] [[Code](https://github.com/SKKU-SecLab/VulDeBERT)]
309 | - (07/2022) VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. **`IJCNN 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9892280)] [[Code](https://github.com/ICL-ml4csec/VulBERTa)]
310 | - (06/2022) Cyber Security Vulnerability Detection Using Natural Language Processing. **`AIIoT 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9817336)]
311 | - (05/2022) LineVul: A Transformer-based Line-level Vulnerability Prediction. **`MSR 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3524842.3528452)] [[Code](https://github.com/awsm-research/LineVul)]
312 | - (05/2022) LineVD: Statement-level Vulnerability Detection using Graph Neural Networks. **`MSR 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3524842.3527949)] [[Code](https://github.com/davidhin/linevd)]
313 | - (03/2022) Intelligent Detection of Vulnerable Functions in Software through Neural Embedding-based Code Analysis. **`IJNM 2022`** [[Paper](https://onlinelibrary.wiley.com/doi/full/10.1002/nem.2198)] [[Code](https://cybercodeintelligence.github.io/CyberCI/)]
314 | - (01/2022) Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization. **`Security and Communication Networks 2022`** [[Paper](https://onlinelibrary.wiley.com/doi/full/10.1155/2022/5203217)] [[Code](https://cybercodeintelligence.github.io/CyberCI/)]
315 |
316 | ### 2021
317 | - (12/2021) Automated Software Vulnerability Detection via Pre-trained Context Encoder and Self Attention. **`ICDF2C 2021`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-06365-7_15)]
318 | - (11/2021) Detecting Integer Overflow Errors in Java Source Code via Machine Learning. **`ICTAI 2021`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9643278)]
319 | - (06/2021) Unified Pre-training for Program Understanding and Generation. **`NAACL 2021`** [[Paper](https://par.nsf.gov/servlets/purl/10336701)] [[Code](https://github.com/wasiahmad/PLBART)]
320 | - (05/2021) Security Vulnerability Detection Using Deep Learning Natural Language Processing. **`INFOCOM 2021`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9484500)]
321 |
322 | ### 2020
323 | - (06/2020) Exploring Software Naturalness through Neural Language Models. **`arXiv 2020`** [[Paper](https://arxiv.org/abs/2006.12641)]
324 |
325 |
326 | ## Datasets
327 |
328 | - SARD. [[Repo](https://samate.nist.gov/SARD)]
329 | - Juliet C/C++. [[Repo](https://samate.nist.gov/SARD/test-suites/112)]
330 | - Juliet Java. [[Repo](https://samate.nist.gov/SARD/test-suites/111)]
331 | - VulDeePecker. **`NDSS`** [[Paper](https://www.ndss-symposium.org/wp-content/uploads/2018/02/ndss2018_03A-2_Li_paper.pdf)] [[Repo](https://github.com/CGCL-codes/VulDeePecker)]
332 | - Draper. **`ICMLA`** [[Paper](https://ieeexplore.ieee.org/document/8614145)] [[Repo](https://osf.io/d45bw/)]
333 | - Devign. **`NeurIPS`** [[Paper](https://proceedings.neurips.cc/paper_files/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html)] [[Repo](https://github.com/epicosy/devign)]
334 | - Big-Vul. **`MSR`** [[Paper](https://dl.acm.org/doi/10.1145/3379597.3387501)] [[Repo](https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset)]
335 | - D2A. **`ICSE-SEIP`** [[Paper](https://ieeexplore.ieee.org/document/9402126)] [[Repo](https://github.com/IBM/D2A)]
336 | - Reveal. **`TSE`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9448435)] [[Repo](https://github.com/VulDetProject/ReVeal)]
337 | - CVEfixes. **`PROMISE`** [[Paper](https://dl.acm.org/doi/10.1145/3475960.3475985)] [[Repo](https://zenodo.org/records/13118970)]
338 | - CrossVul. **`ESEC/FSE`** [[Paper](https://dl.acm.org/doi/10.1145/3468264.3473122)] [[Repo](https://zenodo.org/records/4734050)]
339 | - SecurityEval. **`MSR4P&S`** [[Paper](https://dl.acm.org/doi/10.1145/3549035.3561184)] [[Repo](https://github.com/s2e-lab/SecurityEval)]
340 | - DiverseVul. **`RAID`** [[Paper](https://dl.acm.org/doi/10.1145/3607199.3607242)] [[Repo](https://github.com/wagner-group/diversevul)]
341 | - SVEN. **`CCS`** [[Paper](https://dl.acm.org/doi/10.1145/3576915.3623175)] [[Repo](https://github.com/eth-sri/sven)]
342 | - FormAI. **`PROMISE`** [[Paper](https://dl.acm.org/doi/10.1145/3617555.3617874)] [[Repo](https://github.com/FormAI-Dataset/FormAI-dataset)]
343 | - ReposVul. **`ICSE-Companion`** [[Paper](https://dl.acm.org/doi/10.1145/3639478.3647634)] [[Repo](https://github.com/Eshe0922/ReposVul)]
344 | - PrimeVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2403.18624)] [[Repo](https://github.com/DLVulDet/PrimeVul)]
345 | - PairVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2406.11147)] [[Repo](https://github.com/KnowledgeRAG4LLMVulD/KnowledgeRAG4LLMVulD/tree/main/dataset)]
346 | - MegaVul. **`MSR`** [[Paper](https://dl.acm.org/doi/10.1145/3643991.3644886)] [[Repo](https://github.com/Icyrockton/MegaVul)]
347 | - CleanVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2411.17274)] [[Repo](https://github.com/yikun-li/CleanVul)]
348 |
349 |
350 |
351 | ## Contribution
352 |
353 | If you want to suggest additions to the list of studies or datasets, please open a pull request or submit an issue.
354 |
355 |
356 | ## License
357 |
358 | - 🧠 Code & scripts (`*.py`, `*.ipynb`, etc.): Licensed under the [MIT License](LICENSE).
359 | - 📚 Taxonomy, markdown outputs and lists: Licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
360 |
361 | Please cite our paper if you use this resource.
362 |
--------------------------------------------------------------------------------
/analyses/study_taxonomy_analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "85d7fa6a",
6 | "metadata": {},
7 | "source": [
8 | "## Insights into Taxonomy"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "5a5bf3ec",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import pandas as pd\n",
19 | "import seaborn as sns\n",
20 | "import numpy as np\n",
21 | "import matplotlib\n",
22 | "import matplotlib.pyplot as plt\n",
23 | "import matplotlib.gridspec as gridspec\n",
24 | "import matplotlib.cm as cm\n",
25 | "import plotly.colors as pc\n",
26 | "import plotly.express as px\n",
27 | "import plotly.graph_objects as go\n",
28 | "import plotly.colors as pc\n",
29 | "import plotly.io as pio\n",
30 | "from plotly.subplots import make_subplots\n",
31 | "import json\n",
32 | "import os\n",
33 | "import re\n",
34 | "import h5py\n",
35 | "pio.renderers.default = \"vscode\"\n",
36 | "\n",
37 | "from matplotlib.colors import LinearSegmentedColormap\n",
38 | "from matplotlib.patches import Rectangle\n",
39 | "from matplotlib.patches import Patch\n",
40 | "from matplotlib.lines import Line2D\n",
41 | "from matplotlib.ticker import MultipleLocator\n",
42 | "from matplotlib.ticker import AutoMinorLocator\n",
43 | "from mpl_toolkits.axes_grid1.inset_locator import inset_axes, mark_inset\n",
44 | "from collections import defaultdict\n",
45 | "from collections import Counter\n",
46 | "from pathlib import Path\n",
47 | "\n",
48 | "plt.rcParams[\"font.family\"] = \"serif\"\n",
49 | "plt.rcParams[\"font.serif\"] = [\"Times New Roman\"]\n",
50 | "plt.rcParams[\"mathtext.fontset\"] = \"dejavuserif\" \n",
51 | "\n",
52 | "sns.set_theme(style=\"white\")\n",
53 | "pd.set_option('display.max_rows', None)\n",
54 | "pd.set_option('display.max_columns', None)\n",
55 | "pd.set_option('display.width', None) # Prevents wrapping\n",
56 | "pd.set_option('display.max_colwidth', None) # Shows full content in each cell"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 2,
62 | "id": "daf22d40",
63 | "metadata": {},
64 | "outputs": [
65 | {
66 | "data": {
67 | "application/vnd.plotly.v1+json": {
68 | "config": {
69 | "plotlyServerURL": "https://plot.ly"
70 | },
71 | "data": [
72 | {
73 | "arrangement": "snap",
74 | "link": {
75 | "color": [
76 | "rgba(220, 220, 220, 0.5)",
77 | "rgb(102, 197, 204)",
78 | "rgb(102, 197, 204)",
79 | "rgb(102, 197, 204)",
80 | "rgba(220, 220, 220, 0.5)",
81 | "rgb(158, 185, 243)",
82 | "rgb(158, 185, 243)",
83 | "rgb(158, 185, 243)",
84 | "rgba(220, 220, 220, 0.5)",
85 | "rgb(254, 136, 177)",
86 | "rgb(254, 136, 177)",
87 | "rgb(254, 136, 177)",
88 | "rgba(220, 220, 220, 0.5)",
89 | "rgb(201, 219, 116)",
90 | "rgb(201, 219, 116)",
91 | "rgb(201, 219, 116)"
92 | ],
93 | "source": {
94 | "bdata": "AAAAAAUFBQUGBgYGBwcHBw==",
95 | "dtype": "i1"
96 | },
97 | "target": {
98 | "bdata": "AQIDBAECAwQBAgMEAQIDBA==",
99 | "dtype": "i1"
100 | },
101 | "value": {
102 | "bdata": "AAAAAAAQYUAAAAAAAIAlQAAAAAAAoDJAAAAAAACACkAAAAAAAEA9QAAAAAAAABJAAAAAAACAAkAAAAAAAAAIQAAAAAAAABZAAAAAAAAA8D8AAAAAAIAMQAAAAAAAgCJAAAAAAAAAGUAAAAAAAAACQAAAAAAAgBFAAAAAAAAAsD8=",
103 | "dtype": "f8"
104 | }
105 | },
106 | "node": {
107 | "color": [
108 | "rgb(102, 197, 204)",
109 | "rgba(200, 200, 200, 0.5)",
110 | "rgb(248, 156, 116)",
111 | "rgb(220, 176, 242)",
112 | "rgb(135, 197, 95)",
113 | "rgb(158, 185, 243)",
114 | "rgb(254, 136, 177)",
115 | "rgb(201, 219, 116)"
116 | ],
117 | "hovertemplate": "%{label}
Volume: %{value:.2f}",
118 | "label": [
119 | "Binary (F1.1) {195}",
120 | "Classification Only {190}",
121 | "Description (F2.1) {21}",
122 | "Reasoning (F2.2) {40}",
123 | "Report (F2.3) {23}",
124 | "Multi-Class (F1.2) {61}",
125 | "Multi-Label (F1.3) {23}",
126 | "Vulnerability-Specific (F1.1.1) {20}"
127 | ],
128 | "line": {
129 | "color": "black",
130 | "width": 0.5
131 | },
132 | "pad": 20,
133 | "thickness": 20
134 | },
135 | "type": "sankey"
136 | }
137 | ],
138 | "layout": {
139 | "annotations": [
140 | {
141 | "align": "center",
142 | "font": {
143 | "color": "black",
144 | "size": 20
145 | },
146 | "showarrow": false,
147 | "text": "Classification (F1)",
148 | "x": 0,
149 | "xref": "paper",
150 | "y": -0.1,
151 | "yref": "paper"
152 | },
153 | {
154 | "align": "center",
155 | "font": {
156 | "color": "black",
157 | "size": 20
158 | },
159 | "showarrow": false,
160 | "text": "Generation (F2)",
161 | "x": 1,
162 | "xref": "paper",
163 | "y": -0.1,
164 | "yref": "paper"
165 | }
166 | ],
167 | "font": {
168 | "color": "black",
169 | "family": "Times New Roman, serif",
170 | "size": 20
171 | },
172 | "height": 500,
173 | "margin": {
174 | "b": 60,
175 | "t": 40
176 | },
177 | "template": {
178 | "data": {
179 | "bar": [
180 | {
181 | "error_x": {
182 | "color": "#2a3f5f"
183 | },
184 | "error_y": {
185 | "color": "#2a3f5f"
186 | },
187 | "marker": {
188 | "line": {
189 | "color": "#E5ECF6",
190 | "width": 0.5
191 | },
192 | "pattern": {
193 | "fillmode": "overlay",
194 | "size": 10,
195 | "solidity": 0.2
196 | }
197 | },
198 | "type": "bar"
199 | }
200 | ],
201 | "barpolar": [
202 | {
203 | "marker": {
204 | "line": {
205 | "color": "#E5ECF6",
206 | "width": 0.5
207 | },
208 | "pattern": {
209 | "fillmode": "overlay",
210 | "size": 10,
211 | "solidity": 0.2
212 | }
213 | },
214 | "type": "barpolar"
215 | }
216 | ],
217 | "carpet": [
218 | {
219 | "aaxis": {
220 | "endlinecolor": "#2a3f5f",
221 | "gridcolor": "white",
222 | "linecolor": "white",
223 | "minorgridcolor": "white",
224 | "startlinecolor": "#2a3f5f"
225 | },
226 | "baxis": {
227 | "endlinecolor": "#2a3f5f",
228 | "gridcolor": "white",
229 | "linecolor": "white",
230 | "minorgridcolor": "white",
231 | "startlinecolor": "#2a3f5f"
232 | },
233 | "type": "carpet"
234 | }
235 | ],
236 | "choropleth": [
237 | {
238 | "colorbar": {
239 | "outlinewidth": 0,
240 | "ticks": ""
241 | },
242 | "type": "choropleth"
243 | }
244 | ],
245 | "contour": [
246 | {
247 | "colorbar": {
248 | "outlinewidth": 0,
249 | "ticks": ""
250 | },
251 | "colorscale": [
252 | [
253 | 0,
254 | "#0d0887"
255 | ],
256 | [
257 | 0.1111111111111111,
258 | "#46039f"
259 | ],
260 | [
261 | 0.2222222222222222,
262 | "#7201a8"
263 | ],
264 | [
265 | 0.3333333333333333,
266 | "#9c179e"
267 | ],
268 | [
269 | 0.4444444444444444,
270 | "#bd3786"
271 | ],
272 | [
273 | 0.5555555555555556,
274 | "#d8576b"
275 | ],
276 | [
277 | 0.6666666666666666,
278 | "#ed7953"
279 | ],
280 | [
281 | 0.7777777777777778,
282 | "#fb9f3a"
283 | ],
284 | [
285 | 0.8888888888888888,
286 | "#fdca26"
287 | ],
288 | [
289 | 1,
290 | "#f0f921"
291 | ]
292 | ],
293 | "type": "contour"
294 | }
295 | ],
296 | "contourcarpet": [
297 | {
298 | "colorbar": {
299 | "outlinewidth": 0,
300 | "ticks": ""
301 | },
302 | "type": "contourcarpet"
303 | }
304 | ],
305 | "heatmap": [
306 | {
307 | "colorbar": {
308 | "outlinewidth": 0,
309 | "ticks": ""
310 | },
311 | "colorscale": [
312 | [
313 | 0,
314 | "#0d0887"
315 | ],
316 | [
317 | 0.1111111111111111,
318 | "#46039f"
319 | ],
320 | [
321 | 0.2222222222222222,
322 | "#7201a8"
323 | ],
324 | [
325 | 0.3333333333333333,
326 | "#9c179e"
327 | ],
328 | [
329 | 0.4444444444444444,
330 | "#bd3786"
331 | ],
332 | [
333 | 0.5555555555555556,
334 | "#d8576b"
335 | ],
336 | [
337 | 0.6666666666666666,
338 | "#ed7953"
339 | ],
340 | [
341 | 0.7777777777777778,
342 | "#fb9f3a"
343 | ],
344 | [
345 | 0.8888888888888888,
346 | "#fdca26"
347 | ],
348 | [
349 | 1,
350 | "#f0f921"
351 | ]
352 | ],
353 | "type": "heatmap"
354 | }
355 | ],
356 | "histogram": [
357 | {
358 | "marker": {
359 | "pattern": {
360 | "fillmode": "overlay",
361 | "size": 10,
362 | "solidity": 0.2
363 | }
364 | },
365 | "type": "histogram"
366 | }
367 | ],
368 | "histogram2d": [
369 | {
370 | "colorbar": {
371 | "outlinewidth": 0,
372 | "ticks": ""
373 | },
374 | "colorscale": [
375 | [
376 | 0,
377 | "#0d0887"
378 | ],
379 | [
380 | 0.1111111111111111,
381 | "#46039f"
382 | ],
383 | [
384 | 0.2222222222222222,
385 | "#7201a8"
386 | ],
387 | [
388 | 0.3333333333333333,
389 | "#9c179e"
390 | ],
391 | [
392 | 0.4444444444444444,
393 | "#bd3786"
394 | ],
395 | [
396 | 0.5555555555555556,
397 | "#d8576b"
398 | ],
399 | [
400 | 0.6666666666666666,
401 | "#ed7953"
402 | ],
403 | [
404 | 0.7777777777777778,
405 | "#fb9f3a"
406 | ],
407 | [
408 | 0.8888888888888888,
409 | "#fdca26"
410 | ],
411 | [
412 | 1,
413 | "#f0f921"
414 | ]
415 | ],
416 | "type": "histogram2d"
417 | }
418 | ],
419 | "histogram2dcontour": [
420 | {
421 | "colorbar": {
422 | "outlinewidth": 0,
423 | "ticks": ""
424 | },
425 | "colorscale": [
426 | [
427 | 0,
428 | "#0d0887"
429 | ],
430 | [
431 | 0.1111111111111111,
432 | "#46039f"
433 | ],
434 | [
435 | 0.2222222222222222,
436 | "#7201a8"
437 | ],
438 | [
439 | 0.3333333333333333,
440 | "#9c179e"
441 | ],
442 | [
443 | 0.4444444444444444,
444 | "#bd3786"
445 | ],
446 | [
447 | 0.5555555555555556,
448 | "#d8576b"
449 | ],
450 | [
451 | 0.6666666666666666,
452 | "#ed7953"
453 | ],
454 | [
455 | 0.7777777777777778,
456 | "#fb9f3a"
457 | ],
458 | [
459 | 0.8888888888888888,
460 | "#fdca26"
461 | ],
462 | [
463 | 1,
464 | "#f0f921"
465 | ]
466 | ],
467 | "type": "histogram2dcontour"
468 | }
469 | ],
470 | "mesh3d": [
471 | {
472 | "colorbar": {
473 | "outlinewidth": 0,
474 | "ticks": ""
475 | },
476 | "type": "mesh3d"
477 | }
478 | ],
479 | "parcoords": [
480 | {
481 | "line": {
482 | "colorbar": {
483 | "outlinewidth": 0,
484 | "ticks": ""
485 | }
486 | },
487 | "type": "parcoords"
488 | }
489 | ],
490 | "pie": [
491 | {
492 | "automargin": true,
493 | "type": "pie"
494 | }
495 | ],
496 | "scatter": [
497 | {
498 | "fillpattern": {
499 | "fillmode": "overlay",
500 | "size": 10,
501 | "solidity": 0.2
502 | },
503 | "type": "scatter"
504 | }
505 | ],
506 | "scatter3d": [
507 | {
508 | "line": {
509 | "colorbar": {
510 | "outlinewidth": 0,
511 | "ticks": ""
512 | }
513 | },
514 | "marker": {
515 | "colorbar": {
516 | "outlinewidth": 0,
517 | "ticks": ""
518 | }
519 | },
520 | "type": "scatter3d"
521 | }
522 | ],
523 | "scattercarpet": [
524 | {
525 | "marker": {
526 | "colorbar": {
527 | "outlinewidth": 0,
528 | "ticks": ""
529 | }
530 | },
531 | "type": "scattercarpet"
532 | }
533 | ],
534 | "scattergeo": [
535 | {
536 | "marker": {
537 | "colorbar": {
538 | "outlinewidth": 0,
539 | "ticks": ""
540 | }
541 | },
542 | "type": "scattergeo"
543 | }
544 | ],
545 | "scattergl": [
546 | {
547 | "marker": {
548 | "colorbar": {
549 | "outlinewidth": 0,
550 | "ticks": ""
551 | }
552 | },
553 | "type": "scattergl"
554 | }
555 | ],
556 | "scattermap": [
557 | {
558 | "marker": {
559 | "colorbar": {
560 | "outlinewidth": 0,
561 | "ticks": ""
562 | }
563 | },
564 | "type": "scattermap"
565 | }
566 | ],
567 | "scattermapbox": [
568 | {
569 | "marker": {
570 | "colorbar": {
571 | "outlinewidth": 0,
572 | "ticks": ""
573 | }
574 | },
575 | "type": "scattermapbox"
576 | }
577 | ],
578 | "scatterpolar": [
579 | {
580 | "marker": {
581 | "colorbar": {
582 | "outlinewidth": 0,
583 | "ticks": ""
584 | }
585 | },
586 | "type": "scatterpolar"
587 | }
588 | ],
589 | "scatterpolargl": [
590 | {
591 | "marker": {
592 | "colorbar": {
593 | "outlinewidth": 0,
594 | "ticks": ""
595 | }
596 | },
597 | "type": "scatterpolargl"
598 | }
599 | ],
600 | "scatterternary": [
601 | {
602 | "marker": {
603 | "colorbar": {
604 | "outlinewidth": 0,
605 | "ticks": ""
606 | }
607 | },
608 | "type": "scatterternary"
609 | }
610 | ],
611 | "surface": [
612 | {
613 | "colorbar": {
614 | "outlinewidth": 0,
615 | "ticks": ""
616 | },
617 | "colorscale": [
618 | [
619 | 0,
620 | "#0d0887"
621 | ],
622 | [
623 | 0.1111111111111111,
624 | "#46039f"
625 | ],
626 | [
627 | 0.2222222222222222,
628 | "#7201a8"
629 | ],
630 | [
631 | 0.3333333333333333,
632 | "#9c179e"
633 | ],
634 | [
635 | 0.4444444444444444,
636 | "#bd3786"
637 | ],
638 | [
639 | 0.5555555555555556,
640 | "#d8576b"
641 | ],
642 | [
643 | 0.6666666666666666,
644 | "#ed7953"
645 | ],
646 | [
647 | 0.7777777777777778,
648 | "#fb9f3a"
649 | ],
650 | [
651 | 0.8888888888888888,
652 | "#fdca26"
653 | ],
654 | [
655 | 1,
656 | "#f0f921"
657 | ]
658 | ],
659 | "type": "surface"
660 | }
661 | ],
662 | "table": [
663 | {
664 | "cells": {
665 | "fill": {
666 | "color": "#EBF0F8"
667 | },
668 | "line": {
669 | "color": "white"
670 | }
671 | },
672 | "header": {
673 | "fill": {
674 | "color": "#C8D4E3"
675 | },
676 | "line": {
677 | "color": "white"
678 | }
679 | },
680 | "type": "table"
681 | }
682 | ]
683 | },
684 | "layout": {
685 | "annotationdefaults": {
686 | "arrowcolor": "#2a3f5f",
687 | "arrowhead": 0,
688 | "arrowwidth": 1
689 | },
690 | "autotypenumbers": "strict",
691 | "coloraxis": {
692 | "colorbar": {
693 | "outlinewidth": 0,
694 | "ticks": ""
695 | }
696 | },
697 | "colorscale": {
698 | "diverging": [
699 | [
700 | 0,
701 | "#8e0152"
702 | ],
703 | [
704 | 0.1,
705 | "#c51b7d"
706 | ],
707 | [
708 | 0.2,
709 | "#de77ae"
710 | ],
711 | [
712 | 0.3,
713 | "#f1b6da"
714 | ],
715 | [
716 | 0.4,
717 | "#fde0ef"
718 | ],
719 | [
720 | 0.5,
721 | "#f7f7f7"
722 | ],
723 | [
724 | 0.6,
725 | "#e6f5d0"
726 | ],
727 | [
728 | 0.7,
729 | "#b8e186"
730 | ],
731 | [
732 | 0.8,
733 | "#7fbc41"
734 | ],
735 | [
736 | 0.9,
737 | "#4d9221"
738 | ],
739 | [
740 | 1,
741 | "#276419"
742 | ]
743 | ],
744 | "sequential": [
745 | [
746 | 0,
747 | "#0d0887"
748 | ],
749 | [
750 | 0.1111111111111111,
751 | "#46039f"
752 | ],
753 | [
754 | 0.2222222222222222,
755 | "#7201a8"
756 | ],
757 | [
758 | 0.3333333333333333,
759 | "#9c179e"
760 | ],
761 | [
762 | 0.4444444444444444,
763 | "#bd3786"
764 | ],
765 | [
766 | 0.5555555555555556,
767 | "#d8576b"
768 | ],
769 | [
770 | 0.6666666666666666,
771 | "#ed7953"
772 | ],
773 | [
774 | 0.7777777777777778,
775 | "#fb9f3a"
776 | ],
777 | [
778 | 0.8888888888888888,
779 | "#fdca26"
780 | ],
781 | [
782 | 1,
783 | "#f0f921"
784 | ]
785 | ],
786 | "sequentialminus": [
787 | [
788 | 0,
789 | "#0d0887"
790 | ],
791 | [
792 | 0.1111111111111111,
793 | "#46039f"
794 | ],
795 | [
796 | 0.2222222222222222,
797 | "#7201a8"
798 | ],
799 | [
800 | 0.3333333333333333,
801 | "#9c179e"
802 | ],
803 | [
804 | 0.4444444444444444,
805 | "#bd3786"
806 | ],
807 | [
808 | 0.5555555555555556,
809 | "#d8576b"
810 | ],
811 | [
812 | 0.6666666666666666,
813 | "#ed7953"
814 | ],
815 | [
816 | 0.7777777777777778,
817 | "#fb9f3a"
818 | ],
819 | [
820 | 0.8888888888888888,
821 | "#fdca26"
822 | ],
823 | [
824 | 1,
825 | "#f0f921"
826 | ]
827 | ]
828 | },
829 | "colorway": [
830 | "#636efa",
831 | "#EF553B",
832 | "#00cc96",
833 | "#ab63fa",
834 | "#FFA15A",
835 | "#19d3f3",
836 | "#FF6692",
837 | "#B6E880",
838 | "#FF97FF",
839 | "#FECB52"
840 | ],
841 | "font": {
842 | "color": "#2a3f5f"
843 | },
844 | "geo": {
845 | "bgcolor": "white",
846 | "lakecolor": "white",
847 | "landcolor": "#E5ECF6",
848 | "showlakes": true,
849 | "showland": true,
850 | "subunitcolor": "white"
851 | },
852 | "hoverlabel": {
853 | "align": "left"
854 | },
855 | "hovermode": "closest",
856 | "mapbox": {
857 | "style": "light"
858 | },
859 | "paper_bgcolor": "white",
860 | "plot_bgcolor": "#E5ECF6",
861 | "polar": {
862 | "angularaxis": {
863 | "gridcolor": "white",
864 | "linecolor": "white",
865 | "ticks": ""
866 | },
867 | "bgcolor": "#E5ECF6",
868 | "radialaxis": {
869 | "gridcolor": "white",
870 | "linecolor": "white",
871 | "ticks": ""
872 | }
873 | },
874 | "scene": {
875 | "xaxis": {
876 | "backgroundcolor": "#E5ECF6",
877 | "gridcolor": "white",
878 | "gridwidth": 2,
879 | "linecolor": "white",
880 | "showbackground": true,
881 | "ticks": "",
882 | "zerolinecolor": "white"
883 | },
884 | "yaxis": {
885 | "backgroundcolor": "#E5ECF6",
886 | "gridcolor": "white",
887 | "gridwidth": 2,
888 | "linecolor": "white",
889 | "showbackground": true,
890 | "ticks": "",
891 | "zerolinecolor": "white"
892 | },
893 | "zaxis": {
894 | "backgroundcolor": "#E5ECF6",
895 | "gridcolor": "white",
896 | "gridwidth": 2,
897 | "linecolor": "white",
898 | "showbackground": true,
899 | "ticks": "",
900 | "zerolinecolor": "white"
901 | }
902 | },
903 | "shapedefaults": {
904 | "line": {
905 | "color": "#2a3f5f"
906 | }
907 | },
908 | "ternary": {
909 | "aaxis": {
910 | "gridcolor": "white",
911 | "linecolor": "white",
912 | "ticks": ""
913 | },
914 | "baxis": {
915 | "gridcolor": "white",
916 | "linecolor": "white",
917 | "ticks": ""
918 | },
919 | "bgcolor": "#E5ECF6",
920 | "caxis": {
921 | "gridcolor": "white",
922 | "linecolor": "white",
923 | "ticks": ""
924 | }
925 | },
926 | "title": {
927 | "x": 0.05
928 | },
929 | "xaxis": {
930 | "automargin": true,
931 | "gridcolor": "white",
932 | "linecolor": "white",
933 | "ticks": "",
934 | "title": {
935 | "standoff": 15
936 | },
937 | "zerolinecolor": "white",
938 | "zerolinewidth": 2
939 | },
940 | "yaxis": {
941 | "automargin": true,
942 | "gridcolor": "white",
943 | "linecolor": "white",
944 | "ticks": "",
945 | "title": {
946 | "standoff": 15
947 | },
948 | "zerolinecolor": "white",
949 | "zerolinewidth": 2
950 | }
951 | }
952 | },
953 | "width": 1000
954 | }
955 | }
956 | },
957 | "metadata": {},
958 | "output_type": "display_data"
959 | }
960 | ],
961 | "source": [
962 | "# sankey task formulation\n",
963 | "# ==========================================\n",
964 | "taxonomy_task_df = pd.read_excel(\"./taxonomy.xlsx\", sheet_name=\"STUDY_TASK\")\n",
965 | "df = taxonomy_task_df[['CitationKey', 'Classification', 'Generation']].copy()\n",
966 | "\n",
967 | "df['Classification'] = df['Classification'].fillna('No Classification')\n",
968 | "df['Generation'] = df['Generation'].fillna('Classification Only') \n",
969 | "\n",
970 | "df = df.replace('None', 'No Classification')\n",
971 | "df['Generation'] = df['Generation'].replace('No Classification', 'Classification Only')\n",
972 | "df = df.replace('nan', 'No Classification')\n",
973 | "\n",
974 | "# Explode lists\n",
975 | "for col in ['Classification', 'Generation']:\n",
976 | " df[col] = df[col].astype(str).str.split(',')\n",
977 | " df = df.explode(col)\n",
978 | " df[col] = df[col].str.strip()\n",
979 | "\n",
980 | "df = df[df['Classification'] != '']\n",
981 | "df = df[df['Generation'] != '']\n",
982 | "\n",
983 | "# Calculate Weights\n",
984 | "df['Class_Count'] = df.groupby('CitationKey')['Classification'].transform('count')\n",
985 | "df['Gen_Count'] = df.groupby('CitationKey')['Generation'].transform('count')\n",
986 | "df['Weight'] = 1 / (df['Class_Count'] * df['Gen_Count'])\n",
987 | "\n",
988 | "# Create Edges\n",
989 | "edges = df.groupby(['Classification', 'Generation'])['Weight'].sum().reset_index(name='Value')\n",
990 | "edges = edges.rename(columns={'Classification': 'Source', 'Generation': 'Target'})\n",
991 | "\n",
992 | "# Define Node Properties\n",
993 | "all_labels = pd.unique(edges[['Source', 'Target']].values.ravel())\n",
994 | "nodes_df = pd.DataFrame({'Label': all_labels})\n",
995 | "nodes_df['ID'] = nodes_df.index\n",
996 | "label_to_id = dict(zip(nodes_df['Label'], nodes_df['ID']))\n",
997 | "\n",
998 | "edges['SourceID'] = edges['Source'].map(label_to_id)\n",
999 | "edges['TargetID'] = edges['Target'].map(label_to_id)\n",
1000 | "\n",
1001 | "# ==========================================\n",
1002 | "residual_labels = ['Classification Only', 'Generation Only', 'No Classification']\n",
1003 | "palette = pc.qualitative.Pastel \n",
1004 | "\n",
1005 | "node_colors = []\n",
1006 | "link_colors = []\n",
1007 | "\n",
1008 | "# Assign Node Colors\n",
1009 | "for idx, row in nodes_df.iterrows():\n",
1010 | " if row['Label'] in residual_labels:\n",
1011 | " # Keep residuals gray\n",
1012 | " node_colors.append('rgba(200, 200, 200, 0.5)') \n",
1013 | " else:\n",
1014 | " # Assign color from Pastel palette\n",
1015 | " color_idx = idx % len(palette)\n",
1016 | " node_colors.append(palette[color_idx])\n",
1017 | "\n",
1018 | "# Assign Link Colors\n",
1019 | "for idx, row in edges.iterrows():\n",
1020 | " source_label = row['Source']\n",
1021 | " target_label = row['Target']\n",
1022 | " \n",
1023 | " if source_label in residual_labels or target_label in residual_labels:\n",
1024 | " link_colors.append('rgba(220, 220, 220, 0.5)')\n",
1025 | " else:\n",
1026 | " source_id = label_to_id[source_label]\n",
1027 | " base_color = node_colors[source_id]\n",
1028 | " if base_color.startswith('#'):\n",
1029 | " h = base_color.lstrip('#')\n",
1030 | " rgb = tuple(int(h[i:i+2], 16) for i in (0, 2, 4))\n",
1031 | " link_colors.append(f'rgba({rgb[0]}, {rgb[1]}, {rgb[2]}, 0.6)')\n",
1032 | " else:\n",
1033 | " link_colors.append(base_color)\n",
1034 | "\n",
1035 | "# ==========================================\n",
1036 | "# counts\n",
1037 | "label_to_studies = defaultdict(set)\n",
1038 | "for idx, row in taxonomy_task_df.iterrows():\n",
1039 | " val_c = str(row['Classification'])\n",
1040 | " if val_c != 'None' and val_c != 'nan':\n",
1041 | " for tag in val_c.split(','):\n",
1042 | " label_to_studies[tag.strip()].add(row['CitationKey'])\n",
1043 | " else:\n",
1044 | " label_to_studies['No Classification'].add(row['CitationKey'])\n",
1045 | "\n",
1046 | " val_g = str(row['Generation'])\n",
1047 | " if val_g != 'None' and val_g != 'nan':\n",
1048 | " for tag in val_g.split(','):\n",
1049 | " label_to_studies[tag.strip()].add(row['CitationKey'])\n",
1050 | " else:\n",
1051 | " label_to_studies['Classification Only'].add(row['CitationKey'])\n",
1052 | "\n",
1053 | "nodes_df['StudyCount'] = nodes_df['Label'].map(lambda x: len(label_to_studies.get(x, set())))\n",
1054 | "\n",
1055 | "# --- Taxonomy IDs ---\n",
1056 | "taxonomy_ids = {\n",
1057 | " \"Binary\": \"F1.1\",\n",
1058 | " \"Multi-Class\": \"F1.2\",\n",
1059 | " \"Multi-Label\": \"F1.3\",\n",
1060 | " \"Vulnerability-Specific\": \"F1.1.1\",\n",
1061 | " \"Description\": \"F2.1\",\n",
1062 | " \"Reasoning\": \"F2.2\",\n",
1063 | " \"Report\": \"F2.3\"\n",
1064 | "}\n",
1065 | "\n",
1066 | "# Label Formatter with () and {}\n",
1067 | "def format_label(row):\n",
1068 | " label = row['Label']\n",
1069 | " count = row['StudyCount']\n",
1070 | " \n",
1071 | " if label == 'No Classification':\n",
1072 | " return \"\"\n",
1073 | " \n",
1074 | " tax_id = taxonomy_ids.get(label, \"\")\n",
1075 | " if tax_id:\n",
1076 | " return f\"{label} ({tax_id}) {{{count}}}\"\n",
1077 | " else:\n",
1078 | " return f\"{label} {{{count}}}\"\n",
1079 | "\n",
1080 | "nodes_df['LabelDisplay'] = nodes_df.apply(format_label, axis=1)\n",
1081 | "\n",
1082 | "# Plot\n",
1083 | "fig = go.Figure(data=[go.Sankey(\n",
1084 | " arrangement=\"snap\",\n",
1085 | " node=dict(\n",
1086 | " pad=20,\n",
1087 | " thickness=20,\n",
1088 | " line=dict(color=\"black\", width=0.5),\n",
1089 | " label=nodes_df['LabelDisplay'],\n",
1090 | " color=node_colors,\n",
1091 | " hovertemplate='%{label}
Volume: %{value:.2f}',\n",
1092 | " ),\n",
1093 | " link=dict(\n",
1094 | " source=edges['SourceID'],\n",
1095 | " target=edges['TargetID'],\n",
1096 | " value=edges['Value'],\n",
1097 | " color=link_colors\n",
1098 | " )\n",
1099 | ")])\n",
1100 | "\n",
1101 | "# ==========================================\n",
1102 | "fig.update_layout(\n",
1103 | " font=dict(\n",
1104 | " family=\"Times New Roman, serif\", \n",
1105 | " size=20, \n",
1106 | " color=\"black\"\n",
1107 | " ),\n",
1108 | " width=1000,\n",
1109 | " height=500,\n",
1110 | " margin=dict(b=60, t=40),\n",
1111 | " \n",
1112 | " annotations=[\n",
1113 | " # Left Column Label\n",
1114 | " dict(\n",
1115 | " x=0,\n",
1116 | " y=-0.1,\n",
1117 | " xref=\"paper\",\n",
1118 | " yref=\"paper\",\n",
1119 | " text=\"Classification (F1)\", # Taxonomy in ()\n",
1120 | " showarrow=False,\n",
1121 | " font=dict(size=20, color=\"black\"), \n",
1122 | " align=\"center\"\n",
1123 | " ),\n",
1124 | " # Right Column Label\n",
1125 | " dict(\n",
1126 | " x=1,\n",
1127 | " y=-0.1,\n",
1128 | " xref=\"paper\",\n",
1129 | " yref=\"paper\",\n",
1130 | " text=\"Generation (F2)\", # Taxonomy in ()\n",
1131 | " showarrow=False,\n",
1132 | " font=dict(size=20, color=\"black\"),\n",
1133 | " align=\"center\"\n",
1134 | " )\n",
1135 | " ]\n",
1136 | ")\n",
1137 | "\n",
1138 | "fig.show()"
1139 | ]
1140 | },
1141 | {
1142 | "cell_type": "code",
1143 | "execution_count": 3,
1144 | "id": "e4405f7a",
1145 | "metadata": {},
1146 | "outputs": [
1147 | {
1148 | "data": {
1149 | "application/vnd.plotly.v1+json": {
1150 | "config": {
1151 | "plotlyServerURL": "https://plot.ly"
1152 | },
1153 | "data": [
1154 | {
1155 | "arrangement": "snap",
1156 | "link": {
1157 | "color": [
1158 | "rgba(220, 176, 242, 0.4)",
1159 | "rgba(220, 176, 242, 0.4)",
1160 | "rgba(220, 176, 242, 0.4)",
1161 | "rgba(248, 156, 116, 0.4)",
1162 | "rgba(248, 156, 116, 0.4)",
1163 | "rgba(248, 156, 116, 0.4)",
1164 | "rgba(248, 156, 116, 0.4)",
1165 | "rgba(246, 207, 113, 0.4)",
1166 | "rgba(246, 207, 113, 0.4)",
1167 | "rgba(246, 207, 113, 0.4)",
1168 | "rgba(246, 207, 113, 0.4)",
1169 | "rgba(246, 207, 113, 0.4)",
1170 | "rgba(102, 197, 204, 0.4)",
1171 | "rgba(102, 197, 204, 0.4)",
1172 | "rgba(102, 197, 204, 0.4)",
1173 | "rgba(102, 197, 204, 0.4)",
1174 | "rgba(102, 197, 204, 0.4)",
1175 | "rgba(200, 200, 200, 0.4)",
1176 | "rgba(135, 197, 95, 0.4)",
1177 | "rgba(135, 197, 95, 0.4)",
1178 | "rgba(158, 185, 243, 0.4)",
1179 | "rgba(158, 185, 243, 0.4)",
1180 | "rgba(158, 185, 243, 0.4)",
1181 | "rgba(158, 185, 243, 0.4)",
1182 | "rgba(158, 185, 243, 0.4)",
1183 | "rgba(158, 185, 243, 0.4)",
1184 | "rgba(158, 185, 243, 0.4)",
1185 | "rgba(200, 200, 200, 0.4)",
1186 | "rgba(254, 136, 177, 0.4)",
1187 | "rgba(254, 136, 177, 0.4)",
1188 | "rgba(254, 136, 177, 0.4)",
1189 | "rgba(254, 136, 177, 0.4)",
1190 | "rgba(254, 136, 177, 0.4)"
1191 | ],
1192 | "source": [
1193 | 3,
1194 | 3,
1195 | 3,
1196 | 2,
1197 | 2,
1198 | 2,
1199 | 2,
1200 | 1,
1201 | 1,
1202 | 1,
1203 | 1,
1204 | 1,
1205 | 0,
1206 | 0,
1207 | 0,
1208 | 0,
1209 | 0,
1210 | 12,
1211 | 4,
1212 | 4,
1213 | 5,
1214 | 5,
1215 | 5,
1216 | 5,
1217 | 5,
1218 | 5,
1219 | 5,
1220 | 19,
1221 | 6,
1222 | 6,
1223 | 6,
1224 | 6,
1225 | 6
1226 | ],
1227 | "target": [
1228 | 4,
1229 | 5,
1230 | 6,
1231 | 12,
1232 | 4,
1233 | 5,
1234 | 6,
1235 | 12,
1236 | 4,
1237 | 5,
1238 | 19,
1239 | 6,
1240 | 12,
1241 | 4,
1242 | 5,
1243 | 19,
1244 | 6,
1245 | 12,
1246 | 14,
1247 | 16,
1248 | 9,
1249 | 10,
1250 | 16,
1251 | 17,
1252 | 18,
1253 | 20,
1254 | 22,
1255 | 19,
1256 | 11,
1257 | 13,
1258 | 15,
1259 | 21,
1260 | 23
1261 | ],
1262 | "value": [
1263 | 1.3499999999999999,
1264 | 0.21428571428571427,
1265 | 42.962686127082414,
1266 | 0.75,
1267 | 2.25,
1268 | 1.3766917293233083,
1269 | 30.223142722001082,
1270 | 1.5833333333333333,
1271 | 7.536111111111111,
1272 | 14.904020467836258,
1273 | 0.3333333333333333,
1274 | 13.764757789080544,
1275 | 28,
1276 | 84.49130116959064,
1277 | 11.873214285714285,
1278 | 7,
1279 | 0.41666666666666663,
1280 | 30.833333333333332,
1281 | 97.21769005847953,
1282 | 2.34375,
1283 | 1.5,
1284 | 1,
1285 | 3.488095238095238,
1286 | 4.833333333333333,
1287 | 16.44715956558062,
1288 | 4,
1289 | 2,
1290 | 7.333333333333333,
1291 | 16.536805555555556,
1292 | 11.989499791144528,
1293 | 17.947916666666668,
1294 | 11.614583333333334,
1295 | 33.91449979114453
1296 | ]
1297 | },
1298 | "node": {
1299 | "color": [
1300 | "rgb(102, 197, 204)",
1301 | "rgb(246, 207, 113)",
1302 | "rgb(248, 156, 116)",
1303 | "rgb(220, 176, 242)",
1304 | "rgb(135, 197, 95)",
1305 | "rgb(158, 185, 243)",
1306 | "rgb(254, 136, 177)",
1307 | "lightgrey",
1308 | "lightgrey",
1309 | "rgb(158, 185, 243)",
1310 | "rgb(158, 185, 243)",
1311 | "rgb(254, 136, 177)",
1312 | "lightgrey",
1313 | "rgb(254, 136, 177)",
1314 | "rgb(135, 197, 95)",
1315 | "rgb(254, 136, 177)",
1316 | "rgb(158, 185, 243)",
1317 | "rgb(158, 185, 243)",
1318 | "rgb(158, 185, 243)",
1319 | "lightgrey",
1320 | "rgb(158, 185, 243)",
1321 | "rgb(254, 136, 177)",
1322 | "rgb(158, 185, 243)",
1323 | "rgb(254, 136, 177)"
1324 | ],
1325 | "hovertemplate": "%{label}
Weighted Volume: %{value:.2f}",
1326 | "label": [
1327 | "Tiny (S1.3.1)",
1328 | "Small (S1.3.2)",
1329 | "Medium (S1.3.3)",
1330 | "Large (S1.3.4)",
1331 | "Full Fine-Tuning (T2.2.2.1)",
1332 | "Parameter-Efficient Fine-Tuning (T2.2.2.2)",
1333 | "Prompt Engineering (T2.1)",
1334 | "Pre-Training (T2.2.1)",
1335 | "Feature Extraction (T1)",
1336 | "Adapter-Tuning {2}",
1337 | "Additive-Other {1}",
1338 | "CoT {38}",
1339 | "Feature Extraction {32}",
1340 | "Few-Shot {28}",
1341 | "Full-Parameter {117}",
1342 | "In-Context {30}",
1343 | "Instruction-Tuning {13}",
1344 | "LoRA Derivates {8}",
1345 | "Low-Rank Decomposition {27}",
1346 | "Pre-Training {14}",
1347 | "Prompt-Tuning {5}",
1348 | "RAG {25}",
1349 | "Selective {2}",
1350 | "Zero-Shot {56}"
1351 | ],
1352 | "line": {
1353 | "color": "black",
1354 | "width": 0.5
1355 | },
1356 | "pad": 15,
1357 | "thickness": 20
1358 | },
1359 | "type": "sankey"
1360 | }
1361 | ],
1362 | "layout": {
1363 | "annotations": [
1364 | {
1365 | "align": "center",
1366 | "font": {
1367 | "color": "black",
1368 | "size": 20
1369 | },
1370 | "showarrow": false,
1371 | "text": "Model Scale (S1.3)",
1372 | "x": 0,
1373 | "xref": "paper",
1374 | "y": -0.1,
1375 | "yref": "paper"
1376 | },
1377 | {
1378 | "align": "center",
1379 | "font": {
1380 | "color": "black",
1381 | "size": 20
1382 | },
1383 | "showarrow": false,
1384 | "text": "Adaptation Technique (T2)",
1385 | "x": 1,
1386 | "xref": "paper",
1387 | "y": -0.1,
1388 | "yref": "paper"
1389 | }
1390 | ],
1391 | "font": {
1392 | "color": "black",
1393 | "family": "Times New Roman, serif",
1394 | "size": 17
1395 | },
1396 | "height": 600,
1397 | "margin": {
1398 | "b": 60,
1399 | "t": 40
1400 | },
1401 | "template": {
1402 | "data": {
1403 | "bar": [
1404 | {
1405 | "error_x": {
1406 | "color": "#2a3f5f"
1407 | },
1408 | "error_y": {
1409 | "color": "#2a3f5f"
1410 | },
1411 | "marker": {
1412 | "line": {
1413 | "color": "#E5ECF6",
1414 | "width": 0.5
1415 | },
1416 | "pattern": {
1417 | "fillmode": "overlay",
1418 | "size": 10,
1419 | "solidity": 0.2
1420 | }
1421 | },
1422 | "type": "bar"
1423 | }
1424 | ],
1425 | "barpolar": [
1426 | {
1427 | "marker": {
1428 | "line": {
1429 | "color": "#E5ECF6",
1430 | "width": 0.5
1431 | },
1432 | "pattern": {
1433 | "fillmode": "overlay",
1434 | "size": 10,
1435 | "solidity": 0.2
1436 | }
1437 | },
1438 | "type": "barpolar"
1439 | }
1440 | ],
1441 | "carpet": [
1442 | {
1443 | "aaxis": {
1444 | "endlinecolor": "#2a3f5f",
1445 | "gridcolor": "white",
1446 | "linecolor": "white",
1447 | "minorgridcolor": "white",
1448 | "startlinecolor": "#2a3f5f"
1449 | },
1450 | "baxis": {
1451 | "endlinecolor": "#2a3f5f",
1452 | "gridcolor": "white",
1453 | "linecolor": "white",
1454 | "minorgridcolor": "white",
1455 | "startlinecolor": "#2a3f5f"
1456 | },
1457 | "type": "carpet"
1458 | }
1459 | ],
1460 | "choropleth": [
1461 | {
1462 | "colorbar": {
1463 | "outlinewidth": 0,
1464 | "ticks": ""
1465 | },
1466 | "type": "choropleth"
1467 | }
1468 | ],
1469 | "contour": [
1470 | {
1471 | "colorbar": {
1472 | "outlinewidth": 0,
1473 | "ticks": ""
1474 | },
1475 | "colorscale": [
1476 | [
1477 | 0,
1478 | "#0d0887"
1479 | ],
1480 | [
1481 | 0.1111111111111111,
1482 | "#46039f"
1483 | ],
1484 | [
1485 | 0.2222222222222222,
1486 | "#7201a8"
1487 | ],
1488 | [
1489 | 0.3333333333333333,
1490 | "#9c179e"
1491 | ],
1492 | [
1493 | 0.4444444444444444,
1494 | "#bd3786"
1495 | ],
1496 | [
1497 | 0.5555555555555556,
1498 | "#d8576b"
1499 | ],
1500 | [
1501 | 0.6666666666666666,
1502 | "#ed7953"
1503 | ],
1504 | [
1505 | 0.7777777777777778,
1506 | "#fb9f3a"
1507 | ],
1508 | [
1509 | 0.8888888888888888,
1510 | "#fdca26"
1511 | ],
1512 | [
1513 | 1,
1514 | "#f0f921"
1515 | ]
1516 | ],
1517 | "type": "contour"
1518 | }
1519 | ],
1520 | "contourcarpet": [
1521 | {
1522 | "colorbar": {
1523 | "outlinewidth": 0,
1524 | "ticks": ""
1525 | },
1526 | "type": "contourcarpet"
1527 | }
1528 | ],
1529 | "heatmap": [
1530 | {
1531 | "colorbar": {
1532 | "outlinewidth": 0,
1533 | "ticks": ""
1534 | },
1535 | "colorscale": [
1536 | [
1537 | 0,
1538 | "#0d0887"
1539 | ],
1540 | [
1541 | 0.1111111111111111,
1542 | "#46039f"
1543 | ],
1544 | [
1545 | 0.2222222222222222,
1546 | "#7201a8"
1547 | ],
1548 | [
1549 | 0.3333333333333333,
1550 | "#9c179e"
1551 | ],
1552 | [
1553 | 0.4444444444444444,
1554 | "#bd3786"
1555 | ],
1556 | [
1557 | 0.5555555555555556,
1558 | "#d8576b"
1559 | ],
1560 | [
1561 | 0.6666666666666666,
1562 | "#ed7953"
1563 | ],
1564 | [
1565 | 0.7777777777777778,
1566 | "#fb9f3a"
1567 | ],
1568 | [
1569 | 0.8888888888888888,
1570 | "#fdca26"
1571 | ],
1572 | [
1573 | 1,
1574 | "#f0f921"
1575 | ]
1576 | ],
1577 | "type": "heatmap"
1578 | }
1579 | ],
1580 | "histogram": [
1581 | {
1582 | "marker": {
1583 | "pattern": {
1584 | "fillmode": "overlay",
1585 | "size": 10,
1586 | "solidity": 0.2
1587 | }
1588 | },
1589 | "type": "histogram"
1590 | }
1591 | ],
1592 | "histogram2d": [
1593 | {
1594 | "colorbar": {
1595 | "outlinewidth": 0,
1596 | "ticks": ""
1597 | },
1598 | "colorscale": [
1599 | [
1600 | 0,
1601 | "#0d0887"
1602 | ],
1603 | [
1604 | 0.1111111111111111,
1605 | "#46039f"
1606 | ],
1607 | [
1608 | 0.2222222222222222,
1609 | "#7201a8"
1610 | ],
1611 | [
1612 | 0.3333333333333333,
1613 | "#9c179e"
1614 | ],
1615 | [
1616 | 0.4444444444444444,
1617 | "#bd3786"
1618 | ],
1619 | [
1620 | 0.5555555555555556,
1621 | "#d8576b"
1622 | ],
1623 | [
1624 | 0.6666666666666666,
1625 | "#ed7953"
1626 | ],
1627 | [
1628 | 0.7777777777777778,
1629 | "#fb9f3a"
1630 | ],
1631 | [
1632 | 0.8888888888888888,
1633 | "#fdca26"
1634 | ],
1635 | [
1636 | 1,
1637 | "#f0f921"
1638 | ]
1639 | ],
1640 | "type": "histogram2d"
1641 | }
1642 | ],
1643 | "histogram2dcontour": [
1644 | {
1645 | "colorbar": {
1646 | "outlinewidth": 0,
1647 | "ticks": ""
1648 | },
1649 | "colorscale": [
1650 | [
1651 | 0,
1652 | "#0d0887"
1653 | ],
1654 | [
1655 | 0.1111111111111111,
1656 | "#46039f"
1657 | ],
1658 | [
1659 | 0.2222222222222222,
1660 | "#7201a8"
1661 | ],
1662 | [
1663 | 0.3333333333333333,
1664 | "#9c179e"
1665 | ],
1666 | [
1667 | 0.4444444444444444,
1668 | "#bd3786"
1669 | ],
1670 | [
1671 | 0.5555555555555556,
1672 | "#d8576b"
1673 | ],
1674 | [
1675 | 0.6666666666666666,
1676 | "#ed7953"
1677 | ],
1678 | [
1679 | 0.7777777777777778,
1680 | "#fb9f3a"
1681 | ],
1682 | [
1683 | 0.8888888888888888,
1684 | "#fdca26"
1685 | ],
1686 | [
1687 | 1,
1688 | "#f0f921"
1689 | ]
1690 | ],
1691 | "type": "histogram2dcontour"
1692 | }
1693 | ],
1694 | "mesh3d": [
1695 | {
1696 | "colorbar": {
1697 | "outlinewidth": 0,
1698 | "ticks": ""
1699 | },
1700 | "type": "mesh3d"
1701 | }
1702 | ],
1703 | "parcoords": [
1704 | {
1705 | "line": {
1706 | "colorbar": {
1707 | "outlinewidth": 0,
1708 | "ticks": ""
1709 | }
1710 | },
1711 | "type": "parcoords"
1712 | }
1713 | ],
1714 | "pie": [
1715 | {
1716 | "automargin": true,
1717 | "type": "pie"
1718 | }
1719 | ],
1720 | "scatter": [
1721 | {
1722 | "fillpattern": {
1723 | "fillmode": "overlay",
1724 | "size": 10,
1725 | "solidity": 0.2
1726 | },
1727 | "type": "scatter"
1728 | }
1729 | ],
1730 | "scatter3d": [
1731 | {
1732 | "line": {
1733 | "colorbar": {
1734 | "outlinewidth": 0,
1735 | "ticks": ""
1736 | }
1737 | },
1738 | "marker": {
1739 | "colorbar": {
1740 | "outlinewidth": 0,
1741 | "ticks": ""
1742 | }
1743 | },
1744 | "type": "scatter3d"
1745 | }
1746 | ],
1747 | "scattercarpet": [
1748 | {
1749 | "marker": {
1750 | "colorbar": {
1751 | "outlinewidth": 0,
1752 | "ticks": ""
1753 | }
1754 | },
1755 | "type": "scattercarpet"
1756 | }
1757 | ],
1758 | "scattergeo": [
1759 | {
1760 | "marker": {
1761 | "colorbar": {
1762 | "outlinewidth": 0,
1763 | "ticks": ""
1764 | }
1765 | },
1766 | "type": "scattergeo"
1767 | }
1768 | ],
1769 | "scattergl": [
1770 | {
1771 | "marker": {
1772 | "colorbar": {
1773 | "outlinewidth": 0,
1774 | "ticks": ""
1775 | }
1776 | },
1777 | "type": "scattergl"
1778 | }
1779 | ],
1780 | "scattermap": [
1781 | {
1782 | "marker": {
1783 | "colorbar": {
1784 | "outlinewidth": 0,
1785 | "ticks": ""
1786 | }
1787 | },
1788 | "type": "scattermap"
1789 | }
1790 | ],
1791 | "scattermapbox": [
1792 | {
1793 | "marker": {
1794 | "colorbar": {
1795 | "outlinewidth": 0,
1796 | "ticks": ""
1797 | }
1798 | },
1799 | "type": "scattermapbox"
1800 | }
1801 | ],
1802 | "scatterpolar": [
1803 | {
1804 | "marker": {
1805 | "colorbar": {
1806 | "outlinewidth": 0,
1807 | "ticks": ""
1808 | }
1809 | },
1810 | "type": "scatterpolar"
1811 | }
1812 | ],
1813 | "scatterpolargl": [
1814 | {
1815 | "marker": {
1816 | "colorbar": {
1817 | "outlinewidth": 0,
1818 | "ticks": ""
1819 | }
1820 | },
1821 | "type": "scatterpolargl"
1822 | }
1823 | ],
1824 | "scatterternary": [
1825 | {
1826 | "marker": {
1827 | "colorbar": {
1828 | "outlinewidth": 0,
1829 | "ticks": ""
1830 | }
1831 | },
1832 | "type": "scatterternary"
1833 | }
1834 | ],
1835 | "surface": [
1836 | {
1837 | "colorbar": {
1838 | "outlinewidth": 0,
1839 | "ticks": ""
1840 | },
1841 | "colorscale": [
1842 | [
1843 | 0,
1844 | "#0d0887"
1845 | ],
1846 | [
1847 | 0.1111111111111111,
1848 | "#46039f"
1849 | ],
1850 | [
1851 | 0.2222222222222222,
1852 | "#7201a8"
1853 | ],
1854 | [
1855 | 0.3333333333333333,
1856 | "#9c179e"
1857 | ],
1858 | [
1859 | 0.4444444444444444,
1860 | "#bd3786"
1861 | ],
1862 | [
1863 | 0.5555555555555556,
1864 | "#d8576b"
1865 | ],
1866 | [
1867 | 0.6666666666666666,
1868 | "#ed7953"
1869 | ],
1870 | [
1871 | 0.7777777777777778,
1872 | "#fb9f3a"
1873 | ],
1874 | [
1875 | 0.8888888888888888,
1876 | "#fdca26"
1877 | ],
1878 | [
1879 | 1,
1880 | "#f0f921"
1881 | ]
1882 | ],
1883 | "type": "surface"
1884 | }
1885 | ],
1886 | "table": [
1887 | {
1888 | "cells": {
1889 | "fill": {
1890 | "color": "#EBF0F8"
1891 | },
1892 | "line": {
1893 | "color": "white"
1894 | }
1895 | },
1896 | "header": {
1897 | "fill": {
1898 | "color": "#C8D4E3"
1899 | },
1900 | "line": {
1901 | "color": "white"
1902 | }
1903 | },
1904 | "type": "table"
1905 | }
1906 | ]
1907 | },
1908 | "layout": {
1909 | "annotationdefaults": {
1910 | "arrowcolor": "#2a3f5f",
1911 | "arrowhead": 0,
1912 | "arrowwidth": 1
1913 | },
1914 | "autotypenumbers": "strict",
1915 | "coloraxis": {
1916 | "colorbar": {
1917 | "outlinewidth": 0,
1918 | "ticks": ""
1919 | }
1920 | },
1921 | "colorscale": {
1922 | "diverging": [
1923 | [
1924 | 0,
1925 | "#8e0152"
1926 | ],
1927 | [
1928 | 0.1,
1929 | "#c51b7d"
1930 | ],
1931 | [
1932 | 0.2,
1933 | "#de77ae"
1934 | ],
1935 | [
1936 | 0.3,
1937 | "#f1b6da"
1938 | ],
1939 | [
1940 | 0.4,
1941 | "#fde0ef"
1942 | ],
1943 | [
1944 | 0.5,
1945 | "#f7f7f7"
1946 | ],
1947 | [
1948 | 0.6,
1949 | "#e6f5d0"
1950 | ],
1951 | [
1952 | 0.7,
1953 | "#b8e186"
1954 | ],
1955 | [
1956 | 0.8,
1957 | "#7fbc41"
1958 | ],
1959 | [
1960 | 0.9,
1961 | "#4d9221"
1962 | ],
1963 | [
1964 | 1,
1965 | "#276419"
1966 | ]
1967 | ],
1968 | "sequential": [
1969 | [
1970 | 0,
1971 | "#0d0887"
1972 | ],
1973 | [
1974 | 0.1111111111111111,
1975 | "#46039f"
1976 | ],
1977 | [
1978 | 0.2222222222222222,
1979 | "#7201a8"
1980 | ],
1981 | [
1982 | 0.3333333333333333,
1983 | "#9c179e"
1984 | ],
1985 | [
1986 | 0.4444444444444444,
1987 | "#bd3786"
1988 | ],
1989 | [
1990 | 0.5555555555555556,
1991 | "#d8576b"
1992 | ],
1993 | [
1994 | 0.6666666666666666,
1995 | "#ed7953"
1996 | ],
1997 | [
1998 | 0.7777777777777778,
1999 | "#fb9f3a"
2000 | ],
2001 | [
2002 | 0.8888888888888888,
2003 | "#fdca26"
2004 | ],
2005 | [
2006 | 1,
2007 | "#f0f921"
2008 | ]
2009 | ],
2010 | "sequentialminus": [
2011 | [
2012 | 0,
2013 | "#0d0887"
2014 | ],
2015 | [
2016 | 0.1111111111111111,
2017 | "#46039f"
2018 | ],
2019 | [
2020 | 0.2222222222222222,
2021 | "#7201a8"
2022 | ],
2023 | [
2024 | 0.3333333333333333,
2025 | "#9c179e"
2026 | ],
2027 | [
2028 | 0.4444444444444444,
2029 | "#bd3786"
2030 | ],
2031 | [
2032 | 0.5555555555555556,
2033 | "#d8576b"
2034 | ],
2035 | [
2036 | 0.6666666666666666,
2037 | "#ed7953"
2038 | ],
2039 | [
2040 | 0.7777777777777778,
2041 | "#fb9f3a"
2042 | ],
2043 | [
2044 | 0.8888888888888888,
2045 | "#fdca26"
2046 | ],
2047 | [
2048 | 1,
2049 | "#f0f921"
2050 | ]
2051 | ]
2052 | },
2053 | "colorway": [
2054 | "#636efa",
2055 | "#EF553B",
2056 | "#00cc96",
2057 | "#ab63fa",
2058 | "#FFA15A",
2059 | "#19d3f3",
2060 | "#FF6692",
2061 | "#B6E880",
2062 | "#FF97FF",
2063 | "#FECB52"
2064 | ],
2065 | "font": {
2066 | "color": "#2a3f5f"
2067 | },
2068 | "geo": {
2069 | "bgcolor": "white",
2070 | "lakecolor": "white",
2071 | "landcolor": "#E5ECF6",
2072 | "showlakes": true,
2073 | "showland": true,
2074 | "subunitcolor": "white"
2075 | },
2076 | "hoverlabel": {
2077 | "align": "left"
2078 | },
2079 | "hovermode": "closest",
2080 | "mapbox": {
2081 | "style": "light"
2082 | },
2083 | "paper_bgcolor": "white",
2084 | "plot_bgcolor": "#E5ECF6",
2085 | "polar": {
2086 | "angularaxis": {
2087 | "gridcolor": "white",
2088 | "linecolor": "white",
2089 | "ticks": ""
2090 | },
2091 | "bgcolor": "#E5ECF6",
2092 | "radialaxis": {
2093 | "gridcolor": "white",
2094 | "linecolor": "white",
2095 | "ticks": ""
2096 | }
2097 | },
2098 | "scene": {
2099 | "xaxis": {
2100 | "backgroundcolor": "#E5ECF6",
2101 | "gridcolor": "white",
2102 | "gridwidth": 2,
2103 | "linecolor": "white",
2104 | "showbackground": true,
2105 | "ticks": "",
2106 | "zerolinecolor": "white"
2107 | },
2108 | "yaxis": {
2109 | "backgroundcolor": "#E5ECF6",
2110 | "gridcolor": "white",
2111 | "gridwidth": 2,
2112 | "linecolor": "white",
2113 | "showbackground": true,
2114 | "ticks": "",
2115 | "zerolinecolor": "white"
2116 | },
2117 | "zaxis": {
2118 | "backgroundcolor": "#E5ECF6",
2119 | "gridcolor": "white",
2120 | "gridwidth": 2,
2121 | "linecolor": "white",
2122 | "showbackground": true,
2123 | "ticks": "",
2124 | "zerolinecolor": "white"
2125 | }
2126 | },
2127 | "shapedefaults": {
2128 | "line": {
2129 | "color": "#2a3f5f"
2130 | }
2131 | },
2132 | "ternary": {
2133 | "aaxis": {
2134 | "gridcolor": "white",
2135 | "linecolor": "white",
2136 | "ticks": ""
2137 | },
2138 | "baxis": {
2139 | "gridcolor": "white",
2140 | "linecolor": "white",
2141 | "ticks": ""
2142 | },
2143 | "bgcolor": "#E5ECF6",
2144 | "caxis": {
2145 | "gridcolor": "white",
2146 | "linecolor": "white",
2147 | "ticks": ""
2148 | }
2149 | },
2150 | "title": {
2151 | "x": 0.05
2152 | },
2153 | "xaxis": {
2154 | "automargin": true,
2155 | "gridcolor": "white",
2156 | "linecolor": "white",
2157 | "ticks": "",
2158 | "title": {
2159 | "standoff": 15
2160 | },
2161 | "zerolinecolor": "white",
2162 | "zerolinewidth": 2
2163 | },
2164 | "yaxis": {
2165 | "automargin": true,
2166 | "gridcolor": "white",
2167 | "linecolor": "white",
2168 | "ticks": "",
2169 | "title": {
2170 | "standoff": 15
2171 | },
2172 | "zerolinecolor": "white",
2173 | "zerolinewidth": 2
2174 | }
2175 | }
2176 | },
2177 | "width": 1000
2178 | }
2179 | }
2180 | },
2181 | "metadata": {},
2182 | "output_type": "display_data"
2183 | }
2184 | ],
2185 | "source": [
2186 | "# sankey model & adaptation techniques\n",
2187 | "# ==========================================\n",
2188 | "df_models = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"MODELS_ESTIMATED\")\n",
2189 | "df_study_model = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"STUDY_MODEL\")\n",
2190 | "df_techniques = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"STUDY_TECHNIQUE\")\n",
2191 | "\n",
2192 | "\n",
2193 | "df_study_model['Adaptation'] = df_study_model['Adaptation'].astype(str).str.split(',')\n",
2194 | "df_study_model = df_study_model.explode('Adaptation')\n",
2195 | "df_study_model['Adaptation'] = df_study_model['Adaptation'].str.strip()\n",
2196 | "\n",
2197 | "merged_models = pd.merge(\n",
2198 | " df_study_model[['CitationKey', 'ModelKey', 'Adaptation']],\n",
2199 | " df_models[['ModelKey', 'Scale']],\n",
2200 | " on='ModelKey',\n",
2201 | " how='left'\n",
2202 | ")\n",
2203 | "\n",
2204 | "full_df = pd.merge(\n",
2205 | " merged_models,\n",
2206 | " df_techniques[['CitationKey', 'Prompt-Engineering', 'Training']],\n",
2207 | " on='CitationKey',\n",
2208 | " how='left'\n",
2209 | ")\n",
2210 | "\n",
2211 | "# ==========================================\n",
2212 | "peft_keywords = ['Low-Rank Decomposition', 'LoRA Derivates', 'Adapter-Tuning', 'Selective', 'Additive-Other', 'Prompt-Tuning', 'Instruction-Tuning']\n",
2213 | "full_keywords = ['Full-Parameter Fine-Tuning', 'Instruction-Tuning']\n",
2214 | "prompt_keywords = ['CoT', 'Few-Shot', 'RAG', 'In-Context', 'Zero-Shot']\n",
2215 | "pre_keywords = ['Pre-Training']\n",
2216 | "\n",
2217 | "def resolve_technique(row):\n",
2218 | " adaptation = str(row['Adaptation']).upper().strip()\n",
2219 | " \n",
2220 | " if adaptation == 'PROMPT':\n",
2221 | " val = str(row['Prompt-Engineering'])\n",
2222 | " if val in ['nan', 'None', '']: return [\"Unspecified Prompting\"]\n",
2223 | " tags = [x.strip() for x in val.split(',')]\n",
2224 | " valid_tags = [t for t in tags if any(k.lower() in t.lower() for k in prompt_keywords)]\n",
2225 | " return valid_tags if valid_tags else tags \n",
2226 | "\n",
2227 | " train_val = str(row['Training'])\n",
2228 | " if train_val in ['nan', 'None', '']: return [\"Unspecified Training\"]\n",
2229 | " tags = [x.strip() for x in train_val.split(',')]\n",
2230 | " relevant_techniques = []\n",
2231 | "\n",
2232 | " if adaptation == 'PEFT':\n",
2233 | " for tag in tags:\n",
2234 | " if any(k.lower() in tag.lower() for k in peft_keywords):\n",
2235 | " relevant_techniques.append(tag)\n",
2236 | " if not relevant_techniques: relevant_techniques.append(\"Other PEFT\")\n",
2237 | "\n",
2238 | " elif adaptation == 'FULL':\n",
2239 | " for tag in tags:\n",
2240 | " if any(k.lower() in tag.lower() for k in full_keywords):\n",
2241 | " relevant_techniques.append(tag)\n",
2242 | " if not relevant_techniques: relevant_techniques.append(\"Other Fine-Tuning\")\n",
2243 | " \n",
2244 | " elif adaptation == 'PRE':\n",
2245 | " for tag in tags:\n",
2246 | " if any(k.lower() in tag.lower() for k in pre_keywords):\n",
2247 | " relevant_techniques.append(tag)\n",
2248 | " if not relevant_techniques: relevant_techniques.append(\"Pre-Training\")\n",
2249 | "\n",
2250 | " elif adaptation == 'FEATURE':\n",
2251 | " return [\"Feature Extraction\"]\n",
2252 | "\n",
2253 | " return relevant_techniques\n",
2254 | "\n",
2255 | "full_df['Specific_Techniques'] = full_df.apply(resolve_technique, axis=1)\n",
2256 | "sankey_df = full_df.explode('Specific_Techniques')\n",
2257 | "sankey_df = sankey_df.dropna(subset=['Specific_Techniques']) \n",
2258 | "sankey_df = sankey_df[sankey_df['Specific_Techniques'] != \"\"] \n",
2259 | "\n",
2260 | "\n",
2261 | "# ==========================================\n",
2262 | "def get_method_category(code):\n",
2263 | " code = str(code).upper()\n",
2264 | " if code == 'PROMPT': return \"Prompt Engineering\"\n",
2265 | " if code == 'FULL': return \"Fine-Tuning\" \n",
2266 | " if code == 'PEFT': return \"Parameter-Efficient Fine-Tuning\"\n",
2267 | " if code == 'PRE': return \"Pre-Training\"\n",
2268 | " if code == 'FEATURE': return \"Feature Extraction\"\n",
2269 | " return \"Other\"\n",
2270 | "\n",
2271 | "sankey_df['Method_Category'] = sankey_df['Adaptation'].apply(get_method_category)\n",
2272 | "replace_map = {'Full-Parameter Fine-Tuning': 'Full-Parameter'}\n",
2273 | "sankey_df['Specific_Techniques'] = sankey_df['Specific_Techniques'].replace(replace_map)\n",
2274 | "sankey_df['Scale'] = sankey_df['Scale'].astype(str).str.strip().str.title()\n",
2275 | "\n",
2276 | "# Weights\n",
2277 | "sankey_df['Study_Row_Count'] = sankey_df.groupby('CitationKey')['CitationKey'].transform('count')\n",
2278 | "sankey_df['Weight'] = 1 / sankey_df['Study_Row_Count']\n",
2279 | "\n",
2280 | "# Unique Counts (Only needed for Level 2 now based on requirements)\n",
2281 | "unique_counts_lvl2 = sankey_df.groupby('Specific_Techniques')['CitationKey'].nunique()\n",
2282 | "\n",
2283 | "\n",
2284 | "# ==========================================\n",
2285 | "raw_to_display = {} \n",
2286 | "scale_ids = {\n",
2287 | " \"Tiny\": \"S1.3.1\",\n",
2288 | " \"Small\": \"S1.3.2\",\n",
2289 | " \"Medium\": \"S1.3.3\",\n",
2290 | " \"Large\": \"S1.3.4\"\n",
2291 | "}\n",
2292 | "raw_lvl0 = [\"Tiny\", \"Small\", \"Medium\", \"Large\"]\n",
2293 | "lvl0_labels = []\n",
2294 | "\n",
2295 | "for raw in raw_lvl0:\n",
2296 | " if raw in sankey_df['Scale'].unique():\n",
2297 | " tax_id = scale_ids.get(raw, \"\")\n",
2298 | " # Format: \"Tiny (S1.3.1)\"\n",
2299 | " final_label = f\"{raw} ({tax_id})\" if tax_id else raw\n",
2300 | " lvl0_labels.append(final_label)\n",
2301 | " raw_to_display[raw] = final_label\n",
2302 | "\n",
2303 | "cat_ids = {\n",
2304 | " \"Feature Extraction\": \"T1\",\n",
2305 | " \"Pre-Training\": \"T2.2.1\",\n",
2306 | " \"Prompt Engineering\": \"T2.1\",\n",
2307 | " \"Fine-Tuning\": \"T2.2.2.1\",\n",
2308 | " \"Parameter-Efficient Fine-Tuning\": \"T2.2.2.2\"\n",
2309 | "}\n",
2310 | "cat_display_names = {\n",
2311 | " \"Fine-Tuning\": \"Full Fine-Tuning\"\n",
2312 | "}\n",
2313 | "\n",
2314 | "raw_lvl1 = [\"Fine-Tuning\", \"Parameter-Efficient Fine-Tuning\", \"Prompt Engineering\", \"Pre-Training\", \"Feature Extraction\"]\n",
2315 | "lvl1_labels = []\n",
2316 | "existing_cats = sankey_df['Method_Category'].unique()\n",
2317 | "\n",
2318 | "for raw in raw_lvl1:\n",
2319 | " if raw in existing_cats:\n",
2320 | " tax_id = cat_ids.get(raw, \"\")\n",
2321 | " disp_name = cat_display_names.get(raw, raw)\n",
2322 | " # Format: \"Pre-Training (T2.2.1)\"\n",
2323 | " final_label = f\"{disp_name} ({tax_id})\" if tax_id else disp_name\n",
2324 | " lvl1_labels.append(final_label)\n",
2325 | " raw_to_display[raw] = final_label\n",
2326 | "\n",
2327 | "# Specific Techniques\n",
2328 | "# Format: \"LoRA {25}\" \n",
2329 | "raw_lvl2 = sorted(sankey_df['Specific_Techniques'].unique().tolist())\n",
2330 | "lvl2_labels = []\n",
2331 | "for raw in raw_lvl2:\n",
2332 | " count = unique_counts_lvl2.get(raw, 0)\n",
2333 | " # Using triple braces {{{ }}} to print literal braces in f-string\n",
2334 | " final_label = f\"{raw} {{{count}}}\"\n",
2335 | " lvl2_labels.append(final_label)\n",
2336 | " raw_to_display[raw] = final_label\n",
2337 | "\n",
2338 | "# Combine all\n",
2339 | "all_labels = lvl0_labels + lvl1_labels + lvl2_labels\n",
2340 | "label_map = {label: i for i, label in enumerate(all_labels)}\n",
2341 | "\n",
2342 | "\n",
2343 | "# ==========================================\n",
2344 | "palette = pc.qualitative.Pastel\n",
2345 | "grey_color = 'lightgrey'\n",
2346 | "grey_link = 'rgba(200, 200, 200, 0.4)'\n",
2347 | "grey_cats = ['Pre-Training', 'Feature Extraction', 'Other']\n",
2348 | "\n",
2349 | "color_map = {}\n",
2350 | "palette_idx = 0\n",
2351 | "\n",
2352 | "# A. Scales\n",
2353 | "for raw_name in raw_lvl0:\n",
2354 | " if raw_name in raw_to_display:\n",
2355 | " color_map[raw_name] = palette[palette_idx % len(palette)]\n",
2356 | " palette_idx += 1\n",
2357 | "\n",
2358 | "# B. Categories\n",
2359 | "for raw_name in raw_lvl1:\n",
2360 | " if raw_name in raw_to_display:\n",
2361 | " if raw_name in grey_cats:\n",
2362 | " color_map[raw_name] = grey_color\n",
2363 | " else:\n",
2364 | " color_map[raw_name] = palette[palette_idx % len(palette)]\n",
2365 | " palette_idx += 1\n",
2366 | "\n",
2367 | "def hex_to_rgba(hex_code, opacity=0.4):\n",
2368 | " if hex_code == 'lightgrey': return grey_link\n",
2369 | " if hex_code.startswith('rgb'): return hex_code.replace(')', f', {opacity})').replace('rgb', 'rgba')\n",
2370 | " h = hex_code.lstrip('#')\n",
2371 | " rgb = tuple(int(h[i:i+2], 16) for i in (0, 2, 4))\n",
2372 | " return f\"rgba({rgb[0]}, {rgb[1]}, {rgb[2]}, {opacity})\"\n",
2373 | "\n",
2374 | "\n",
2375 | "# ==========================================\n",
2376 | "source = []\n",
2377 | "target = []\n",
2378 | "value = []\n",
2379 | "colors = []\n",
2380 | "\n",
2381 | "# --- Flow 1: Scale -> Category ---\n",
2382 | "flow1 = sankey_df.groupby(['Scale', 'Method_Category'])['Weight'].sum().reset_index()\n",
2383 | "\n",
2384 | "for _, row in flow1.iterrows():\n",
2385 | " scale_raw = row['Scale']\n",
2386 | " cat_raw = row['Method_Category']\n",
2387 | " \n",
2388 | " src = raw_to_display.get(scale_raw)\n",
2389 | " tgt = raw_to_display.get(cat_raw)\n",
2390 | " \n",
2391 | " if src in label_map and tgt in label_map:\n",
2392 | " source.append(label_map[src])\n",
2393 | " target.append(label_map[tgt])\n",
2394 | " value.append(row['Weight'])\n",
2395 | " \n",
2396 | " # Color based on Scale raw name\n",
2397 | " base_color = color_map.get(scale_raw, grey_color)\n",
2398 | " colors.append(hex_to_rgba(base_color))\n",
2399 | "\n",
2400 | "# --- Flow 2: Category -> Specific ---\n",
2401 | "flow2 = sankey_df.groupby(['Method_Category', 'Specific_Techniques'])['Weight'].sum().reset_index()\n",
2402 | "\n",
2403 | "for _, row in flow2.iterrows():\n",
2404 | " cat_raw = row['Method_Category']\n",
2405 | " tech_raw = row['Specific_Techniques']\n",
2406 | " \n",
2407 | " src = raw_to_display.get(cat_raw)\n",
2408 | " tgt = raw_to_display.get(tech_raw)\n",
2409 | " \n",
2410 | " if src in label_map and tgt in label_map:\n",
2411 | " source.append(label_map[src])\n",
2412 | " target.append(label_map[tgt])\n",
2413 | " value.append(row['Weight'])\n",
2414 | " \n",
2415 | " # Color based on Category raw name\n",
2416 | " base_color = color_map.get(cat_raw, grey_color)\n",
2417 | " colors.append(hex_to_rgba(base_color))\n",
2418 | "\n",
2419 | "\n",
2420 | "# ==========================================\n",
2421 | "node_colors = []\n",
2422 | "# Map specific technique to its parent category raw name\n",
2423 | "tech_to_cat = pd.Series(sankey_df.Method_Category.values, index=sankey_df.Specific_Techniques).to_dict()\n",
2424 | "\n",
2425 | "for l in all_labels:\n",
2426 | " final_color = grey_color\n",
2427 | " \n",
2428 | " # Reverse lookup from raw_to_display\n",
2429 | " raw_key = None\n",
2430 | " for k, v in raw_to_display.items():\n",
2431 | " if v == l:\n",
2432 | " raw_key = k\n",
2433 | " break\n",
2434 | " \n",
2435 | " if raw_key:\n",
2436 | " # Case A: Scale or Category\n",
2437 | " if raw_key in color_map:\n",
2438 | " final_color = color_map[raw_key]\n",
2439 | " # Case B: Specific Technique (Inherit)\n",
2440 | " elif raw_key in tech_to_cat:\n",
2441 | " parent_raw = tech_to_cat[raw_key]\n",
2442 | " final_color = color_map.get(parent_raw, grey_color)\n",
2443 | " \n",
2444 | " node_colors.append(final_color)\n",
2445 | "\n",
2446 | "# ==========================================\n",
2447 | "fig = go.Figure(data=[go.Sankey(\n",
2448 | " arrangement=\"snap\",\n",
2449 | " node=dict(\n",
2450 | " pad=15, thickness=20,\n",
2451 | " line=dict(color=\"black\", width=0.5),\n",
2452 | " label=all_labels,\n",
2453 | " color=node_colors,\n",
2454 | " hovertemplate='%{label}
Weighted Volume: %{value:.2f}'\n",
2455 | " ),\n",
2456 | " link=dict(\n",
2457 | " source=source, target=target, value=value, color=colors\n",
2458 | " )\n",
2459 | ")])\n",
2460 | "\n",
2461 | "fig.update_layout(\n",
2462 | " # Global Font Settings (mimics plt.rcParams[\"font.family\"] = \"serif\")\n",
2463 | " font=dict(\n",
2464 | " family=\"Times New Roman, serif\", \n",
2465 | " size=17, \n",
2466 | " color=\"black\"\n",
2467 | " ),\n",
2468 | " width=1000,\n",
2469 | " height=600,\n",
2470 | " margin=dict(b=60, t=40),\n",
2471 | " \n",
2472 | " annotations=[\n",
2473 | " # Left Column Label\n",
2474 | " dict(\n",
2475 | " x=0,\n",
2476 | " y=-0.1,\n",
2477 | " xref=\"paper\",\n",
2478 | " yref=\"paper\",\n",
2479 | " text=\"Model Scale (S1.3)\",\n",
2480 | " showarrow=False,\n",
2481 | " font=dict(size=20, color=\"black\"), \n",
2482 | " align=\"center\"\n",
2483 | " ),\n",
2484 | " # Right Column Label\n",
2485 | " dict(\n",
2486 | " x=1,\n",
2487 | " y=-0.1,\n",
2488 | " xref=\"paper\",\n",
2489 | " yref=\"paper\",\n",
2490 | " text=\"Adaptation Technique (T2)\",\n",
2491 | " showarrow=False,\n",
2492 | " font=dict(size=20, color=\"black\"),\n",
2493 | " align=\"center\"\n",
2494 | " )\n",
2495 | " ]\n",
2496 | ")\n",
2497 | "\n",
2498 | "fig.show()"
2499 | ]
2500 | }
2501 | ],
2502 | "metadata": {
2503 | "kernelspec": {
2504 | "display_name": "vpn_seg",
2505 | "language": "python",
2506 | "name": "python3"
2507 | },
2508 | "language_info": {
2509 | "codemirror_mode": {
2510 | "name": "ipython",
2511 | "version": 3
2512 | },
2513 | "file_extension": ".py",
2514 | "mimetype": "text/x-python",
2515 | "name": "python",
2516 | "nbconvert_exporter": "python",
2517 | "pygments_lexer": "ipython3",
2518 | "version": "3.10.16"
2519 | }
2520 | },
2521 | "nbformat": 4,
2522 | "nbformat_minor": 5
2523 | }
2524 |
--------------------------------------------------------------------------------