├── .gitignore ├── taxonomy ├── taxonomy.png ├── taxonomy.xlsx └── README.md ├── analyses ├── taxonomy.xlsx ├── datasets_labeling_summary.csv └── study_taxonomy_analysis.ipynb ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store -------------------------------------------------------------------------------- /taxonomy/taxonomy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/taxonomy/taxonomy.png -------------------------------------------------------------------------------- /analyses/taxonomy.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/analyses/taxonomy.xlsx -------------------------------------------------------------------------------- /taxonomy/taxonomy.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hs-esslingen-it-security/Awesome-LLM4SVD/HEAD/taxonomy/taxonomy.xlsx -------------------------------------------------------------------------------- /taxonomy/README.md: -------------------------------------------------------------------------------- 1 | # LLM4SVD TAXONOMY 🗂️ 2 | 3 | We categorize existing LLM4SVD approaches according to detection task, input representation, system architecture, and technique. The presented taxonomy allows for meaningful comparison and benchmarking of studies.

4 | 5 | 6 | 7 | 8 | ![Taxonomy of LLM-based vulnerability detection studies.](taxonomy.png) -------------------------------------------------------------------------------- /analyses/datasets_labeling_summary.csv: -------------------------------------------------------------------------------- 1 | Dataset,Labeling,Type 2 | SARD,Synthetic,Mixed 3 | Juliet C/C++,Synthetic,Synthetic 4 | Juliet Java,Synthetic,Synthetic 5 | VulDeePecker,Security Vendor,Mixed 6 | Draper VDISC,Tool,Mixed 7 | Devign,Developer,Real (Balanced) 8 | Big-Vul,Security Vendor,Real (Imbalanced) 9 | D2A,Tool,Real (Imbalanced) 10 | ReVeal,Developer,Real (Imbalanced) 11 | CVEfixes,Security Vendor,Real (Imbalanced) 12 | CrossVul,Security Vendor,Real (Balanced) 13 | SecurityEval,Synthetic,Mixed 14 | SVEN,Developer,Real (Balanced) 15 | DiverseVul,Developer,Real (Imbalanced) 16 | FormAI,Tool,Synthetic 17 | ReposVul,Tool,Real (Imbalanced) 18 | PrimeVul,Security Vendor,Real (Imbalanced) 19 | MegaVul,Security Vendor,Real (Imbalanced) 20 | CleanVul,Developer,Real (Balanced) 21 | PairVul,Security Vendor,Real (Balanced) 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 hs-esslingen-it-security 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome-LLM4SVD 🌟-🧠👩‍💻🔍 2 | 3 | This repository contains the artifacts from the systematic literature review (SLR) on LLM-based software vulnerability detection ("A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models"). 4 | The SLR analyzes 263 studies published between January 2020 and November 2025 and provides a structured taxonomy of detection approaches, input representations, system architectures, techniques, and dataset usage. 5 | 6 | 7 | ## Table of Contents 8 | 9 | To support open science and reproducibility, we publicly release: 10 | - 📝 [Surveyed Papers](#papers): A curated list of surveyed papers. This list will be continuously updated to track the latest papers. 11 | - 🗂️ [Taxonomy](https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD/tree/main/taxonomy): Taxonomy of LLM-based vulnerability detection studies along with the categorization of each surveyed paper. 12 | - 📝 [Selected Datasets](#datasets): A list of the most commonly used datasets in the surveyed studies with their download sources. 13 | 14 | 15 | 16 |
17 | 18 | For details, see our [preprint here](https://arxiv.org/abs/2507.22659): 19 | 20 | 📚 S. Kaniewski, F. Schmidt, M. Enzweiler, M. Menth, und T. Heer. 2025. *A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models*. arXiv:2507.22659. 21 | ```bibtex 22 | @preprint{kaniewskiLLM4SVD2025, 23 | title={{A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models}}, 24 | author={Kaniewski, Sabrina and Schmidt, Fabian and Enzweiler, Markus and Menth, Michael and Heer, Tobias}, 25 | year={2025}, 26 | eprint={2507.22659}, 27 | archivePrefix={arXiv}, 28 | primaryClass={cs.SE}, 29 | url={https://arxiv.org/abs/2507.22659}, 30 | } 31 | ``` 32 | 33 | 34 |
35 | 36 | - 🤝 [Contribute to this repository](#contribution) 37 | - ⚖️ [License](#license) 38 | 39 | 40 |
41 | 42 | ---------------- 43 | ---------------- 44 | 45 | ## Papers 46 | 47 | > **Note:** Entries marked with ✨ indicate the latest papers that are not discussed in the preprint of the SLR. The latest preprint version covers all studies up to November 2025. 48 | 49 | 50 | ### 2025 51 | - (11/2025) Leveraging Self-Paced Learning for Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2511.09212)] [[Code](https://figshare.com/s/bef3211194fc18fe375e)] 52 | - (11/2025) Specification-Guided Vulnerability Detection with Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2511.04014)] [[Code](https://github.com/zhuhaopku/VulInstruct-temp)] 53 | - (11/2025) Compressing Large Language Models for SQL Injection Detection: A Case Study on Deep Seek-Coder and Meta-llama-3-70b-instruct. **`FRUCT 2025`** [[Paper](https://ieeexplore.ieee.org/document/11239157)] 54 | - (11/2025) VulTrLM: LLM-assisted Vulnerability Detection via AST Decomposition and Comment Enhancement. **`EMSE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10664-025-10738-7)] 55 | - (11/2025) Cross-Domain Evaluation of Transformer-Based Vulnerability Detection on Open and Industry Data. **`PROFES 2025`** [[Paper](https://arxiv.org/abs/2509.09313)] [[Code](https://github.com/CybersecurityLab-unibz/cross_domain_evaluation)] 56 | - (11/2025) Learning-based Models for Vulnerability Detection: An Extensive Study. **`EMSE 2025`** [[Paper](https://arxiv.org/abs/2408.07526)] [[Code](https://figshare.com/s/bde8e41890e8179fbe5f?file=41894784)] 57 | - (11/2025) A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence- and Collaboration-based Decision Making. **`EMNLP 2025`** [[Paper](https://aclanthology.org/2025.emnlp-main.1071/)] 58 | - (10/2025) Leveraging Intra-and Inter-References in Vulnerability Detection using Multi-Agent Collaboration Based on LLMs. **`Cluster Computing 2025`** [[Paper](https://link.springer.com/article/10.1007/s10586-025-05721-2)] 59 | - (10/2025) iCodeReviewer: Improving Secure Code Review with Mixture of Prompts. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.12186)] 60 | - (10/2025) Bridging Semantics \& Structure for Software Vulnerability Detection using Hybrid Network Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.10321)] [[Code](https://zenodo.org/records/17259519)] 61 | - (10/2025) FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk. **`ESORICS 2025`** [[Paper](https://arxiv.org/abs/2506.19453)] [[Code](https://github.com/sajalhalder/FuncVul)] 62 | - (10/2025) On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.27675)] 63 | - (10/2025) Towards Explainable Vulnerability Detection With Large Language Models. **`TSE 2025`** [[Paper](https://arxiv.org/abs/2406.09701)] 64 | - (10/2025) MulVuln: Enhancing Pre-trained LMs with Shared and Language-Specific Knowledge for Multilingual Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.04397)] 65 | - (10/2025) Llama-Based Source Code Vulnerability Detection: Prompt Engineering vs Fine Tuning. **`ESORICS 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-032-07884-1_15)] [[Code](https://github.com/DynaSoumhaneOuchebara/Llama-based-vulnerability-detection)] 66 | - (10/2025) Real-VulLLM: An LLM Based Assessment Framework in the Wild. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.04056)] 67 | - (10/2025) Distilling Lightweight Language Models for C/C++ Vulnerabilities. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.06645)] [[Code](https://github.com/yangxiaoxuan123/ FineSec_detect)] 68 | - (10/2025) A Zero-Shot Framework for Cross-Project Vulnerability Detection in Source Code. **`EMSE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10664-025-10749-4)] [[Code](https://github.com/Radowan98/ZSVulD)] 69 | - (10/2025) Sparse-MoE: Syntax-Aware Multi-view Mixture of Experts for Long-Sequence Software Vulnerability Detection. **`ADMA 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-981-95-3456-2_24)] 70 | - (09/2025) DeepVulHunter: Enhancing the Code Vulnerability Detection Capability of LLMs through Multi-Round Analysis. **`JIIS 2025`** [[Paper](https://link.springer.com/article/10.1007/s10844-025-00982-0)] 71 | - (09/2025) Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.12039)] 72 | - (09/2025) GPTVD: vulnerability detection and analysis method based on LLM’s chain of thoughts. **`ASE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10515-025-00550-4)] [[Code](https://github.com/chenyn273/GPTVD)] 73 | - (09/2025) An Advanced Detection Framework for Embedded System Vulnerabilities. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11153853)] 74 | - (09/2025) Utilizing Large Programming Language Models on Software Vulnerability Detection. **`ASYU 2025`** [[Paper](https://ieeexplore.ieee.org/document/11208282)] 75 | - (09/2025) MAVUL: Multi-Agent Vulnerability Detection via Contextual Reasoning and Interactive Refinement. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2510.00317)] [[Code](https://github.com/youpengl/MAVUL)] 76 | - (09/2025) Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2509.12629)] [[Code](https://github.com/sssszh/ELVul4LLM)] 77 | - (09/2025) VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2509.11523)] 78 | - (09/2025) PIONEER: Improving the Robustness of Student Models when Compressing Pre-Trained Models of Code. **`ASE 2025`** [[Paper](https://link.springer.com/article/10.1007/s10515-025-00560-2)] [[Code](https://github.com/illsui1on/PIONEER)] 79 | - (08/2025) VulPr: A Prompt Learning-based Method for Vulnerability Detection. **`EIT 2025`** [[Paper](https://ieeexplore.ieee.org/document/11231886)] 80 | - (08/2025) MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning. **`IRI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11153184)] 81 | - (08/2025) Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/pdf/2508.04448)] [[Code](https://github.com/Damian0401/ProjectAnalyzer)] 82 | - (08/2025) Enhancing Fine-Grained Vulnerability Detection With Reinforcement Learning. **`TSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11145224)] [[Code](https://github.com/YuanJiangGit/RLFD)] 83 | - (08/2025) CryptoScope: Utilizing Large Language Models for Automated Cryptographic Logic Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.11599)] 84 | - (08/2025) Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.21817)] [[Code](https://github.com/yikun-li/TitanVul-BenchVul)] 85 | - (08/2025) LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.16419)] [[Code](https://github.com/NoujoudNader/LLM-Bugs-Detection)] 86 | - (08/2025) Multimodal Fusion for Vulnerability Detection: Integrating Sequence and Graph-Based Analysis with LLM Augmentation. **`MAPR 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11133833)] 87 | - (08/2025) SAFE: A Novel Approach For Software Vulnerability Detection from Enhancing The Capability of Large Language Models. **`ASIACCS 2025`** [[Paper](https://arxiv.org/abs/2409.00882)] 88 | - (08/2025) Software Vulnerability Detection using Large Language Models. **`SecureComm 2025`** [[Paper](https://arxiv.org/abs/2410.00249)] 89 | - (08/2025) Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.16625)] 90 | - (08/2025) Think Broad, Act Narrow: CWE Identification with Multi-Agent Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2508.01451)] [[Code](https://zenodo.org/records/15871507)] 91 | - (08/2025) Improving Software Security Through a LLM-Based Vulnerability Detection Model. **`DEXA 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-032-02049-9_9)] 92 | - (07/2025) An Automatic Classification Model for Long Code Vulnerabilities Based on the Teacher-Student Framework. **`QRS 2025`** [[Paper](https://ieeexplore.ieee.org/document/11216609)] 93 | - (07/2025) LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. **`USENIX Security 2025`** [[Paper](https://arxiv.org/abs/2507.16585)] [[Code](https://github.com/qcri/llmxcpg)] [[Code](https://zenodo.org/records/15614095)] 94 | - (07/2025) CLeVeR: Multi-modal Contrastive Learning for Vulnerability Code Representation. **`ACL 2025`** [[Paper](https://aclanthology.org/2025.findings-acl.414/)] [[Code](https://github.com/yoimiya-nlp/CLeVeR)] 95 | - (07/2025) Revisiting Pre-trained Language Models for Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.16887)] 96 | - (07/2025) Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2507.03051)] 97 | - (07/2025) HgtJIT: Just-in-Time Vulnerability Detection Based on Heterogeneous Graph Transformer. **`TDSC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11072308)] 98 | - (07/2025) AI-Powered Vulnerability Detection in Code Using BERT-Based LLM with Transparency Measures. **`ITC-Egypt 2025`** [[Paper](https://ieeexplore.ieee.org/document/11186618)] 99 | - (07/2025) Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories. **`Unknown 2025`** [[Paper](https://arxiv.org/abs/2503.03586)] 100 | - (06/2025) VulnTeam: A Team Collaboration Framework for LLM-based Vulnerability Detection. **`IJCNN 2025`** [[Paper](https://ieeexplore.ieee.org/document/11229292)] 101 | - (06/2025) One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE). **`PACMSE 2025`** [[Paper](https://arxiv.org/abs/2501.16454)] 102 | - (06/2025) Improving Vulnerability Type Prediction and Line-Level Detection via Adversarial Training-based Data Augmentation and Multi-Task Learning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.23534)] [[Code](https://github.com/Karelye/EDAT-MLT)] 103 | - (06/2025) Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2406.11147)] [[Code](https://github.com/knowledgerag4llmvuld/knowledgerag4llmvuld)] 104 | - (06/2025) Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.10104)] 105 | - (06/2025) Evaluating LLaMA 3.2 for Software Vulnerability Detection. **`EICC 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-94855-8_3)] 106 | - (06/2025) How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2408.10495)] [[Code](https://github.com/jianian0318/LLMSecureCode)] 107 | - (06/2025) Detecting Code Vulnerabilities using LLMs. **`DSN 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11068842)] [[Code](https://github.com/a24167566/LLMs-Code-Vulnerability-Detection)] 108 | - (06/2025) LPASS: Linear Probes as Stepping Stones for Vulnerability Detection using Compressed LLMs. **`JISA 2025`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212625001620)] 109 | - (06/2025) Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.20444)] 110 | - (06/2025) CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2411.17274)] [[Code](https://github.com/yikun-li/CleanVul)] 111 | - (06/2025) Large Language Models for Multilingual Vulnerability Detection: How Far Are We?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.07503)] [[Code](https://github.com/SpanShu96/Large-Language-Model-for-Multilingual-Vulnerability-Detection/tree/main)] 112 | - (06/2025) Large Language Models for In-File Vulnerability Localization Can Be ""Lost in the End"". **`PACMSE 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3715758)] [[Code](https://zenodo.org/records/14840519)] 113 | - (06/2025) LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2401.16185)] [[Code](https://anonymous.4open.science/r/LLM4Vuln/README.md)] 114 | - (06/2025) ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2408.16028)] [[Code](https://anonymous.4open.science/r/anvil)] 115 | - (06/2025) Line-level Semantic Structure Learning for Code Vulnerability Detection. **`Internetware 2025`** [[Paper](https://arxiv.org/abs/2407.18877)] [[Code](https://figshare.com/articles/dataset/CSLS_model_code_and_data/26391658)] 116 | - (06/2025) SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair. **`ISMM 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3735950.3735954)] [[Code](https://github.com/HuantWang/SecureMind)] 117 | - (06/2025) VuL-MCBERT: A Vulnerability Detection Method Based on Self-Supervised Contrastive Learning. **`CAIBDA 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11183103)] 118 | - (06/2025) Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.07390)] [[Code](https://github.com/Xin-Cheng-Wen/PO4Vul)] 119 | - (06/2025) Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs. **`PACMSE 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3728875)] 120 | - (06/2025) An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2401.16310)] [[Code](https://zenodo.org/records/15572151)] 121 | - (05/2025) SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.19828)] [[Code](https://github.com/basimbd/SecVulEval)] 122 | - (05/2025) AutoAdapt: On the Application of AutoML for Parameter-Efficient Fine-Tuning of Pre-Trained Code Models. **`TOSEM 2025`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3734867)] [[Code](https://github.com/serval-uni-lu/AutoAdapt)] 123 | - (05/2025) Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11028308)] 124 | - (05/2025) LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code. **`ICSC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11140501)] 125 | - (05/2025) A Comparative Study of Machine Learning and Large Language Models for SQL and NoSQL Injection Vulnerability Detection. **`SIST 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11139190)] 126 | - (05/2025) Are Sparse Autoencoders Useful for Java Function Bug Detection?. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.10375)] 127 | - (05/2025) ♪ With a Little Help from My (LLM) Friends: Enhancing Static Analysis with LLMs to Detect Software Vulnerabilities. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11028575)] 128 | - (05/2025) GraphCodeBERT-Augmented Graph Attention Networks for Code Vulnerability Detection. **`CAI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11050748)] 129 | - (05/2025) Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.15088)] 130 | - (05/2025) Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.10961)] [[Code](https://figshare.com/s/1514bc9a7aa64b46d94e)] 131 | - (05/2025) Adversarial Training for Robustness Enhancement in LLM-Based Code Vulnerability Detection. **`CISCE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11065803)] 132 | - (05/2025) Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2505.17460)] 133 | - (05/2025) An Automated Code Review Framework Based on BERT and Qianwen Large Model. **`CCAI 2025`** [[Paper](https://ieeexplore.ieee.org/document/11189422)] 134 | - (04/2025) A Software Vulnerability Detection Model Combined with Graph Simplification. **`AIBDF 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3718491.3718525)] 135 | - (04/2025) Human-Understandable Explanation for Software Vulnerability Prediction. **`JSS 2025`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0164121225001232)] [[Code](https://github.com/quy-ng/human-xai-software-vulnerability-prediction)] 136 | - (04/2025) Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.16584)] [[Code](https://huggingface.co/floxihunter/codegen-mono-CWEdetect)] [[Code](https://huggingface.co/datasets/floxihunter/synthetic_python_cwe)] 137 | - (04/2025) Vulnerability Detection with Code Language Models: How Far are We?. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11029911)] [[Code](https://github.com/DLVulDet/PrimeVul)] 138 | - (04/2025) Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.13474)] [[Code](https://anonymous.4open.science/r/CORRECT/README.md)] 139 | - (04/2025) IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. **`ICLR 2025`** [[Paper](https://arxiv.org/abs/2405.17238)] [[Code](https://github.com/iris-sast/iris)] 140 | - (04/2025) Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.13676)] 141 | - (04/2025) An Ensemble Transformer Approach with Cross-Attention for Automated Code Security Vulnerability Detection and Documentation. **`ISDFS 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11012039)] 142 | - (04/2025) Metamorphic-Based Many-Objective Distillation of LLMs for Code-Related Tasks. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/document/11029766)] [[Code](https://zenodo.org/records/14857610)] 143 | - (04/2025) XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection. **`The Journal of Supercomputing 2025`** [[Paper](https://link.springer.com/article/10.1007/s11227-025-07198-7)] 144 | - (04/2025) Leveraging Multi-Task Learning to Improve the Detection of SATD and Vulnerability. **`ICPC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11025930)] [[Code](https://github.com/moritzmock/multitask-vulberability-detection)] 145 | - (04/2025) Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection \& Repair in the IDE. **`ICSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11029760)] [[Code](https://figshare.com/articles/dataset/Closing_the_Gap_A_User_Study_on_the_Real-world_Usefulness_of_AI-powered_Vulnerability_Detection_Repair_in_the_IDE/26367139?file=52478936)] 146 | - (04/2025) R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2504.04699)] [[Code](https://github.com/martin-wey/R2Vul)] 147 | - (04/2025) Context-Enhanced Vulnerability Detection Based on Large Language Models. **`TOSEM 2025`** [[Paper](https://arxiv.org/abs/2504.16877)] [[Code](https://github.com/DoeSEResearch/PacVD)] 148 | - (04/2025) SSRFSeek: An LLM-based Static Analysis Framework for Detecting SSRF Vulnerabilities in PHP Applications. **`AINIT 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11035424)] 149 | - (03/2025) CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection. **`TASE 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-98208-8_15)] [[Code](https://github.com/CASTLE-Benchmark)] 150 | - (03/2025) SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs?. **`TSE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10910240)] 151 | - (03/2025) Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. **`ICST 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10988968)] [[Code](https://github.com/seal-research/secvul-llm-study/)] 152 | - (03/2025) Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis. **`ADIoT 2025`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-85593-1_9)] 153 | - (03/2025) Steering Large Language Models for Vulnerability Detection. **`ICASSP 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10887736)] 154 | - (03/2025) HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.10793)] 155 | - (03/2025) You Only Train Once: A Flexible Training Framework for Code Vulnerability Detection Driven by Vul-Vector. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2506.10988)] 156 | - (03/2025) Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.01449)] [[Code](https://github.com/soarsmu/SVD-Bench)] 157 | - (03/2025) Reasoning with LLMs for Zero-Shot Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2503.17885)] [[Code](https://github.com/Erroristotle/VulnSage)] 158 | - (02/2025) EFVD: A Framework of Source Code Vulnerability Detection via Fusion of Enhanced Graph Representation Learning and Pre-trained Transformer-Based Model. **`CNSSE 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3732365.3732421)] 159 | - (02/2025) Fine-Tuning Transformer LLMs for Detecting SQL Injection and XSS Vulnerabilities. **`ICAIIC 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10920868)] 160 | - (02/2025) Finetuning Large Language Models for Vulnerability Detection. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10908394)] [[Code](https://github.com/rmusab/vul-llm-finetune)] 161 | - (02/2025) Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study. **`IEEE Access 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10879492)] 162 | - (02/2025) Manual Prompt Engineering is Not Dead: A Case Study on Large Language Models for Code Vulnerability Detection with DSPy. **`CDMA 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10908746)] 163 | - (02/2025) AIDetectVul: Software Vulnerability Detection Method Based on Feature Fusion of Pre-trained Models. **`ICCECE 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10985370)] 164 | - (01/2025) DMVL4AVD: A Deep Multi-View Learning Model for Automated Vulnerability Detection. **`Neural Comput. Appl. 2025`** [[Paper](https://link.springer.com/article/10.1007/s00521-024-10892-x)] [[Code](https://drive.google.com/file/d/1-qWqmRuBi8kRAAE2yiG6JNiY8vLYxXlz/view)] 165 | - (01/2025) Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.14841)] 166 | - (01/2025) CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2501.04510)] 167 | - (01/2025) Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2412.18260)] [[Code](https://github.com/SakiRinn/LLM4CVD)] [[Code](https://huggingface.co/datasets/xuefen/VulResource)] 168 | - (01/2025) To Err is Machine: Vulnerability Detection Challenges LLM Reasoning. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2403.17218)] [[Code](https://figshare.com/articles/dataset/Data_Package_for_LLM_Vulnerability_Detection_Study/27368025)] 169 | - (01/2025) Streamlining Security Vulnerability Triage with Large Language Models. **`arXiv 2025`** [[Paper](https://arxiv.org/abs/2501.18908)] [[Code](https://zenodo.org/records/14776104)] 170 | - (01/2025) Sink Vulnerability Type Prediction Using Small Language Model (SLM). **`IC3ECSBHI 2025`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10991300)] 171 | - (01/2025) A Vulnerability Detection Framework Based on Graph Decomposition Fusion and Augmented Abstract Syntax Tree. **`BDICN 2025`** [[Paper](https://dl.acm.org/doi/full/10.1145/3727353.3727471)] 172 | 173 | ### 2024 174 | - (12/2024) Vulnerability Detection in Popular Programming Languages with Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.15905)] [[Code](https://github.com/syafiq/llm_vd)] 175 | - (12/2024) On the Compression of Language Models for Code: An Empirical Study on CodeBERT. **`SANER 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10992473)] [[Code](https://zenodo.org/records/14357478)] 176 | - (12/2024) LLM-Based Approach for Buffer Overflow Detection in Source Code. **`CIT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/11021816)] 177 | - (12/2024) A Source Code Vulnerability Detection Method Based on Positive-Unlabeled Learning. **`RICAI 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10911761)] 178 | - (12/2024) Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows. **`ICMLA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10903489)] 179 | - (12/2024) EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code. **`BigData 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10825609)] 180 | - (12/2024) Software Vulnerability Detection Using LLM: Does Additional Information Help?. **`ACSAC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10917361)] [[Code](https://github.com/research7485/vulnerability_detection)] 181 | - (12/2024) Enhancing Source Code Vulnerability Detection Using Flattened Code Graph Structures. **`ICFTIC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10913325)] 182 | - (12/2024) SQL Injection Vulnerability Detection Based on Pissa-Tuned Llama 3 Large Language Model. **`ICFTIC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10912886)] 183 | - (12/2024) A Method of SQL Injection Attack Detection Based on Large Language Models. **`CNTEIE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10987904)] 184 | - (12/2024) MVD: A Multi-Lingual Software Vulnerability Detection Framework. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.06166)] [[Code](https://figshare.com/s/10ec70108294a225f391)] 185 | - (12/2024) Python Source Code Vulnerability Detection Based on CodeBERT Language Model. **`ACAI 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10899694)] 186 | - (11/2024) RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?. **`EMNLP 2024`** [[Paper](https://arxiv.org/abs/2410.07573)] 187 | - (11/2024) StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model. **`TSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10746847)] [[Code](https://github.com/YuanJiangGit/StagedVulBERT)] 188 | - (11/2024) Applying Contrastive Learning to Code Vulnerability Type Classification. **`EMNLP 2024`** [[Paper](https://aclanthology.org/2024.emnlp-main.666/)] 189 | - (11/2024) Boosting Cybersecurity Vulnerability Scanning based on LLM-supported Static Application Security Testing. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.15735)] 190 | - (11/2024) Enhancing Vulnerability Detection Efficiency: An Exploration of Light-weight LLMs with Hybrid Code Features. **`JISA 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212624002278)] [[Code](https://github.com/JNL-28/Enhancing-Vulnerability-Detection-Efficiency)] 191 | - (11/2024) Research on the LLM-Driven Vulnerability Detection System Using LProtector. **`ICDSCA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10859408)] 192 | - (11/2024) Enhanced LLM-Based Framework for Predicting Null Pointer Dereference in Source Code. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2412.00216)] 193 | - (10/2024) Vulnerability Prediction using Pre-trained Models: An Empirical Evaluation. **`MASCOTS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10786510)] [[Code](https://sites.google.com/view/vpllm/)] 194 | - (10/2024) Fine-Tuning Pre-trained Model with Optimizable Prompt Learning for Code Vulnerability Detection. **`ISSRE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10771498)] [[Code](https://github.com/Exclusisve-V/PromptVulnerabilityDetection)] 195 | - (10/2024) Improving Long-Tail Vulnerability Detection Through Data Augmentation Based on Large Language Models. **`ICSME 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10795073)] [[Code](https://github.com/LuckyDengXiao/LERT)] 196 | - (10/2024) Exploring AI for Vulnerability Detection and Repair. **`CARS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10778769)] 197 | - (10/2024) DetectBERT: Code Vulnerability Detection. **`GCCIT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10862235)] 198 | - (10/2024) VULREM: Fine-Tuned BERT-Based Source-Code Potential Vulnerability Scanning System to Mitigate Attacks in Web Applications. **`Applied Sciences 2024`** [[Paper](https://www.mdpi.com/2076-3417/14/21/9697)] 199 | - (10/2024) A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality. **`TPS-ISA 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10835695)] [[Code](https://figshare.com/articles/dataset/Reproduction_package_for_paper_A_Qualitative_Study_on_Using_ChatGPT_for_Software_Security_Perception_vs_Practicality_/24452365?file=48008890)] 200 | - (10/2024) Vul-LMGNNs: Fusing Language Models and Online-distilled Graph Neural Networks for Code Vulnerability Detection. **`Information Fusion 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S1566253524005268)] [[Code](https://github.com/Vul-LMGNN/vul-LMGGNN)] 201 | - (10/2024) SecureQwen: Leveraging LLMs for Vulnerability Detection in Python Codebases. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824004565)] 202 | - (10/2024) VulnerAI: GPT Based Web Application Vulnerability Detection. **`ICAMAC 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10828788)] 203 | - (10/2024) DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection. **`JSS 2024`** [[Paper](nan)] [[Code](https://github.com/Yang-Yanjing/DLAP)] 204 | - (10/2024) Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability. **`TSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10706805)] [[Code](https://github.com/vinci-grape/VulEmpirical)] 205 | - (10/2024) Detecting Source Code Vulnerabilities Using Fine-Tuned Pre-Trained LLMs. **`ICSP 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10846595)] 206 | - (10/2024) A Source Code Vulnerability Detection Method Based on Adaptive Graph Neural Networks. **`ASE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10765114)] 207 | - (09/2024) Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection. **`ESORICS 2024`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-70879-4_14)] 208 | - (09/2024) Navigating (In)Security of AI-Generated Code. **`CSR 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10679468)] 209 | - (09/2024) Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code. **`ISSTA 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3650212.3652127)] [[Code](https://anonymous.4open.science/r/EXPO/README.md)] 210 | - (09/2024) Can a Llama Be a Watchdog? Exploring Llama 3 and Code Llama for Static Application Security Testing. **`CSR 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10679444)] 211 | - (09/2024) May the Source Be with You: On ChatGPT, Cybersecurity, and Secure Coding. **`Information 2024`** [[Paper](https://www.mdpi.com/2078-2489/15/9/572)] 212 | - (09/2024) Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.00571)] 213 | - (09/2024) Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.10490)] 214 | - (09/2024) SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection. **`ISSTA 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3650212.3652124)] [[Code](https://github.com/Xin-Cheng-Wen/Comment4Vul)] 215 | - (09/2024) Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.01001)] [[Code](https://figshare.com/s/5da14b0776750c6fa787)] 216 | - (09/2024) VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.10756)] 217 | - (08/2024) VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2406.07595)] [[Code](https://github.com/Sweetaroo/VulDetectBench)] 218 | - (08/2024) Defect-Scanner: A Comparative Empirical Study on Language Model and Deep Learning Approach for Software Vulnerability Detection. **`IJIS 2024`** [[Paper](https://link.springer.com/article/10.1007/s10207-024-00901-4)] 219 | - (08/2024) From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.02329)] 220 | - (08/2024) Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.06428)] 221 | - (08/2024) Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. **`ACL 2024`** [[Paper](https://arxiv.org/abs/2406.03718)] [[Code](https://github.com/CGCL-codes/VulLLM)] 222 | - (08/2024) Unintentional Security Flaws in Code: Automated Defense via Root Cause Analysis. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2409.00199)] [[Code](https://anonymous.4open.science/r/Threat_Detection_Modeling-BB7B/README.md)] 223 | - (08/2024) Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection. **`USENIX Security 2024`** [[Paper](https://www.usenix.org/conference/usenixsecurity24/presentation/risse)] [[Code](https://github.com/niklasrisse/USENIX_2024)] [[Code](https://github.com/niklasrisse/VPP)] 224 | - (08/2024) VulSim: Leveraging Similarity of {Multi-Dimensional. **`USENIX Security 2024`** [[Paper](https://www.usenix.org/conference/usenixsecurity24/presentation/shimmi)] [[Code](https://github.com/SamihaShimmi/VulSim)] 225 | - (07/2024) Enhancing Software Code Vulnerability Detection Using GPT-4o and Claude-3.5 Sonnet: A Study on Prompt Engineering Techniques. **`Electronics 2024`** [[Paper](https://www.mdpi.com/2079-9292/13/13/2657)] 226 | - (07/2024) MultiVD: A Transformer-based Multitask Approach for Software Vulnerability Detection. **`SECRYPT 2024`** [[Paper](https://www.scitepress.org/Papers/2024/127194/127194.pdf)] 227 | - (07/2024) DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection. **`Internetware 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3671016.3671388)] [[Code](https://github.com/GCVulnerability/DFEPT)] 228 | - (07/2024) Vulnerability Classification on Source Code Using Text Mining and Deep Learning Techniques. **`QRS 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10727022)] [[Code](https://sites.google.com/view/vulnerabilityclassification/)] 229 | - (07/2024) Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection. **`SSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10664399)] 230 | - (07/2024) Effectiveness of ChatGPT for Static Analysis: How Far Are We?. **`AIware 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3664646.3664777)] [[Code](https://zenodo.org/records/10828316)] 231 | - (07/2024) Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2408.00197)] 232 | - (07/2024) M2CVD: Enhancing Vulnerability Understanding through Multi-Model Collaboration for Code Vulnerability Detection. **`TOSEM 2024`** [[Paper](https://arxiv.org/abs/2406.05940)] [[Code](https://github.com/HotFrom/M2CVD)] 233 | - (07/2024) SCL-CVD: Supervised Contrastive Learning for Code Vulnerability Detection via GraphCodeBERT. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824002992)] 234 | - (07/2024) Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2407.16235)] 235 | - (06/2024) Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT. **`EASE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3661167.3661281)] [[Code](https://github.com/lhmtriet/LLM4Vul)] 236 | - (06/2024) Greening Large Language Models of Code. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3639475.3640097)] [[Code](https://github.com/soarsmu/Avatar)] 237 | - (06/2024) Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2406.05892)] [[Code](https://zenodo.org/records/11403208)] 238 | - (06/2024) Evaluating the Impact of Conventional Code Analysis Against Large Language Models in API Vulnerability Detection. **`EICC 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3655693.3655701)] 239 | - (06/2024) SVulDetector: Vulnerability Detection based on Similarity using Tree-based Attention and Weighted Graph Embedding Mechanisms. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824002335)] [[Code](https://figshare.com/s/426156a96a83da1d38d0)] 240 | - (05/2024) DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection. **`IEEE Access 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10517582)] 241 | - (05/2024) LLM-CloudSec: Large Language Model Empowered Automatic and Deep Vulnerability Analysis for Intelligent Clouds. **`INFOCOM 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10620804)] [[Code](https://github.com/DPCa0/LLM-CloudSec)] 242 | - (05/2024) LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. **`SP 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10646663/)] [[Code](https://github.com/ai4cloudops/SecLLMHolmes)] 243 | - (05/2024) VulD-CodeBERT: CodeBERT-Based Vulnerability Detection Model for C/C++ Code. **`CISCE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10653337)] 244 | - (05/2024) Large Language Model for Vulnerability Detection: Emerging Results and Future Directions. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3639476.3639762)] [[Code](https://github.com/soarsmu/ChatGPT-VulDetection)] 245 | - (04/2024) VulnGPT: Enhancing Source Code Vulnerability Detection Using AutoGPT and Adaptive Supervision Strategies. **`DCOSS-IoT 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10621527)] 246 | - (04/2024) BiT5: A Bidirectional NLP Approach for Advanced Vulnerability Detection in Codebase. **`Procedia Computer Science 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S1877050924006306)] 247 | - (04/2024) Software Vulnerability and Functionality Assessment using Large Language Models. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3643787.3648036)] 248 | - (04/2024) Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks. **`ICSE 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10548173)] [[Code](https://zenodo.org/records/10140638)] 249 | - (04/2024) Towards Causal Deep Learning for Vulnerability Detection. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3597503.3639170)] [[Code](https://figshare.com/s/0ffda320dcb96c249ef2?file=41801019)] 250 | - (04/2024) ProRLearn: Boosting Prompt Tuning-based Vulnerability Detection by Reinforcement Learning. **`ASE 2024`** [[Paper](https://link.springer.com/article/10.1007/s10515-024-00438-9)] [[Code](https://github.com/ProRLearn/ProRLearn001)] 251 | - (04/2024) VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2404.15596)] 252 | - (03/2024) Python Source Code Vulnerability Detection with Named Entity Recognition. **`COSE 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0167404824001032)] [[Code](https://github.com/mmeberg/PyVulDet-NER)] 253 | - (03/2024) GRACE: Empowering LLM-based Software Vulnerability Detection with Graph Structure and In-Context Learning. **`JSS 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0164121224000748)] [[Code](https://github.com/P-E-Vul/GRACE)] 254 | - (03/2024) Learning Defect Prediction from Unrealistic Data. **`SANER 2024`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10589866)] [[Code](https://zenodo.org/records/10514652)] 255 | - (03/2024) Making Vulnerability Prediction more Practical: Prediction, Categorization, and Localization. **`IST 2024`** [[Paper](https://www.sciencedirect.com/science/article/pii/S0950584924000636)] [[Code](https://github.com/liucyy/VulPCL)] 256 | - (02/2024) A Preliminary Study on Using Large Language Models in Software Pentesting. **`NDSS 2024`** [[Paper](https://arxiv.org/abs/2401.17459)] 257 | - (02/2024) TRACED: Execution-aware Pre-training for Source Code. **`ICSE 2024`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3597503.3608140)] [[Code](https://github.com/ARiSE-Lab/TRACED_ICSE_24)] 258 | - (02/2024) LLbezpeky: Leveraging Large Language Models for Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2401.01269)] 259 | - (02/2024) Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2402.17230)] 260 | - (01/2024) Your Instructions Are Not Always Helpful: Assessing the Efficacy of Instruction Fine-tuning for Software Vulnerability Detection. **`arXiv 2024`** [[Paper](https://arxiv.org/abs/2401.07466)] 261 | 262 | ### 2023 263 | - (12/2023) Joint Geometrical and Statistical Domain Adaptation for Cross-domain Code Vulnerability Detection. **`EMNLP 2023`** [[Paper](https://aclanthology.org/2023.emnlp-main.788/)] 264 | - (12/2023) ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?. **`APSEC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10479409)] [[Code](https://github.com/awsm-research/ChatGPT4Vul)] 265 | - (12/2023) Code Defect Detection Method Based on BERT and Ensemble. **`ICCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10507306)] 266 | - (12/2023) Assessing the Effectiveness of Vulnerability Detection via Prompt Tuning: An Empirical Study. **`APSEC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10479384)] [[Code](https://github.com/P-E-Vul/prompt-empircial-vulnerability)] 267 | - (12/2023) Enhancing Code Security Through Open-source Large Language Models: A Comparative Study. **`FPS 2023`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-57537-2_15)] 268 | - (12/2023) Optimizing Pre-trained Language Models for Efficient Vulnerability Detection in Code Snippets. **`ICCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10507456)] 269 | - (12/2023) Exploring the Limits of ChatGPT in Software Security Applications. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2312.05275)] 270 | - (11/2023) How To Get Better Embeddings with Code Pre-trained Models? An Empirical Study. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2311.08066)] 271 | - (11/2023) AIBugHunter: A Practical Tool for Predicting, Classifying and Repairing Software Vulnerabilities. **`EMSE 2023`** [[Paper](https://link.springer.com/article/10.1007/s10664-023-10346-3)] [[Code](https://github.com/awsm-research/AIBugHunter)] 272 | - (11/2023) The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification. **`ESEC/FSE 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3611643.3616304)] [[Code](https://zenodo.org/records/10499843)] 273 | - (11/2023) Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation. **`ESEC/FSE 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3611643.3616358)] [[Code](https://github.com/jacknichao/SVulD)] 274 | - (11/2023) Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2311.04109)] [[Code](https://figshare.com/s/4a16a528d6874aad51a0)] 275 | - (11/2023) Software Vulnerabilities Detection Based on a Pre-trained Language Model. **`TrustCom 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10538979)] 276 | - (10/2023) DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. **`RAID 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3607199.3607242)] [[Code](https://github.com/wagner-group/diversevul)] 277 | - (10/2023) PTLVD:Program Slicing and Transformer-based Line-level Vulnerability Detection System. **`SCAM 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10356694)] [[Code](https://github.com/chenshixu/PTLVD)] 278 | - (10/2023) Software Vulnerability Detection using Large Language Models. **`ISSRE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10301302)] 279 | - (10/2023) Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2310.16263)] 280 | - (09/2023) Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge. **`ASE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10298584)] [[Code](https://github.com/jacknichao/MVulD)] 281 | - (09/2023) DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2309.15324)] [[Code](https://github.com/WJ-8/DefectHunter)] 282 | - (09/2023) When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection. **`ASE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10298363)] [[Code](https://github.com/PILOT-VD-2023/PILOT)] 283 | - (08/2023) Using ChatGPT as a Static Application Security Testing Tool. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2308.14434)] [[Code](https://github.com/abakhshandeh/ChatGPTasSAST)] 284 | - (08/2023) VulExplainer: A Transformer-Based Hierarchical Distillation for Explaining Vulnerability Types. **`TSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10220166)] [[Code](https://github.com/awsm-research/VulExplainer)] 285 | - (08/2023) Software Vulnerability Detection with GPT and In-Context Learning. **`DSC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10381286)] 286 | - (08/2023) Can Large Language Models Find And Fix Vulnerable Software?. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2308.10345)] 287 | - (07/2023) Leveraging Deep Learning Models for Cross-function Null Pointer Risks Detection. **`AITest 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10229470)] 288 | - (07/2023) An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph. **`EuroS&P 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10190505)] [[Code](https://github.com/pial08/SemVulDet)] 289 | - (07/2023) VulDetect: A novel technique for detecting software vulnerabilities using Language Models. **`CSR 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10224924)] 290 | - (07/2023) An Enhanced Vulnerability Detection in Software Using a Heterogeneous Encoding Ensemble. **`ISCC 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10217978)] 291 | - (06/2023) New Tricks to Old Codes: Can AI Chatbots Replace Static Code Analysis Tools?. **`EICC 2023`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3590777.3590780)] [[Code](https://github.com/New-Tricks-to-Old-Codes/Replace-Static-Analysis-Tools)] 292 | - (06/2023) Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. **`TSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10153647)] [[Code](https://zenodo.org/records/7123322)] 293 | - (05/2023) An Empirical Study of Deep Learning Models for Vulnerability Detection. **`ICSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10172583)] [[Code](https://figshare.com/articles/dataset/An_Empirical_Study_of_Deep_Learning_Models_for_Vulnerability_Detection/20791240?file=39183863)] 294 | - (05/2023) Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2306.01754)] 295 | - (05/2023) Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models. **`ICSE 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10172346)] [[Code](https://github.com/ReliableCoding/REPEAT)] 296 | - (05/2023) Detecting Vulnerabilities in IoT Software: New Hybrid Model and Comprehensive Data Analysis. **`JISA 2023`** [[Paper](https://www.sciencedirect.com/science/article/pii/S2214212623000510)] 297 | - (05/2023) VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software Vulnerabilities Using Language Models. **`JEEIT 2023`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10185860)] 298 | - (04/2023) Evaluation of ChatGPT Model for Vulnerability Detection. **`arXiv 2023`** [[Paper](https://arxiv.org/abs/2304.07232)] 299 | 300 | ### 2022 301 | - (12/2022) BBVD: A BERT-based Method for Vulnerability Detection. **`IJACSA 2022`** [[Paper](https://www.proquest.com/docview/2770373789?pq-origsite=gscholar&fromopenview=true&sourcetype=Scholarly%20Journals)] 302 | - (12/2022) Exploring Transformers for Multi-Label Classification of Java Vulnerabilities. **`QRS 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10062434)] [[Code](https://github.com/TQRG/VDET-for-Java)] 303 | - (12/2022) Transformer-Based Language Models for Software Vulnerability Detection. **`ACSAC 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3564625.3567985)] [[Code](https://bitbucket.csiro.au/users/jan087/repos/acsac-2022-submission/browse)] 304 | - (12/2022) PATVD: Vulnerability Detection Based on Pre-training Techniques and Adversarial Training. **`SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/10189687/)] 305 | - (11/2022) Multi-view Pre-trained Model for Code Vulnerability Identification. **`WASA 2022`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-19211-1_11)] 306 | - (11/2022) Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection. **`Mathematics 2022`** [[Paper](https://www.mdpi.com/2227-7390/10/23/4482)] 307 | - (11/2022) BERT-Based Vulnerability Type Identification with Effective Program Representation. **`WASA 2022`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-19208-1_23#citeas)] 308 | - (10/2022) VulDeBERT: A Vulnerability Detection System Using BERT. **`ISSRE 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9985089)] [[Code](https://github.com/SKKU-SecLab/VulDeBERT)] 309 | - (07/2022) VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. **`IJCNN 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9892280)] [[Code](https://github.com/ICL-ml4csec/VulBERTa)] 310 | - (06/2022) Cyber Security Vulnerability Detection Using Natural Language Processing. **`AIIoT 2022`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9817336)] 311 | - (05/2022) LineVul: A Transformer-based Line-level Vulnerability Prediction. **`MSR 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3524842.3528452)] [[Code](https://github.com/awsm-research/LineVul)] 312 | - (05/2022) LineVD: Statement-level Vulnerability Detection using Graph Neural Networks. **`MSR 2022`** [[Paper](https://dl.acm.org/doi/abs/10.1145/3524842.3527949)] [[Code](https://github.com/davidhin/linevd)] 313 | - (03/2022) Intelligent Detection of Vulnerable Functions in Software through Neural Embedding-based Code Analysis. **`IJNM 2022`** [[Paper](https://onlinelibrary.wiley.com/doi/full/10.1002/nem.2198)] [[Code](https://cybercodeintelligence.github.io/CyberCI/)] 314 | - (01/2022) Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization. **`Security and Communication Networks 2022`** [[Paper](https://onlinelibrary.wiley.com/doi/full/10.1155/2022/5203217)] [[Code](https://cybercodeintelligence.github.io/CyberCI/)] 315 | 316 | ### 2021 317 | - (12/2021) Automated Software Vulnerability Detection via Pre-trained Context Encoder and Self Attention. **`ICDF2C 2021`** [[Paper](https://link.springer.com/chapter/10.1007/978-3-031-06365-7_15)] 318 | - (11/2021) Detecting Integer Overflow Errors in Java Source Code via Machine Learning. **`ICTAI 2021`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9643278)] 319 | - (06/2021) Unified Pre-training for Program Understanding and Generation. **`NAACL 2021`** [[Paper](https://par.nsf.gov/servlets/purl/10336701)] [[Code](https://github.com/wasiahmad/PLBART)] 320 | - (05/2021) Security Vulnerability Detection Using Deep Learning Natural Language Processing. **`INFOCOM 2021`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9484500)] 321 | 322 | ### 2020 323 | - (06/2020) Exploring Software Naturalness through Neural Language Models. **`arXiv 2020`** [[Paper](https://arxiv.org/abs/2006.12641)] 324 | 325 | 326 | ## Datasets 327 | 328 | - SARD. [[Repo](https://samate.nist.gov/SARD)] 329 | - Juliet C/C++. [[Repo](https://samate.nist.gov/SARD/test-suites/112)] 330 | - Juliet Java. [[Repo](https://samate.nist.gov/SARD/test-suites/111)] 331 | - VulDeePecker. **`NDSS`** [[Paper](https://www.ndss-symposium.org/wp-content/uploads/2018/02/ndss2018_03A-2_Li_paper.pdf)] [[Repo](https://github.com/CGCL-codes/VulDeePecker)] 332 | - Draper. **`ICMLA`** [[Paper](https://ieeexplore.ieee.org/document/8614145)] [[Repo](https://osf.io/d45bw/)] 333 | - Devign. **`NeurIPS`** [[Paper](https://proceedings.neurips.cc/paper_files/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html)] [[Repo](https://github.com/epicosy/devign)] 334 | - Big-Vul. **`MSR`** [[Paper](https://dl.acm.org/doi/10.1145/3379597.3387501)] [[Repo](https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset)] 335 | - D2A. **`ICSE-SEIP`** [[Paper](https://ieeexplore.ieee.org/document/9402126)] [[Repo](https://github.com/IBM/D2A)] 336 | - Reveal. **`TSE`** [[Paper](https://ieeexplore.ieee.org/abstract/document/9448435)] [[Repo](https://github.com/VulDetProject/ReVeal)] 337 | - CVEfixes. **`PROMISE`** [[Paper](https://dl.acm.org/doi/10.1145/3475960.3475985)] [[Repo](https://zenodo.org/records/13118970)] 338 | - CrossVul. **`ESEC/FSE`** [[Paper](https://dl.acm.org/doi/10.1145/3468264.3473122)] [[Repo](https://zenodo.org/records/4734050)] 339 | - SecurityEval. **`MSR4P&S`** [[Paper](https://dl.acm.org/doi/10.1145/3549035.3561184)] [[Repo](https://github.com/s2e-lab/SecurityEval)] 340 | - DiverseVul. **`RAID`** [[Paper](https://dl.acm.org/doi/10.1145/3607199.3607242)] [[Repo](https://github.com/wagner-group/diversevul)] 341 | - SVEN. **`CCS`** [[Paper](https://dl.acm.org/doi/10.1145/3576915.3623175)] [[Repo](https://github.com/eth-sri/sven)] 342 | - FormAI. **`PROMISE`** [[Paper](https://dl.acm.org/doi/10.1145/3617555.3617874)] [[Repo](https://github.com/FormAI-Dataset/FormAI-dataset)] 343 | - ReposVul. **`ICSE-Companion`** [[Paper](https://dl.acm.org/doi/10.1145/3639478.3647634)] [[Repo](https://github.com/Eshe0922/ReposVul)] 344 | - PrimeVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2403.18624)] [[Repo](https://github.com/DLVulDet/PrimeVul)] 345 | - PairVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2406.11147)] [[Repo](https://github.com/KnowledgeRAG4LLMVulD/KnowledgeRAG4LLMVulD/tree/main/dataset)] 346 | - MegaVul. **`MSR`** [[Paper](https://dl.acm.org/doi/10.1145/3643991.3644886)] [[Repo](https://github.com/Icyrockton/MegaVul)] 347 | - CleanVul. **`arXiv`** [[Paper](https://arxiv.org/abs/2411.17274)] [[Repo](https://github.com/yikun-li/CleanVul)] 348 | 349 | 350 | 351 | ## Contribution 352 | 353 | If you want to suggest additions to the list of studies or datasets, please open a pull request or submit an issue. 354 | 355 | 356 | ## License 357 | 358 | - 🧠 Code & scripts (`*.py`, `*.ipynb`, etc.): Licensed under the [MIT License](LICENSE). 359 | - 📚 Taxonomy, markdown outputs and lists: Licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). 360 | 361 | Please cite our paper if you use this resource. 362 | -------------------------------------------------------------------------------- /analyses/study_taxonomy_analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "85d7fa6a", 6 | "metadata": {}, 7 | "source": [ 8 | "## Insights into Taxonomy" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "5a5bf3ec", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pandas as pd\n", 19 | "import seaborn as sns\n", 20 | "import numpy as np\n", 21 | "import matplotlib\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "import matplotlib.gridspec as gridspec\n", 24 | "import matplotlib.cm as cm\n", 25 | "import plotly.colors as pc\n", 26 | "import plotly.express as px\n", 27 | "import plotly.graph_objects as go\n", 28 | "import plotly.colors as pc\n", 29 | "import plotly.io as pio\n", 30 | "from plotly.subplots import make_subplots\n", 31 | "import json\n", 32 | "import os\n", 33 | "import re\n", 34 | "import h5py\n", 35 | "pio.renderers.default = \"vscode\"\n", 36 | "\n", 37 | "from matplotlib.colors import LinearSegmentedColormap\n", 38 | "from matplotlib.patches import Rectangle\n", 39 | "from matplotlib.patches import Patch\n", 40 | "from matplotlib.lines import Line2D\n", 41 | "from matplotlib.ticker import MultipleLocator\n", 42 | "from matplotlib.ticker import AutoMinorLocator\n", 43 | "from mpl_toolkits.axes_grid1.inset_locator import inset_axes, mark_inset\n", 44 | "from collections import defaultdict\n", 45 | "from collections import Counter\n", 46 | "from pathlib import Path\n", 47 | "\n", 48 | "plt.rcParams[\"font.family\"] = \"serif\"\n", 49 | "plt.rcParams[\"font.serif\"] = [\"Times New Roman\"]\n", 50 | "plt.rcParams[\"mathtext.fontset\"] = \"dejavuserif\" \n", 51 | "\n", 52 | "sns.set_theme(style=\"white\")\n", 53 | "pd.set_option('display.max_rows', None)\n", 54 | "pd.set_option('display.max_columns', None)\n", 55 | "pd.set_option('display.width', None) # Prevents wrapping\n", 56 | "pd.set_option('display.max_colwidth', None) # Shows full content in each cell" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 2, 62 | "id": "daf22d40", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "application/vnd.plotly.v1+json": { 68 | "config": { 69 | "plotlyServerURL": "https://plot.ly" 70 | }, 71 | "data": [ 72 | { 73 | "arrangement": "snap", 74 | "link": { 75 | "color": [ 76 | "rgba(220, 220, 220, 0.5)", 77 | "rgb(102, 197, 204)", 78 | "rgb(102, 197, 204)", 79 | "rgb(102, 197, 204)", 80 | "rgba(220, 220, 220, 0.5)", 81 | "rgb(158, 185, 243)", 82 | "rgb(158, 185, 243)", 83 | "rgb(158, 185, 243)", 84 | "rgba(220, 220, 220, 0.5)", 85 | "rgb(254, 136, 177)", 86 | "rgb(254, 136, 177)", 87 | "rgb(254, 136, 177)", 88 | "rgba(220, 220, 220, 0.5)", 89 | "rgb(201, 219, 116)", 90 | "rgb(201, 219, 116)", 91 | "rgb(201, 219, 116)" 92 | ], 93 | "source": { 94 | "bdata": "AAAAAAUFBQUGBgYGBwcHBw==", 95 | "dtype": "i1" 96 | }, 97 | "target": { 98 | "bdata": "AQIDBAECAwQBAgMEAQIDBA==", 99 | "dtype": "i1" 100 | }, 101 | "value": { 102 | "bdata": "AAAAAAAQYUAAAAAAAIAlQAAAAAAAoDJAAAAAAACACkAAAAAAAEA9QAAAAAAAABJAAAAAAACAAkAAAAAAAAAIQAAAAAAAABZAAAAAAAAA8D8AAAAAAIAMQAAAAAAAgCJAAAAAAAAAGUAAAAAAAAACQAAAAAAAgBFAAAAAAAAAsD8=", 103 | "dtype": "f8" 104 | } 105 | }, 106 | "node": { 107 | "color": [ 108 | "rgb(102, 197, 204)", 109 | "rgba(200, 200, 200, 0.5)", 110 | "rgb(248, 156, 116)", 111 | "rgb(220, 176, 242)", 112 | "rgb(135, 197, 95)", 113 | "rgb(158, 185, 243)", 114 | "rgb(254, 136, 177)", 115 | "rgb(201, 219, 116)" 116 | ], 117 | "hovertemplate": "%{label}
Volume: %{value:.2f}", 118 | "label": [ 119 | "Binary (F1.1) {195}", 120 | "Classification Only {190}", 121 | "Description (F2.1) {21}", 122 | "Reasoning (F2.2) {40}", 123 | "Report (F2.3) {23}", 124 | "Multi-Class (F1.2) {61}", 125 | "Multi-Label (F1.3) {23}", 126 | "Vulnerability-Specific (F1.1.1) {20}" 127 | ], 128 | "line": { 129 | "color": "black", 130 | "width": 0.5 131 | }, 132 | "pad": 20, 133 | "thickness": 20 134 | }, 135 | "type": "sankey" 136 | } 137 | ], 138 | "layout": { 139 | "annotations": [ 140 | { 141 | "align": "center", 142 | "font": { 143 | "color": "black", 144 | "size": 20 145 | }, 146 | "showarrow": false, 147 | "text": "Classification (F1)", 148 | "x": 0, 149 | "xref": "paper", 150 | "y": -0.1, 151 | "yref": "paper" 152 | }, 153 | { 154 | "align": "center", 155 | "font": { 156 | "color": "black", 157 | "size": 20 158 | }, 159 | "showarrow": false, 160 | "text": "Generation (F2)", 161 | "x": 1, 162 | "xref": "paper", 163 | "y": -0.1, 164 | "yref": "paper" 165 | } 166 | ], 167 | "font": { 168 | "color": "black", 169 | "family": "Times New Roman, serif", 170 | "size": 20 171 | }, 172 | "height": 500, 173 | "margin": { 174 | "b": 60, 175 | "t": 40 176 | }, 177 | "template": { 178 | "data": { 179 | "bar": [ 180 | { 181 | "error_x": { 182 | "color": "#2a3f5f" 183 | }, 184 | "error_y": { 185 | "color": "#2a3f5f" 186 | }, 187 | "marker": { 188 | "line": { 189 | "color": "#E5ECF6", 190 | "width": 0.5 191 | }, 192 | "pattern": { 193 | "fillmode": "overlay", 194 | "size": 10, 195 | "solidity": 0.2 196 | } 197 | }, 198 | "type": "bar" 199 | } 200 | ], 201 | "barpolar": [ 202 | { 203 | "marker": { 204 | "line": { 205 | "color": "#E5ECF6", 206 | "width": 0.5 207 | }, 208 | "pattern": { 209 | "fillmode": "overlay", 210 | "size": 10, 211 | "solidity": 0.2 212 | } 213 | }, 214 | "type": "barpolar" 215 | } 216 | ], 217 | "carpet": [ 218 | { 219 | "aaxis": { 220 | "endlinecolor": "#2a3f5f", 221 | "gridcolor": "white", 222 | "linecolor": "white", 223 | "minorgridcolor": "white", 224 | "startlinecolor": "#2a3f5f" 225 | }, 226 | "baxis": { 227 | "endlinecolor": "#2a3f5f", 228 | "gridcolor": "white", 229 | "linecolor": "white", 230 | "minorgridcolor": "white", 231 | "startlinecolor": "#2a3f5f" 232 | }, 233 | "type": "carpet" 234 | } 235 | ], 236 | "choropleth": [ 237 | { 238 | "colorbar": { 239 | "outlinewidth": 0, 240 | "ticks": "" 241 | }, 242 | "type": "choropleth" 243 | } 244 | ], 245 | "contour": [ 246 | { 247 | "colorbar": { 248 | "outlinewidth": 0, 249 | "ticks": "" 250 | }, 251 | "colorscale": [ 252 | [ 253 | 0, 254 | "#0d0887" 255 | ], 256 | [ 257 | 0.1111111111111111, 258 | "#46039f" 259 | ], 260 | [ 261 | 0.2222222222222222, 262 | "#7201a8" 263 | ], 264 | [ 265 | 0.3333333333333333, 266 | "#9c179e" 267 | ], 268 | [ 269 | 0.4444444444444444, 270 | "#bd3786" 271 | ], 272 | [ 273 | 0.5555555555555556, 274 | "#d8576b" 275 | ], 276 | [ 277 | 0.6666666666666666, 278 | "#ed7953" 279 | ], 280 | [ 281 | 0.7777777777777778, 282 | "#fb9f3a" 283 | ], 284 | [ 285 | 0.8888888888888888, 286 | "#fdca26" 287 | ], 288 | [ 289 | 1, 290 | "#f0f921" 291 | ] 292 | ], 293 | "type": "contour" 294 | } 295 | ], 296 | "contourcarpet": [ 297 | { 298 | "colorbar": { 299 | "outlinewidth": 0, 300 | "ticks": "" 301 | }, 302 | "type": "contourcarpet" 303 | } 304 | ], 305 | "heatmap": [ 306 | { 307 | "colorbar": { 308 | "outlinewidth": 0, 309 | "ticks": "" 310 | }, 311 | "colorscale": [ 312 | [ 313 | 0, 314 | "#0d0887" 315 | ], 316 | [ 317 | 0.1111111111111111, 318 | "#46039f" 319 | ], 320 | [ 321 | 0.2222222222222222, 322 | "#7201a8" 323 | ], 324 | [ 325 | 0.3333333333333333, 326 | "#9c179e" 327 | ], 328 | [ 329 | 0.4444444444444444, 330 | "#bd3786" 331 | ], 332 | [ 333 | 0.5555555555555556, 334 | "#d8576b" 335 | ], 336 | [ 337 | 0.6666666666666666, 338 | "#ed7953" 339 | ], 340 | [ 341 | 0.7777777777777778, 342 | "#fb9f3a" 343 | ], 344 | [ 345 | 0.8888888888888888, 346 | "#fdca26" 347 | ], 348 | [ 349 | 1, 350 | "#f0f921" 351 | ] 352 | ], 353 | "type": "heatmap" 354 | } 355 | ], 356 | "histogram": [ 357 | { 358 | "marker": { 359 | "pattern": { 360 | "fillmode": "overlay", 361 | "size": 10, 362 | "solidity": 0.2 363 | } 364 | }, 365 | "type": "histogram" 366 | } 367 | ], 368 | "histogram2d": [ 369 | { 370 | "colorbar": { 371 | "outlinewidth": 0, 372 | "ticks": "" 373 | }, 374 | "colorscale": [ 375 | [ 376 | 0, 377 | "#0d0887" 378 | ], 379 | [ 380 | 0.1111111111111111, 381 | "#46039f" 382 | ], 383 | [ 384 | 0.2222222222222222, 385 | "#7201a8" 386 | ], 387 | [ 388 | 0.3333333333333333, 389 | "#9c179e" 390 | ], 391 | [ 392 | 0.4444444444444444, 393 | "#bd3786" 394 | ], 395 | [ 396 | 0.5555555555555556, 397 | "#d8576b" 398 | ], 399 | [ 400 | 0.6666666666666666, 401 | "#ed7953" 402 | ], 403 | [ 404 | 0.7777777777777778, 405 | "#fb9f3a" 406 | ], 407 | [ 408 | 0.8888888888888888, 409 | "#fdca26" 410 | ], 411 | [ 412 | 1, 413 | "#f0f921" 414 | ] 415 | ], 416 | "type": "histogram2d" 417 | } 418 | ], 419 | "histogram2dcontour": [ 420 | { 421 | "colorbar": { 422 | "outlinewidth": 0, 423 | "ticks": "" 424 | }, 425 | "colorscale": [ 426 | [ 427 | 0, 428 | "#0d0887" 429 | ], 430 | [ 431 | 0.1111111111111111, 432 | "#46039f" 433 | ], 434 | [ 435 | 0.2222222222222222, 436 | "#7201a8" 437 | ], 438 | [ 439 | 0.3333333333333333, 440 | "#9c179e" 441 | ], 442 | [ 443 | 0.4444444444444444, 444 | "#bd3786" 445 | ], 446 | [ 447 | 0.5555555555555556, 448 | "#d8576b" 449 | ], 450 | [ 451 | 0.6666666666666666, 452 | "#ed7953" 453 | ], 454 | [ 455 | 0.7777777777777778, 456 | "#fb9f3a" 457 | ], 458 | [ 459 | 0.8888888888888888, 460 | "#fdca26" 461 | ], 462 | [ 463 | 1, 464 | "#f0f921" 465 | ] 466 | ], 467 | "type": "histogram2dcontour" 468 | } 469 | ], 470 | "mesh3d": [ 471 | { 472 | "colorbar": { 473 | "outlinewidth": 0, 474 | "ticks": "" 475 | }, 476 | "type": "mesh3d" 477 | } 478 | ], 479 | "parcoords": [ 480 | { 481 | "line": { 482 | "colorbar": { 483 | "outlinewidth": 0, 484 | "ticks": "" 485 | } 486 | }, 487 | "type": "parcoords" 488 | } 489 | ], 490 | "pie": [ 491 | { 492 | "automargin": true, 493 | "type": "pie" 494 | } 495 | ], 496 | "scatter": [ 497 | { 498 | "fillpattern": { 499 | "fillmode": "overlay", 500 | "size": 10, 501 | "solidity": 0.2 502 | }, 503 | "type": "scatter" 504 | } 505 | ], 506 | "scatter3d": [ 507 | { 508 | "line": { 509 | "colorbar": { 510 | "outlinewidth": 0, 511 | "ticks": "" 512 | } 513 | }, 514 | "marker": { 515 | "colorbar": { 516 | "outlinewidth": 0, 517 | "ticks": "" 518 | } 519 | }, 520 | "type": "scatter3d" 521 | } 522 | ], 523 | "scattercarpet": [ 524 | { 525 | "marker": { 526 | "colorbar": { 527 | "outlinewidth": 0, 528 | "ticks": "" 529 | } 530 | }, 531 | "type": "scattercarpet" 532 | } 533 | ], 534 | "scattergeo": [ 535 | { 536 | "marker": { 537 | "colorbar": { 538 | "outlinewidth": 0, 539 | "ticks": "" 540 | } 541 | }, 542 | "type": "scattergeo" 543 | } 544 | ], 545 | "scattergl": [ 546 | { 547 | "marker": { 548 | "colorbar": { 549 | "outlinewidth": 0, 550 | "ticks": "" 551 | } 552 | }, 553 | "type": "scattergl" 554 | } 555 | ], 556 | "scattermap": [ 557 | { 558 | "marker": { 559 | "colorbar": { 560 | "outlinewidth": 0, 561 | "ticks": "" 562 | } 563 | }, 564 | "type": "scattermap" 565 | } 566 | ], 567 | "scattermapbox": [ 568 | { 569 | "marker": { 570 | "colorbar": { 571 | "outlinewidth": 0, 572 | "ticks": "" 573 | } 574 | }, 575 | "type": "scattermapbox" 576 | } 577 | ], 578 | "scatterpolar": [ 579 | { 580 | "marker": { 581 | "colorbar": { 582 | "outlinewidth": 0, 583 | "ticks": "" 584 | } 585 | }, 586 | "type": "scatterpolar" 587 | } 588 | ], 589 | "scatterpolargl": [ 590 | { 591 | "marker": { 592 | "colorbar": { 593 | "outlinewidth": 0, 594 | "ticks": "" 595 | } 596 | }, 597 | "type": "scatterpolargl" 598 | } 599 | ], 600 | "scatterternary": [ 601 | { 602 | "marker": { 603 | "colorbar": { 604 | "outlinewidth": 0, 605 | "ticks": "" 606 | } 607 | }, 608 | "type": "scatterternary" 609 | } 610 | ], 611 | "surface": [ 612 | { 613 | "colorbar": { 614 | "outlinewidth": 0, 615 | "ticks": "" 616 | }, 617 | "colorscale": [ 618 | [ 619 | 0, 620 | "#0d0887" 621 | ], 622 | [ 623 | 0.1111111111111111, 624 | "#46039f" 625 | ], 626 | [ 627 | 0.2222222222222222, 628 | "#7201a8" 629 | ], 630 | [ 631 | 0.3333333333333333, 632 | "#9c179e" 633 | ], 634 | [ 635 | 0.4444444444444444, 636 | "#bd3786" 637 | ], 638 | [ 639 | 0.5555555555555556, 640 | "#d8576b" 641 | ], 642 | [ 643 | 0.6666666666666666, 644 | "#ed7953" 645 | ], 646 | [ 647 | 0.7777777777777778, 648 | "#fb9f3a" 649 | ], 650 | [ 651 | 0.8888888888888888, 652 | "#fdca26" 653 | ], 654 | [ 655 | 1, 656 | "#f0f921" 657 | ] 658 | ], 659 | "type": "surface" 660 | } 661 | ], 662 | "table": [ 663 | { 664 | "cells": { 665 | "fill": { 666 | "color": "#EBF0F8" 667 | }, 668 | "line": { 669 | "color": "white" 670 | } 671 | }, 672 | "header": { 673 | "fill": { 674 | "color": "#C8D4E3" 675 | }, 676 | "line": { 677 | "color": "white" 678 | } 679 | }, 680 | "type": "table" 681 | } 682 | ] 683 | }, 684 | "layout": { 685 | "annotationdefaults": { 686 | "arrowcolor": "#2a3f5f", 687 | "arrowhead": 0, 688 | "arrowwidth": 1 689 | }, 690 | "autotypenumbers": "strict", 691 | "coloraxis": { 692 | "colorbar": { 693 | "outlinewidth": 0, 694 | "ticks": "" 695 | } 696 | }, 697 | "colorscale": { 698 | "diverging": [ 699 | [ 700 | 0, 701 | "#8e0152" 702 | ], 703 | [ 704 | 0.1, 705 | "#c51b7d" 706 | ], 707 | [ 708 | 0.2, 709 | "#de77ae" 710 | ], 711 | [ 712 | 0.3, 713 | "#f1b6da" 714 | ], 715 | [ 716 | 0.4, 717 | "#fde0ef" 718 | ], 719 | [ 720 | 0.5, 721 | "#f7f7f7" 722 | ], 723 | [ 724 | 0.6, 725 | "#e6f5d0" 726 | ], 727 | [ 728 | 0.7, 729 | "#b8e186" 730 | ], 731 | [ 732 | 0.8, 733 | "#7fbc41" 734 | ], 735 | [ 736 | 0.9, 737 | "#4d9221" 738 | ], 739 | [ 740 | 1, 741 | "#276419" 742 | ] 743 | ], 744 | "sequential": [ 745 | [ 746 | 0, 747 | "#0d0887" 748 | ], 749 | [ 750 | 0.1111111111111111, 751 | "#46039f" 752 | ], 753 | [ 754 | 0.2222222222222222, 755 | "#7201a8" 756 | ], 757 | [ 758 | 0.3333333333333333, 759 | "#9c179e" 760 | ], 761 | [ 762 | 0.4444444444444444, 763 | "#bd3786" 764 | ], 765 | [ 766 | 0.5555555555555556, 767 | "#d8576b" 768 | ], 769 | [ 770 | 0.6666666666666666, 771 | "#ed7953" 772 | ], 773 | [ 774 | 0.7777777777777778, 775 | "#fb9f3a" 776 | ], 777 | [ 778 | 0.8888888888888888, 779 | "#fdca26" 780 | ], 781 | [ 782 | 1, 783 | "#f0f921" 784 | ] 785 | ], 786 | "sequentialminus": [ 787 | [ 788 | 0, 789 | "#0d0887" 790 | ], 791 | [ 792 | 0.1111111111111111, 793 | "#46039f" 794 | ], 795 | [ 796 | 0.2222222222222222, 797 | "#7201a8" 798 | ], 799 | [ 800 | 0.3333333333333333, 801 | "#9c179e" 802 | ], 803 | [ 804 | 0.4444444444444444, 805 | "#bd3786" 806 | ], 807 | [ 808 | 0.5555555555555556, 809 | "#d8576b" 810 | ], 811 | [ 812 | 0.6666666666666666, 813 | "#ed7953" 814 | ], 815 | [ 816 | 0.7777777777777778, 817 | "#fb9f3a" 818 | ], 819 | [ 820 | 0.8888888888888888, 821 | "#fdca26" 822 | ], 823 | [ 824 | 1, 825 | "#f0f921" 826 | ] 827 | ] 828 | }, 829 | "colorway": [ 830 | "#636efa", 831 | "#EF553B", 832 | "#00cc96", 833 | "#ab63fa", 834 | "#FFA15A", 835 | "#19d3f3", 836 | "#FF6692", 837 | "#B6E880", 838 | "#FF97FF", 839 | "#FECB52" 840 | ], 841 | "font": { 842 | "color": "#2a3f5f" 843 | }, 844 | "geo": { 845 | "bgcolor": "white", 846 | "lakecolor": "white", 847 | "landcolor": "#E5ECF6", 848 | "showlakes": true, 849 | "showland": true, 850 | "subunitcolor": "white" 851 | }, 852 | "hoverlabel": { 853 | "align": "left" 854 | }, 855 | "hovermode": "closest", 856 | "mapbox": { 857 | "style": "light" 858 | }, 859 | "paper_bgcolor": "white", 860 | "plot_bgcolor": "#E5ECF6", 861 | "polar": { 862 | "angularaxis": { 863 | "gridcolor": "white", 864 | "linecolor": "white", 865 | "ticks": "" 866 | }, 867 | "bgcolor": "#E5ECF6", 868 | "radialaxis": { 869 | "gridcolor": "white", 870 | "linecolor": "white", 871 | "ticks": "" 872 | } 873 | }, 874 | "scene": { 875 | "xaxis": { 876 | "backgroundcolor": "#E5ECF6", 877 | "gridcolor": "white", 878 | "gridwidth": 2, 879 | "linecolor": "white", 880 | "showbackground": true, 881 | "ticks": "", 882 | "zerolinecolor": "white" 883 | }, 884 | "yaxis": { 885 | "backgroundcolor": "#E5ECF6", 886 | "gridcolor": "white", 887 | "gridwidth": 2, 888 | "linecolor": "white", 889 | "showbackground": true, 890 | "ticks": "", 891 | "zerolinecolor": "white" 892 | }, 893 | "zaxis": { 894 | "backgroundcolor": "#E5ECF6", 895 | "gridcolor": "white", 896 | "gridwidth": 2, 897 | "linecolor": "white", 898 | "showbackground": true, 899 | "ticks": "", 900 | "zerolinecolor": "white" 901 | } 902 | }, 903 | "shapedefaults": { 904 | "line": { 905 | "color": "#2a3f5f" 906 | } 907 | }, 908 | "ternary": { 909 | "aaxis": { 910 | "gridcolor": "white", 911 | "linecolor": "white", 912 | "ticks": "" 913 | }, 914 | "baxis": { 915 | "gridcolor": "white", 916 | "linecolor": "white", 917 | "ticks": "" 918 | }, 919 | "bgcolor": "#E5ECF6", 920 | "caxis": { 921 | "gridcolor": "white", 922 | "linecolor": "white", 923 | "ticks": "" 924 | } 925 | }, 926 | "title": { 927 | "x": 0.05 928 | }, 929 | "xaxis": { 930 | "automargin": true, 931 | "gridcolor": "white", 932 | "linecolor": "white", 933 | "ticks": "", 934 | "title": { 935 | "standoff": 15 936 | }, 937 | "zerolinecolor": "white", 938 | "zerolinewidth": 2 939 | }, 940 | "yaxis": { 941 | "automargin": true, 942 | "gridcolor": "white", 943 | "linecolor": "white", 944 | "ticks": "", 945 | "title": { 946 | "standoff": 15 947 | }, 948 | "zerolinecolor": "white", 949 | "zerolinewidth": 2 950 | } 951 | } 952 | }, 953 | "width": 1000 954 | } 955 | } 956 | }, 957 | "metadata": {}, 958 | "output_type": "display_data" 959 | } 960 | ], 961 | "source": [ 962 | "# sankey task formulation\n", 963 | "# ==========================================\n", 964 | "taxonomy_task_df = pd.read_excel(\"./taxonomy.xlsx\", sheet_name=\"STUDY_TASK\")\n", 965 | "df = taxonomy_task_df[['CitationKey', 'Classification', 'Generation']].copy()\n", 966 | "\n", 967 | "df['Classification'] = df['Classification'].fillna('No Classification')\n", 968 | "df['Generation'] = df['Generation'].fillna('Classification Only') \n", 969 | "\n", 970 | "df = df.replace('None', 'No Classification')\n", 971 | "df['Generation'] = df['Generation'].replace('No Classification', 'Classification Only')\n", 972 | "df = df.replace('nan', 'No Classification')\n", 973 | "\n", 974 | "# Explode lists\n", 975 | "for col in ['Classification', 'Generation']:\n", 976 | " df[col] = df[col].astype(str).str.split(',')\n", 977 | " df = df.explode(col)\n", 978 | " df[col] = df[col].str.strip()\n", 979 | "\n", 980 | "df = df[df['Classification'] != '']\n", 981 | "df = df[df['Generation'] != '']\n", 982 | "\n", 983 | "# Calculate Weights\n", 984 | "df['Class_Count'] = df.groupby('CitationKey')['Classification'].transform('count')\n", 985 | "df['Gen_Count'] = df.groupby('CitationKey')['Generation'].transform('count')\n", 986 | "df['Weight'] = 1 / (df['Class_Count'] * df['Gen_Count'])\n", 987 | "\n", 988 | "# Create Edges\n", 989 | "edges = df.groupby(['Classification', 'Generation'])['Weight'].sum().reset_index(name='Value')\n", 990 | "edges = edges.rename(columns={'Classification': 'Source', 'Generation': 'Target'})\n", 991 | "\n", 992 | "# Define Node Properties\n", 993 | "all_labels = pd.unique(edges[['Source', 'Target']].values.ravel())\n", 994 | "nodes_df = pd.DataFrame({'Label': all_labels})\n", 995 | "nodes_df['ID'] = nodes_df.index\n", 996 | "label_to_id = dict(zip(nodes_df['Label'], nodes_df['ID']))\n", 997 | "\n", 998 | "edges['SourceID'] = edges['Source'].map(label_to_id)\n", 999 | "edges['TargetID'] = edges['Target'].map(label_to_id)\n", 1000 | "\n", 1001 | "# ==========================================\n", 1002 | "residual_labels = ['Classification Only', 'Generation Only', 'No Classification']\n", 1003 | "palette = pc.qualitative.Pastel \n", 1004 | "\n", 1005 | "node_colors = []\n", 1006 | "link_colors = []\n", 1007 | "\n", 1008 | "# Assign Node Colors\n", 1009 | "for idx, row in nodes_df.iterrows():\n", 1010 | " if row['Label'] in residual_labels:\n", 1011 | " # Keep residuals gray\n", 1012 | " node_colors.append('rgba(200, 200, 200, 0.5)') \n", 1013 | " else:\n", 1014 | " # Assign color from Pastel palette\n", 1015 | " color_idx = idx % len(palette)\n", 1016 | " node_colors.append(palette[color_idx])\n", 1017 | "\n", 1018 | "# Assign Link Colors\n", 1019 | "for idx, row in edges.iterrows():\n", 1020 | " source_label = row['Source']\n", 1021 | " target_label = row['Target']\n", 1022 | " \n", 1023 | " if source_label in residual_labels or target_label in residual_labels:\n", 1024 | " link_colors.append('rgba(220, 220, 220, 0.5)')\n", 1025 | " else:\n", 1026 | " source_id = label_to_id[source_label]\n", 1027 | " base_color = node_colors[source_id]\n", 1028 | " if base_color.startswith('#'):\n", 1029 | " h = base_color.lstrip('#')\n", 1030 | " rgb = tuple(int(h[i:i+2], 16) for i in (0, 2, 4))\n", 1031 | " link_colors.append(f'rgba({rgb[0]}, {rgb[1]}, {rgb[2]}, 0.6)')\n", 1032 | " else:\n", 1033 | " link_colors.append(base_color)\n", 1034 | "\n", 1035 | "# ==========================================\n", 1036 | "# counts\n", 1037 | "label_to_studies = defaultdict(set)\n", 1038 | "for idx, row in taxonomy_task_df.iterrows():\n", 1039 | " val_c = str(row['Classification'])\n", 1040 | " if val_c != 'None' and val_c != 'nan':\n", 1041 | " for tag in val_c.split(','):\n", 1042 | " label_to_studies[tag.strip()].add(row['CitationKey'])\n", 1043 | " else:\n", 1044 | " label_to_studies['No Classification'].add(row['CitationKey'])\n", 1045 | "\n", 1046 | " val_g = str(row['Generation'])\n", 1047 | " if val_g != 'None' and val_g != 'nan':\n", 1048 | " for tag in val_g.split(','):\n", 1049 | " label_to_studies[tag.strip()].add(row['CitationKey'])\n", 1050 | " else:\n", 1051 | " label_to_studies['Classification Only'].add(row['CitationKey'])\n", 1052 | "\n", 1053 | "nodes_df['StudyCount'] = nodes_df['Label'].map(lambda x: len(label_to_studies.get(x, set())))\n", 1054 | "\n", 1055 | "# --- Taxonomy IDs ---\n", 1056 | "taxonomy_ids = {\n", 1057 | " \"Binary\": \"F1.1\",\n", 1058 | " \"Multi-Class\": \"F1.2\",\n", 1059 | " \"Multi-Label\": \"F1.3\",\n", 1060 | " \"Vulnerability-Specific\": \"F1.1.1\",\n", 1061 | " \"Description\": \"F2.1\",\n", 1062 | " \"Reasoning\": \"F2.2\",\n", 1063 | " \"Report\": \"F2.3\"\n", 1064 | "}\n", 1065 | "\n", 1066 | "# Label Formatter with () and {}\n", 1067 | "def format_label(row):\n", 1068 | " label = row['Label']\n", 1069 | " count = row['StudyCount']\n", 1070 | " \n", 1071 | " if label == 'No Classification':\n", 1072 | " return \"\"\n", 1073 | " \n", 1074 | " tax_id = taxonomy_ids.get(label, \"\")\n", 1075 | " if tax_id:\n", 1076 | " return f\"{label} ({tax_id}) {{{count}}}\"\n", 1077 | " else:\n", 1078 | " return f\"{label} {{{count}}}\"\n", 1079 | "\n", 1080 | "nodes_df['LabelDisplay'] = nodes_df.apply(format_label, axis=1)\n", 1081 | "\n", 1082 | "# Plot\n", 1083 | "fig = go.Figure(data=[go.Sankey(\n", 1084 | " arrangement=\"snap\",\n", 1085 | " node=dict(\n", 1086 | " pad=20,\n", 1087 | " thickness=20,\n", 1088 | " line=dict(color=\"black\", width=0.5),\n", 1089 | " label=nodes_df['LabelDisplay'],\n", 1090 | " color=node_colors,\n", 1091 | " hovertemplate='%{label}
Volume: %{value:.2f}',\n", 1092 | " ),\n", 1093 | " link=dict(\n", 1094 | " source=edges['SourceID'],\n", 1095 | " target=edges['TargetID'],\n", 1096 | " value=edges['Value'],\n", 1097 | " color=link_colors\n", 1098 | " )\n", 1099 | ")])\n", 1100 | "\n", 1101 | "# ==========================================\n", 1102 | "fig.update_layout(\n", 1103 | " font=dict(\n", 1104 | " family=\"Times New Roman, serif\", \n", 1105 | " size=20, \n", 1106 | " color=\"black\"\n", 1107 | " ),\n", 1108 | " width=1000,\n", 1109 | " height=500,\n", 1110 | " margin=dict(b=60, t=40),\n", 1111 | " \n", 1112 | " annotations=[\n", 1113 | " # Left Column Label\n", 1114 | " dict(\n", 1115 | " x=0,\n", 1116 | " y=-0.1,\n", 1117 | " xref=\"paper\",\n", 1118 | " yref=\"paper\",\n", 1119 | " text=\"Classification (F1)\", # Taxonomy in ()\n", 1120 | " showarrow=False,\n", 1121 | " font=dict(size=20, color=\"black\"), \n", 1122 | " align=\"center\"\n", 1123 | " ),\n", 1124 | " # Right Column Label\n", 1125 | " dict(\n", 1126 | " x=1,\n", 1127 | " y=-0.1,\n", 1128 | " xref=\"paper\",\n", 1129 | " yref=\"paper\",\n", 1130 | " text=\"Generation (F2)\", # Taxonomy in ()\n", 1131 | " showarrow=False,\n", 1132 | " font=dict(size=20, color=\"black\"),\n", 1133 | " align=\"center\"\n", 1134 | " )\n", 1135 | " ]\n", 1136 | ")\n", 1137 | "\n", 1138 | "fig.show()" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "code", 1143 | "execution_count": 3, 1144 | "id": "e4405f7a", 1145 | "metadata": {}, 1146 | "outputs": [ 1147 | { 1148 | "data": { 1149 | "application/vnd.plotly.v1+json": { 1150 | "config": { 1151 | "plotlyServerURL": "https://plot.ly" 1152 | }, 1153 | "data": [ 1154 | { 1155 | "arrangement": "snap", 1156 | "link": { 1157 | "color": [ 1158 | "rgba(220, 176, 242, 0.4)", 1159 | "rgba(220, 176, 242, 0.4)", 1160 | "rgba(220, 176, 242, 0.4)", 1161 | "rgba(248, 156, 116, 0.4)", 1162 | "rgba(248, 156, 116, 0.4)", 1163 | "rgba(248, 156, 116, 0.4)", 1164 | "rgba(248, 156, 116, 0.4)", 1165 | "rgba(246, 207, 113, 0.4)", 1166 | "rgba(246, 207, 113, 0.4)", 1167 | "rgba(246, 207, 113, 0.4)", 1168 | "rgba(246, 207, 113, 0.4)", 1169 | "rgba(246, 207, 113, 0.4)", 1170 | "rgba(102, 197, 204, 0.4)", 1171 | "rgba(102, 197, 204, 0.4)", 1172 | "rgba(102, 197, 204, 0.4)", 1173 | "rgba(102, 197, 204, 0.4)", 1174 | "rgba(102, 197, 204, 0.4)", 1175 | "rgba(200, 200, 200, 0.4)", 1176 | "rgba(135, 197, 95, 0.4)", 1177 | "rgba(135, 197, 95, 0.4)", 1178 | "rgba(158, 185, 243, 0.4)", 1179 | "rgba(158, 185, 243, 0.4)", 1180 | "rgba(158, 185, 243, 0.4)", 1181 | "rgba(158, 185, 243, 0.4)", 1182 | "rgba(158, 185, 243, 0.4)", 1183 | "rgba(158, 185, 243, 0.4)", 1184 | "rgba(158, 185, 243, 0.4)", 1185 | "rgba(200, 200, 200, 0.4)", 1186 | "rgba(254, 136, 177, 0.4)", 1187 | "rgba(254, 136, 177, 0.4)", 1188 | "rgba(254, 136, 177, 0.4)", 1189 | "rgba(254, 136, 177, 0.4)", 1190 | "rgba(254, 136, 177, 0.4)" 1191 | ], 1192 | "source": [ 1193 | 3, 1194 | 3, 1195 | 3, 1196 | 2, 1197 | 2, 1198 | 2, 1199 | 2, 1200 | 1, 1201 | 1, 1202 | 1, 1203 | 1, 1204 | 1, 1205 | 0, 1206 | 0, 1207 | 0, 1208 | 0, 1209 | 0, 1210 | 12, 1211 | 4, 1212 | 4, 1213 | 5, 1214 | 5, 1215 | 5, 1216 | 5, 1217 | 5, 1218 | 5, 1219 | 5, 1220 | 19, 1221 | 6, 1222 | 6, 1223 | 6, 1224 | 6, 1225 | 6 1226 | ], 1227 | "target": [ 1228 | 4, 1229 | 5, 1230 | 6, 1231 | 12, 1232 | 4, 1233 | 5, 1234 | 6, 1235 | 12, 1236 | 4, 1237 | 5, 1238 | 19, 1239 | 6, 1240 | 12, 1241 | 4, 1242 | 5, 1243 | 19, 1244 | 6, 1245 | 12, 1246 | 14, 1247 | 16, 1248 | 9, 1249 | 10, 1250 | 16, 1251 | 17, 1252 | 18, 1253 | 20, 1254 | 22, 1255 | 19, 1256 | 11, 1257 | 13, 1258 | 15, 1259 | 21, 1260 | 23 1261 | ], 1262 | "value": [ 1263 | 1.3499999999999999, 1264 | 0.21428571428571427, 1265 | 42.962686127082414, 1266 | 0.75, 1267 | 2.25, 1268 | 1.3766917293233083, 1269 | 30.223142722001082, 1270 | 1.5833333333333333, 1271 | 7.536111111111111, 1272 | 14.904020467836258, 1273 | 0.3333333333333333, 1274 | 13.764757789080544, 1275 | 28, 1276 | 84.49130116959064, 1277 | 11.873214285714285, 1278 | 7, 1279 | 0.41666666666666663, 1280 | 30.833333333333332, 1281 | 97.21769005847953, 1282 | 2.34375, 1283 | 1.5, 1284 | 1, 1285 | 3.488095238095238, 1286 | 4.833333333333333, 1287 | 16.44715956558062, 1288 | 4, 1289 | 2, 1290 | 7.333333333333333, 1291 | 16.536805555555556, 1292 | 11.989499791144528, 1293 | 17.947916666666668, 1294 | 11.614583333333334, 1295 | 33.91449979114453 1296 | ] 1297 | }, 1298 | "node": { 1299 | "color": [ 1300 | "rgb(102, 197, 204)", 1301 | "rgb(246, 207, 113)", 1302 | "rgb(248, 156, 116)", 1303 | "rgb(220, 176, 242)", 1304 | "rgb(135, 197, 95)", 1305 | "rgb(158, 185, 243)", 1306 | "rgb(254, 136, 177)", 1307 | "lightgrey", 1308 | "lightgrey", 1309 | "rgb(158, 185, 243)", 1310 | "rgb(158, 185, 243)", 1311 | "rgb(254, 136, 177)", 1312 | "lightgrey", 1313 | "rgb(254, 136, 177)", 1314 | "rgb(135, 197, 95)", 1315 | "rgb(254, 136, 177)", 1316 | "rgb(158, 185, 243)", 1317 | "rgb(158, 185, 243)", 1318 | "rgb(158, 185, 243)", 1319 | "lightgrey", 1320 | "rgb(158, 185, 243)", 1321 | "rgb(254, 136, 177)", 1322 | "rgb(158, 185, 243)", 1323 | "rgb(254, 136, 177)" 1324 | ], 1325 | "hovertemplate": "%{label}
Weighted Volume: %{value:.2f}", 1326 | "label": [ 1327 | "Tiny (S1.3.1)", 1328 | "Small (S1.3.2)", 1329 | "Medium (S1.3.3)", 1330 | "Large (S1.3.4)", 1331 | "Full Fine-Tuning (T2.2.2.1)", 1332 | "Parameter-Efficient Fine-Tuning (T2.2.2.2)", 1333 | "Prompt Engineering (T2.1)", 1334 | "Pre-Training (T2.2.1)", 1335 | "Feature Extraction (T1)", 1336 | "Adapter-Tuning {2}", 1337 | "Additive-Other {1}", 1338 | "CoT {38}", 1339 | "Feature Extraction {32}", 1340 | "Few-Shot {28}", 1341 | "Full-Parameter {117}", 1342 | "In-Context {30}", 1343 | "Instruction-Tuning {13}", 1344 | "LoRA Derivates {8}", 1345 | "Low-Rank Decomposition {27}", 1346 | "Pre-Training {14}", 1347 | "Prompt-Tuning {5}", 1348 | "RAG {25}", 1349 | "Selective {2}", 1350 | "Zero-Shot {56}" 1351 | ], 1352 | "line": { 1353 | "color": "black", 1354 | "width": 0.5 1355 | }, 1356 | "pad": 15, 1357 | "thickness": 20 1358 | }, 1359 | "type": "sankey" 1360 | } 1361 | ], 1362 | "layout": { 1363 | "annotations": [ 1364 | { 1365 | "align": "center", 1366 | "font": { 1367 | "color": "black", 1368 | "size": 20 1369 | }, 1370 | "showarrow": false, 1371 | "text": "Model Scale (S1.3)", 1372 | "x": 0, 1373 | "xref": "paper", 1374 | "y": -0.1, 1375 | "yref": "paper" 1376 | }, 1377 | { 1378 | "align": "center", 1379 | "font": { 1380 | "color": "black", 1381 | "size": 20 1382 | }, 1383 | "showarrow": false, 1384 | "text": "Adaptation Technique (T2)", 1385 | "x": 1, 1386 | "xref": "paper", 1387 | "y": -0.1, 1388 | "yref": "paper" 1389 | } 1390 | ], 1391 | "font": { 1392 | "color": "black", 1393 | "family": "Times New Roman, serif", 1394 | "size": 17 1395 | }, 1396 | "height": 600, 1397 | "margin": { 1398 | "b": 60, 1399 | "t": 40 1400 | }, 1401 | "template": { 1402 | "data": { 1403 | "bar": [ 1404 | { 1405 | "error_x": { 1406 | "color": "#2a3f5f" 1407 | }, 1408 | "error_y": { 1409 | "color": "#2a3f5f" 1410 | }, 1411 | "marker": { 1412 | "line": { 1413 | "color": "#E5ECF6", 1414 | "width": 0.5 1415 | }, 1416 | "pattern": { 1417 | "fillmode": "overlay", 1418 | "size": 10, 1419 | "solidity": 0.2 1420 | } 1421 | }, 1422 | "type": "bar" 1423 | } 1424 | ], 1425 | "barpolar": [ 1426 | { 1427 | "marker": { 1428 | "line": { 1429 | "color": "#E5ECF6", 1430 | "width": 0.5 1431 | }, 1432 | "pattern": { 1433 | "fillmode": "overlay", 1434 | "size": 10, 1435 | "solidity": 0.2 1436 | } 1437 | }, 1438 | "type": "barpolar" 1439 | } 1440 | ], 1441 | "carpet": [ 1442 | { 1443 | "aaxis": { 1444 | "endlinecolor": "#2a3f5f", 1445 | "gridcolor": "white", 1446 | "linecolor": "white", 1447 | "minorgridcolor": "white", 1448 | "startlinecolor": "#2a3f5f" 1449 | }, 1450 | "baxis": { 1451 | "endlinecolor": "#2a3f5f", 1452 | "gridcolor": "white", 1453 | "linecolor": "white", 1454 | "minorgridcolor": "white", 1455 | "startlinecolor": "#2a3f5f" 1456 | }, 1457 | "type": "carpet" 1458 | } 1459 | ], 1460 | "choropleth": [ 1461 | { 1462 | "colorbar": { 1463 | "outlinewidth": 0, 1464 | "ticks": "" 1465 | }, 1466 | "type": "choropleth" 1467 | } 1468 | ], 1469 | "contour": [ 1470 | { 1471 | "colorbar": { 1472 | "outlinewidth": 0, 1473 | "ticks": "" 1474 | }, 1475 | "colorscale": [ 1476 | [ 1477 | 0, 1478 | "#0d0887" 1479 | ], 1480 | [ 1481 | 0.1111111111111111, 1482 | "#46039f" 1483 | ], 1484 | [ 1485 | 0.2222222222222222, 1486 | "#7201a8" 1487 | ], 1488 | [ 1489 | 0.3333333333333333, 1490 | "#9c179e" 1491 | ], 1492 | [ 1493 | 0.4444444444444444, 1494 | "#bd3786" 1495 | ], 1496 | [ 1497 | 0.5555555555555556, 1498 | "#d8576b" 1499 | ], 1500 | [ 1501 | 0.6666666666666666, 1502 | "#ed7953" 1503 | ], 1504 | [ 1505 | 0.7777777777777778, 1506 | "#fb9f3a" 1507 | ], 1508 | [ 1509 | 0.8888888888888888, 1510 | "#fdca26" 1511 | ], 1512 | [ 1513 | 1, 1514 | "#f0f921" 1515 | ] 1516 | ], 1517 | "type": "contour" 1518 | } 1519 | ], 1520 | "contourcarpet": [ 1521 | { 1522 | "colorbar": { 1523 | "outlinewidth": 0, 1524 | "ticks": "" 1525 | }, 1526 | "type": "contourcarpet" 1527 | } 1528 | ], 1529 | "heatmap": [ 1530 | { 1531 | "colorbar": { 1532 | "outlinewidth": 0, 1533 | "ticks": "" 1534 | }, 1535 | "colorscale": [ 1536 | [ 1537 | 0, 1538 | "#0d0887" 1539 | ], 1540 | [ 1541 | 0.1111111111111111, 1542 | "#46039f" 1543 | ], 1544 | [ 1545 | 0.2222222222222222, 1546 | "#7201a8" 1547 | ], 1548 | [ 1549 | 0.3333333333333333, 1550 | "#9c179e" 1551 | ], 1552 | [ 1553 | 0.4444444444444444, 1554 | "#bd3786" 1555 | ], 1556 | [ 1557 | 0.5555555555555556, 1558 | "#d8576b" 1559 | ], 1560 | [ 1561 | 0.6666666666666666, 1562 | "#ed7953" 1563 | ], 1564 | [ 1565 | 0.7777777777777778, 1566 | "#fb9f3a" 1567 | ], 1568 | [ 1569 | 0.8888888888888888, 1570 | "#fdca26" 1571 | ], 1572 | [ 1573 | 1, 1574 | "#f0f921" 1575 | ] 1576 | ], 1577 | "type": "heatmap" 1578 | } 1579 | ], 1580 | "histogram": [ 1581 | { 1582 | "marker": { 1583 | "pattern": { 1584 | "fillmode": "overlay", 1585 | "size": 10, 1586 | "solidity": 0.2 1587 | } 1588 | }, 1589 | "type": "histogram" 1590 | } 1591 | ], 1592 | "histogram2d": [ 1593 | { 1594 | "colorbar": { 1595 | "outlinewidth": 0, 1596 | "ticks": "" 1597 | }, 1598 | "colorscale": [ 1599 | [ 1600 | 0, 1601 | "#0d0887" 1602 | ], 1603 | [ 1604 | 0.1111111111111111, 1605 | "#46039f" 1606 | ], 1607 | [ 1608 | 0.2222222222222222, 1609 | "#7201a8" 1610 | ], 1611 | [ 1612 | 0.3333333333333333, 1613 | "#9c179e" 1614 | ], 1615 | [ 1616 | 0.4444444444444444, 1617 | "#bd3786" 1618 | ], 1619 | [ 1620 | 0.5555555555555556, 1621 | "#d8576b" 1622 | ], 1623 | [ 1624 | 0.6666666666666666, 1625 | "#ed7953" 1626 | ], 1627 | [ 1628 | 0.7777777777777778, 1629 | "#fb9f3a" 1630 | ], 1631 | [ 1632 | 0.8888888888888888, 1633 | "#fdca26" 1634 | ], 1635 | [ 1636 | 1, 1637 | "#f0f921" 1638 | ] 1639 | ], 1640 | "type": "histogram2d" 1641 | } 1642 | ], 1643 | "histogram2dcontour": [ 1644 | { 1645 | "colorbar": { 1646 | "outlinewidth": 0, 1647 | "ticks": "" 1648 | }, 1649 | "colorscale": [ 1650 | [ 1651 | 0, 1652 | "#0d0887" 1653 | ], 1654 | [ 1655 | 0.1111111111111111, 1656 | "#46039f" 1657 | ], 1658 | [ 1659 | 0.2222222222222222, 1660 | "#7201a8" 1661 | ], 1662 | [ 1663 | 0.3333333333333333, 1664 | "#9c179e" 1665 | ], 1666 | [ 1667 | 0.4444444444444444, 1668 | "#bd3786" 1669 | ], 1670 | [ 1671 | 0.5555555555555556, 1672 | "#d8576b" 1673 | ], 1674 | [ 1675 | 0.6666666666666666, 1676 | "#ed7953" 1677 | ], 1678 | [ 1679 | 0.7777777777777778, 1680 | "#fb9f3a" 1681 | ], 1682 | [ 1683 | 0.8888888888888888, 1684 | "#fdca26" 1685 | ], 1686 | [ 1687 | 1, 1688 | "#f0f921" 1689 | ] 1690 | ], 1691 | "type": "histogram2dcontour" 1692 | } 1693 | ], 1694 | "mesh3d": [ 1695 | { 1696 | "colorbar": { 1697 | "outlinewidth": 0, 1698 | "ticks": "" 1699 | }, 1700 | "type": "mesh3d" 1701 | } 1702 | ], 1703 | "parcoords": [ 1704 | { 1705 | "line": { 1706 | "colorbar": { 1707 | "outlinewidth": 0, 1708 | "ticks": "" 1709 | } 1710 | }, 1711 | "type": "parcoords" 1712 | } 1713 | ], 1714 | "pie": [ 1715 | { 1716 | "automargin": true, 1717 | "type": "pie" 1718 | } 1719 | ], 1720 | "scatter": [ 1721 | { 1722 | "fillpattern": { 1723 | "fillmode": "overlay", 1724 | "size": 10, 1725 | "solidity": 0.2 1726 | }, 1727 | "type": "scatter" 1728 | } 1729 | ], 1730 | "scatter3d": [ 1731 | { 1732 | "line": { 1733 | "colorbar": { 1734 | "outlinewidth": 0, 1735 | "ticks": "" 1736 | } 1737 | }, 1738 | "marker": { 1739 | "colorbar": { 1740 | "outlinewidth": 0, 1741 | "ticks": "" 1742 | } 1743 | }, 1744 | "type": "scatter3d" 1745 | } 1746 | ], 1747 | "scattercarpet": [ 1748 | { 1749 | "marker": { 1750 | "colorbar": { 1751 | "outlinewidth": 0, 1752 | "ticks": "" 1753 | } 1754 | }, 1755 | "type": "scattercarpet" 1756 | } 1757 | ], 1758 | "scattergeo": [ 1759 | { 1760 | "marker": { 1761 | "colorbar": { 1762 | "outlinewidth": 0, 1763 | "ticks": "" 1764 | } 1765 | }, 1766 | "type": "scattergeo" 1767 | } 1768 | ], 1769 | "scattergl": [ 1770 | { 1771 | "marker": { 1772 | "colorbar": { 1773 | "outlinewidth": 0, 1774 | "ticks": "" 1775 | } 1776 | }, 1777 | "type": "scattergl" 1778 | } 1779 | ], 1780 | "scattermap": [ 1781 | { 1782 | "marker": { 1783 | "colorbar": { 1784 | "outlinewidth": 0, 1785 | "ticks": "" 1786 | } 1787 | }, 1788 | "type": "scattermap" 1789 | } 1790 | ], 1791 | "scattermapbox": [ 1792 | { 1793 | "marker": { 1794 | "colorbar": { 1795 | "outlinewidth": 0, 1796 | "ticks": "" 1797 | } 1798 | }, 1799 | "type": "scattermapbox" 1800 | } 1801 | ], 1802 | "scatterpolar": [ 1803 | { 1804 | "marker": { 1805 | "colorbar": { 1806 | "outlinewidth": 0, 1807 | "ticks": "" 1808 | } 1809 | }, 1810 | "type": "scatterpolar" 1811 | } 1812 | ], 1813 | "scatterpolargl": [ 1814 | { 1815 | "marker": { 1816 | "colorbar": { 1817 | "outlinewidth": 0, 1818 | "ticks": "" 1819 | } 1820 | }, 1821 | "type": "scatterpolargl" 1822 | } 1823 | ], 1824 | "scatterternary": [ 1825 | { 1826 | "marker": { 1827 | "colorbar": { 1828 | "outlinewidth": 0, 1829 | "ticks": "" 1830 | } 1831 | }, 1832 | "type": "scatterternary" 1833 | } 1834 | ], 1835 | "surface": [ 1836 | { 1837 | "colorbar": { 1838 | "outlinewidth": 0, 1839 | "ticks": "" 1840 | }, 1841 | "colorscale": [ 1842 | [ 1843 | 0, 1844 | "#0d0887" 1845 | ], 1846 | [ 1847 | 0.1111111111111111, 1848 | "#46039f" 1849 | ], 1850 | [ 1851 | 0.2222222222222222, 1852 | "#7201a8" 1853 | ], 1854 | [ 1855 | 0.3333333333333333, 1856 | "#9c179e" 1857 | ], 1858 | [ 1859 | 0.4444444444444444, 1860 | "#bd3786" 1861 | ], 1862 | [ 1863 | 0.5555555555555556, 1864 | "#d8576b" 1865 | ], 1866 | [ 1867 | 0.6666666666666666, 1868 | "#ed7953" 1869 | ], 1870 | [ 1871 | 0.7777777777777778, 1872 | "#fb9f3a" 1873 | ], 1874 | [ 1875 | 0.8888888888888888, 1876 | "#fdca26" 1877 | ], 1878 | [ 1879 | 1, 1880 | "#f0f921" 1881 | ] 1882 | ], 1883 | "type": "surface" 1884 | } 1885 | ], 1886 | "table": [ 1887 | { 1888 | "cells": { 1889 | "fill": { 1890 | "color": "#EBF0F8" 1891 | }, 1892 | "line": { 1893 | "color": "white" 1894 | } 1895 | }, 1896 | "header": { 1897 | "fill": { 1898 | "color": "#C8D4E3" 1899 | }, 1900 | "line": { 1901 | "color": "white" 1902 | } 1903 | }, 1904 | "type": "table" 1905 | } 1906 | ] 1907 | }, 1908 | "layout": { 1909 | "annotationdefaults": { 1910 | "arrowcolor": "#2a3f5f", 1911 | "arrowhead": 0, 1912 | "arrowwidth": 1 1913 | }, 1914 | "autotypenumbers": "strict", 1915 | "coloraxis": { 1916 | "colorbar": { 1917 | "outlinewidth": 0, 1918 | "ticks": "" 1919 | } 1920 | }, 1921 | "colorscale": { 1922 | "diverging": [ 1923 | [ 1924 | 0, 1925 | "#8e0152" 1926 | ], 1927 | [ 1928 | 0.1, 1929 | "#c51b7d" 1930 | ], 1931 | [ 1932 | 0.2, 1933 | "#de77ae" 1934 | ], 1935 | [ 1936 | 0.3, 1937 | "#f1b6da" 1938 | ], 1939 | [ 1940 | 0.4, 1941 | "#fde0ef" 1942 | ], 1943 | [ 1944 | 0.5, 1945 | "#f7f7f7" 1946 | ], 1947 | [ 1948 | 0.6, 1949 | "#e6f5d0" 1950 | ], 1951 | [ 1952 | 0.7, 1953 | "#b8e186" 1954 | ], 1955 | [ 1956 | 0.8, 1957 | "#7fbc41" 1958 | ], 1959 | [ 1960 | 0.9, 1961 | "#4d9221" 1962 | ], 1963 | [ 1964 | 1, 1965 | "#276419" 1966 | ] 1967 | ], 1968 | "sequential": [ 1969 | [ 1970 | 0, 1971 | "#0d0887" 1972 | ], 1973 | [ 1974 | 0.1111111111111111, 1975 | "#46039f" 1976 | ], 1977 | [ 1978 | 0.2222222222222222, 1979 | "#7201a8" 1980 | ], 1981 | [ 1982 | 0.3333333333333333, 1983 | "#9c179e" 1984 | ], 1985 | [ 1986 | 0.4444444444444444, 1987 | "#bd3786" 1988 | ], 1989 | [ 1990 | 0.5555555555555556, 1991 | "#d8576b" 1992 | ], 1993 | [ 1994 | 0.6666666666666666, 1995 | "#ed7953" 1996 | ], 1997 | [ 1998 | 0.7777777777777778, 1999 | "#fb9f3a" 2000 | ], 2001 | [ 2002 | 0.8888888888888888, 2003 | "#fdca26" 2004 | ], 2005 | [ 2006 | 1, 2007 | "#f0f921" 2008 | ] 2009 | ], 2010 | "sequentialminus": [ 2011 | [ 2012 | 0, 2013 | "#0d0887" 2014 | ], 2015 | [ 2016 | 0.1111111111111111, 2017 | "#46039f" 2018 | ], 2019 | [ 2020 | 0.2222222222222222, 2021 | "#7201a8" 2022 | ], 2023 | [ 2024 | 0.3333333333333333, 2025 | "#9c179e" 2026 | ], 2027 | [ 2028 | 0.4444444444444444, 2029 | "#bd3786" 2030 | ], 2031 | [ 2032 | 0.5555555555555556, 2033 | "#d8576b" 2034 | ], 2035 | [ 2036 | 0.6666666666666666, 2037 | "#ed7953" 2038 | ], 2039 | [ 2040 | 0.7777777777777778, 2041 | "#fb9f3a" 2042 | ], 2043 | [ 2044 | 0.8888888888888888, 2045 | "#fdca26" 2046 | ], 2047 | [ 2048 | 1, 2049 | "#f0f921" 2050 | ] 2051 | ] 2052 | }, 2053 | "colorway": [ 2054 | "#636efa", 2055 | "#EF553B", 2056 | "#00cc96", 2057 | "#ab63fa", 2058 | "#FFA15A", 2059 | "#19d3f3", 2060 | "#FF6692", 2061 | "#B6E880", 2062 | "#FF97FF", 2063 | "#FECB52" 2064 | ], 2065 | "font": { 2066 | "color": "#2a3f5f" 2067 | }, 2068 | "geo": { 2069 | "bgcolor": "white", 2070 | "lakecolor": "white", 2071 | "landcolor": "#E5ECF6", 2072 | "showlakes": true, 2073 | "showland": true, 2074 | "subunitcolor": "white" 2075 | }, 2076 | "hoverlabel": { 2077 | "align": "left" 2078 | }, 2079 | "hovermode": "closest", 2080 | "mapbox": { 2081 | "style": "light" 2082 | }, 2083 | "paper_bgcolor": "white", 2084 | "plot_bgcolor": "#E5ECF6", 2085 | "polar": { 2086 | "angularaxis": { 2087 | "gridcolor": "white", 2088 | "linecolor": "white", 2089 | "ticks": "" 2090 | }, 2091 | "bgcolor": "#E5ECF6", 2092 | "radialaxis": { 2093 | "gridcolor": "white", 2094 | "linecolor": "white", 2095 | "ticks": "" 2096 | } 2097 | }, 2098 | "scene": { 2099 | "xaxis": { 2100 | "backgroundcolor": "#E5ECF6", 2101 | "gridcolor": "white", 2102 | "gridwidth": 2, 2103 | "linecolor": "white", 2104 | "showbackground": true, 2105 | "ticks": "", 2106 | "zerolinecolor": "white" 2107 | }, 2108 | "yaxis": { 2109 | "backgroundcolor": "#E5ECF6", 2110 | "gridcolor": "white", 2111 | "gridwidth": 2, 2112 | "linecolor": "white", 2113 | "showbackground": true, 2114 | "ticks": "", 2115 | "zerolinecolor": "white" 2116 | }, 2117 | "zaxis": { 2118 | "backgroundcolor": "#E5ECF6", 2119 | "gridcolor": "white", 2120 | "gridwidth": 2, 2121 | "linecolor": "white", 2122 | "showbackground": true, 2123 | "ticks": "", 2124 | "zerolinecolor": "white" 2125 | } 2126 | }, 2127 | "shapedefaults": { 2128 | "line": { 2129 | "color": "#2a3f5f" 2130 | } 2131 | }, 2132 | "ternary": { 2133 | "aaxis": { 2134 | "gridcolor": "white", 2135 | "linecolor": "white", 2136 | "ticks": "" 2137 | }, 2138 | "baxis": { 2139 | "gridcolor": "white", 2140 | "linecolor": "white", 2141 | "ticks": "" 2142 | }, 2143 | "bgcolor": "#E5ECF6", 2144 | "caxis": { 2145 | "gridcolor": "white", 2146 | "linecolor": "white", 2147 | "ticks": "" 2148 | } 2149 | }, 2150 | "title": { 2151 | "x": 0.05 2152 | }, 2153 | "xaxis": { 2154 | "automargin": true, 2155 | "gridcolor": "white", 2156 | "linecolor": "white", 2157 | "ticks": "", 2158 | "title": { 2159 | "standoff": 15 2160 | }, 2161 | "zerolinecolor": "white", 2162 | "zerolinewidth": 2 2163 | }, 2164 | "yaxis": { 2165 | "automargin": true, 2166 | "gridcolor": "white", 2167 | "linecolor": "white", 2168 | "ticks": "", 2169 | "title": { 2170 | "standoff": 15 2171 | }, 2172 | "zerolinecolor": "white", 2173 | "zerolinewidth": 2 2174 | } 2175 | } 2176 | }, 2177 | "width": 1000 2178 | } 2179 | } 2180 | }, 2181 | "metadata": {}, 2182 | "output_type": "display_data" 2183 | } 2184 | ], 2185 | "source": [ 2186 | "# sankey model & adaptation techniques\n", 2187 | "# ==========================================\n", 2188 | "df_models = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"MODELS_ESTIMATED\")\n", 2189 | "df_study_model = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"STUDY_MODEL\")\n", 2190 | "df_techniques = pd.read_excel(\"taxonomy.xlsx\", sheet_name=\"STUDY_TECHNIQUE\")\n", 2191 | "\n", 2192 | "\n", 2193 | "df_study_model['Adaptation'] = df_study_model['Adaptation'].astype(str).str.split(',')\n", 2194 | "df_study_model = df_study_model.explode('Adaptation')\n", 2195 | "df_study_model['Adaptation'] = df_study_model['Adaptation'].str.strip()\n", 2196 | "\n", 2197 | "merged_models = pd.merge(\n", 2198 | " df_study_model[['CitationKey', 'ModelKey', 'Adaptation']],\n", 2199 | " df_models[['ModelKey', 'Scale']],\n", 2200 | " on='ModelKey',\n", 2201 | " how='left'\n", 2202 | ")\n", 2203 | "\n", 2204 | "full_df = pd.merge(\n", 2205 | " merged_models,\n", 2206 | " df_techniques[['CitationKey', 'Prompt-Engineering', 'Training']],\n", 2207 | " on='CitationKey',\n", 2208 | " how='left'\n", 2209 | ")\n", 2210 | "\n", 2211 | "# ==========================================\n", 2212 | "peft_keywords = ['Low-Rank Decomposition', 'LoRA Derivates', 'Adapter-Tuning', 'Selective', 'Additive-Other', 'Prompt-Tuning', 'Instruction-Tuning']\n", 2213 | "full_keywords = ['Full-Parameter Fine-Tuning', 'Instruction-Tuning']\n", 2214 | "prompt_keywords = ['CoT', 'Few-Shot', 'RAG', 'In-Context', 'Zero-Shot']\n", 2215 | "pre_keywords = ['Pre-Training']\n", 2216 | "\n", 2217 | "def resolve_technique(row):\n", 2218 | " adaptation = str(row['Adaptation']).upper().strip()\n", 2219 | " \n", 2220 | " if adaptation == 'PROMPT':\n", 2221 | " val = str(row['Prompt-Engineering'])\n", 2222 | " if val in ['nan', 'None', '']: return [\"Unspecified Prompting\"]\n", 2223 | " tags = [x.strip() for x in val.split(',')]\n", 2224 | " valid_tags = [t for t in tags if any(k.lower() in t.lower() for k in prompt_keywords)]\n", 2225 | " return valid_tags if valid_tags else tags \n", 2226 | "\n", 2227 | " train_val = str(row['Training'])\n", 2228 | " if train_val in ['nan', 'None', '']: return [\"Unspecified Training\"]\n", 2229 | " tags = [x.strip() for x in train_val.split(',')]\n", 2230 | " relevant_techniques = []\n", 2231 | "\n", 2232 | " if adaptation == 'PEFT':\n", 2233 | " for tag in tags:\n", 2234 | " if any(k.lower() in tag.lower() for k in peft_keywords):\n", 2235 | " relevant_techniques.append(tag)\n", 2236 | " if not relevant_techniques: relevant_techniques.append(\"Other PEFT\")\n", 2237 | "\n", 2238 | " elif adaptation == 'FULL':\n", 2239 | " for tag in tags:\n", 2240 | " if any(k.lower() in tag.lower() for k in full_keywords):\n", 2241 | " relevant_techniques.append(tag)\n", 2242 | " if not relevant_techniques: relevant_techniques.append(\"Other Fine-Tuning\")\n", 2243 | " \n", 2244 | " elif adaptation == 'PRE':\n", 2245 | " for tag in tags:\n", 2246 | " if any(k.lower() in tag.lower() for k in pre_keywords):\n", 2247 | " relevant_techniques.append(tag)\n", 2248 | " if not relevant_techniques: relevant_techniques.append(\"Pre-Training\")\n", 2249 | "\n", 2250 | " elif adaptation == 'FEATURE':\n", 2251 | " return [\"Feature Extraction\"]\n", 2252 | "\n", 2253 | " return relevant_techniques\n", 2254 | "\n", 2255 | "full_df['Specific_Techniques'] = full_df.apply(resolve_technique, axis=1)\n", 2256 | "sankey_df = full_df.explode('Specific_Techniques')\n", 2257 | "sankey_df = sankey_df.dropna(subset=['Specific_Techniques']) \n", 2258 | "sankey_df = sankey_df[sankey_df['Specific_Techniques'] != \"\"] \n", 2259 | "\n", 2260 | "\n", 2261 | "# ==========================================\n", 2262 | "def get_method_category(code):\n", 2263 | " code = str(code).upper()\n", 2264 | " if code == 'PROMPT': return \"Prompt Engineering\"\n", 2265 | " if code == 'FULL': return \"Fine-Tuning\" \n", 2266 | " if code == 'PEFT': return \"Parameter-Efficient Fine-Tuning\"\n", 2267 | " if code == 'PRE': return \"Pre-Training\"\n", 2268 | " if code == 'FEATURE': return \"Feature Extraction\"\n", 2269 | " return \"Other\"\n", 2270 | "\n", 2271 | "sankey_df['Method_Category'] = sankey_df['Adaptation'].apply(get_method_category)\n", 2272 | "replace_map = {'Full-Parameter Fine-Tuning': 'Full-Parameter'}\n", 2273 | "sankey_df['Specific_Techniques'] = sankey_df['Specific_Techniques'].replace(replace_map)\n", 2274 | "sankey_df['Scale'] = sankey_df['Scale'].astype(str).str.strip().str.title()\n", 2275 | "\n", 2276 | "# Weights\n", 2277 | "sankey_df['Study_Row_Count'] = sankey_df.groupby('CitationKey')['CitationKey'].transform('count')\n", 2278 | "sankey_df['Weight'] = 1 / sankey_df['Study_Row_Count']\n", 2279 | "\n", 2280 | "# Unique Counts (Only needed for Level 2 now based on requirements)\n", 2281 | "unique_counts_lvl2 = sankey_df.groupby('Specific_Techniques')['CitationKey'].nunique()\n", 2282 | "\n", 2283 | "\n", 2284 | "# ==========================================\n", 2285 | "raw_to_display = {} \n", 2286 | "scale_ids = {\n", 2287 | " \"Tiny\": \"S1.3.1\",\n", 2288 | " \"Small\": \"S1.3.2\",\n", 2289 | " \"Medium\": \"S1.3.3\",\n", 2290 | " \"Large\": \"S1.3.4\"\n", 2291 | "}\n", 2292 | "raw_lvl0 = [\"Tiny\", \"Small\", \"Medium\", \"Large\"]\n", 2293 | "lvl0_labels = []\n", 2294 | "\n", 2295 | "for raw in raw_lvl0:\n", 2296 | " if raw in sankey_df['Scale'].unique():\n", 2297 | " tax_id = scale_ids.get(raw, \"\")\n", 2298 | " # Format: \"Tiny (S1.3.1)\"\n", 2299 | " final_label = f\"{raw} ({tax_id})\" if tax_id else raw\n", 2300 | " lvl0_labels.append(final_label)\n", 2301 | " raw_to_display[raw] = final_label\n", 2302 | "\n", 2303 | "cat_ids = {\n", 2304 | " \"Feature Extraction\": \"T1\",\n", 2305 | " \"Pre-Training\": \"T2.2.1\",\n", 2306 | " \"Prompt Engineering\": \"T2.1\",\n", 2307 | " \"Fine-Tuning\": \"T2.2.2.1\",\n", 2308 | " \"Parameter-Efficient Fine-Tuning\": \"T2.2.2.2\"\n", 2309 | "}\n", 2310 | "cat_display_names = {\n", 2311 | " \"Fine-Tuning\": \"Full Fine-Tuning\"\n", 2312 | "}\n", 2313 | "\n", 2314 | "raw_lvl1 = [\"Fine-Tuning\", \"Parameter-Efficient Fine-Tuning\", \"Prompt Engineering\", \"Pre-Training\", \"Feature Extraction\"]\n", 2315 | "lvl1_labels = []\n", 2316 | "existing_cats = sankey_df['Method_Category'].unique()\n", 2317 | "\n", 2318 | "for raw in raw_lvl1:\n", 2319 | " if raw in existing_cats:\n", 2320 | " tax_id = cat_ids.get(raw, \"\")\n", 2321 | " disp_name = cat_display_names.get(raw, raw)\n", 2322 | " # Format: \"Pre-Training (T2.2.1)\"\n", 2323 | " final_label = f\"{disp_name} ({tax_id})\" if tax_id else disp_name\n", 2324 | " lvl1_labels.append(final_label)\n", 2325 | " raw_to_display[raw] = final_label\n", 2326 | "\n", 2327 | "# Specific Techniques\n", 2328 | "# Format: \"LoRA {25}\" \n", 2329 | "raw_lvl2 = sorted(sankey_df['Specific_Techniques'].unique().tolist())\n", 2330 | "lvl2_labels = []\n", 2331 | "for raw in raw_lvl2:\n", 2332 | " count = unique_counts_lvl2.get(raw, 0)\n", 2333 | " # Using triple braces {{{ }}} to print literal braces in f-string\n", 2334 | " final_label = f\"{raw} {{{count}}}\"\n", 2335 | " lvl2_labels.append(final_label)\n", 2336 | " raw_to_display[raw] = final_label\n", 2337 | "\n", 2338 | "# Combine all\n", 2339 | "all_labels = lvl0_labels + lvl1_labels + lvl2_labels\n", 2340 | "label_map = {label: i for i, label in enumerate(all_labels)}\n", 2341 | "\n", 2342 | "\n", 2343 | "# ==========================================\n", 2344 | "palette = pc.qualitative.Pastel\n", 2345 | "grey_color = 'lightgrey'\n", 2346 | "grey_link = 'rgba(200, 200, 200, 0.4)'\n", 2347 | "grey_cats = ['Pre-Training', 'Feature Extraction', 'Other']\n", 2348 | "\n", 2349 | "color_map = {}\n", 2350 | "palette_idx = 0\n", 2351 | "\n", 2352 | "# A. Scales\n", 2353 | "for raw_name in raw_lvl0:\n", 2354 | " if raw_name in raw_to_display:\n", 2355 | " color_map[raw_name] = palette[palette_idx % len(palette)]\n", 2356 | " palette_idx += 1\n", 2357 | "\n", 2358 | "# B. Categories\n", 2359 | "for raw_name in raw_lvl1:\n", 2360 | " if raw_name in raw_to_display:\n", 2361 | " if raw_name in grey_cats:\n", 2362 | " color_map[raw_name] = grey_color\n", 2363 | " else:\n", 2364 | " color_map[raw_name] = palette[palette_idx % len(palette)]\n", 2365 | " palette_idx += 1\n", 2366 | "\n", 2367 | "def hex_to_rgba(hex_code, opacity=0.4):\n", 2368 | " if hex_code == 'lightgrey': return grey_link\n", 2369 | " if hex_code.startswith('rgb'): return hex_code.replace(')', f', {opacity})').replace('rgb', 'rgba')\n", 2370 | " h = hex_code.lstrip('#')\n", 2371 | " rgb = tuple(int(h[i:i+2], 16) for i in (0, 2, 4))\n", 2372 | " return f\"rgba({rgb[0]}, {rgb[1]}, {rgb[2]}, {opacity})\"\n", 2373 | "\n", 2374 | "\n", 2375 | "# ==========================================\n", 2376 | "source = []\n", 2377 | "target = []\n", 2378 | "value = []\n", 2379 | "colors = []\n", 2380 | "\n", 2381 | "# --- Flow 1: Scale -> Category ---\n", 2382 | "flow1 = sankey_df.groupby(['Scale', 'Method_Category'])['Weight'].sum().reset_index()\n", 2383 | "\n", 2384 | "for _, row in flow1.iterrows():\n", 2385 | " scale_raw = row['Scale']\n", 2386 | " cat_raw = row['Method_Category']\n", 2387 | " \n", 2388 | " src = raw_to_display.get(scale_raw)\n", 2389 | " tgt = raw_to_display.get(cat_raw)\n", 2390 | " \n", 2391 | " if src in label_map and tgt in label_map:\n", 2392 | " source.append(label_map[src])\n", 2393 | " target.append(label_map[tgt])\n", 2394 | " value.append(row['Weight'])\n", 2395 | " \n", 2396 | " # Color based on Scale raw name\n", 2397 | " base_color = color_map.get(scale_raw, grey_color)\n", 2398 | " colors.append(hex_to_rgba(base_color))\n", 2399 | "\n", 2400 | "# --- Flow 2: Category -> Specific ---\n", 2401 | "flow2 = sankey_df.groupby(['Method_Category', 'Specific_Techniques'])['Weight'].sum().reset_index()\n", 2402 | "\n", 2403 | "for _, row in flow2.iterrows():\n", 2404 | " cat_raw = row['Method_Category']\n", 2405 | " tech_raw = row['Specific_Techniques']\n", 2406 | " \n", 2407 | " src = raw_to_display.get(cat_raw)\n", 2408 | " tgt = raw_to_display.get(tech_raw)\n", 2409 | " \n", 2410 | " if src in label_map and tgt in label_map:\n", 2411 | " source.append(label_map[src])\n", 2412 | " target.append(label_map[tgt])\n", 2413 | " value.append(row['Weight'])\n", 2414 | " \n", 2415 | " # Color based on Category raw name\n", 2416 | " base_color = color_map.get(cat_raw, grey_color)\n", 2417 | " colors.append(hex_to_rgba(base_color))\n", 2418 | "\n", 2419 | "\n", 2420 | "# ==========================================\n", 2421 | "node_colors = []\n", 2422 | "# Map specific technique to its parent category raw name\n", 2423 | "tech_to_cat = pd.Series(sankey_df.Method_Category.values, index=sankey_df.Specific_Techniques).to_dict()\n", 2424 | "\n", 2425 | "for l in all_labels:\n", 2426 | " final_color = grey_color\n", 2427 | " \n", 2428 | " # Reverse lookup from raw_to_display\n", 2429 | " raw_key = None\n", 2430 | " for k, v in raw_to_display.items():\n", 2431 | " if v == l:\n", 2432 | " raw_key = k\n", 2433 | " break\n", 2434 | " \n", 2435 | " if raw_key:\n", 2436 | " # Case A: Scale or Category\n", 2437 | " if raw_key in color_map:\n", 2438 | " final_color = color_map[raw_key]\n", 2439 | " # Case B: Specific Technique (Inherit)\n", 2440 | " elif raw_key in tech_to_cat:\n", 2441 | " parent_raw = tech_to_cat[raw_key]\n", 2442 | " final_color = color_map.get(parent_raw, grey_color)\n", 2443 | " \n", 2444 | " node_colors.append(final_color)\n", 2445 | "\n", 2446 | "# ==========================================\n", 2447 | "fig = go.Figure(data=[go.Sankey(\n", 2448 | " arrangement=\"snap\",\n", 2449 | " node=dict(\n", 2450 | " pad=15, thickness=20,\n", 2451 | " line=dict(color=\"black\", width=0.5),\n", 2452 | " label=all_labels,\n", 2453 | " color=node_colors,\n", 2454 | " hovertemplate='%{label}
Weighted Volume: %{value:.2f}'\n", 2455 | " ),\n", 2456 | " link=dict(\n", 2457 | " source=source, target=target, value=value, color=colors\n", 2458 | " )\n", 2459 | ")])\n", 2460 | "\n", 2461 | "fig.update_layout(\n", 2462 | " # Global Font Settings (mimics plt.rcParams[\"font.family\"] = \"serif\")\n", 2463 | " font=dict(\n", 2464 | " family=\"Times New Roman, serif\", \n", 2465 | " size=17, \n", 2466 | " color=\"black\"\n", 2467 | " ),\n", 2468 | " width=1000,\n", 2469 | " height=600,\n", 2470 | " margin=dict(b=60, t=40),\n", 2471 | " \n", 2472 | " annotations=[\n", 2473 | " # Left Column Label\n", 2474 | " dict(\n", 2475 | " x=0,\n", 2476 | " y=-0.1,\n", 2477 | " xref=\"paper\",\n", 2478 | " yref=\"paper\",\n", 2479 | " text=\"Model Scale (S1.3)\",\n", 2480 | " showarrow=False,\n", 2481 | " font=dict(size=20, color=\"black\"), \n", 2482 | " align=\"center\"\n", 2483 | " ),\n", 2484 | " # Right Column Label\n", 2485 | " dict(\n", 2486 | " x=1,\n", 2487 | " y=-0.1,\n", 2488 | " xref=\"paper\",\n", 2489 | " yref=\"paper\",\n", 2490 | " text=\"Adaptation Technique (T2)\",\n", 2491 | " showarrow=False,\n", 2492 | " font=dict(size=20, color=\"black\"),\n", 2493 | " align=\"center\"\n", 2494 | " )\n", 2495 | " ]\n", 2496 | ")\n", 2497 | "\n", 2498 | "fig.show()" 2499 | ] 2500 | } 2501 | ], 2502 | "metadata": { 2503 | "kernelspec": { 2504 | "display_name": "vpn_seg", 2505 | "language": "python", 2506 | "name": "python3" 2507 | }, 2508 | "language_info": { 2509 | "codemirror_mode": { 2510 | "name": "ipython", 2511 | "version": 3 2512 | }, 2513 | "file_extension": ".py", 2514 | "mimetype": "text/x-python", 2515 | "name": "python", 2516 | "nbconvert_exporter": "python", 2517 | "pygments_lexer": "ipython3", 2518 | "version": "3.10.16" 2519 | } 2520 | }, 2521 | "nbformat": 4, 2522 | "nbformat_minor": 5 2523 | } 2524 | --------------------------------------------------------------------------------