├── .DS_Store ├── dataset_with_link.md ├── TODO.md ├── README.md ├── script.py ├── graph_based_models.md ├── sequence_based_models.md ├── sequence_based_models ├── years.md ├── tasks.md ├── models.md └── datasets.md └── graph_based_models ├── years.md ├── tasks.md └── datasets.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codingClaire/Structural-Code-Understanding/HEAD/.DS_Store -------------------------------------------------------------------------------- /dataset_with_link.md: -------------------------------------------------------------------------------- 1 | name: ROMISE 2 | language: 3 | link: http://openscience.us/repo/defect 4 | 5 | name: OJ 6 | language: 7 | link: http://programming.grids.cn/ 8 | 9 | name: notebookcdg 10 | language: 11 | link: https://paperswithcode.com/dataset/notebookcdg 12 | 13 | name: C# 14 | language:C# 15 | link: https://archive.org/details/stackexchange 16 | 17 | -------------------------------------------------------------------------------- /TODO.md: -------------------------------------------------------------------------------- 1 | # contribute to the repository 2 | 0. requirements: 3 | `pip install tabulate` 4 | 1. git clone / git pull 5 | 2. add new paper information in [sequence_based_models.md](sequence_based_models.md) with the following format: 6 | 7 | > title: 8 | > year: 9 | > venue: 10 | > task: Code Generation/Code Summarization/Safety Analysis/Program Repair/Program Classification/Others 11 | > model: CNN/Transformer/GNN/ (can use comma) 12 | > dataset: check if it is duplicated with different spellings 13 | > pdf: 14 | > code: 15 | 16 | 1. run script.py and check the generated markdown files 17 | 2. git add / commit / push 18 | 19 | 20 | code-change-data:https://drive.google.com/ file/d/1wSl SN17tbATqlhNMO0O7sEkH9gqJ9Vr. 21 | 22 | 23 | code-comment pairs:A parallel corpus of python functions and documentation strings for automated code documentation and code generation 24 | 25 | 26 | 27 | [Python]https://paperswithcode.com/dataset/100doh 28 | 29 | [python]https://github.com/LiuFang816/SALSTM_py_data 30 | 31 | [C#](https://archive.org/details/stackexchange) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Code Understanding Literatures in Deep Learning 2 | Check Our Survey on [arxiv](https://arxiv.org/abs/2205.01293)! 3 | ## Sequence-based Models 4 | Total: 44 papers 5 | ### [By years](sequence_based_models/years.md) 6 | - 2021:4 paper(s) 7 | - 2020:12 paper(s) 8 | - 2019:7 paper(s) 9 | - 2018:8 paper(s) 10 | - 2017:3 paper(s) 11 | - 2016:6 paper(s) 12 | - 2014:3 paper(s) 13 | - 2012:1 paper(s) 14 | ### [By tasks](sequence_based_models/tasks.md) 15 | - Program Classification :2 paper(s) 16 | - Code Search:3 paper(s) 17 | - Code Generation:19 paper(s) 18 | - Pretrain:4 paper(s) 19 | - Code representation:1 paper(s) 20 | - Safety Analysis:4 paper(s) 21 | - Program Repair :2 paper(s) 22 | - Clone Detection :2 paper(s) 23 | - Code Summarization:7 paper(s) 24 | ### [By datasets](sequence_based_models/datasets.md) 25 | - Java:6 paper(s) 26 | - DeepFix:1 paper(s) 27 | - TFix's Code Patches Data:1 paper(s) 28 | - C:6 paper(s) 29 | - 9714 Java projects from GitHub:1 paper(s) 30 | - Python:2 paper(s) 31 | - JavaScript:1 paper(s) 32 | - JS150:2 paper(s) 33 | - python:1 paper(s) 34 | - Uncategorized:44 paper(s) 35 | - PY150:2 paper(s) 36 | - C#:4 paper(s) 37 | - 10072 Java GitHub repositories:1 paper(s) 38 | - C#(dataset of CodeNN):0 paper(s) 39 | ### [By models](sequence_based_models/models.md) 40 | - N-gram:3 paper(s) 41 | - TreeBERT:1 paper(s) 42 | - Others:1 paper(s) 43 | - word2vec:2 paper(s) 44 | - Multinomial Naive Bayes (MNB) :0 paper(s) 45 | - GRU:3 paper(s) 46 | - Bi-LSTM:4 paper(s) 47 | - word embedding:1 paper(s) 48 | - Transformer:8 paper(s) 49 | - CRF:1 paper(s) 50 | - DNN:1 paper(s) 51 | - pointer network:1 paper(s) 52 | - DBN:2 paper(s) 53 | - CAN:1 paper(s) 54 | - LSTM:15 paper(s) 55 | - RNN:5 paper(s) 56 | ## Graph-based Models 57 | Total: 36 papers 58 | ### [By years](graph_based_models/years.md) 59 | - 2021:9 paper(s) 60 | - 2020:10 paper(s) 61 | - 2019:9 paper(s) 62 | - 2018:3 paper(s) 63 | - 2017:1 paper(s) 64 | - 2016:1 paper(s) 65 | - 2015:1 paper(s) 66 | - 2014:2 paper(s) 67 | ### [By tasks](graph_based_models/tasks.md) 68 | - Defect Prediction:3 paper(s) 69 | - Code Search:2 paper(s) 70 | - Program Repair:6 paper(s) 71 | - Code Generation:5 paper(s) 72 | - Program Verification:1 paper(s) 73 | - Program Classification:4 paper(s) 74 | - Vulnerability Detection:3 paper(s) 75 | - Clone Detection:8 paper(s) 76 | - Code Summarization:10 paper(s) 77 | ### [By datasets](graph_based_models/datasets.md) 78 | - Java repos collected in this work:1 paper(s) 79 | - Code-Change-Data:1 paper(s) 80 | - Hybrid-DeepCom Dataset:1 paper(s) 81 | - JAVA method naming datasets:1 paper(s) 82 | - ARM binary dataset:1 paper(s) 83 | - Genius Dataset:1 paper(s) 84 | - Google Code Jam (GCJ):0 paper(s) 85 | - notebookcdg:1 paper(s) 86 | - program variables dataset produced in this work:1 paper(s) 87 | - gcc dataset:1 paper(s) 88 | - Python method documentation dataset:1 paper(s) 89 | - JS150:2 paper(s) 90 | - Findutils:1 paper(s) 91 | - Validation dataset:1 paper(s) 92 | - Devign Dataset:1 paper(s) 93 | - C Dataset:1 paper(s) 94 | - Diffutils:1 paper(s) 95 | - OpenCL Dataset:1 paper(s) 96 | - code-comment pairs:1 paper(s) 97 | - BCB:1 paper(s) 98 | - collected in this work:4 paper(s) 99 | - Coreutils:1 paper(s) 100 | - DeepFix dataset:1 paper(s) 101 | - Syntax similar dataset:1 paper(s) 102 | - SPoC:1 paper(s) 103 | - IJDataset2.0:1 paper(s) 104 | - CodeForces dataset:1 paper(s) 105 | - Linux kernel's code collected in this work:1 paper(s) 106 | - OJClone:5 paper(s) 107 | - YANCFG Dataset:1 paper(s) 108 | - Defects4J:1 paper(s) 109 | - PY150:3 paper(s) 110 | - iclr18-prog-graphs-dataset:1 paper(s) 111 | - TL-CodeSum:1 paper(s) 112 | - CoCoNet:1 paper(s) 113 | - MSKCFG Dataset:1 paper(s) 114 | - C Program Dataset:1 paper(s) 115 | - Java method-comment pairs:1 paper(s) 116 | - BigCloneBench:2 paper(s) 117 | - Firmware image dataset:1 paper(s) 118 | - C# dataset:2 paper(s) 119 | - CodeSearchNet:2 paper(s) 120 | - Java Dataset collected in this work:1 paper(s) 121 | - C dataset:1 paper(s) 122 | ### [By models](graph_based_models/models.md) 123 | - GINN:1 paper(s) 124 | - Tree-RNN:1 paper(s) 125 | - GRU:3 paper(s) 126 | - Text-associated DeepWalk:1 paper(s) 127 | - Multi-Relational Graph Neural Network:1 paper(s) 128 | - TBCNN:1 paper(s) 129 | - Flow2Vec:1 paper(s) 130 | - LSTM:5 paper(s) 131 | - DGCNN:1 paper(s) 132 | - GNN:11 paper(s) 133 | - Transformer:3 paper(s) 134 | - Structure2vec:1 paper(s) 135 | - MPNN:2 paper(s) 136 | - GAT:3 paper(s) 137 | - CNN:7 paper(s) 138 | - GCN:2 paper(s) 139 | - RNN:3 paper(s) 140 | - Tree-LSTM:2 paper(s) 141 | - tree-based LSTM:1 paper(s) 142 | - Decision tree:1 paper(s) 143 | - HAConvGNN:1 paper(s) 144 | - ConvGNN:2 paper(s) 145 | - GTN:1 paper(s) 146 | - GGNN:8 paper(s) 147 | - CharCNN:1 paper(s) 148 | - Feed-forward neural network:1 paper(s) 149 | - bidirectional RNN:1 paper(s) 150 | - code property graphs:1 paper(s) 151 | - attention:1 paper(s) 152 | - Attention mechanism:1 paper(s) 153 | -------------------------------------------------------------------------------- /script.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import numpy as np 4 | import pandas as pd 5 | from pandas.core.frame import DataFrame 6 | 7 | 8 | def write_data(fname, dic, del_col): 9 | with open(fname, "w", encoding="utf-8") as f: 10 | for k in dic.keys(): 11 | df_k = dic[k].reset_index(drop=True) 12 | #del df_k[del_col] 13 | if k == "\n": 14 | k = "Uncategorized\n" 15 | f.writelines("# " + k) 16 | df_k.to_markdown(f) 17 | f.writelines("\n") 18 | 19 | 20 | if __name__ == "__main__": 21 | source_list = ["title", "year", "venue", 22 | "task", "model", "dataset", "pdf", "code"] 23 | names = ['sequence_based_models', 'graph_based_models'] 24 | #names = ["graph_based_models"] 25 | print("#" * 10, "Begin", "#" * 10) 26 | 27 | statistic_dic = { 28 | names[0]: { 29 | "years": {}, 30 | "tasks": {}, 31 | "datasets": {}, 32 | "models": {}, 33 | }, 34 | names[1]: { 35 | "years": {}, 36 | "tasks": {}, 37 | "datasets": {}, 38 | "models": {}, 39 | }, 40 | } 41 | 42 | for name in names: 43 | print("generating " + name + " markdown files!") 44 | 45 | with open(name + ".md", "r", encoding="UTF-8") as f: 46 | content = f.readlines() 47 | lineno = 0 48 | papers = [] 49 | # parse paper from source.md 50 | while lineno < len(content): 51 | if content[lineno][0:5] == "title": # target title 52 | paper = [] 53 | for offset in range(0, len(source_list)): 54 | content[lineno + offset] = content[lineno + offset].strip( 55 | source_list[offset] + ":" 56 | ) 57 | while content[lineno + offset][0] == " ": 58 | content[lineno + offset] = content[lineno + 59 | offset].strip(" ") 60 | paper.append(content[lineno + offset]) 61 | papers.append(paper) 62 | lineno += 1 63 | 64 | # generate dataframe of total paper 65 | papers_df = pd.DataFrame(papers, columns=source_list) 66 | papers_df = papers_df.drop_duplicates(subset=["title"]) 67 | # add pdf and code icon 68 | 69 | papers_df["pdf"] = papers_df["pdf"].map( 70 | lambda x: "[📑](" + x[0:-1] + ")" if x != "\n" else x 71 | ) 72 | papers_df["code"] = papers_df["code"].map( 73 | lambda x: "[:octocat:](" + x[0:-1] + ")" if x != "\n" else x 74 | ) 75 | 76 | # sort by year 77 | year_dic = dict(list(papers_df.groupby(papers_df["year"]))) 78 | year_dic = dict( 79 | sorted(year_dic.items(), key=lambda item: item[0], reverse=True)) 80 | write_data(name+"/years.md", year_dic, "year") 81 | for k in year_dic.keys(): 82 | statistic_dic[name]["years"][k] = year_dic[k].shape[0] 83 | print("Finish sort by years!") 84 | 85 | ############ sort by task ############ 86 | total_tasks = [] 87 | task_dic = {} 88 | for task_str in list(papers_df["task"]): 89 | sublist = task_str.split(",") 90 | for element in range(len(sublist)): 91 | while(sublist[element][0] == ' '): 92 | sublist[element] = sublist[element][1:] 93 | if(sublist[element][-1] != '\n'): 94 | sublist[element] = sublist[element]+'\n' 95 | total_tasks.extend(sublist) 96 | 97 | total_tasks = list(set(total_tasks)) 98 | for task in total_tasks: 99 | bool = papers_df["task"].str.contains(task[0:-1]) 100 | tasks_df = papers_df[bool] 101 | task_dic[task] = tasks_df 102 | write_data(name + "/tasks.md", task_dic, "task") 103 | for k in task_dic.keys(): 104 | statistic_dic[name]["tasks"][k] = task_dic[k].shape[0] 105 | print("Finish sort by tasks!") 106 | 107 | ############ sort by dataset ############ 108 | total_datasets = [] 109 | dataset_dic = {} 110 | for dataset_str in list(papers_df["dataset"]): 111 | sublist = dataset_str.split(",") 112 | for element in range(len(sublist)): 113 | while(sublist[element][0] == ' '): 114 | sublist[element] = sublist[element][1:] 115 | if(sublist[element][-1] != '\n'): 116 | sublist[element] = sublist[element]+'\n' 117 | total_datasets.extend(sublist) 118 | total_datasets = list(set(total_datasets)) 119 | #print(total_datasets) 120 | for dataset in total_datasets: 121 | #print(dataset[0:-1]) 122 | bool = papers_df["dataset"].str.contains(dataset[0:-1]) 123 | datasets_df = papers_df[bool] 124 | dataset_dic[dataset] = datasets_df 125 | 126 | write_data(name + "/datasets.md", dataset_dic, "dataset") 127 | for k in dataset_dic.keys(): 128 | statistic_dic[name]["datasets"][k] = dataset_dic[k].shape[0] 129 | print("Finish sort by datasets!") 130 | 131 | ############ sort by model ############ 132 | total_models = [] 133 | model_dic = {} 134 | for model_str in list(papers_df["model"]): 135 | sublist = model_str.split(",") 136 | for element in range(len(sublist)): 137 | while(sublist[element][0] == ' '): 138 | sublist[element] = sublist[element][1:] 139 | if(sublist[element][-1] != '\n'): 140 | sublist[element] = sublist[element]+'\n' 141 | total_models.extend(sublist) 142 | total_models = list(set(total_models)) 143 | for model in total_models: 144 | bool = papers_df["model"].str.contains(model[0:-1]) 145 | models_df = papers_df[bool] 146 | model_dic[model] = models_df 147 | write_data(name + "/models.md", model_dic, "model") 148 | for k in model_dic.keys(): 149 | statistic_dic[name]["models"][k] = model_dic[k].shape[0] 150 | print("Finish sort by models!") 151 | 152 | # Write markdown files 153 | with open("README.md", "w", encoding="utf-8") as f: 154 | categories = ["years", "tasks", "datasets", "models"] 155 | f.writelines("# Code Understanding Literatures in Deep Learning\n") 156 | f.writelines("## Sequence-based Models\n") 157 | total_number=0 158 | for k in statistic_dic[names[0]]["years"].keys(): 159 | total_number+=statistic_dic[names[0]]["years"][k] 160 | f.writelines("Total: "+ str(total_number)+" papers\n") 161 | for category in categories: 162 | f.writelines( 163 | "### [By "+category+"](sequence_based_models/"+category+".md)\n") 164 | for k in statistic_dic[names[0]][category].keys(): 165 | num = str(statistic_dic[names[0]][category][k]) 166 | k = "Uncategorized\n" if k == "\n" else k 167 | f.writelines( 168 | "- " + k[0:-1] + ":" + num + " paper(s)\n" 169 | ) 170 | f.writelines("## Graph-based Models\n") 171 | total_number = 0 172 | for k in statistic_dic[names[1]]["years"].keys(): 173 | total_number += statistic_dic[names[1]]["years"][k] 174 | f.writelines("Total: " + str(total_number)+" papers\n") 175 | for category in categories: 176 | f.writelines( 177 | "### [By "+category+"](graph_based_models/"+category+".md)\n") 178 | for k in statistic_dic[names[1]][category].keys(): 179 | num = str(statistic_dic[names[1]][category][k]) 180 | k = "Uncategorized\n" if k == "\n" else k 181 | f.writelines( 182 | "- " + k[0:-1] + ":" + num + " paper(s)\n" 183 | ) 184 | -------------------------------------------------------------------------------- /graph_based_models.md: -------------------------------------------------------------------------------- 1 | title:HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks 2 | year: 2021 3 | venue: EMNLP 4 | task: Code Summarization 5 | model: HAConvGNN 6 | dataset: notebookcdg 7 | pdf: https://arxiv.org/abs/2104.01002 8 | code: https://github.com/xuyeliu/HAConvGNN 9 | 10 | title:Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs 11 | year: 2021 12 | venue: AAAI 13 | task: Code Generation 14 | model: GAT 15 | dataset: JS150, PY150 16 | pdf: http://arxiv.org/abs/2103.09499 17 | code: 18 | 19 | title: CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees 20 | year: 2021 21 | venue: EMNLP 22 | task: Code Summarization 23 | model: RNN,attention 24 | dataset: TL-CodeSum 25 | pdf: http://arxiv.org/abs/2108.12987 26 | code: https://anonymous.4open.science/r/CAST/ 27 | 28 | title: Learning to represent programs with graphs 29 | year: 2018 30 | venue: ICLR 31 | task: Defect Prediction,Code Generation 32 | model: GGNN 33 | dataset: iclr18-prog-graphs-dataset 34 | pdf: https://arxiv.org/abs/1711.00740 35 | code: https://github.com/Microsoft/gated-graph-neural-network-samples 36 | 37 | title: A Novel Neural Source Code Representation Based on Abstract Syntax Tree 38 | year: 2019 39 | venue: ICSE 40 | task: Program Classification, Clone Detection 41 | model: bidirectional RNN 42 | dataset: OJClone,BCB 43 | pdf: https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92 44 | code: https://github.com/zhangj1994/astnn 45 | 46 | title: TBCNN: A tree-based convolutional neural network for programming language processing 47 | year: 2014 48 | venue: arixiv 49 | task: Program Classification 50 | model: TBCNN 51 | dataset: OJClone 52 | pdf: https://arxiv.org/abs/1409.5718v1 53 | code: https://sites.google.com/site/treebasedcnn/ 54 | 55 | title: Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST 56 | year: 2019 57 | venue: ACM International Conference on Computing Frontiers 58 | task: Clone Detection,Code Search, Code Summarization 59 | model: tree-based LSTM 60 | dataset:OJClone,BigCloneBench 61 | pdf: https://doi.org/10.1145/3310273.3321560 62 | code: https://github.com/milkfan/TBCAA 63 | 64 | title: Improving automatic source code summarization via deep reinforcement learning 65 | year: 2018 66 | venue: ASE 67 | task: Code Summarization 68 | model: RNN,Tree-RNN 69 | dataset: code-comment pairs 70 | pdf: https://arxiv.org/abs/1811.07234v1 71 | code: 72 | 73 | title: CODIT: Code Editing with Tree-Based Neural Models 74 | year: 2020 75 | venue: IEEE Transactions on Software Engineering 76 | task: Program Repair 77 | model: LSTM 78 | dataset: Defects4J,Code-Change-Data 79 | pdf: http://arxiv.org/abs/1810.00314 80 | code: https://git.io/JJGwU 81 | 82 | title: Gated graph sequence neural networks 83 | year: 2015 84 | venue: ICLR 85 | task: Program Verification 86 | model: GGNN 87 | dataset: program variables dataset produced in this work 88 | pdf: http://arxiv.org/abs/1511.05493 89 | code: https://github.com/yujiali/ggnn 90 | 91 | title: Structured neural summarization 92 | year: 2019 93 | venue: ICLR 94 | task: Code Summarization 95 | model: GGNN 96 | dataset: C# dataset,JAVA method naming datasets, Python method documentation dataset 97 | pdf: https://arxiv.org/abs/1811.01824v4 98 | code: https://github.com/CoderPat/structured-neural-summarization 99 | 100 | title: GGF: A graph-based method for programming language syntax error correction 101 | year: 2020 102 | venue: ICPC 103 | task: Program Repair 104 | model: GGNN 105 | dataset: DeepFix dataset,CodeForces dataset 106 | pdf: https://dl.acm.org/doi/10.1145/3387904.3389252 107 | code: 108 | 109 | title: Improved code summarization via a graph neural network 110 | year: 2020 111 | venue: ICPC 112 | task: Code Summarization 113 | model: ConvGNN 114 | dataset: Java method-comment pairs 115 | pdf: https://arxiv.org/abs/2004.02843v2 116 | code: 117 | 118 | title: Code Clone Detection with Hierarchical Attentive Graph Embedding 119 | year: 2021 120 | venue: International Journal of Software Engineering and Knowledge Engineering 121 | task: Clone Detection 122 | model: GCN 123 | dataset: IJDataset2.0 124 | pdf: https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X 125 | code: 126 | 127 | title: Graph-based, Self-Supervised Program Repair from Diagnostic Feedback 128 | year: 2020 129 | venue: ICML 130 | task: Program Repair 131 | model: GAT, LSTM 132 | dataset: SPoC 133 | pdf: http://arxiv.org/abs/2005.10636 134 | code: https://github.com/michiyasunaga/DrRepair 135 | 136 | title: Generative code modeling with graphs 137 | year: 2019 138 | venue: ICLR 139 | task: Program Repair,Code Generation 140 | model: GRU,GGNN 141 | dataset: C# dataset 142 | pdf: https://arxiv.org/abs/1805.08490v2 143 | code: https://github.com/Microsoft/graph-based-code-modelling 144 | 145 | title: Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting 146 | year: 2021 147 | venue: ICPC 148 | task: Code Summarization 149 | model: Tree-LSTM 150 | dataset: CodeSearchNet, Hybrid-DeepCom Dataset 151 | pdf: https://arxiv.org/abs/2103.07845v2 152 | code: https://github.com/XMUDM/BASTS 153 | 154 | title: Neural network-based graph embedding for cross-platform binary code similarity detection 155 | year: 2017 156 | venue: Proceedings of the ACM Conference on Computer and Communications Security 157 | task: Program Repair 158 | model: Structure2vec 159 | dataset: collected in this work, Genius Dataset 160 | pdf: http://arxiv.org/abs/1708.06525 161 | code: https://github.com/xiaojunxu/dnn-binary-code-similarity 162 | 163 | title: BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network 164 | year: 2021 165 | venue: ASIA CCS 166 | task: Vulnerability Detection 167 | model: GTN 168 | dataset: Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset 169 | pdf: https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf 170 | code: 171 | 172 | title: DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing 173 | year: 2020 174 | venue: NDSS 175 | task: Clone Detection 176 | model: Text-associated DeepWalk 177 | dataset: Coreutils, Diffutils, Findutils 178 | pdf: https://dx.doi.org/10.14722/ndss.2020.24311 179 | code: https://github.com/deepbindiff/DeepBinDiff 180 | 181 | title: Order matters: Semantic-aware neural networks for binary code similarity detection 182 | year: 2020 183 | venue: AAAI 184 | task: Clone Detection 185 | model: MPNN,CNN 186 | dataset: gcc dataset 187 | pdf: https://ojs.aaai.org/index.php/AAAI/article/view/5466 188 | code: 189 | 190 | title: Learning semantic program embeddings with graph interval neural network 191 | year: 2020 192 | venue: Proceedings of the ACM on Programming Languages 193 | task: Program Repair 194 | model: GINN 195 | dataset: PY150 196 | pdf: https://arxiv.org/abs/2005.09997v2 197 | code: 198 | 199 | title: Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network 200 | year: 2019 201 | venue:Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 202 | task: Defect Prediction 203 | model: DGCNN 204 | dataset: MSKCFG Dataset, YANCFG Dataset 205 | pdf: https://ieeexplore.ieee.org/document/8809504 206 | code: 207 | 208 | title: Semantic Code Clone Detection Via Event Embedding Tree and GAT Network 209 | year: 2020 210 | venue: QRS 211 | task: Clone Detection 212 | model: Transformer, GAT, CNN 213 | dataset: OJClone 214 | pdf: https://ieeexplore.ieee.org/document/9282778/ 215 | code: https://github.com/lbzwoaini/CSEM.git 216 | 217 | title: How could Neural Networks understand Programs? 218 | year: 2021 219 | venue: ICML 220 | task: Clone Detection 221 | model: Transformer 222 | dataset: OJClone 223 | pdf: http://arxiv.org/abs/2105.04297 224 | code: https://github.com/pdlan/OSCAR 225 | 226 | title: Multi-modal attention network learning for semantic source code retrieval 227 | year: 2019 228 | venue: ASE 229 | task: Code Search 230 | model: GGNN, Tree-LSTM 231 | dataset: C dataset 232 | pdf: https://arxiv.org/abs/1909.13516v1 233 | code: 234 | 235 | title: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks 236 | year: 2019 237 | venue: NIPS 238 | task: Vulnerability Detection 239 | model: GGNN, GRU, CNN 240 | dataset: Devign Dataset 241 | pdf: https://arxiv.org/abs/1909.03496v1 242 | code: https://sites.google.com/view/devign 243 | 244 | title:Flow2Vec:value-flow-based precise code embedding 245 | year: 2020 246 | venue: Proceedings of the ACM on Program ming Languages 247 | task: Code Summarization, Program Classification 248 | model: Flow2Vec 249 | dataset: C Dataset 250 | pdf: https://dl.acm.org/doi/abs/10.1145/3428301 251 | code: 252 | 253 | title:Compiler-based graph representations for deep learning models of code 254 | year: 2020 255 | venue: Proceedings of the 29th International Conference on Compiler Construction 256 | task: Program Classification 257 | model: GGNN 258 | dataset: OpenCL Dataset 259 | pdf: https://doi.org/10.1145/3377555.3377894 260 | code: https:ithub.com/tud-ccc/learning-compiler-graphs 261 | 262 | title: DeepSim: Deep Learning Code Functional Similarity 263 | year: 2018 264 | venue: ESEC/FSE 265 | task: Clone Detection 266 | model: Feed-forward neural network 267 | dataset: Google Code Jam (GCJ), BigCloneBench 268 | pdf: https://doi.org/10.1145/3236024.3236068 269 | code: https://github.com/parasol-aser/deepsim 270 | 271 | title: CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network 272 | year: 2021 273 | venue: arxiv 274 | task: Code Summarization 275 | model: Transformer,Multi-Relational Graph Neural Network 276 | dataset: CodeSearchNet, CoCoNet 277 | pdf: https://arxiv.org/abs/2107.01933v1 278 | code: 279 | 280 | title: Improving bug detection via context-based code representation learning and attention-based neural networks 281 | year: 2019 282 | venue: Proceedings of the ACM on Programming Languages 283 | task: Defect Prediction 284 | model: GRU, CNN, Attention mechanism 285 | dataset: Java Dataset collected in this work 286 | pdf: https://dl.acm.org/doi/abs/10.1145/3360588 287 | code: 288 | 289 | title: Modeling and discovering vulnerabilities with code property graphs 290 | year: 2014 291 | venue: Proceedings IEEE Symposium on Security and Privacy 292 | task: Vulnerability Detection 293 | model: code property graphs 294 | dataset: Linux kernel's code collected in this work 295 | pdf: http://ieeexplore.ieee.org/document/6956589/ 296 | code: 297 | 298 | title: Retrieval-Augmented Generation for Code Summarization via Hybrid GNN 299 | year: 2021 300 | venue: ICLR 301 | task: Code Summarization 302 | model: GNN 303 | dataset: C Program Dataset 304 | pdf: https://arxiv.org/abs/2006.05405v5 305 | code: https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization 306 | 307 | title: Probabilistic model for code with decision trees 308 | year: 2016 309 | venue: SIGPLAN 310 | task: Code Generation 311 | model: Decision tree 312 | dataset: PY150, JS150 313 | pdf: https://dl.acm.org/doi/10.1145/2983990.2984041 314 | code: 315 | 316 | title: Open vocabulary learning on source code with a graph-structured caches 317 | year: 2019 318 | venue: ICML 319 | task: Code Generation 320 | model: MPNN,CharCNN 321 | dataset: Java repos collected in this work 322 | pdf: https://arxiv.org/abs/1810.08305v2 323 | code: https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary 324 | -------------------------------------------------------------------------------- /sequence_based_models.md: -------------------------------------------------------------------------------- 1 | title: On the naturalness of software 2 | year: 2012 3 | venue: None 4 | task: Code Generation 5 | model: N-gram 6 | dataset: 7 | pdf: https://dl.acm.org/doi/pdf/10.1145/2902362 8 | code: 9 | 10 | title: On the localness of software 11 | year: 2014 12 | venue: FSE/ESEC 13 | task: Code Generation 14 | model: N-gram 15 | dataset: 16 | pdf: https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 17 | code: 18 | 19 | title: Phrase-Based Statistical Translation of Programming Languages 20 | year: 2014 21 | venue:OOPSLA 22 | task: Code Generation 23 | model: N-gram 24 | dataset: 25 | pdf:https://files.sri.inf.ethz.ch/website/papers/onward14.pdf 26 | code: 27 | 28 | title: A convolutional attention network for extreme summarization of source code 29 | year: 2016 30 | venue: ICML 31 | task: Code Summarization 32 | model: CAN 33 | dataset: Java 34 | pdf: http://proceedings.mlr.press/v48/allamanis16.html 35 | code: https://github.com/mast-group/convolutional-attention 36 | 37 | title: Code completion with statistical language models 38 | year: 2014 39 | venue:PLDI 40 | task: Code Generation 41 | model: RNN 42 | dataset: 43 | pdf:https://dl.acm.org/doi/pdf/10.1145/2594291.2594321 44 | code: 45 | 46 | title: Neural Code Comprehension: A Learnable Representation of Code Semantics 47 | year: 2018 48 | venue:NuerIPs 49 | task: Code representation 50 | model: RNN 51 | dataset: 52 | pdf: https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html 53 | code: 54 | 55 | title: A deep language model for software code 56 | year: 2016 57 | venue: None 58 | task: Code Generation 59 | model: LSTM 60 | dataset: 61 | pdf: https://arxiv.org/pdf/1608.02715 62 | code: 63 | 64 | title: Summarizing Source Code using a Neural Attention Model 65 | year: 2016 66 | venue: ACL 67 | task: Code Summarization 68 | model: LSTM 69 | dataset: C# 70 | pdf: https://aclanthology.org/P16-1195.pdf 71 | code: https://github.com/sriniiyer/codenn 72 | 73 | title: Latent Attention For If-Then Program Synthesis 74 | year: 2016 75 | venue:NuerIPs 76 | task: Code Generation 77 | model: Bi-LSTM 78 | dataset: 79 | pdf: https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf 80 | code: 81 | 82 | title: Abstract Syntax Networks for Code Generation and Semantic Parsing 83 | year: 2016 84 | venue:ACL 85 | task: Code Generation 86 | model: LSTM 87 | dataset: 88 | pdf: https://arxiv.org/pdf/1704.07535 89 | code: 90 | 91 | title: CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling 92 | year: 2020 93 | venue: IST 94 | task: Code Generation 95 | model: GRU 96 | dataset: 97 | pdf: https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE 98 | code: 99 | 100 | title: A transformer-based approach for source code summarization 101 | year: 2020 102 | venue: ACL 103 | task: Code Summarization 104 | model: Transformer 105 | dataset: 106 | pdf: https://arxiv.org/abs/2005.00653 107 | code:https://github.com/wasiahmad/NeuralCodeSum 108 | 109 | title: CodeBERT: A Pre-Trained Model for Programming and Natural Languages 110 | year: 2020 111 | venue:EMNLP 112 | task: Pretrain 113 | model: Transformer 114 | dataset: 115 | pdf: https://arxiv.org/pdf/2002.08155.pdf 116 | code: https://github.com/microsoft/CodeBERT 117 | 118 | title: Learning and Evaluating Contextual Embedding of Source Code 119 | year: 2020 120 | venue:ICML 121 | task: Pretrain 122 | model: Transformer 123 | dataset: 124 | pdf: https://proceedings.mlr.press/v119/kanade20a.html 125 | code: 126 | 127 | title: Learning and Evaluating Contextual Embedding of Source Code 128 | year: 2020 129 | venue:FSE/ESEC 130 | task: Pretrain 131 | model: Transformer 132 | dataset: 133 | pdf: https://dl.acm.org/doi/pdf/10.1145/3368089.3417058 134 | code: 135 | 136 | title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation 137 | year: 2021 138 | venue:EMNLP 139 | task: Pretrain 140 | model: Transformer 141 | dataset: 142 | pdf: https://arxiv.org/pdf/2109.00859 143 | code: 144 | 145 | title: A general path-based representation for predicting programproperties 146 | year: 2018 147 | venue: PLDL 148 | task: Code Generation 149 | model: word2vec,CRF 150 | dataset: JavaScript, Java, Python, C# 151 | pdf: https://dl.acm.org/doi/pdf/10.1145/3296979.3192412 152 | code: 153 | 154 | title: Exploring API embedding for API usages and applications 155 | year: 2017 156 | venue: ICSE 157 | task: Code Generation 158 | model: word2vec 159 | dataset: Java, C# 160 | pdf: https://ieeexplore.ieee.org/abstract/document/7985683 161 | code: 162 | 163 | title: Automatically learning semantic features for defect prediction 164 | year: 2016 165 | venue: ICSE 166 | task: Safety Analysis 167 | model: DBN 168 | dataset: 169 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912 170 | code: 171 | 172 | title: Deep Semantic Feature Learning for Software Defect Prediction 173 | year: 2020 174 | venue: TSE 175 | task: Safety Analysis 176 | model: DBN 177 | dataset: 178 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853 179 | code: 180 | 181 | title: Neural Code Completion 182 | year: 2018 183 | venue: ICPC 184 | task: Code Generation 185 | model: LSTM 186 | dataset: JS150,PY150 187 | pdf: https://openreview.net/pdf?id=rJbPBt9lg 188 | code: 189 | 190 | title: Code Completion with Neural Attention and Pointer Networks 191 | year: 2018 192 | venue: IJCAI 193 | task: Code Generation 194 | model: LSTM,pointer network 195 | dataset: JS150,PY150 196 | pdf: https://ieeexplore.ieee.org/abstract/document/7985683 197 | code:https://github.com/jack57lee/neuralCodeCompletion 198 | 199 | title: Deep code comment generation 200 | year: 2018 201 | venue: ICPC 202 | task: Code Summarization 203 | model: LSTM 204 | dataset: 205 | pdf: https://ieeexplore.ieee.org/abstract/document/8973050 206 | code:https://github.com/LRNavin/AutoComments 207 | 208 | title: Code2vec: learning distributed representations of code 209 | year: 2019 210 | venue: POPL 211 | task: Code Generation 212 | model: LSTM 213 | dataset: 10072 Java GitHub repositories 214 | pdf: https://arxiv.org/pdf/1803.09473 215 | code: https://github.com/tech-srl/code2vec 216 | 217 | title: Seml: A semantic lstm model for software defect prediction 218 | year: 2019 219 | venue: None 220 | task: Safety Analysis 221 | model: LSTM 222 | dataset: 223 | pdf: https://ieeexplore.ieee.org/abstract/document/8747001 224 | code: 225 | 226 | title: Modeling programs hierarchically with stack-augmented LSTM 227 | year: 2020 228 | venue: JSS 229 | task: Code Generation 230 | model: LSTM 231 | dataset: C, python 232 | pdf: https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ 233 | code: 234 | 235 | title: Code2seq: Generating Sequences from Structured Representations of Code 236 | year: 2019 237 | venue: ICLR 238 | task: Code Generation 239 | model: Bi-LSTM 240 | dataset: Java, C#(dataset of CodeNN) 241 | pdf: https://arxiv.org/pdf/1808.01400 242 | code: https://github.com/tech-srl/code2seq 243 | 244 | title: DeepCPDP: Deep Learning Based Cross-Project Defect Prediction 245 | year: 2019 246 | venue: 247 | task: Safety Analysis 248 | model: Bi-LSTM 249 | dataset: 250 | pdf: https://ieeexplore.ieee.org/abstract/document/8937501/ 251 | code: 252 | 253 | title: Pythia: AI-assisted Code Completion System 254 | year: 2019 255 | venue: SIGKDD 256 | task: Code Generation 257 | model: Bi-LSTM 258 | dataset: Python 259 | pdf: https://dl.acm.org/doi/pdf/10.1145/3292500.3330699 260 | code: https://github.com/Microsoft/PTVS 261 | 262 | title: A neural model for generating natural language summaries of program subroutines(astted-gru) 263 | year: 2019 264 | venue: ICSE 265 | task: Code Summarization 266 | model: GRU 267 | dataset: 268 | pdf: https://arxiv.org/pdf/1902.01954v1.pdf 269 | code: https://github.com/mcmillco/funcom 270 | 271 | title: Deep code comment generation with hybrid lexical and syntactical information 272 | year: 2020 273 | venue: FSE/EFEC 274 | task: Code Summarization 275 | model: GRU 276 | dataset: 9714 Java projects from GitHub 277 | pdf: https://link.springer.com/article/10.1007/s10664-019-09730-9 278 | code: https://github.com/Rick-Feng-u/Deep-code-comment-generation 279 | 280 | title:TreeBERT: A Tree-Based Pre-Trained Model for Programming Language 281 | year:2021 282 | venue:UAI 283 | task: Pretrain 284 | model: TreeBERT 285 | dataset: 286 | pdf: https://arxiv.org/abs/2105.12485 287 | code: https://github.com/17385/TreeBERT 288 | 289 | title: Structural language models of code 290 | year: 2020 291 | venue: ICML 292 | task: Code Generation 293 | model: Transformer 294 | dataset: 295 | pdf: https://proceedings.mlr.press/v119/alon20a.html 296 | code: 297 | 298 | title: Code prediction by Feeding Trees to Transfomers 299 | year: 2021 300 | venue: ICSE 301 | task: Code Generation 302 | model: Transformer 303 | dataset: 304 | pdf: https://dl.acm.org/doi/pdf/10.1145/3387904.3389261 305 | 306 | title: A self-attentional neural architecture for code completion with multi-task learning 307 | year: 2020 308 | venue: ICPC 309 | task: Code Generation 310 | model: Transformer 311 | dataset: 312 | pdf: https://ieeexplore.ieee.org/abstract/document/9402114 313 | code: 314 | 315 | title: Retrieval-based Neural Source Code Summarization 316 | year: 2020 317 | venue: ICSE 318 | task: Code Summarization 319 | model: Others 320 | dataset: 321 | pdf: https://ieeexplore.ieee.org/abstract/document/9284039 322 | code: 323 | 324 | title: Retrieval on Source Code: A Neural Code Search 325 | year: 2018 326 | venue: PLDI 327 | task: Code Search 328 | model: word embedding 329 | dataset: 330 | pdf: https://ieeexplore.ieee.org/abstract/document/9284039 331 | code: 332 | 333 | title: Deep code search 334 | year: 2018 335 | venue: ICSE 336 | task: Code Search 337 | model: RNN 338 | dataset: 339 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172 340 | code: 341 | 342 | title: Improving Code Search with Co-Attentive Representation Learning 343 | year: 2020 344 | venue: ICPC 345 | task: Code Search 346 | model: RNN 347 | dataset: 348 | pdf: https://dl.acm.org/doi/pdf/10.1145/3387904.3389269 349 | code: 350 | 351 | title: Cclearner: A deep learning-based clone detection approach 352 | year: 2017 353 | venue: ICSME 354 | task: Clone Detection 355 | model: DNN 356 | dataset: 357 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426 358 | code: 359 | 360 | title: Deep learning code fragments for code clone detection 361 | year: 2017 362 | venue: ASE 363 | task: Clone Detection 364 | model: RNN 365 | dataset: 366 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1 367 | code: 368 | 369 | title: Neural Program Repair by Jointly Learning to Localize and Repair 370 | year: 2019 371 | venue: ICLR 372 | task: Program Repair 373 | model: LSTM 374 | dataset: DeepFix 375 | pdf: https://arxiv.org/pdf/1904.01720 376 | code:https://github.com/mdrafiqulrabin/SIVAND 377 | 378 | title: TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer 379 | year: 2021 380 | venue: ICML 381 | task: Program Repair 382 | model: Transformer 383 | dataset: TFix's Code Patches Data 384 | pdf: https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf 385 | code:https://github.com/eth-sri/TFix 386 | 387 | title: Embedding Java Classes with code2vec: Improvements from Variable Obfuscation 388 | year: 2020 389 | venue: 390 | task: Program Classification 391 | model: LSTM 392 | dataset: 393 | pdf: https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf 394 | code:https://github.com/eth-sri/TFix 395 | 396 | title: SCC: Automatic Classification of Code Snippets 397 | year: 2018 398 | venue: 399 | task: Program Classification 400 | model: Multinomial Naive Bayes (MNB) 401 | dataset: 402 | pdf: https://arxiv.org/pdf/1809.07945v1.pdf 403 | code:https://github.com/mindscan-de/FluentGenesis-Classifier -------------------------------------------------------------------------------- /sequence_based_models/years.md: -------------------------------------------------------------------------------- 1 | # 2021 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:----------------|:------------|:-------------------------|:-------------------------------------------------------------------|:-----------------------------------------------| 4 | | 0 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation | 2021 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2109.00859) | | 5 | | 1 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language | 2021 | UAI | Pretrain | TreeBERT | | [📑](https://arxiv.org/abs/2105.12485) | [:octocat:](https://github.com/17385/TreeBERT) | 6 | | 2 | Code prediction by Feeding Trees to Transfomers | 2021 | ICSE | Code Generation | Transformer | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261) | | 7 | | 3 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 8 | # 2020 9 | | | title | year | venue | task | model | dataset | pdf | code | 10 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:-----------------------|:------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------| 11 | | 0 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling | 2020 | IST | Code Generation | GRU | | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) | | 12 | | 1 | A transformer-based approach for source code summarization | 2020 | ACL | Code Summarization | Transformer | | [📑](https://arxiv.org/abs/2005.00653 ) | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum) | 13 | | 2 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages | 2020 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2002.08155.pdf ) | [:octocat:](https://github.com/microsoft/CodeBERT) | 14 | | 3 | Learning and Evaluating Contextual Embedding of Source Code | 2020 | ICML | Pretrain | Transformer | | [📑](https://proceedings.mlr.press/v119/kanade20a.html ) | | 15 | | 4 | Deep Semantic Feature Learning for Software Defect Prediction | 2020 | TSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) | | 16 | | 5 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 17 | | 6 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 18 | | 7 | Structural language models of code | 2020 | ICML | Code Generation | Transformer | | [📑](https://proceedings.mlr.press/v119/alon20a.html ) | | 19 | | 8 | A self-attentional neural architecture for code completion with multi-task learning | 2020 | ICPC | Code Generation | Transformer | | [📑](https://ieeexplore.ieee.org/abstract/document/9402114) | | 20 | | 9 | Retrieval-based Neural Source Code Summarization | 2020 | ICSE | Code Summarization | Others | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 21 | | 10 | Improving Code Search with Co-Attentive Representation Learning | 2020 | ICPC | Code Search | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269) | | 22 | | 11 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation | 2020 | | Program Classification | LSTM | | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 23 | # 2019 24 | | | title | year | venue | task | model | dataset | pdf | code | 25 | |---:|:--------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:-------------------------------|:-------------------------------------------------------------|:------------------------------------------------------| 26 | | 0 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 27 | | 1 | Seml: A semantic lstm model for software defect prediction | 2019 | None | Safety Analysis | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8747001) | | 28 | | 2 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 29 | | 3 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction | 2019 | | Safety Analysis | Bi-LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) | | 30 | | 4 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 31 | | 5 | A neural model for generating natural language summaries of program subroutines(astted-gru) | 2019 | ICSE | Code Summarization | GRU | | [📑](https://arxiv.org/pdf/1902.01954v1.pdf) | [:octocat:](https://github.com/mcmillco/funcom) | 32 | | 6 | Neural Program Repair by Jointly Learning to Localize and Repair | 2019 | ICLR | Program Repair | LSTM | DeepFix | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) | 33 | # 2018 34 | | | title | year | venue | task | model | dataset | pdf | code | 35 | |---:|:------------------------------------------------------------------------|-------:|:--------|:-----------------------|:------------------------------|:-----------------------------|:-----------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------| 36 | | 0 | Neural Code Comprehension: A Learnable Representation of Code Semantics | 2018 | NuerIPs | Code representation | RNN | | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) | | 37 | | 1 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 38 | | 2 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 39 | | 3 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 40 | | 4 | Deep code comment generation | 2018 | ICPC | Code Summarization | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8973050) | [:octocat:](https://github.com/LRNavin/AutoComments) | 41 | | 5 | Retrieval on Source Code: A Neural Code Search | 2018 | PLDI | Code Search | word embedding | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 42 | | 6 | Deep code search | 2018 | ICSE | Code Search | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172) | | 43 | | 7 | SCC: Automatic Classification of Code Snippets | 2018 | | Program Classification | Multinomial Naive Bayes (MNB) | | [📑](https://arxiv.org/pdf/1809.07945v1.pdf) | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie) | 44 | # 2017 45 | | | title | year | venue | task | model | dataset | pdf | code | 46 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:---------|:----------|:-----------------------------------------------------------------------------|:-------| 47 | | 0 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 48 | | 1 | Cclearner: A deep learning-based clone detection approach | 2017 | ICSME | Clone Detection | DNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426) | | 49 | | 2 | Deep learning code fragments for code clone detection | 2017 | ASE | Clone Detection | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) | | 50 | # 2016 51 | | | title | year | venue | task | model | dataset | pdf | code | 52 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------| 53 | | 0 | A convolutional attention network for extreme summarization of source code | 2016 | ICML | Code Summarization | CAN | Java | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) | 54 | | 1 | A deep language model for software code | 2016 | None | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1608.02715 ) | | 55 | | 2 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 56 | | 3 | Latent Attention For If-Then Program Synthesis | 2016 | NuerIPs | Code Generation | Bi-LSTM | | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf ) | | 57 | | 4 | Abstract Syntax Networks for Code Generation and Semantic Parsing | 2016 | ACL | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1704.07535 ) | | 58 | | 5 | Automatically learning semantic features for defect prediction | 2016 | ICSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) | | 59 | # 2014 60 | | | title | year | venue | task | model | dataset | pdf | code | 61 | |---:|:--------------------------------------------------------------|-------:|:---------|:----------------|:--------|:----------|:------------------------------------------------------------------|:-------| 62 | | 0 | On the localness of software | 2014 | FSE/ESEC | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 ) | | 63 | | 1 | Phrase-Based Statistical Translation of Programming Languages | 2014 | OOPSLA | Code Generation | N-gram | | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf ) | | 64 | | 2 | Code completion with statistical language models | 2014 | PLDI | Code Generation | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321 ) | | 65 | # 2012 66 | | | title | year | venue | task | model | dataset | pdf | code | 67 | |---:|:-------------------------------|-------:|:--------|:----------------|:--------|:----------|:--------------------------------------------------|:-------| 68 | | 0 | On the naturalness of software | 2012 | None | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 ) | | 69 | -------------------------------------------------------------------------------- /sequence_based_models/tasks.md: -------------------------------------------------------------------------------- 1 | # Program Classification 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:-----------------------------------------------------------------------------|-------:|:--------|:-----------------------|:------------------------------|:----------|:-------------------------------------------------------------------|:--------------------------------------------------------------------| 4 | | 0 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation | 2020 | | Program Classification | LSTM | | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 5 | | 1 | SCC: Automatic Classification of Code Snippets | 2018 | | Program Classification | Multinomial Naive Bayes (MNB) | | [📑](https://arxiv.org/pdf/1809.07945v1.pdf) | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie) | 6 | # Code Search 7 | | | title | year | venue | task | model | dataset | pdf | code | 8 | |---:|:----------------------------------------------------------------|-------:|:--------|:------------|:---------------|:----------|:-----------------------------------------------------------------------|:-------| 9 | | 0 | Retrieval on Source Code: A Neural Code Search | 2018 | PLDI | Code Search | word embedding | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 10 | | 1 | Deep code search | 2018 | ICSE | Code Search | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172) | | 11 | | 2 | Improving Code Search with Co-Attentive Representation Learning | 2020 | ICPC | Code Search | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269) | | 12 | # Code Generation 13 | | | title | year | venue | task | model | dataset | pdf | code | 14 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:----------------|:---------------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------| 15 | | 0 | On the naturalness of software | 2012 | None | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 ) | | 16 | | 1 | On the localness of software | 2014 | FSE/ESEC | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 ) | | 17 | | 2 | Phrase-Based Statistical Translation of Programming Languages | 2014 | OOPSLA | Code Generation | N-gram | | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf ) | | 18 | | 3 | Code completion with statistical language models | 2014 | PLDI | Code Generation | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321 ) | | 19 | | 4 | A deep language model for software code | 2016 | None | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1608.02715 ) | | 20 | | 5 | Latent Attention For If-Then Program Synthesis | 2016 | NuerIPs | Code Generation | Bi-LSTM | | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf ) | | 21 | | 6 | Abstract Syntax Networks for Code Generation and Semantic Parsing | 2016 | ACL | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1704.07535 ) | | 22 | | 7 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling | 2020 | IST | Code Generation | GRU | | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) | | 23 | | 8 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 24 | | 9 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 25 | | 10 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 26 | | 11 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 27 | | 12 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 28 | | 13 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 29 | | 14 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 30 | | 15 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 31 | | 16 | Structural language models of code | 2020 | ICML | Code Generation | Transformer | | [📑](https://proceedings.mlr.press/v119/alon20a.html ) | | 32 | | 17 | Code prediction by Feeding Trees to Transfomers | 2021 | ICSE | Code Generation | Transformer | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261) | | 33 | | 18 | A self-attentional neural architecture for code completion with multi-task learning | 2020 | ICPC | Code Generation | Transformer | | [📑](https://ieeexplore.ieee.org/abstract/document/9402114) | | 34 | # Pretrain 35 | | | title | year | venue | task | model | dataset | pdf | code | 36 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:---------|:------------|:----------|:----------------------------------------------------------|:---------------------------------------------------| 37 | | 0 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages | 2020 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2002.08155.pdf ) | [:octocat:](https://github.com/microsoft/CodeBERT) | 38 | | 1 | Learning and Evaluating Contextual Embedding of Source Code | 2020 | ICML | Pretrain | Transformer | | [📑](https://proceedings.mlr.press/v119/kanade20a.html ) | | 39 | | 2 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation | 2021 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2109.00859) | | 40 | | 3 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language | 2021 | UAI | Pretrain | TreeBERT | | [📑](https://arxiv.org/abs/2105.12485) | [:octocat:](https://github.com/17385/TreeBERT) | 41 | # Code representation 42 | | | title | year | venue | task | model | dataset | pdf | code | 43 | |---:|:------------------------------------------------------------------------|-------:|:--------|:--------------------|:--------|:----------|:-----------------------------------------------------------------------------------------------------|:-------| 44 | | 0 | Neural Code Comprehension: A Learnable Representation of Code Semantics | 2018 | NuerIPs | Code representation | RNN | | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) | | 45 | # Safety Analysis 46 | | | title | year | venue | task | model | dataset | pdf | code | 47 | |---:|:---------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------| 48 | | 0 | Automatically learning semantic features for defect prediction | 2016 | ICSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) | | 49 | | 1 | Deep Semantic Feature Learning for Software Defect Prediction | 2020 | TSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) | | 50 | | 2 | Seml: A semantic lstm model for software defect prediction | 2019 | None | Safety Analysis | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8747001) | | 51 | | 3 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction | 2019 | | Safety Analysis | Bi-LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) | | 52 | # Program Repair 53 | | | title | year | venue | task | model | dataset | pdf | code | 54 | |---:|:--------------------------------------------------------------------|-------:|:--------|:---------------|:------------|:-------------------------|:-------------------------------------------------------------------|:------------------------------------------------------| 55 | | 0 | Neural Program Repair by Jointly Learning to Localize and Repair | 2019 | ICLR | Program Repair | LSTM | DeepFix | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) | 56 | | 1 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 57 | # Clone Detection 58 | | | title | year | venue | task | model | dataset | pdf | code | 59 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------------|:-------| 60 | | 0 | Cclearner: A deep learning-based clone detection approach | 2017 | ICSME | Clone Detection | DNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426) | | 61 | | 1 | Deep learning code fragments for code clone detection | 2017 | ASE | Clone Detection | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) | | 62 | # Code Summarization 63 | | | title | year | venue | task | model | dataset | pdf | code | 64 | |---:|:--------------------------------------------------------------------------------------------|-------:|:---------|:-------------------|:------------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------| 65 | | 0 | A convolutional attention network for extreme summarization of source code | 2016 | ICML | Code Summarization | CAN | Java | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) | 66 | | 1 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 67 | | 2 | A transformer-based approach for source code summarization | 2020 | ACL | Code Summarization | Transformer | | [📑](https://arxiv.org/abs/2005.00653 ) | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum) | 68 | | 3 | Deep code comment generation | 2018 | ICPC | Code Summarization | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8973050) | [:octocat:](https://github.com/LRNavin/AutoComments) | 69 | | 4 | A neural model for generating natural language summaries of program subroutines(astted-gru) | 2019 | ICSE | Code Summarization | GRU | | [📑](https://arxiv.org/pdf/1902.01954v1.pdf) | [:octocat:](https://github.com/mcmillco/funcom) | 70 | | 5 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 71 | | 6 | Retrieval-based Neural Source Code Summarization | 2020 | ICSE | Code Summarization | Others | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 72 | -------------------------------------------------------------------------------- /graph_based_models/years.md: -------------------------------------------------------------------------------- 1 | # 2021 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:------------------------|:--------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:------------------------------------------------------------------------------------| 4 | | 0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks | 2021 | EMNLP | Code Summarization | HAConvGNN | notebookcdg | [📑](https://arxiv.org/abs/2104.01002) | [:octocat:](https://github.com/xuyeliu/HAConvGNN) | 5 | | 1 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs | 2021 | AAAI | Code Generation | GAT | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499) | | 6 | | 2 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees | 2021 | EMNLP | Code Summarization | RNN,attention | TL-CodeSum | [📑](http://arxiv.org/abs/2108.12987) | [:octocat:](https://anonymous.4open.science/r/CAST/) | 7 | | 3 | Code Clone Detection with Hierarchical Attentive Graph Embedding | 2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection | GCN | IJDataset2.0 | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X) | | 8 | | 4 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting | 2021 | ICPC | Code Summarization | Tree-LSTM | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) | 9 | | 5 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 10 | | 6 | How could Neural Networks understand Programs? | 2021 | ICML | Clone Detection | Transformer | OJClone | [📑](http://arxiv.org/abs/2105.04297) | [:octocat:](https://github.com/pdlan/OSCAR) | 11 | | 7 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network | 2021 | arxiv | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet | [📑](https://arxiv.org/abs/2107.01933v1) | | 12 | | 8 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN | 2021 | ICLR | Code Summarization | GNN | C Program Dataset | [📑](https://arxiv.org/abs/2006.05405v5) | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) | 13 | # 2020 14 | | | title | year | venue | task | model | dataset | pdf | code | 15 | |---:|:-----------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------------------|:-------------------------|:-----------------------------------|:------------------------------------------------------------|:--------------------------------------------------------------| 16 | | 0 | CODIT: Code Editing with Tree-Based Neural Models | 2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) | 17 | | 1 | GGF: A graph-based method for programming language syntax error correction | 2020 | ICPC | Program Repair | GGNN | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) | | 18 | | 2 | Improved code summarization via a graph neural network | 2020 | ICPC | Code Summarization | ConvGNN | Java method-comment pairs | [📑](https://arxiv.org/abs/2004.02843v2) | | 19 | | 3 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback | 2020 | ICML | Program Repair | GAT, LSTM | SPoC | [📑](http://arxiv.org/abs/2005.10636) | [:octocat:](https://github.com/michiyasunaga/DrRepair) | 20 | | 4 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing | 2020 | NDSS | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) | 21 | | 5 | Order matters: Semantic-aware neural networks for binary code similarity detection | 2020 | AAAI | Clone Detection | MPNN,CNN | gcc dataset | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466) | | 22 | | 6 | Learning semantic program embeddings with graph interval neural network | 2020 | Proceedings of the ACM on Programming Languages | Program Repair | GINN | PY150 | [📑](https://arxiv.org/abs/2005.09997v2) | | 23 | | 7 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network | 2020 | QRS | Clone Detection | Transformer, GAT, CNN | OJClone | [📑](https://ieeexplore.ieee.org/document/9282778/) | [:octocat:](https://github.com/lbzwoaini/CSEM.git) | 24 | | 8 | Flow2Vec:value-flow-based precise code embedding | 2020 | Proceedings of the ACM on Program ming Languages | Code Summarization, Program Classification | Flow2Vec | C Dataset | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) | | 25 | | 9 | Compiler-based graph representations for deep learning models of code | 2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification | GGNN | OpenCL Dataset | [📑](https://doi.org/10.1145/3377555.3377894) | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) | 26 | # 2019 27 | | | title | year | venue | task | model | dataset | pdf | code | 28 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------------------------------------|:------------------------------|:----------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------| 29 | | 0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree | 2019 | ICSE | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) | 30 | | 1 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 31 | | 2 | Structured neural summarization | 2019 | ICLR | Code Summarization | GGNN | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) | 32 | | 3 | Generative code modeling with graphs | 2019 | ICLR | Program Repair,Code Generation | GRU,GGNN | C# dataset | [📑](https://arxiv.org/abs/1805.08490v2) | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling) | 33 | | 4 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network | 2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) | | 34 | | 5 | Multi-modal attention network learning for semantic source code retrieval | 2019 | ASE | Code Search | GGNN, Tree-LSTM | C dataset | [📑](https://arxiv.org/abs/1909.13516v1) | | 35 | | 6 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks | 2019 | NIPS | Vulnerability Detection | GGNN, GRU, CNN | Devign Dataset | [📑](https://arxiv.org/abs/1909.03496v1) | [:octocat:](https://sites.google.com/view/devign) | 36 | | 7 | Improving bug detection via context-based code representation learning and attention-based neural networks | 2019 | Proceedings of the ACM on Programming Languages | Defect Prediction | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588) | | 37 | | 8 | Open vocabulary learning on source code with a graph-structured caches | 2019 | ICML | Code Generation | MPNN,CharCNN | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2) | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) | 38 | # 2018 39 | | | title | year | venue | task | model | dataset | pdf | code | 40 | |---:|:------------------------------------------------------------------------------|-------:|:---------|:----------------------------------|:----------------------------|:-------------------------------------|:----------------------------------------------|:-----------------------------------------------------------------------------| 41 | | 0 | Learning to represent programs with graphs | 2018 | ICLR | Defect Prediction,Code Generation | GGNN | iclr18-prog-graphs-dataset | [📑](https://arxiv.org/abs/1711.00740) | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) | 42 | | 1 | Improving automatic source code summarization via deep reinforcement learning | 2018 | ASE | Code Summarization | RNN,Tree-RNN | code-comment pairs | [📑](https://arxiv.org/abs/1811.07234v1) | | 43 | | 2 | DeepSim: Deep Learning Code Functional Similarity | 2018 | ESEC/FSE | Clone Detection | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068) | [:octocat:](https://github.com/parasol-aser/deepsim) | 44 | # 2017 45 | | | title | year | venue | task | model | dataset | pdf | code | 46 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:---------------|:--------------|:---------------------------------------|:--------------------------------------|:---------------------------------------------------------------------| 47 | | 0 | Neural network-based graph embedding for cross-platform binary code similarity detection | 2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) | 48 | # 2016 49 | | | title | year | venue | task | model | dataset | pdf | code | 50 | |---:|:-------------------------------------------------|-------:|:--------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------| 51 | | 0 | Probabilistic model for code with decision trees | 2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) | | 52 | # 2015 53 | | | title | year | venue | task | model | dataset | pdf | code | 54 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------| 55 | | 0 | Gated graph sequence neural networks | 2015 | ICLR | Program Verification | GGNN | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) | 56 | # 2014 57 | | | title | year | venue | task | model | dataset | pdf | code | 58 | |---:|:-------------------------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:-------------------------------------------|:---------------------------------------------------|:---------------------------------------------------------| 59 | | 0 | TBCNN: A tree-based convolutional neural network for programming language processing | 2014 | arixiv | Program Classification | TBCNN | OJClone | [📑](https://arxiv.org/abs/1409.5718v1) | [:octocat:](https://sites.google.com/site/treebasedcnn/) | 60 | | 1 | Modeling and discovering vulnerabilities with code property graphs | 2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) | | 61 | -------------------------------------------------------------------------------- /sequence_based_models/models.md: -------------------------------------------------------------------------------- 1 | # N-gram 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:--------------------------------------------------------------|-------:|:---------|:----------------|:--------|:----------|:------------------------------------------------------------------|:-------| 4 | | 0 | On the naturalness of software | 2012 | None | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 ) | | 5 | | 1 | On the localness of software | 2014 | FSE/ESEC | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 ) | | 6 | | 2 | Phrase-Based Statistical Translation of Programming Languages | 2014 | OOPSLA | Code Generation | N-gram | | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf ) | | 7 | # TreeBERT 8 | | | title | year | venue | task | model | dataset | pdf | code | 9 | |---:|:------------------------------------------------------------------|-------:|:--------|:---------|:---------|:----------|:---------------------------------------|:-----------------------------------------------| 10 | | 0 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language | 2021 | UAI | Pretrain | TreeBERT | | [📑](https://arxiv.org/abs/2105.12485) | [:octocat:](https://github.com/17385/TreeBERT) | 11 | # Others 12 | | | title | year | venue | task | model | dataset | pdf | code | 13 | |---:|:-------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:------------------------------------------------------------|:-------| 14 | | 0 | Retrieval-based Neural Source Code Summarization | 2020 | ICSE | Code Summarization | Others | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 15 | # word2vec 16 | | | title | year | venue | task | model | dataset | pdf | code | 17 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:------------------------------------------------------------|:-------| 18 | | 0 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 19 | | 1 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 20 | # Multinomial Naive Bayes (MNB) 21 | | title | year | venue | task | model | dataset | pdf | code | 22 | |---------|--------|---------|--------|---------|-----------|-------|--------| 23 | # GRU 24 | | | title | year | venue | task | model | dataset | pdf | code | 25 | |---:|:--------------------------------------------------------------------------------------------|-------:|:---------|:-------------------|:--------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------| 26 | | 0 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling | 2020 | IST | Code Generation | GRU | | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) | | 27 | | 1 | A neural model for generating natural language summaries of program subroutines(astted-gru) | 2019 | ICSE | Code Summarization | GRU | | [📑](https://arxiv.org/pdf/1902.01954v1.pdf) | [:octocat:](https://github.com/mcmillco/funcom) | 28 | | 2 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 29 | # Bi-LSTM 30 | | | title | year | venue | task | model | dataset | pdf | code | 31 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------------------------|:--------------------------------------------------------------------------------------------------|:--------------------------------------------------| 32 | | 0 | Latent Attention For If-Then Program Synthesis | 2016 | NuerIPs | Code Generation | Bi-LSTM | | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf ) | | 33 | | 1 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 34 | | 2 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction | 2019 | | Safety Analysis | Bi-LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) | | 35 | | 3 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 36 | # word embedding 37 | | | title | year | venue | task | model | dataset | pdf | code | 38 | |---:|:-----------------------------------------------|-------:|:--------|:------------|:---------------|:----------|:------------------------------------------------------------|:-------| 39 | | 0 | Retrieval on Source Code: A Neural Code Search | 2018 | PLDI | Code Search | word embedding | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 40 | # Transformer 41 | | | title | year | venue | task | model | dataset | pdf | code | 42 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:------------|:-------------------------|:-------------------------------------------------------------------|:--------------------------------------------------------| 43 | | 0 | A transformer-based approach for source code summarization | 2020 | ACL | Code Summarization | Transformer | | [📑](https://arxiv.org/abs/2005.00653 ) | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum) | 44 | | 1 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages | 2020 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2002.08155.pdf ) | [:octocat:](https://github.com/microsoft/CodeBERT) | 45 | | 2 | Learning and Evaluating Contextual Embedding of Source Code | 2020 | ICML | Pretrain | Transformer | | [📑](https://proceedings.mlr.press/v119/kanade20a.html ) | | 46 | | 3 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation | 2021 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2109.00859) | | 47 | | 4 | Structural language models of code | 2020 | ICML | Code Generation | Transformer | | [📑](https://proceedings.mlr.press/v119/alon20a.html ) | | 48 | | 5 | Code prediction by Feeding Trees to Transfomers | 2021 | ICSE | Code Generation | Transformer | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261) | | 49 | | 6 | A self-attentional neural architecture for code completion with multi-task learning | 2020 | ICPC | Code Generation | Transformer | | [📑](https://ieeexplore.ieee.org/abstract/document/9402114) | | 50 | | 7 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 51 | # CRF 52 | | | title | year | venue | task | model | dataset | pdf | code | 53 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-------| 54 | | 0 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 55 | # DNN 56 | | | title | year | venue | task | model | dataset | pdf | code | 57 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------| 58 | | 0 | Cclearner: A deep learning-based clone detection approach | 2017 | ICSME | Clone Detection | DNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426) | | 59 | # pointer network 60 | | | title | year | venue | task | model | dataset | pdf | code | 61 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------| 62 | | 0 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 63 | # DBN 64 | | | title | year | venue | task | model | dataset | pdf | code | 65 | |---:|:---------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------| 66 | | 0 | Automatically learning semantic features for defect prediction | 2016 | ICSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) | | 67 | | 1 | Deep Semantic Feature Learning for Software Defect Prediction | 2020 | TSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) | | 68 | # CAN 69 | | | title | year | venue | task | model | dataset | pdf | code | 70 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:--------------------------------------------------------|:-------------------------------------------------------------------| 71 | | 0 | A convolutional attention network for extreme summarization of source code | 2016 | ICML | Code Summarization | CAN | Java | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) | 72 | # LSTM 73 | | | title | year | venue | task | model | dataset | pdf | code | 74 | |---:|:-----------------------------------------------------------------------------|-------:|:--------|:-----------------------|:---------------------|:-------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------| 75 | | 0 | A deep language model for software code | 2016 | None | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1608.02715 ) | | 76 | | 1 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 77 | | 2 | Latent Attention For If-Then Program Synthesis | 2016 | NuerIPs | Code Generation | Bi-LSTM | | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf ) | | 78 | | 3 | Abstract Syntax Networks for Code Generation and Semantic Parsing | 2016 | ACL | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1704.07535 ) | | 79 | | 4 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 80 | | 5 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 81 | | 6 | Deep code comment generation | 2018 | ICPC | Code Summarization | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8973050) | [:octocat:](https://github.com/LRNavin/AutoComments) | 82 | | 7 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 83 | | 8 | Seml: A semantic lstm model for software defect prediction | 2019 | None | Safety Analysis | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8747001) | | 84 | | 9 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 85 | | 10 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 86 | | 11 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction | 2019 | | Safety Analysis | Bi-LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) | | 87 | | 12 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 88 | | 13 | Neural Program Repair by Jointly Learning to Localize and Repair | 2019 | ICLR | Program Repair | LSTM | DeepFix | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) | 89 | | 14 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation | 2020 | | Program Classification | LSTM | | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 90 | # RNN 91 | | | title | year | venue | task | model | dataset | pdf | code | 92 | |---:|:------------------------------------------------------------------------|-------:|:--------|:--------------------|:--------|:----------|:-----------------------------------------------------------------------------------------------------|:-------| 93 | | 0 | Code completion with statistical language models | 2014 | PLDI | Code Generation | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321 ) | | 94 | | 1 | Neural Code Comprehension: A Learnable Representation of Code Semantics | 2018 | NuerIPs | Code representation | RNN | | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) | | 95 | | 2 | Deep code search | 2018 | ICSE | Code Search | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172) | | 96 | | 3 | Improving Code Search with Co-Attentive Representation Learning | 2020 | ICPC | Code Search | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269) | | 97 | | 4 | Deep learning code fragments for code clone detection | 2017 | ASE | Clone Detection | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) | | 98 | -------------------------------------------------------------------------------- /graph_based_models/tasks.md: -------------------------------------------------------------------------------- 1 | # Defect Prediction 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:----------------------------------|:------------------------------|:------------------------------------|:---------------------------------------------------|:-----------------------------------------------------------------------------| 4 | | 0 | Learning to represent programs with graphs | 2018 | ICLR | Defect Prediction,Code Generation | GGNN | iclr18-prog-graphs-dataset | [📑](https://arxiv.org/abs/1711.00740) | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) | 5 | | 1 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network | 2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) | | 6 | | 2 | Improving bug detection via context-based code representation learning and attention-based neural networks | 2019 | Proceedings of the ACM on Programming Languages | Defect Prediction | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588) | | 7 | # Code Search 8 | | | title | year | venue | task | model | dataset | pdf | code | 9 | |---:|:---------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------|:----------------------|:----------------------------------------------|:----------------------------------------------| 10 | | 0 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 11 | | 1 | Multi-modal attention network learning for semantic source code retrieval | 2019 | ASE | Code Search | GGNN, Tree-LSTM | C dataset | [📑](https://arxiv.org/abs/1909.13516v1) | | 12 | # Program Repair 13 | | | title | year | venue | task | model | dataset | pdf | code | 14 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------|:--------------|:---------------------------------------|:-----------------------------------------------------|:---------------------------------------------------------------------| 15 | | 0 | CODIT: Code Editing with Tree-Based Neural Models | 2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) | 16 | | 1 | GGF: A graph-based method for programming language syntax error correction | 2020 | ICPC | Program Repair | GGNN | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) | | 17 | | 2 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback | 2020 | ICML | Program Repair | GAT, LSTM | SPoC | [📑](http://arxiv.org/abs/2005.10636) | [:octocat:](https://github.com/michiyasunaga/DrRepair) | 18 | | 3 | Generative code modeling with graphs | 2019 | ICLR | Program Repair,Code Generation | GRU,GGNN | C# dataset | [📑](https://arxiv.org/abs/1805.08490v2) | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling) | 19 | | 4 | Neural network-based graph embedding for cross-platform binary code similarity detection | 2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) | 20 | | 5 | Learning semantic program embeddings with graph interval neural network | 2020 | Proceedings of the ACM on Programming Languages | Program Repair | GINN | PY150 | [📑](https://arxiv.org/abs/2005.09997v2) | | 21 | # Code Generation 22 | | | title | year | venue | task | model | dataset | pdf | code | 23 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------------------------|:--------------|:----------------------------------|:-----------------------------------------------------|:------------------------------------------------------------------------------------------| 24 | | 0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs | 2021 | AAAI | Code Generation | GAT | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499) | | 25 | | 1 | Learning to represent programs with graphs | 2018 | ICLR | Defect Prediction,Code Generation | GGNN | iclr18-prog-graphs-dataset | [📑](https://arxiv.org/abs/1711.00740) | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) | 26 | | 2 | Generative code modeling with graphs | 2019 | ICLR | Program Repair,Code Generation | GRU,GGNN | C# dataset | [📑](https://arxiv.org/abs/1805.08490v2) | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling) | 27 | | 3 | Probabilistic model for code with decision trees | 2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) | | 28 | | 4 | Open vocabulary learning on source code with a graph-structured caches | 2019 | ICML | Code Generation | MPNN,CharCNN | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2) | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) | 29 | # Program Verification 30 | | | title | year | venue | task | model | dataset | pdf | code | 31 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------| 32 | | 0 | Gated graph sequence neural networks | 2015 | ICLR | Program Verification | GGNN | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) | 33 | # Program Classification 34 | | | title | year | venue | task | model | dataset | pdf | code | 35 | |---:|:-------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------------------|:------------------|:---------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------| 36 | | 0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree | 2019 | ICSE | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) | 37 | | 1 | TBCNN: A tree-based convolutional neural network for programming language processing | 2014 | arixiv | Program Classification | TBCNN | OJClone | [📑](https://arxiv.org/abs/1409.5718v1) | [:octocat:](https://sites.google.com/site/treebasedcnn/) | 38 | | 2 | Flow2Vec:value-flow-based precise code embedding | 2020 | Proceedings of the ACM on Program ming Languages | Code Summarization, Program Classification | Flow2Vec | C Dataset | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) | | 39 | | 3 | Compiler-based graph representations for deep learning models of code | 2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification | GGNN | OpenCL Dataset | [📑](https://doi.org/10.1145/3377555.3377894) | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) | 40 | # Vulnerability Detection 41 | | | title | year | venue | task | model | dataset | pdf | code | 42 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:--------------------------------------------------| 43 | | 0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 44 | | 1 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks | 2019 | NIPS | Vulnerability Detection | GGNN, GRU, CNN | Devign Dataset | [📑](https://arxiv.org/abs/1909.03496v1) | [:octocat:](https://sites.google.com/view/devign) | 45 | | 2 | Modeling and discovering vulnerabilities with code property graphs | 2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) | | 46 | # Clone Detection 47 | | | title | year | venue | task | model | dataset | pdf | code | 48 | |---:|:-----------------------------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:------------------------------------------------|:----------------------------|:-------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------| 49 | | 0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree | 2019 | ICSE | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) | 50 | | 1 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 51 | | 2 | Code Clone Detection with Hierarchical Attentive Graph Embedding | 2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection | GCN | IJDataset2.0 | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X) | | 52 | | 3 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing | 2020 | NDSS | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) | 53 | | 4 | Order matters: Semantic-aware neural networks for binary code similarity detection | 2020 | AAAI | Clone Detection | MPNN,CNN | gcc dataset | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466) | | 54 | | 5 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network | 2020 | QRS | Clone Detection | Transformer, GAT, CNN | OJClone | [📑](https://ieeexplore.ieee.org/document/9282778/) | [:octocat:](https://github.com/lbzwoaini/CSEM.git) | 55 | | 6 | How could Neural Networks understand Programs? | 2021 | ICML | Clone Detection | Transformer | OJClone | [📑](http://arxiv.org/abs/2105.04297) | [:octocat:](https://github.com/pdlan/OSCAR) | 56 | | 7 | DeepSim: Deep Learning Code Functional Similarity | 2018 | ESEC/FSE | Clone Detection | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068) | [:octocat:](https://github.com/parasol-aser/deepsim) | 57 | # Code Summarization 58 | | | title | year | venue | task | model | dataset | pdf | code | 59 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:--------------------------------------------------|:----------------------------------------------------------------------------|:-------------------------------------------------|:------------------------------------------------------------------------------------| 60 | | 0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks | 2021 | EMNLP | Code Summarization | HAConvGNN | notebookcdg | [📑](https://arxiv.org/abs/2104.01002) | [:octocat:](https://github.com/xuyeliu/HAConvGNN) | 61 | | 1 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees | 2021 | EMNLP | Code Summarization | RNN,attention | TL-CodeSum | [📑](http://arxiv.org/abs/2108.12987) | [:octocat:](https://anonymous.4open.science/r/CAST/) | 62 | | 2 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 63 | | 3 | Improving automatic source code summarization via deep reinforcement learning | 2018 | ASE | Code Summarization | RNN,Tree-RNN | code-comment pairs | [📑](https://arxiv.org/abs/1811.07234v1) | | 64 | | 4 | Structured neural summarization | 2019 | ICLR | Code Summarization | GGNN | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) | 65 | | 5 | Improved code summarization via a graph neural network | 2020 | ICPC | Code Summarization | ConvGNN | Java method-comment pairs | [📑](https://arxiv.org/abs/2004.02843v2) | | 66 | | 6 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting | 2021 | ICPC | Code Summarization | Tree-LSTM | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) | 67 | | 7 | Flow2Vec:value-flow-based precise code embedding | 2020 | Proceedings of the ACM on Program ming Languages | Code Summarization, Program Classification | Flow2Vec | C Dataset | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) | | 68 | | 8 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network | 2021 | arxiv | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet | [📑](https://arxiv.org/abs/2107.01933v1) | | 69 | | 9 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN | 2021 | ICLR | Code Summarization | GNN | C Program Dataset | [📑](https://arxiv.org/abs/2006.05405v5) | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) | 70 | -------------------------------------------------------------------------------- /sequence_based_models/datasets.md: -------------------------------------------------------------------------------- 1 | # Java 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:-----------------------------------------------------------------------------|-------:|:---------|:-------------------|:-------------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------| 4 | | 0 | A convolutional attention network for extreme summarization of source code | 2016 | ICML | Code Summarization | CAN | Java | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) | 5 | | 1 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 6 | | 2 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 7 | | 3 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 8 | | 4 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 9 | | 5 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 10 | # DeepFix 11 | | | title | year | venue | task | model | dataset | pdf | code | 12 | |---:|:-----------------------------------------------------------------|-------:|:--------|:---------------|:--------|:----------|:---------------------------------------|:------------------------------------------------------| 13 | | 0 | Neural Program Repair by Jointly Learning to Localize and Repair | 2019 | ICLR | Program Repair | LSTM | DeepFix | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) | 14 | # TFix's Code Patches Data 15 | | | title | year | venue | task | model | dataset | pdf | code | 16 | |---:|:--------------------------------------------------------------------|-------:|:--------|:---------------|:------------|:-------------------------|:-------------------------------------------------------------------|:---------------------------------------------| 17 | | 0 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 18 | # C 19 | | | title | year | venue | task | model | dataset | pdf | code | 20 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-----------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------| 21 | | 0 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 22 | | 1 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 23 | | 2 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 24 | | 3 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 25 | | 4 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 26 | | 5 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 27 | # 9714 Java projects from GitHub 28 | | | title | year | venue | task | model | dataset | pdf | code | 29 | |---:|:-----------------------------------------------------------------------------|-------:|:---------|:-------------------|:--------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------| 30 | | 0 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 31 | # Python 32 | | | title | year | venue | task | model | dataset | pdf | code | 33 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-----------------------------------------------| 34 | | 0 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 35 | | 1 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 36 | # JavaScript 37 | | | title | year | venue | task | model | dataset | pdf | code | 38 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-------| 39 | | 0 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 40 | # JS150 41 | | | title | year | venue | task | model | dataset | pdf | code | 42 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------| 43 | | 0 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 44 | | 1 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 45 | # python 46 | | | title | year | venue | task | model | dataset | pdf | code | 47 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------| 48 | | 0 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 49 | # Uncategorized 50 | | | title | year | venue | task | model | dataset | pdf | code | 51 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:---------|:-----------------------|:------------------------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------| 52 | | 0 | On the naturalness of software | 2012 | None | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 ) | | 53 | | 1 | On the localness of software | 2014 | FSE/ESEC | Code Generation | N-gram | | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 ) | | 54 | | 2 | Phrase-Based Statistical Translation of Programming Languages | 2014 | OOPSLA | Code Generation | N-gram | | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf ) | | 55 | | 3 | A convolutional attention network for extreme summarization of source code | 2016 | ICML | Code Summarization | CAN | Java | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) | 56 | | 4 | Code completion with statistical language models | 2014 | PLDI | Code Generation | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321 ) | | 57 | | 5 | Neural Code Comprehension: A Learnable Representation of Code Semantics | 2018 | NuerIPs | Code representation | RNN | | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) | | 58 | | 6 | A deep language model for software code | 2016 | None | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1608.02715 ) | | 59 | | 7 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 60 | | 8 | Latent Attention For If-Then Program Synthesis | 2016 | NuerIPs | Code Generation | Bi-LSTM | | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf ) | | 61 | | 9 | Abstract Syntax Networks for Code Generation and Semantic Parsing | 2016 | ACL | Code Generation | LSTM | | [📑](https://arxiv.org/pdf/1704.07535 ) | | 62 | | 10 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling | 2020 | IST | Code Generation | GRU | | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) | | 63 | | 11 | A transformer-based approach for source code summarization | 2020 | ACL | Code Summarization | Transformer | | [📑](https://arxiv.org/abs/2005.00653 ) | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum) | 64 | | 12 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages | 2020 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2002.08155.pdf ) | [:octocat:](https://github.com/microsoft/CodeBERT) | 65 | | 13 | Learning and Evaluating Contextual Embedding of Source Code | 2020 | ICML | Pretrain | Transformer | | [📑](https://proceedings.mlr.press/v119/kanade20a.html ) | | 66 | | 14 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation | 2021 | EMNLP | Pretrain | Transformer | | [📑](https://arxiv.org/pdf/2109.00859) | | 67 | | 15 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 68 | | 16 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 69 | | 17 | Automatically learning semantic features for defect prediction | 2016 | ICSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) | | 70 | | 18 | Deep Semantic Feature Learning for Software Defect Prediction | 2020 | TSE | Safety Analysis | DBN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) | | 71 | | 19 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 72 | | 20 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 73 | | 21 | Deep code comment generation | 2018 | ICPC | Code Summarization | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8973050) | [:octocat:](https://github.com/LRNavin/AutoComments) | 74 | | 22 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 75 | | 23 | Seml: A semantic lstm model for software defect prediction | 2019 | None | Safety Analysis | LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8747001) | | 76 | | 24 | Modeling programs hierarchically with stack-augmented LSTM | 2020 | JSS | Code Generation | LSTM | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) | | 77 | | 25 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 78 | | 26 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction | 2019 | | Safety Analysis | Bi-LSTM | | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) | | 79 | | 27 | Pythia: AI-assisted Code Completion System | 2019 | SIGKDD | Code Generation | Bi-LSTM | Python | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) | 80 | | 28 | A neural model for generating natural language summaries of program subroutines(astted-gru) | 2019 | ICSE | Code Summarization | GRU | | [📑](https://arxiv.org/pdf/1902.01954v1.pdf) | [:octocat:](https://github.com/mcmillco/funcom) | 81 | | 29 | Deep code comment generation with hybrid lexical and syntactical information | 2020 | FSE/EFEC | Code Summarization | GRU | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) | 82 | | 30 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language | 2021 | UAI | Pretrain | TreeBERT | | [📑](https://arxiv.org/abs/2105.12485) | [:octocat:](https://github.com/17385/TreeBERT) | 83 | | 31 | Structural language models of code | 2020 | ICML | Code Generation | Transformer | | [📑](https://proceedings.mlr.press/v119/alon20a.html ) | | 84 | | 32 | Code prediction by Feeding Trees to Transfomers | 2021 | ICSE | Code Generation | Transformer | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261) | | 85 | | 33 | A self-attentional neural architecture for code completion with multi-task learning | 2020 | ICPC | Code Generation | Transformer | | [📑](https://ieeexplore.ieee.org/abstract/document/9402114) | | 86 | | 34 | Retrieval-based Neural Source Code Summarization | 2020 | ICSE | Code Summarization | Others | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 87 | | 35 | Retrieval on Source Code: A Neural Code Search | 2018 | PLDI | Code Search | word embedding | | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) | | 88 | | 36 | Deep code search | 2018 | ICSE | Code Search | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172) | | 89 | | 37 | Improving Code Search with Co-Attentive Representation Learning | 2020 | ICPC | Code Search | RNN | | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269) | | 90 | | 38 | Cclearner: A deep learning-based clone detection approach | 2017 | ICSME | Clone Detection | DNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426) | | 91 | | 39 | Deep learning code fragments for code clone detection | 2017 | ASE | Clone Detection | RNN | | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) | | 92 | | 40 | Neural Program Repair by Jointly Learning to Localize and Repair | 2019 | ICLR | Program Repair | LSTM | DeepFix | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) | 93 | | 41 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer | 2021 | ICML | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 94 | | 42 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation | 2020 | | Program Classification | LSTM | | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) | 95 | | 43 | SCC: Automatic Classification of Code Snippets | 2018 | | Program Classification | Multinomial Naive Bayes (MNB) | | [📑](https://arxiv.org/pdf/1809.07945v1.pdf) | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie) | 96 | # PY150 97 | | | title | year | venue | task | model | dataset | pdf | code | 98 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------| 99 | | 0 | Neural Code Completion | 2018 | ICPC | Code Generation | LSTM | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg) | | 100 | | 1 | Code Completion with Neural Attention and Pointer Networks | 2018 | IJCAI | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) | 101 | # C# 102 | | | title | year | venue | task | model | dataset | pdf | code | 103 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-----------------------------|:------------------------------------------------------------|:--------------------------------------------------| 104 | | 0 | Summarizing Source Code using a Neural Attention Model | 2016 | ACL | Code Summarization | LSTM | C# | [📑](https://aclanthology.org/P16-1195.pdf) | [:octocat:](https://github.com/sriniiyer/codenn) | 105 | | 1 | A general path-based representation for predicting programproperties | 2018 | PLDL | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) | | 106 | | 2 | Exploring API embedding for API usages and applications | 2017 | ICSE | Code Generation | word2vec | Java, C# | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | | 107 | | 3 | Code2seq: Generating Sequences from Structured Representations of Code | 2019 | ICLR | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400) | [:octocat:](https://github.com/tech-srl/code2seq) | 108 | # 10072 Java GitHub repositories 109 | | | title | year | venue | task | model | dataset | pdf | code | 110 | |---:|:-------------------------------------------------------|-------:|:--------|:----------------|:--------|:-------------------------------|:---------------------------------------|:--------------------------------------------------| 111 | | 0 | Code2vec: learning distributed representations of code | 2019 | POPL | Code Generation | LSTM | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) | 112 | # C#(dataset of CodeNN) 113 | | title | year | venue | task | model | dataset | pdf | code | 114 | |---------|--------|---------|--------|---------|-----------|-------|--------| 115 | -------------------------------------------------------------------------------- /graph_based_models/datasets.md: -------------------------------------------------------------------------------- 1 | # Java repos collected in this work 2 | | | title | year | venue | task | model | dataset | pdf | code | 3 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:----------------------------------|:-----------------------------------------|:------------------------------------------------------------------------------------------| 4 | | 0 | Open vocabulary learning on source code with a graph-structured caches | 2019 | ICML | Code Generation | MPNN,CharCNN | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2) | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) | 5 | # Code-Change-Data 6 | | | title | year | venue | task | model | dataset | pdf | code | 7 | |---:|:--------------------------------------------------|-------:|:------------------------------------------|:---------------|:--------|:---------------------------|:--------------------------------------|:----------------------------------| 8 | | 0 | CODIT: Code Editing with Tree-Based Neural Models | 2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) | 9 | # Hybrid-DeepCom Dataset 10 | | | title | year | venue | task | model | dataset | pdf | code | 11 | |---:|:----------------------------------------------------------------------------|-------:|:--------|:-------------------|:----------|:--------------------------------------|:-----------------------------------------|:--------------------------------------------| 12 | | 0 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting | 2021 | ICPC | Code Summarization | Tree-LSTM | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) | 13 | # JAVA method naming datasets 14 | | | title | year | venue | task | model | dataset | pdf | code | 15 | |---:|:--------------------------------|-------:|:--------|:-------------------|:--------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------| 16 | | 0 | Structured neural summarization | 2019 | ICLR | Code Summarization | GGNN | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) | 17 | # ARM binary dataset 18 | | | title | year | venue | task | model | dataset | pdf | code | 19 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------| 20 | | 0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 21 | # Genius Dataset 22 | | | title | year | venue | task | model | dataset | pdf | code | 23 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:---------------|:--------------|:---------------------------------------|:--------------------------------------|:---------------------------------------------------------------------| 24 | | 0 | Neural network-based graph embedding for cross-platform binary code similarity detection | 2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) | 25 | # Google Code Jam (GCJ) 26 | | title | year | venue | task | model | dataset | pdf | code | 27 | |---------|--------|---------|--------|---------|-----------|-------|--------| 28 | # notebookcdg 29 | | | title | year | venue | task | model | dataset | pdf | code | 30 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:----------|:------------|:---------------------------------------|:--------------------------------------------------| 31 | | 0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks | 2021 | EMNLP | Code Summarization | HAConvGNN | notebookcdg | [📑](https://arxiv.org/abs/2104.01002) | [:octocat:](https://github.com/xuyeliu/HAConvGNN) | 32 | # program variables dataset produced in this work 33 | | | title | year | venue | task | model | dataset | pdf | code | 34 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------| 35 | | 0 | Gated graph sequence neural networks | 2015 | ICLR | Program Verification | GGNN | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) | 36 | # gcc dataset 37 | | | title | year | venue | task | model | dataset | pdf | code | 38 | |---:|:-----------------------------------------------------------------------------------|-------:|:--------|:----------------|:---------|:------------|:------------------------------------------------------------|:-------| 39 | | 0 | Order matters: Semantic-aware neural networks for binary code similarity detection | 2020 | AAAI | Clone Detection | MPNN,CNN | gcc dataset | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466) | | 40 | # Python method documentation dataset 41 | | | title | year | venue | task | model | dataset | pdf | code | 42 | |---:|:--------------------------------|-------:|:--------|:-------------------|:--------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------| 43 | | 0 | Structured neural summarization | 2019 | ICLR | Code Summarization | GGNN | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) | 44 | # JS150 45 | | | title | year | venue | task | model | dataset | pdf | code | 46 | |---:|:----------------------------------------------------------------------|-------:|:--------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------| 47 | | 0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs | 2021 | AAAI | Code Generation | GAT | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499) | | 48 | | 1 | Probabilistic model for code with decision trees | 2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) | | 49 | # Findutils 50 | | | title | year | venue | task | model | dataset | pdf | code | 51 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------| 52 | | 0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing | 2020 | NDSS | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) | 53 | # Validation dataset 54 | | | title | year | venue | task | model | dataset | pdf | code | 55 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------| 56 | | 0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 57 | # Devign Dataset 58 | | | title | year | venue | task | model | dataset | pdf | code | 59 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:--------|:------------------------|:---------------|:---------------|:-----------------------------------------|:--------------------------------------------------| 60 | | 0 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks | 2019 | NIPS | Vulnerability Detection | GGNN, GRU, CNN | Devign Dataset | [📑](https://arxiv.org/abs/1909.03496v1) | [:octocat:](https://sites.google.com/view/devign) | 61 | # C Dataset 62 | | | title | year | venue | task | model | dataset | pdf | code | 63 | |---:|:-------------------------------------------------|-------:|:-------------------------------------------------|:-------------------------------------------|:---------|:----------|:-------------------------------------------------|:-------| 64 | | 0 | Flow2Vec:value-flow-based precise code embedding | 2020 | Proceedings of the ACM on Program ming Languages | Code Summarization, Program Classification | Flow2Vec | C Dataset | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) | | 65 | # Diffutils 66 | | | title | year | venue | task | model | dataset | pdf | code | 67 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------| 68 | | 0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing | 2020 | NDSS | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) | 69 | # OpenCL Dataset 70 | | | title | year | venue | task | model | dataset | pdf | code | 71 | |---:|:----------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-----------------------|:--------|:---------------|:----------------------------------------------|:--------------------------------------------------------------| 72 | | 0 | Compiler-based graph representations for deep learning models of code | 2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification | GGNN | OpenCL Dataset | [📑](https://doi.org/10.1145/3377555.3377894) | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) | 73 | # code-comment pairs 74 | | | title | year | venue | task | model | dataset | pdf | code | 75 | |---:|:------------------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-------------------|:-----------------------------------------|:-------| 76 | | 0 | Improving automatic source code summarization via deep reinforcement learning | 2018 | ASE | Code Summarization | RNN,Tree-RNN | code-comment pairs | [📑](https://arxiv.org/abs/1811.07234v1) | | 77 | # BCB 78 | | | title | year | venue | task | model | dataset | pdf | code | 79 | |---:|:------------------------------------------------------------------------|-------:|:--------|:----------------------------------------|:------------------|:------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------| 80 | | 0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree | 2019 | ICSE | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) | 81 | # collected in this work 82 | | | title | year | venue | task | model | dataset | pdf | code | 83 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:------------------------|:------------------------------|:-------------------------------------------|:---------------------------------------------------|:------------------------------------------------------------------------------------------| 84 | | 0 | Neural network-based graph embedding for cross-platform binary code similarity detection | 2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) | 85 | | 1 | Improving bug detection via context-based code representation learning and attention-based neural networks | 2019 | Proceedings of the ACM on Programming Languages | Defect Prediction | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588) | | 86 | | 2 | Modeling and discovering vulnerabilities with code property graphs | 2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) | | 87 | | 3 | Open vocabulary learning on source code with a graph-structured caches | 2019 | ICML | Code Generation | MPNN,CharCNN | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2) | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) | 88 | # Coreutils 89 | | | title | year | venue | task | model | dataset | pdf | code | 90 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------| 91 | | 0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing | 2020 | NDSS | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) | 92 | # DeepFix dataset 93 | | | title | year | venue | task | model | dataset | pdf | code | 94 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:---------------|:--------|:-----------------------------------|:-----------------------------------------------------|:-------| 95 | | 0 | GGF: A graph-based method for programming language syntax error correction | 2020 | ICPC | Program Repair | GGNN | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) | | 96 | # Syntax similar dataset 97 | | | title | year | venue | task | model | dataset | pdf | code | 98 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------| 99 | | 0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 100 | # SPoC 101 | | | title | year | venue | task | model | dataset | pdf | code | 102 | |---:|:---------------------------------------------------------------------|-------:|:--------|:---------------|:----------|:----------|:--------------------------------------|:-------------------------------------------------------| 103 | | 0 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback | 2020 | ICML | Program Repair | GAT, LSTM | SPoC | [📑](http://arxiv.org/abs/2005.10636) | [:octocat:](https://github.com/michiyasunaga/DrRepair) | 104 | # IJDataset2.0 105 | | | title | year | venue | task | model | dataset | pdf | code | 106 | |---:|:-----------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:----------------|:--------|:-------------|:------------------------------------------------------------------------|:-------| 107 | | 0 | Code Clone Detection with Hierarchical Attentive Graph Embedding | 2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection | GCN | IJDataset2.0 | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X) | | 108 | # CodeForces dataset 109 | | | title | year | venue | task | model | dataset | pdf | code | 110 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:---------------|:--------|:-----------------------------------|:-----------------------------------------------------|:-------| 111 | | 0 | GGF: A graph-based method for programming language syntax error correction | 2020 | ICPC | Program Repair | GGNN | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) | | 112 | # Linux kernel's code collected in this work 113 | | | title | year | venue | task | model | dataset | pdf | code | 114 | |---:|:-------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:-------------------------------------------|:---------------------------------------------------|:-------| 115 | | 0 | Modeling and discovering vulnerabilities with code property graphs | 2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) | | 116 | # OJClone 117 | | | title | year | venue | task | model | dataset | pdf | code | 118 | |---:|:-------------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------------|:----------------------|:-------------------------------------------------------------------------------------|:---------------------------------------------------------| 119 | | 0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree | 2019 | ICSE | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) | 120 | | 1 | TBCNN: A tree-based convolutional neural network for programming language processing | 2014 | arixiv | Program Classification | TBCNN | OJClone | [📑](https://arxiv.org/abs/1409.5718v1) | [:octocat:](https://sites.google.com/site/treebasedcnn/) | 121 | | 2 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 122 | | 3 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network | 2020 | QRS | Clone Detection | Transformer, GAT, CNN | OJClone | [📑](https://ieeexplore.ieee.org/document/9282778/) | [:octocat:](https://github.com/lbzwoaini/CSEM.git) | 123 | | 4 | How could Neural Networks understand Programs? | 2021 | ICML | Clone Detection | Transformer | OJClone | [📑](http://arxiv.org/abs/2105.04297) | [:octocat:](https://github.com/pdlan/OSCAR) | 124 | # YANCFG Dataset 125 | | | title | year | venue | task | model | dataset | pdf | code | 126 | |---:|:-----------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------|:--------|:-------------------------------|:---------------------------------------------------|:-------| 127 | | 0 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network | 2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) | | 128 | # Defects4J 129 | | | title | year | venue | task | model | dataset | pdf | code | 130 | |---:|:--------------------------------------------------|-------:|:------------------------------------------|:---------------|:--------|:---------------------------|:--------------------------------------|:----------------------------------| 131 | | 0 | CODIT: Code Editing with Tree-Based Neural Models | 2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) | 132 | # PY150 133 | | | title | year | venue | task | model | dataset | pdf | code | 134 | |---:|:------------------------------------------------------------------------|-------:|:------------------------------------------------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------| 135 | | 0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs | 2021 | AAAI | Code Generation | GAT | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499) | | 136 | | 1 | Learning semantic program embeddings with graph interval neural network | 2020 | Proceedings of the ACM on Programming Languages | Program Repair | GINN | PY150 | [📑](https://arxiv.org/abs/2005.09997v2) | | 137 | | 2 | Probabilistic model for code with decision trees | 2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) | | 138 | # iclr18-prog-graphs-dataset 139 | | | title | year | venue | task | model | dataset | pdf | code | 140 | |---:|:-------------------------------------------|-------:|:--------|:----------------------------------|:--------|:---------------------------|:---------------------------------------|:-----------------------------------------------------------------------------| 141 | | 0 | Learning to represent programs with graphs | 2018 | ICLR | Defect Prediction,Code Generation | GGNN | iclr18-prog-graphs-dataset | [📑](https://arxiv.org/abs/1711.00740) | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) | 142 | # TL-CodeSum 143 | | | title | year | venue | task | model | dataset | pdf | code | 144 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------|:-----------|:--------------------------------------|:-----------------------------------------------------| 145 | | 0 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees | 2021 | EMNLP | Code Summarization | RNN,attention | TL-CodeSum | [📑](http://arxiv.org/abs/2108.12987) | [:octocat:](https://anonymous.4open.science/r/CAST/) | 146 | # CoCoNet 147 | | | title | year | venue | task | model | dataset | pdf | code | 148 | |---:|:----------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------------------------------------------|:-----------------------|:-----------------------------------------|:-------| 149 | | 0 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network | 2021 | arxiv | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet | [📑](https://arxiv.org/abs/2107.01933v1) | | 150 | # MSKCFG Dataset 151 | | | title | year | venue | task | model | dataset | pdf | code | 152 | |---:|:-----------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------|:--------|:-------------------------------|:---------------------------------------------------|:-------| 153 | | 0 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network | 2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) | | 154 | # C Program Dataset 155 | | | title | year | venue | task | model | dataset | pdf | code | 156 | |---:|:---------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:------------------|:-----------------------------------------|:------------------------------------------------------------------------------------| 157 | | 0 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN | 2021 | ICLR | Code Summarization | GNN | C Program Dataset | [📑](https://arxiv.org/abs/2006.05405v5) | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) | 158 | # Java method-comment pairs 159 | | | title | year | venue | task | model | dataset | pdf | code | 160 | |---:|:-------------------------------------------------------|-------:|:--------|:-------------------|:--------|:--------------------------|:-----------------------------------------|:-------| 161 | | 0 | Improved code summarization via a graph neural network | 2020 | ICPC | Code Summarization | ConvGNN | Java method-comment pairs | [📑](https://arxiv.org/abs/2004.02843v2) | | 162 | # BigCloneBench 163 | | | title | year | venue | task | model | dataset | pdf | code | 164 | |---:|:---------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------------------|:-------------------------------------|:----------------------------------------------|:-----------------------------------------------------| 165 | | 0 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST | 2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) | 166 | | 1 | DeepSim: Deep Learning Code Functional Similarity | 2018 | ESEC/FSE | Clone Detection | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068) | [:octocat:](https://github.com/parasol-aser/deepsim) | 167 | # Firmware image dataset 168 | | | title | year | venue | task | model | dataset | pdf | code | 169 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------| 170 | | 0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network | 2021 | ASIA CCS | Vulnerability Detection | GTN | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) | | 171 | # C# dataset 172 | | | title | year | venue | task | model | dataset | pdf | code | 173 | |---:|:-------------------------------------|-------:|:--------|:-------------------------------|:---------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------| 174 | | 0 | Structured neural summarization | 2019 | ICLR | Code Summarization | GGNN | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) | 175 | | 1 | Generative code modeling with graphs | 2019 | ICLR | Program Repair,Code Generation | GRU,GGNN | C# dataset | [📑](https://arxiv.org/abs/1805.08490v2) | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling) | 176 | # CodeSearchNet 177 | | | title | year | venue | task | model | dataset | pdf | code | 178 | |---:|:----------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------------------------------------------|:--------------------------------------|:-----------------------------------------|:--------------------------------------------| 179 | | 0 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting | 2021 | ICPC | Code Summarization | Tree-LSTM | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) | 180 | | 1 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network | 2021 | arxiv | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet | [📑](https://arxiv.org/abs/2107.01933v1) | | 181 | # Java Dataset collected in this work 182 | | | title | year | venue | task | model | dataset | pdf | code | 183 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:------------------------------------------------|:------------------|:------------------------------|:------------------------------------|:-------------------------------------------------|:-------| 184 | | 0 | Improving bug detection via context-based code representation learning and attention-based neural networks | 2019 | Proceedings of the ACM on Programming Languages | Defect Prediction | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588) | | 185 | # C dataset 186 | | | title | year | venue | task | model | dataset | pdf | code | 187 | |---:|:--------------------------------------------------------------------------|-------:|:--------|:------------|:----------------|:----------|:-----------------------------------------|:-------| 188 | | 0 | Multi-modal attention network learning for semantic source code retrieval | 2019 | ASE | Code Search | GGNN, Tree-LSTM | C dataset | [📑](https://arxiv.org/abs/1909.13516v1) | | 189 | --------------------------------------------------------------------------------