├── .DS_Store
├── dataset_with_link.md
├── TODO.md
├── README.md
├── script.py
├── graph_based_models.md
├── sequence_based_models.md
├── sequence_based_models
    ├── years.md
    ├── tasks.md
    ├── models.md
    └── datasets.md
└── graph_based_models
    ├── years.md
    ├── tasks.md
    └── datasets.md


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codingClaire/Structural-Code-Understanding/HEAD/.DS_Store


--------------------------------------------------------------------------------
/dataset_with_link.md:
--------------------------------------------------------------------------------
 1 | name: ROMISE
 2 | language:
 3 | link: http://openscience.us/repo/defect
 4 | 
 5 | name: OJ
 6 | language:
 7 | link: http://programming.grids.cn/
 8 | 
 9 | name: notebookcdg
10 | language:
11 | link: https://paperswithcode.com/dataset/notebookcdg
12 | 
13 | name: C#
14 | language:C#
15 | link: https://archive.org/details/stackexchange
16 | 
17 | 


--------------------------------------------------------------------------------
/TODO.md:
--------------------------------------------------------------------------------
 1 | # contribute to the repository
 2 | 0. requirements:
 3 | `pip install tabulate`
 4 | 1. git clone / git pull
 5 | 2. add new paper information in [sequence_based_models.md](sequence_based_models.md) with the following format:
 6 | 
 7 | > title: 
 8 | > year: 
 9 | > venue: 
10 | > task: Code Generation/Code Summarization/Safety Analysis/Program Repair/Program Classification/Others
11 | > model: CNN/Transformer/GNN/ (can use comma)
12 | > dataset: check if it is duplicated with different spellings
13 | > pdf: 
14 | > code: 
15 | 
16 | 1. run script.py and check the generated markdown files
17 | 2. git add / commit / push
18 | 
19 | 
20 | code-change-data:https://drive.google.com/ file/d/1wSl SN17tbATqlhNMO0O7sEkH9gqJ9Vr.
21 | 
22 | 
23 | code-comment pairs:A parallel corpus of python functions and documentation strings for automated code documentation and code generation
24 | 
25 | 
26 | 
27 | [Python]https://paperswithcode.com/dataset/100doh
28 | 
29 | [python]https://github.com/LiuFang816/SALSTM_py_data
30 | 
31 | [C#](https://archive.org/details/stackexchange)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Code Understanding Literatures in Deep Learning
  2 | Check Our Survey on [arxiv](https://arxiv.org/abs/2205.01293)!
  3 | ## Sequence-based Models
  4 | Total: 44 papers
  5 | ### [By years](sequence_based_models/years.md)
  6 | - 2021:4 paper(s)
  7 | - 2020:12 paper(s)
  8 | - 2019:7 paper(s)
  9 | - 2018:8 paper(s)
 10 | - 2017:3 paper(s)
 11 | - 2016:6 paper(s)
 12 | - 2014:3 paper(s)
 13 | - 2012:1 paper(s)
 14 | ### [By tasks](sequence_based_models/tasks.md)
 15 | - Program Classification :2 paper(s)
 16 | - Code Search:3 paper(s)
 17 | - Code Generation:19 paper(s)
 18 | - Pretrain:4 paper(s)
 19 | - Code representation:1 paper(s)
 20 | - Safety Analysis:4 paper(s)
 21 | - Program Repair :2 paper(s)
 22 | - Clone Detection :2 paper(s)
 23 | - Code Summarization:7 paper(s)
 24 | ### [By datasets](sequence_based_models/datasets.md)
 25 | - Java:6 paper(s)
 26 | - DeepFix:1 paper(s)
 27 | - TFix's Code Patches Data:1 paper(s)
 28 | - C:6 paper(s)
 29 | - 9714 Java projects from GitHub:1 paper(s)
 30 | - Python:2 paper(s)
 31 | - JavaScript:1 paper(s)
 32 | - JS150:2 paper(s)
 33 | - python:1 paper(s)
 34 | - Uncategorized:44 paper(s)
 35 | - PY150:2 paper(s)
 36 | - C#:4 paper(s)
 37 | - 10072 Java GitHub repositories:1 paper(s)
 38 | - C#(dataset of CodeNN):0 paper(s)
 39 | ### [By models](sequence_based_models/models.md)
 40 | - N-gram:3 paper(s)
 41 | - TreeBERT:1 paper(s)
 42 | - Others:1 paper(s)
 43 | - word2vec:2 paper(s)
 44 | - Multinomial Naive Bayes (MNB) :0 paper(s)
 45 | - GRU:3 paper(s)
 46 | - Bi-LSTM:4 paper(s)
 47 | - word embedding:1 paper(s)
 48 | - Transformer:8 paper(s)
 49 | - CRF:1 paper(s)
 50 | - DNN:1 paper(s)
 51 | - pointer network:1 paper(s)
 52 | - DBN:2 paper(s)
 53 | - CAN:1 paper(s)
 54 | - LSTM:15 paper(s)
 55 | - RNN:5 paper(s)
 56 | ## Graph-based Models
 57 | Total: 36 papers
 58 | ### [By years](graph_based_models/years.md)
 59 | - 2021:9 paper(s)
 60 | - 2020:10 paper(s)
 61 | - 2019:9 paper(s)
 62 | - 2018:3 paper(s)
 63 | - 2017:1 paper(s)
 64 | - 2016:1 paper(s)
 65 | - 2015:1 paper(s)
 66 | - 2014:2 paper(s)
 67 | ### [By tasks](graph_based_models/tasks.md)
 68 | - Defect Prediction:3 paper(s)
 69 | - Code Search:2 paper(s)
 70 | - Program Repair:6 paper(s)
 71 | - Code Generation:5 paper(s)
 72 | - Program Verification:1 paper(s)
 73 | - Program Classification:4 paper(s)
 74 | - Vulnerability Detection:3 paper(s)
 75 | - Clone Detection:8 paper(s)
 76 | - Code Summarization:10 paper(s)
 77 | ### [By datasets](graph_based_models/datasets.md)
 78 | - Java repos collected in this work:1 paper(s)
 79 | - Code-Change-Data:1 paper(s)
 80 | - Hybrid-DeepCom Dataset:1 paper(s)
 81 | - JAVA method naming datasets:1 paper(s)
 82 | - ARM binary dataset:1 paper(s)
 83 | - Genius Dataset:1 paper(s)
 84 | - Google Code Jam (GCJ):0 paper(s)
 85 | - notebookcdg:1 paper(s)
 86 | - program variables dataset produced in this work:1 paper(s)
 87 | - gcc dataset:1 paper(s)
 88 | - Python method documentation dataset:1 paper(s)
 89 | - JS150:2 paper(s)
 90 | - Findutils:1 paper(s)
 91 | - Validation dataset:1 paper(s)
 92 | - Devign Dataset:1 paper(s)
 93 | - C Dataset:1 paper(s)
 94 | - Diffutils:1 paper(s)
 95 | - OpenCL Dataset:1 paper(s)
 96 | - code-comment pairs:1 paper(s)
 97 | - BCB:1 paper(s)
 98 | - collected in this work:4 paper(s)
 99 | - Coreutils:1 paper(s)
100 | - DeepFix dataset:1 paper(s)
101 | - Syntax similar dataset:1 paper(s)
102 | - SPoC:1 paper(s)
103 | - IJDataset2.0:1 paper(s)
104 | - CodeForces dataset:1 paper(s)
105 | - Linux kernel's code collected in this work:1 paper(s)
106 | - OJClone:5 paper(s)
107 | - YANCFG Dataset:1 paper(s)
108 | - Defects4J:1 paper(s)
109 | - PY150:3 paper(s)
110 | - iclr18-prog-graphs-dataset:1 paper(s)
111 | - TL-CodeSum:1 paper(s)
112 | - CoCoNet:1 paper(s)
113 | - MSKCFG Dataset:1 paper(s)
114 | - C Program Dataset:1 paper(s)
115 | - Java method-comment pairs:1 paper(s)
116 | - BigCloneBench:2 paper(s)
117 | - Firmware image dataset:1 paper(s)
118 | - C# dataset:2 paper(s)
119 | - CodeSearchNet:2 paper(s)
120 | - Java Dataset collected in this work:1 paper(s)
121 | - C dataset:1 paper(s)
122 | ### [By models](graph_based_models/models.md)
123 | - GINN:1 paper(s)
124 | - Tree-RNN:1 paper(s)
125 | - GRU:3 paper(s)
126 | - Text-associated DeepWalk:1 paper(s)
127 | - Multi-Relational Graph Neural Network:1 paper(s)
128 | - TBCNN:1 paper(s)
129 | - Flow2Vec:1 paper(s)
130 | - LSTM:5 paper(s)
131 | - DGCNN:1 paper(s)
132 | - GNN:11 paper(s)
133 | - Transformer:3 paper(s)
134 | - Structure2vec:1 paper(s)
135 | - MPNN:2 paper(s)
136 | - GAT:3 paper(s)
137 | - CNN:7 paper(s)
138 | - GCN:2 paper(s)
139 | - RNN:3 paper(s)
140 | - Tree-LSTM:2 paper(s)
141 | - tree-based LSTM:1 paper(s)
142 | - Decision tree:1 paper(s)
143 | - HAConvGNN:1 paper(s)
144 | - ConvGNN:2 paper(s)
145 | - GTN:1 paper(s)
146 | - GGNN:8 paper(s)
147 | - CharCNN:1 paper(s)
148 | - Feed-forward neural network:1 paper(s)
149 | - bidirectional RNN:1 paper(s)
150 | - code property graphs:1 paper(s)
151 | - attention:1 paper(s)
152 | - Attention mechanism:1 paper(s)
153 | 


--------------------------------------------------------------------------------
/script.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | import numpy as np
  4 | import pandas as pd
  5 | from pandas.core.frame import DataFrame
  6 | 
  7 | 
  8 | def write_data(fname, dic, del_col):
  9 |     with open(fname, "w", encoding="utf-8") as f:
 10 |         for k in dic.keys():
 11 |             df_k = dic[k].reset_index(drop=True)
 12 |             #del df_k[del_col]
 13 |             if k == "\n":
 14 |                 k = "Uncategorized\n"
 15 |             f.writelines("# " + k)
 16 |             df_k.to_markdown(f)
 17 |             f.writelines("\n")
 18 | 
 19 | 
 20 | if __name__ == "__main__":
 21 |     source_list = ["title", "year", "venue",
 22 |                    "task", "model", "dataset", "pdf", "code"]
 23 |     names = ['sequence_based_models', 'graph_based_models']
 24 |     #names = ["graph_based_models"]
 25 |     print("#" * 10, "Begin", "#" * 10)
 26 | 
 27 |     statistic_dic = {
 28 |         names[0]: {
 29 |             "years": {},
 30 |             "tasks": {},
 31 |             "datasets": {},
 32 |             "models": {},
 33 |         },
 34 |         names[1]: {
 35 |             "years": {},
 36 |             "tasks": {},
 37 |             "datasets": {},
 38 |             "models": {},
 39 |         },
 40 |     }
 41 | 
 42 |     for name in names:
 43 |         print("generating " + name + " markdown files!")
 44 | 
 45 |         with open(name + ".md", "r", encoding="UTF-8") as f:
 46 |             content = f.readlines()
 47 |         lineno = 0
 48 |         papers = []
 49 |         # parse paper from source.md
 50 |         while lineno < len(content):
 51 |             if content[lineno][0:5] == "title":  # target title
 52 |                 paper = []
 53 |                 for offset in range(0, len(source_list)):
 54 |                     content[lineno + offset] = content[lineno + offset].strip(
 55 |                         source_list[offset] + ":"
 56 |                     )
 57 |                     while content[lineno + offset][0] == " ":
 58 |                         content[lineno + offset] = content[lineno +
 59 |                                                            offset].strip(" ")
 60 |                     paper.append(content[lineno + offset])
 61 |                 papers.append(paper)
 62 |             lineno += 1
 63 | 
 64 |         # generate dataframe of total paper
 65 |         papers_df = pd.DataFrame(papers, columns=source_list)
 66 |         papers_df = papers_df.drop_duplicates(subset=["title"])
 67 |         # add pdf and code icon
 68 | 
 69 |         papers_df["pdf"] = papers_df["pdf"].map(
 70 |             lambda x: "[📑](" + x[0:-1] + ")" if x != "\n" else x
 71 |         )
 72 |         papers_df["code"] = papers_df["code"].map(
 73 |             lambda x: "[:octocat:](" + x[0:-1] + ")" if x != "\n" else x
 74 |         )
 75 | 
 76 |         # sort by year
 77 |         year_dic = dict(list(papers_df.groupby(papers_df["year"])))
 78 |         year_dic = dict(
 79 |             sorted(year_dic.items(), key=lambda item: item[0], reverse=True))
 80 |         write_data(name+"/years.md", year_dic, "year")
 81 |         for k in year_dic.keys():
 82 |             statistic_dic[name]["years"][k] = year_dic[k].shape[0]
 83 |         print("Finish sort by years!")
 84 | 
 85 |         ############ sort by task ############
 86 |         total_tasks = []
 87 |         task_dic = {}
 88 |         for task_str in list(papers_df["task"]):
 89 |             sublist = task_str.split(",")
 90 |             for element in range(len(sublist)):
 91 |                 while(sublist[element][0] == ' '):
 92 |                     sublist[element] = sublist[element][1:]
 93 |                 if(sublist[element][-1] != '\n'):
 94 |                     sublist[element] = sublist[element]+'\n'
 95 |             total_tasks.extend(sublist)
 96 | 
 97 |         total_tasks = list(set(total_tasks))
 98 |         for task in total_tasks:
 99 |             bool = papers_df["task"].str.contains(task[0:-1])
100 |             tasks_df = papers_df[bool]
101 |             task_dic[task] = tasks_df
102 |         write_data(name + "/tasks.md", task_dic, "task")
103 |         for k in task_dic.keys():
104 |             statistic_dic[name]["tasks"][k] = task_dic[k].shape[0]
105 |         print("Finish sort by tasks!")
106 | 
107 |         ############ sort by dataset ############
108 |         total_datasets = []
109 |         dataset_dic = {}
110 |         for dataset_str in list(papers_df["dataset"]):
111 |             sublist = dataset_str.split(",")
112 |             for element in range(len(sublist)):
113 |                 while(sublist[element][0] == ' '):
114 |                     sublist[element] = sublist[element][1:]
115 |                 if(sublist[element][-1] != '\n'):
116 |                     sublist[element] = sublist[element]+'\n'
117 |             total_datasets.extend(sublist)
118 |         total_datasets = list(set(total_datasets))
119 |         #print(total_datasets)
120 |         for dataset in total_datasets:
121 |             #print(dataset[0:-1])
122 |             bool = papers_df["dataset"].str.contains(dataset[0:-1])
123 |             datasets_df = papers_df[bool]
124 |             dataset_dic[dataset] = datasets_df
125 | 
126 |         write_data(name + "/datasets.md", dataset_dic, "dataset")
127 |         for k in dataset_dic.keys():
128 |             statistic_dic[name]["datasets"][k] = dataset_dic[k].shape[0]
129 |         print("Finish sort by datasets!")
130 | 
131 |         ############ sort by model ############
132 |         total_models = []
133 |         model_dic = {}
134 |         for model_str in list(papers_df["model"]):
135 |             sublist = model_str.split(",")
136 |             for element in range(len(sublist)):
137 |                 while(sublist[element][0] == ' '):
138 |                     sublist[element] = sublist[element][1:]
139 |                 if(sublist[element][-1] != '\n'):
140 |                     sublist[element] = sublist[element]+'\n'
141 |             total_models.extend(sublist)
142 |         total_models = list(set(total_models))
143 |         for model in total_models:
144 |             bool = papers_df["model"].str.contains(model[0:-1])
145 |             models_df = papers_df[bool]
146 |             model_dic[model] = models_df
147 |         write_data(name + "/models.md", model_dic, "model")
148 |         for k in model_dic.keys():
149 |             statistic_dic[name]["models"][k] = model_dic[k].shape[0]
150 |         print("Finish sort by models!")
151 | 
152 |     # Write markdown files
153 |     with open("README.md", "w", encoding="utf-8") as f:
154 |         categories = ["years", "tasks", "datasets", "models"]
155 |         f.writelines("# Code Understanding Literatures in Deep Learning\n")
156 |         f.writelines("## Sequence-based Models\n")
157 |         total_number=0
158 |         for k in statistic_dic[names[0]]["years"].keys():
159 |             total_number+=statistic_dic[names[0]]["years"][k]
160 |         f.writelines("Total: "+ str(total_number)+" papers\n")
161 |         for category in categories:
162 |             f.writelines(
163 |                 "### [By "+category+"](sequence_based_models/"+category+".md)\n")
164 |             for k in statistic_dic[names[0]][category].keys():
165 |                 num = str(statistic_dic[names[0]][category][k])
166 |                 k = "Uncategorized\n" if k == "\n" else k
167 |                 f.writelines(
168 |                     "- " + k[0:-1] + ":" + num + " paper(s)\n"
169 |                 )
170 |         f.writelines("## Graph-based Models\n")
171 |         total_number = 0
172 |         for k in statistic_dic[names[1]]["years"].keys():
173 |             total_number += statistic_dic[names[1]]["years"][k]
174 |         f.writelines("Total: " + str(total_number)+" papers\n")
175 |         for category in categories:
176 |             f.writelines(
177 |                 "### [By "+category+"](graph_based_models/"+category+".md)\n")
178 |             for k in statistic_dic[names[1]][category].keys():
179 |                 num = str(statistic_dic[names[1]][category][k])
180 |                 k = "Uncategorized\n" if k == "\n" else k
181 |                 f.writelines(
182 |                     "- " + k[0:-1] + ":" + num + " paper(s)\n"
183 |                 )
184 | 


--------------------------------------------------------------------------------
/graph_based_models.md:
--------------------------------------------------------------------------------
  1 | title:HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks
  2 | year: 2021
  3 | venue: EMNLP
  4 | task: Code Summarization
  5 | model: HAConvGNN
  6 | dataset: notebookcdg
  7 | pdf: https://arxiv.org/abs/2104.01002
  8 | code: https://github.com/xuyeliu/HAConvGNN
  9 | 
 10 | title:Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs
 11 | year: 2021
 12 | venue: AAAI
 13 | task: Code Generation
 14 | model: GAT
 15 | dataset: JS150, PY150
 16 | pdf: http://arxiv.org/abs/2103.09499
 17 | code:
 18 | 
 19 | title: CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees
 20 | year: 2021
 21 | venue: EMNLP
 22 | task: Code Summarization
 23 | model: RNN,attention
 24 | dataset: TL-CodeSum
 25 | pdf: http://arxiv.org/abs/2108.12987
 26 | code: https://anonymous.4open.science/r/CAST/
 27 | 
 28 | title: Learning to represent programs with graphs
 29 | year: 2018
 30 | venue: ICLR
 31 | task: Defect Prediction,Code Generation
 32 | model: GGNN
 33 | dataset: iclr18-prog-graphs-dataset
 34 | pdf: https://arxiv.org/abs/1711.00740
 35 | code: https://github.com/Microsoft/gated-graph-neural-network-samples
 36 | 
 37 | title: A Novel Neural Source Code Representation Based on Abstract Syntax Tree
 38 | year: 2019
 39 | venue: ICSE
 40 | task: Program Classification, Clone Detection
 41 | model: bidirectional RNN
 42 | dataset: OJClone,BCB
 43 | pdf: https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92
 44 | code: https://github.com/zhangj1994/astnn
 45 | 
 46 | title: TBCNN: A tree-based convolutional neural network for programming language processing
 47 | year: 2014
 48 | venue: arixiv
 49 | task: Program Classification
 50 | model: TBCNN
 51 | dataset: OJClone
 52 | pdf: https://arxiv.org/abs/1409.5718v1
 53 | code: https://sites.google.com/site/treebasedcnn/
 54 | 
 55 | title: Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST
 56 | year: 2019
 57 | venue: ACM International Conference on Computing Frontiers
 58 | task: Clone Detection,Code Search, Code Summarization
 59 | model: tree-based LSTM
 60 | dataset:OJClone,BigCloneBench
 61 | pdf: https://doi.org/10.1145/3310273.3321560
 62 | code: https://github.com/milkfan/TBCAA
 63 | 
 64 | title: Improving automatic source code summarization via deep reinforcement learning
 65 | year: 2018
 66 | venue: ASE
 67 | task: Code Summarization
 68 | model: RNN,Tree-RNN
 69 | dataset: code-comment pairs
 70 | pdf: https://arxiv.org/abs/1811.07234v1
 71 | code:
 72 | 
 73 | title: CODIT: Code Editing with Tree-Based Neural Models
 74 | year: 2020
 75 | venue: IEEE Transactions on Software Engineering
 76 | task: Program Repair
 77 | model: LSTM
 78 | dataset: Defects4J,Code-Change-Data
 79 | pdf: http://arxiv.org/abs/1810.00314
 80 | code: https://git.io/JJGwU
 81 | 
 82 | title: Gated graph sequence neural networks
 83 | year: 2015
 84 | venue: ICLR
 85 | task: Program Verification
 86 | model: GGNN
 87 | dataset: program variables dataset produced in this work
 88 | pdf: http://arxiv.org/abs/1511.05493
 89 | code: https://github.com/yujiali/ggnn
 90 | 
 91 | title: Structured neural summarization
 92 | year: 2019
 93 | venue: ICLR
 94 | task: Code Summarization
 95 | model: GGNN
 96 | dataset: C# dataset,JAVA method naming datasets, Python method documentation dataset
 97 | pdf: https://arxiv.org/abs/1811.01824v4
 98 | code: https://github.com/CoderPat/structured-neural-summarization
 99 | 
100 | title: GGF: A graph-based method for programming language syntax error correction
101 | year: 2020
102 | venue: ICPC
103 | task: Program Repair
104 | model: GGNN
105 | dataset: DeepFix dataset,CodeForces dataset
106 | pdf: https://dl.acm.org/doi/10.1145/3387904.3389252
107 | code:
108 | 
109 | title: Improved code summarization via a graph neural network
110 | year: 2020
111 | venue: ICPC
112 | task: Code Summarization
113 | model: ConvGNN
114 | dataset: Java method-comment pairs
115 | pdf: https://arxiv.org/abs/2004.02843v2
116 | code:
117 | 
118 | title: Code Clone Detection with Hierarchical Attentive Graph Embedding
119 | year: 2021
120 | venue: International Journal of Software Engineering and Knowledge Engineering
121 | task: Clone Detection
122 | model: GCN
123 | dataset: IJDataset2.0
124 | pdf: https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X
125 | code:
126 | 
127 | title: Graph-based, Self-Supervised Program Repair from Diagnostic Feedback
128 | year: 2020
129 | venue: ICML
130 | task: Program Repair
131 | model: GAT, LSTM
132 | dataset: SPoC
133 | pdf: http://arxiv.org/abs/2005.10636
134 | code: https://github.com/michiyasunaga/DrRepair
135 | 
136 | title: Generative code modeling with graphs
137 | year: 2019
138 | venue: ICLR
139 | task: Program Repair,Code Generation
140 | model: GRU,GGNN
141 | dataset: C# dataset
142 | pdf: https://arxiv.org/abs/1805.08490v2
143 | code: https://github.com/Microsoft/graph-based-code-modelling
144 | 
145 | title: Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting
146 | year: 2021
147 | venue: ICPC
148 | task: Code Summarization
149 | model: Tree-LSTM
150 | dataset: CodeSearchNet, Hybrid-DeepCom Dataset
151 | pdf: https://arxiv.org/abs/2103.07845v2
152 | code: https://github.com/XMUDM/BASTS
153 | 
154 | title: Neural network-based graph embedding for cross-platform binary code similarity detection
155 | year: 2017
156 | venue: Proceedings of the ACM Conference on Computer and Communications Security
157 | task: Program Repair
158 | model: Structure2vec
159 | dataset: collected in this work, Genius Dataset
160 | pdf: http://arxiv.org/abs/1708.06525
161 | code: https://github.com/xiaojunxu/dnn-binary-code-similarity
162 | 
163 | title: BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network
164 | year: 2021
165 | venue: ASIA CCS
166 | task: Vulnerability Detection
167 | model: GTN
168 | dataset: Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset
169 | pdf: https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf
170 | code:
171 | 
172 | title: DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing
173 | year: 2020
174 | venue: NDSS
175 | task: Clone Detection
176 | model: Text-associated DeepWalk
177 | dataset: Coreutils, Diffutils, Findutils
178 | pdf: https://dx.doi.org/10.14722/ndss.2020.24311
179 | code: https://github.com/deepbindiff/DeepBinDiff
180 | 
181 | title: Order matters: Semantic-aware neural networks for binary code similarity detection
182 | year: 2020
183 | venue: AAAI
184 | task: Clone Detection
185 | model: MPNN,CNN
186 | dataset: gcc dataset
187 | pdf: https://ojs.aaai.org/index.php/AAAI/article/view/5466
188 | code:
189 | 
190 | title: Learning semantic program embeddings with graph interval neural network
191 | year: 2020
192 | venue: Proceedings of the ACM on Programming Languages
193 | task: Program Repair
194 | model: GINN
195 | dataset: PY150
196 | pdf: https://arxiv.org/abs/2005.09997v2
197 | code:
198 | 
199 | title: Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network
200 | year: 2019
201 | venue:Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN
202 | task: Defect Prediction
203 | model: DGCNN
204 | dataset: MSKCFG Dataset, YANCFG Dataset
205 | pdf: https://ieeexplore.ieee.org/document/8809504
206 | code:
207 | 
208 | title: Semantic Code Clone Detection Via Event Embedding Tree and GAT Network
209 | year: 2020
210 | venue: QRS
211 | task: Clone Detection
212 | model: Transformer, GAT, CNN
213 | dataset: OJClone
214 | pdf: https://ieeexplore.ieee.org/document/9282778/
215 | code: https://github.com/lbzwoaini/CSEM.git
216 | 
217 | title: How could Neural Networks understand Programs?
218 | year: 2021
219 | venue: ICML
220 | task: Clone Detection
221 | model: Transformer
222 | dataset: OJClone
223 | pdf: http://arxiv.org/abs/2105.04297
224 | code: https://github.com/pdlan/OSCAR
225 | 
226 | title: Multi-modal attention network learning for semantic source code retrieval
227 | year: 2019
228 | venue: ASE
229 | task: Code Search
230 | model: GGNN, Tree-LSTM
231 | dataset: C dataset
232 | pdf: https://arxiv.org/abs/1909.13516v1
233 | code:
234 | 
235 | title: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
236 | year: 2019
237 | venue: NIPS
238 | task: Vulnerability Detection
239 | model: GGNN, GRU, CNN
240 | dataset: Devign Dataset
241 | pdf: https://arxiv.org/abs/1909.03496v1
242 | code: https://sites.google.com/view/devign
243 | 
244 | title:Flow2Vec:value-flow-based precise code embedding
245 | year: 2020
246 | venue: Proceedings of the ACM on Program ming Languages
247 | task: Code Summarization, Program Classification
248 | model: Flow2Vec
249 | dataset: C Dataset
250 | pdf: https://dl.acm.org/doi/abs/10.1145/3428301
251 | code:
252 | 
253 | title:Compiler-based graph representations for deep learning models of code
254 | year: 2020
255 | venue: Proceedings of the 29th International Conference on Compiler Construction
256 | task: Program Classification
257 | model: GGNN
258 | dataset: OpenCL Dataset
259 | pdf: https://doi.org/10.1145/3377555.3377894
260 | code: https:ithub.com/tud-ccc/learning-compiler-graphs
261 | 
262 | title: DeepSim: Deep Learning Code Functional Similarity
263 | year: 2018
264 | venue: ESEC/FSE
265 | task: Clone Detection
266 | model: Feed-forward neural network
267 | dataset: Google Code Jam (GCJ), BigCloneBench
268 | pdf: https://doi.org/10.1145/3236024.3236068
269 | code: https://github.com/parasol-aser/deepsim
270 | 
271 | title: CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network
272 | year: 2021
273 | venue: arxiv
274 | task: Code Summarization
275 | model: Transformer,Multi-Relational Graph Neural Network
276 | dataset: CodeSearchNet, CoCoNet
277 | pdf: https://arxiv.org/abs/2107.01933v1
278 | code:
279 | 
280 | title: Improving bug detection via context-based code representation learning and attention-based neural networks
281 | year: 2019
282 | venue: Proceedings of the ACM on Programming Languages
283 | task: Defect Prediction
284 | model: GRU, CNN, Attention mechanism
285 | dataset: Java Dataset collected in this work
286 | pdf: https://dl.acm.org/doi/abs/10.1145/3360588
287 | code:
288 | 
289 | title: Modeling and discovering vulnerabilities with code property graphs
290 | year: 2014
291 | venue: Proceedings IEEE Symposium on Security and Privacy
292 | task: Vulnerability Detection
293 | model: code property graphs
294 | dataset: Linux kernel's code collected in this work
295 | pdf: http://ieeexplore.ieee.org/document/6956589/
296 | code:
297 | 
298 | title: Retrieval-Augmented Generation for Code Summarization via Hybrid GNN
299 | year: 2021
300 | venue: ICLR
301 | task: Code Summarization
302 | model: GNN
303 | dataset: C Program Dataset
304 | pdf: https://arxiv.org/abs/2006.05405v5
305 | code: https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization
306 | 
307 | title: Probabilistic model for code with decision trees
308 | year: 2016
309 | venue: SIGPLAN
310 | task: Code Generation
311 | model: Decision tree
312 | dataset: PY150, JS150
313 | pdf: https://dl.acm.org/doi/10.1145/2983990.2984041
314 | code:
315 | 
316 | title: Open vocabulary learning on source code with a graph-structured caches
317 | year: 2019
318 | venue: ICML
319 | task: Code Generation
320 | model: MPNN,CharCNN
321 | dataset: Java repos collected in this work
322 | pdf: https://arxiv.org/abs/1810.08305v2
323 | code: https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary
324 | 


--------------------------------------------------------------------------------
/sequence_based_models.md:
--------------------------------------------------------------------------------
  1 | title: On the naturalness of software
  2 | year: 2012
  3 | venue: None
  4 | task: Code Generation
  5 | model: N-gram
  6 | dataset: 
  7 | pdf: https://dl.acm.org/doi/pdf/10.1145/2902362 
  8 | code: 
  9 | 
 10 | title: On the localness of software
 11 | year: 2014
 12 | venue: FSE/ESEC
 13 | task: Code Generation
 14 | model: N-gram
 15 | dataset: 
 16 | pdf: https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 
 17 | code: 
 18 | 
 19 | title: Phrase-Based Statistical Translation of Programming Languages
 20 | year: 2014
 21 | venue:OOPSLA 
 22 | task: Code Generation
 23 | model: N-gram
 24 | dataset: 
 25 | pdf:https://files.sri.inf.ethz.ch/website/papers/onward14.pdf  
 26 | code: 
 27 | 
 28 | title: A convolutional attention network for extreme summarization of source code
 29 | year: 2016
 30 | venue: ICML
 31 | task: Code Summarization
 32 | model: CAN
 33 | dataset: Java
 34 | pdf:  http://proceedings.mlr.press/v48/allamanis16.html
 35 | code: https://github.com/mast-group/convolutional-attention
 36 | 
 37 | title: Code completion with statistical language models
 38 | year: 2014
 39 | venue:PLDI 
 40 | task: Code Generation
 41 | model: RNN
 42 | dataset: 
 43 | pdf:https://dl.acm.org/doi/pdf/10.1145/2594291.2594321  
 44 | code:
 45 | 
 46 | title: Neural Code Comprehension: A Learnable Representation of Code Semantics
 47 | year: 2018
 48 | venue:NuerIPs 
 49 | task: Code representation
 50 | model: RNN
 51 | dataset: 
 52 | pdf: https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html 
 53 | code:
 54 | 
 55 | title: A deep language model for software code
 56 | year: 2016
 57 | venue: None
 58 | task: Code Generation
 59 | model: LSTM
 60 | dataset: 
 61 | pdf: https://arxiv.org/pdf/1608.02715  
 62 | code:
 63 | 
 64 | title: Summarizing Source Code using a Neural Attention Model
 65 | year: 2016
 66 | venue: ACL
 67 | task: Code Summarization
 68 | model: LSTM
 69 | dataset: C#
 70 | pdf:  https://aclanthology.org/P16-1195.pdf
 71 | code: https://github.com/sriniiyer/codenn
 72 | 
 73 | title: Latent Attention For If-Then Program Synthesis
 74 | year: 2016
 75 | venue:NuerIPs 
 76 | task: Code Generation
 77 | model: Bi-LSTM
 78 | dataset: 
 79 | pdf: https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  
 80 | code:
 81 | 
 82 | title: Abstract Syntax Networks for Code Generation and Semantic Parsing
 83 | year: 2016
 84 | venue:ACL 
 85 | task: Code Generation
 86 | model: LSTM
 87 | dataset: 
 88 | pdf: https://arxiv.org/pdf/1704.07535 
 89 | code:
 90 | 
 91 | title: CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling
 92 | year: 2020
 93 | venue: IST
 94 | task: Code Generation
 95 | model: GRU
 96 | dataset: 
 97 | pdf:  https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE
 98 | code:
 99 | 
100 | title: A transformer-based approach for source code summarization
101 | year: 2020
102 | venue: ACL
103 | task: Code Summarization
104 | model: Transformer
105 | dataset: 
106 | pdf: https://arxiv.org/abs/2005.00653 
107 | code:https://github.com/wasiahmad/NeuralCodeSum
108 | 
109 | title: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
110 | year: 2020
111 | venue:EMNLP 
112 | task: Pretrain
113 | model: Transformer
114 | dataset: 
115 | pdf: https://arxiv.org/pdf/2002.08155.pdf  
116 | code: https://github.com/microsoft/CodeBERT
117 | 
118 | title: Learning and Evaluating Contextual Embedding of Source Code
119 | year: 2020
120 | venue:ICML 
121 | task: Pretrain
122 | model: Transformer
123 | dataset: 
124 | pdf: https://proceedings.mlr.press/v119/kanade20a.html  
125 | code: 
126 | 
127 | title: Learning and Evaluating Contextual Embedding of Source Code
128 | year: 2020
129 | venue:FSE/ESEC 
130 | task: Pretrain
131 | model: Transformer
132 | dataset: 
133 | pdf: https://dl.acm.org/doi/pdf/10.1145/3368089.3417058 
134 | code: 
135 | 
136 | title: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
137 | year: 2021
138 | venue:EMNLP 
139 | task: Pretrain
140 | model: Transformer
141 | dataset: 
142 | pdf: https://arxiv.org/pdf/2109.00859
143 | code: 
144 | 
145 | title: A general path-based representation for predicting programproperties
146 | year: 2018
147 | venue: PLDL
148 | task: Code Generation
149 | model: word2vec,CRF
150 | dataset: JavaScript, Java, Python, C#
151 | pdf:  https://dl.acm.org/doi/pdf/10.1145/3296979.3192412
152 | code:
153 | 
154 | title: Exploring API embedding for API usages and applications
155 | year: 2017
156 | venue: ICSE
157 | task: Code Generation
158 | model: word2vec
159 | dataset: Java, C#
160 | pdf:  https://ieeexplore.ieee.org/abstract/document/7985683
161 | code:
162 | 
163 | title: Automatically learning semantic features for defect prediction
164 | year: 2016
165 | venue: ICSE
166 | task: Safety Analysis
167 | model: DBN
168 | dataset: 
169 | pdf:  https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912
170 | code:
171 | 
172 | title: Deep Semantic Feature Learning for Software Defect Prediction
173 | year: 2020
174 | venue: TSE
175 | task: Safety Analysis
176 | model: DBN
177 | dataset: 
178 | pdf:  https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853
179 | code:
180 | 
181 | title: Neural Code Completion
182 | year: 2018
183 | venue: ICPC
184 | task: Code Generation
185 | model: LSTM
186 | dataset: JS150,PY150
187 | pdf:  https://openreview.net/pdf?id=rJbPBt9lg
188 | code:
189 | 
190 | title: Code Completion with Neural Attention and Pointer Networks
191 | year: 2018
192 | venue: IJCAI
193 | task: Code Generation
194 | model: LSTM,pointer network
195 | dataset: JS150,PY150
196 | pdf:  https://ieeexplore.ieee.org/abstract/document/7985683
197 | code:https://github.com/jack57lee/neuralCodeCompletion
198 | 
199 | title: Deep code comment generation
200 | year: 2018
201 | venue: ICPC
202 | task: Code Summarization
203 | model: LSTM
204 | dataset: 
205 | pdf:  https://ieeexplore.ieee.org/abstract/document/8973050
206 | code:https://github.com/LRNavin/AutoComments
207 | 
208 | title: Code2vec: learning distributed representations of code
209 | year: 2019
210 | venue: POPL
211 | task: Code Generation
212 | model: LSTM
213 | dataset: 10072 Java GitHub repositories
214 | pdf:  https://arxiv.org/pdf/1803.09473
215 | code: https://github.com/tech-srl/code2vec
216 | 
217 | title: Seml: A semantic lstm model for software defect prediction
218 | year: 2019
219 | venue: None
220 | task: Safety Analysis
221 | model: LSTM
222 | dataset: 
223 | pdf:  https://ieeexplore.ieee.org/abstract/document/8747001
224 | code: 
225 | 
226 | title: Modeling programs hierarchically with stack-augmented LSTM
227 | year: 2020
228 | venue: JSS
229 | task: Code Generation
230 | model: LSTM
231 | dataset: C, python
232 | pdf:  https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ
233 | code:
234 | 
235 | title: Code2seq: Generating Sequences from Structured Representations of Code
236 | year: 2019
237 | venue: ICLR
238 | task: Code Generation
239 | model: Bi-LSTM
240 | dataset: Java, C#(dataset of CodeNN)
241 | pdf:  https://arxiv.org/pdf/1808.01400
242 | code: https://github.com/tech-srl/code2seq
243 | 
244 | title: DeepCPDP: Deep Learning Based Cross-Project Defect Prediction
245 | year: 2019
246 | venue: 
247 | task: Safety Analysis
248 | model: Bi-LSTM
249 | dataset: 
250 | pdf:  https://ieeexplore.ieee.org/abstract/document/8937501/
251 | code: 
252 | 
253 | title: Pythia: AI-assisted Code Completion System
254 | year: 2019
255 | venue: SIGKDD
256 | task: Code Generation
257 | model: Bi-LSTM
258 | dataset: Python
259 | pdf:  https://dl.acm.org/doi/pdf/10.1145/3292500.3330699
260 | code: https://github.com/Microsoft/PTVS
261 | 
262 | title: A neural model for generating natural language summaries of program subroutines(astted-gru)
263 | year: 2019
264 | venue: ICSE
265 | task: Code Summarization
266 | model: GRU
267 | dataset: 
268 | pdf:  https://arxiv.org/pdf/1902.01954v1.pdf
269 | code: https://github.com/mcmillco/funcom
270 | 
271 | title: Deep code comment generation with hybrid lexical and syntactical information
272 | year: 2020
273 | venue: FSE/EFEC
274 | task: Code Summarization
275 | model: GRU
276 | dataset: 9714 Java projects from GitHub
277 | pdf:  https://link.springer.com/article/10.1007/s10664-019-09730-9
278 | code: https://github.com/Rick-Feng-u/Deep-code-comment-generation
279 | 
280 | title:TreeBERT: A Tree-Based Pre-Trained Model for Programming Language
281 | year:2021
282 | venue:UAI
283 | task: Pretrain
284 | model: TreeBERT
285 | dataset:
286 | pdf: https://arxiv.org/abs/2105.12485
287 | code: https://github.com/17385/TreeBERT
288 | 
289 | title: Structural language models of code
290 | year: 2020
291 | venue: ICML
292 | task: Code Generation
293 | model: Transformer
294 | dataset: 
295 | pdf: https://proceedings.mlr.press/v119/alon20a.html 
296 | code:
297 | 
298 | title: Code prediction by Feeding Trees to Transfomers
299 | year: 2021
300 | venue: ICSE
301 | task: Code Generation
302 | model: Transformer
303 | dataset: 
304 | pdf:  https://dl.acm.org/doi/pdf/10.1145/3387904.3389261
305 | 
306 | title: A self-attentional neural architecture for code completion with multi-task learning
307 | year: 2020
308 | venue: ICPC
309 | task: Code Generation
310 | model: Transformer
311 | dataset: 
312 | pdf:  https://ieeexplore.ieee.org/abstract/document/9402114
313 | code: 
314 | 
315 | title: Retrieval-based Neural Source Code Summarization
316 | year: 2020
317 | venue: ICSE
318 | task: Code Summarization
319 | model: Others
320 | dataset: 
321 | pdf:  https://ieeexplore.ieee.org/abstract/document/9284039
322 | code: 
323 | 
324 | title: Retrieval on Source Code: A Neural Code Search
325 | year: 2018
326 | venue: PLDI
327 | task: Code Search
328 | model: word embedding
329 | dataset: 
330 | pdf:  https://ieeexplore.ieee.org/abstract/document/9284039
331 | code: 
332 | 
333 | title: Deep code search
334 | year: 2018
335 | venue: ICSE
336 | task: Code Search
337 | model: RNN
338 | dataset: 
339 | pdf:  https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172
340 | code: 
341 | 
342 | title: Improving Code Search with Co-Attentive Representation Learning
343 | year: 2020
344 | venue: ICPC
345 | task: Code Search
346 | model: RNN
347 | dataset: 
348 | pdf: https://dl.acm.org/doi/pdf/10.1145/3387904.3389269
349 | code: 
350 | 
351 | title: Cclearner: A deep learning-based clone detection approach
352 | year: 2017
353 | venue: ICSME
354 | task: Clone Detection 
355 | model: DNN
356 | dataset: 
357 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426
358 | code:
359 | 
360 | title: Deep learning code fragments for code clone detection
361 | year: 2017
362 | venue: ASE
363 | task: Clone Detection 
364 | model: RNN
365 | dataset: 
366 | pdf: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1
367 | code:
368 | 
369 | title: Neural Program Repair by Jointly Learning to Localize and Repair
370 | year: 2019
371 | venue: ICLR
372 | task: Program Repair 
373 | model: LSTM
374 | dataset: DeepFix
375 | pdf: https://arxiv.org/pdf/1904.01720
376 | code:https://github.com/mdrafiqulrabin/SIVAND
377 | 
378 | title: TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer
379 | year: 2021
380 | venue: ICML
381 | task: Program Repair 
382 | model: Transformer
383 | dataset: TFix's Code Patches Data
384 | pdf: https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf
385 | code:https://github.com/eth-sri/TFix
386 | 
387 | title: Embedding Java Classes with code2vec: Improvements from Variable Obfuscation
388 | year: 2020
389 | venue: 
390 | task: Program Classification 
391 | model: LSTM
392 | dataset: 
393 | pdf: https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf
394 | code:https://github.com/eth-sri/TFix
395 | 
396 | title: SCC: Automatic Classification of Code Snippets
397 | year: 2018
398 | venue: 
399 | task: Program Classification 
400 | model: Multinomial Naive Bayes (MNB) 
401 | dataset: 
402 | pdf: https://arxiv.org/pdf/1809.07945v1.pdf
403 | code:https://github.com/mindscan-de/FluentGenesis-Classifier


--------------------------------------------------------------------------------
/sequence_based_models/years.md:
--------------------------------------------------------------------------------
 1 | # 2021
 2 | |    | title                                                                                                     |   year | venue   | task            | model       | dataset                  | pdf                                                                | code                                           |
 3 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:----------------|:------------|:-------------------------|:-------------------------------------------------------------------|:-----------------------------------------------|
 4 | |  0 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation |   2021 | EMNLP   | Pretrain        | Transformer |                          | [📑](https://arxiv.org/pdf/2109.00859)                             |                                                |
 5 | |  1 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language                                         |   2021 | UAI     | Pretrain        | TreeBERT    |                          | [📑](https://arxiv.org/abs/2105.12485)                             | [:octocat:](https://github.com/17385/TreeBERT) |
 6 | |  2 | Code prediction by Feeding Trees to Transfomers                                                           |   2021 | ICSE    | Code Generation | Transformer |                          | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261)           |                                                |
 7 | |  3 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer                                       |   2021 | ICML    | Program Repair  | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix)   |
 8 | # 2020
 9 | |    | title                                                                                   |   year | venue    | task                   | model       | dataset                        | pdf                                                                                                                                                                               | code                                                                     |
10 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:-----------------------|:------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------|
11 | |  0 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling |   2020 | IST      | Code Generation        | GRU         |                                | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) |                                                                          |
12 | |  1 | A transformer-based approach for source code summarization                              |   2020 | ACL      | Code Summarization     | Transformer |                                | [📑](https://arxiv.org/abs/2005.00653 )                                                                                                                                           | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum)                  |
13 | |  2 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages                     |   2020 | EMNLP    | Pretrain               | Transformer |                                | [📑](https://arxiv.org/pdf/2002.08155.pdf  )                                                                                                                                      | [:octocat:](https://github.com/microsoft/CodeBERT)                       |
14 | |  3 | Learning and Evaluating Contextual Embedding of Source Code                             |   2020 | ICML     | Pretrain               | Transformer |                                | [📑](https://proceedings.mlr.press/v119/kanade20a.html  )                                                                                                                         |                                                                          |
15 | |  4 | Deep Semantic Feature Learning for Software Defect Prediction                           |   2020 | TSE      | Safety Analysis        | DBN         |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853)                                                                                                            |                                                                          |
16 | |  5 | Modeling programs hierarchically with stack-augmented LSTM                              |   2020 | JSS      | Code Generation        | LSTM        | C, python                      | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ)  |                                                                          |
17 | |  6 | Deep code comment generation with hybrid lexical and syntactical information            |   2020 | FSE/EFEC | Code Summarization     | GRU         | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9)                                                                                                                | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
18 | |  7 | Structural language models of code                                                      |   2020 | ICML     | Code Generation        | Transformer |                                | [📑](https://proceedings.mlr.press/v119/alon20a.html )                                                                                                                            |                                                                          |
19 | |  8 | A self-attentional neural architecture for code completion with multi-task learning     |   2020 | ICPC     | Code Generation        | Transformer |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9402114)                                                                                                                       |                                                                          |
20 | |  9 | Retrieval-based Neural Source Code Summarization                                        |   2020 | ICSE     | Code Summarization     | Others      |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)                                                                                                                       |                                                                          |
21 | | 10 | Improving Code Search with Co-Attentive Representation Learning                         |   2020 | ICPC     | Code Search            | RNN         |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269)                                                                                                                          |                                                                          |
22 | | 11 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation            |   2020 |          | Program Classification | LSTM        |                                | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf)                                                                                                                | [:octocat:](https://github.com/eth-sri/TFix)                             |
23 | # 2019
24 | |    | title                                                                                       |   year | venue   | task               | model   | dataset                        | pdf                                                          | code                                                  |
25 | |---:|:--------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:-------------------------------|:-------------------------------------------------------------|:------------------------------------------------------|
26 | |  0 | Code2vec: learning distributed representations of code                                      |   2019 | POPL    | Code Generation    | LSTM    | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473)                       | [:octocat:](https://github.com/tech-srl/code2vec)     |
27 | |  1 | Seml: A semantic lstm model for software defect prediction                                  |   2019 | None    | Safety Analysis    | LSTM    |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8747001)  |                                                       |
28 | |  2 | Code2seq: Generating Sequences from Structured Representations of Code                      |   2019 | ICLR    | Code Generation    | Bi-LSTM | Java, C#(dataset of CodeNN)    | [📑](https://arxiv.org/pdf/1808.01400)                       | [:octocat:](https://github.com/tech-srl/code2seq)     |
29 | |  3 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction                               |   2019 |         | Safety Analysis    | Bi-LSTM |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/) |                                                       |
30 | |  4 | Pythia: AI-assisted Code Completion System                                                  |   2019 | SIGKDD  | Code Generation    | Bi-LSTM | Python                         | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699)     | [:octocat:](https://github.com/Microsoft/PTVS)        |
31 | |  5 | A neural model for generating natural language summaries of program subroutines(astted-gru) |   2019 | ICSE    | Code Summarization | GRU     |                                | [📑](https://arxiv.org/pdf/1902.01954v1.pdf)                 | [:octocat:](https://github.com/mcmillco/funcom)       |
32 | |  6 | Neural Program Repair by Jointly Learning to Localize and Repair                            |   2019 | ICLR    | Program Repair     | LSTM    | DeepFix                        | [📑](https://arxiv.org/pdf/1904.01720)                       | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) |
33 | # 2018
34 | |    | title                                                                   |   year | venue   | task                   | model                         | dataset                      | pdf                                                                                                  | code                                                                |
35 | |---:|:------------------------------------------------------------------------|-------:|:--------|:-----------------------|:------------------------------|:-----------------------------|:-----------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------|
36 | |  0 | Neural Code Comprehension: A Learnable Representation of Code Semantics |   2018 | NuerIPs | Code representation    | RNN                           |                              | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) |                                                                     |
37 | |  1 | A general path-based representation for predicting programproperties    |   2018 | PLDL    | Code Generation        | word2vec,CRF                  | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)                                             |                                                                     |
38 | |  2 | Neural Code Completion                                                  |   2018 | ICPC    | Code Generation        | LSTM                          | JS150,PY150                  | [📑](https://openreview.net/pdf?id=rJbPBt9lg)                                                        |                                                                     |
39 | |  3 | Code Completion with Neural Attention and Pointer Networks              |   2018 | IJCAI   | Code Generation        | LSTM,pointer network          | JS150,PY150                  | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                          | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion)      |
40 | |  4 | Deep code comment generation                                            |   2018 | ICPC    | Code Summarization     | LSTM                          |                              | [📑](https://ieeexplore.ieee.org/abstract/document/8973050)                                          | [:octocat:](https://github.com/LRNavin/AutoComments)                |
41 | |  5 | Retrieval on Source Code: A Neural Code Search                          |   2018 | PLDI    | Code Search            | word embedding                |                              | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)                                          |                                                                     |
42 | |  6 | Deep code search                                                        |   2018 | ICSE    | Code Search            | RNN                           |                              | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172)                               |                                                                     |
43 | |  7 | SCC: Automatic Classification of Code Snippets                          |   2018 |         | Program Classification | Multinomial Naive Bayes (MNB) |                              | [📑](https://arxiv.org/pdf/1809.07945v1.pdf)                                                         | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie) |
44 | # 2017
45 | |    | title                                                     |   year | venue   | task            | model    | dataset   | pdf                                                                          | code   |
46 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:---------|:----------|:-----------------------------------------------------------------------------|:-------|
47 | |  0 | Exploring API embedding for API usages and applications   |   2017 | ICSE    | Code Generation | word2vec | Java, C#  | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                  |        |
48 | |  1 | Cclearner: A deep learning-based clone detection approach |   2017 | ICSME   | Clone Detection | DNN      |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426)       |        |
49 | |  2 | Deep learning code fragments for code clone detection     |   2017 | ASE     | Clone Detection | RNN      |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) |        |
50 | # 2016
51 | |    | title                                                                      |   year | venue   | task               | model   | dataset   | pdf                                                                                               | code                                                               |
52 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
53 | |  0 | A convolutional attention network for extreme summarization of source code |   2016 | ICML    | Code Summarization | CAN     | Java      | [📑](http://proceedings.mlr.press/v48/allamanis16.html)                                           | [:octocat:](https://github.com/mast-group/convolutional-attention) |
54 | |  1 | A deep language model for software code                                    |   2016 | None    | Code Generation    | LSTM    |           | [📑](https://arxiv.org/pdf/1608.02715  )                                                          |                                                                    |
55 | |  2 | Summarizing Source Code using a Neural Attention Model                     |   2016 | ACL     | Code Summarization | LSTM    | C#        | [📑](https://aclanthology.org/P16-1195.pdf)                                                       | [:octocat:](https://github.com/sriniiyer/codenn)                   |
56 | |  3 | Latent Attention For If-Then Program Synthesis                             |   2016 | NuerIPs | Code Generation    | Bi-LSTM |           | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  ) |                                                                    |
57 | |  4 | Abstract Syntax Networks for Code Generation and Semantic Parsing          |   2016 | ACL     | Code Generation    | LSTM    |           | [📑](https://arxiv.org/pdf/1704.07535 )                                                           |                                                                    |
58 | |  5 | Automatically learning semantic features for defect prediction             |   2016 | ICSE    | Safety Analysis    | DBN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912)                            |                                                                    |
59 | # 2014
60 | |    | title                                                         |   year | venue    | task            | model   | dataset   | pdf                                                               | code   |
61 | |---:|:--------------------------------------------------------------|-------:|:---------|:----------------|:--------|:----------|:------------------------------------------------------------------|:-------|
62 | |  0 | On the localness of software                                  |   2014 | FSE/ESEC | Code Generation | N-gram  |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 )         |        |
63 | |  1 | Phrase-Based Statistical Translation of Programming Languages |   2014 | OOPSLA   | Code Generation | N-gram  |           | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf  ) |        |
64 | |  2 | Code completion with statistical language models              |   2014 | PLDI     | Code Generation | RNN     |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321  )        |        |
65 | # 2012
66 | |    | title                          |   year | venue   | task            | model   | dataset   | pdf                                               | code   |
67 | |---:|:-------------------------------|-------:|:--------|:----------------|:--------|:----------|:--------------------------------------------------|:-------|
68 | |  0 | On the naturalness of software |   2012 | None    | Code Generation | N-gram  |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 ) |        |
69 | 


--------------------------------------------------------------------------------
/sequence_based_models/tasks.md:
--------------------------------------------------------------------------------
 1 | # Program Classification 
 2 | |    | title                                                                        |   year | venue   | task                   | model                         | dataset   | pdf                                                                | code                                                                |
 3 | |---:|:-----------------------------------------------------------------------------|-------:|:--------|:-----------------------|:------------------------------|:----------|:-------------------------------------------------------------------|:--------------------------------------------------------------------|
 4 | |  0 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation |   2020 |         | Program Classification | LSTM                          |           | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix)                        |
 5 | |  1 | SCC: Automatic Classification of Code Snippets                               |   2018 |         | Program Classification | Multinomial Naive Bayes (MNB) |           | [📑](https://arxiv.org/pdf/1809.07945v1.pdf)                       | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie) |
 6 | # Code Search
 7 | |    | title                                                           |   year | venue   | task        | model          | dataset   | pdf                                                                    | code   |
 8 | |---:|:----------------------------------------------------------------|-------:|:--------|:------------|:---------------|:----------|:-----------------------------------------------------------------------|:-------|
 9 | |  0 | Retrieval on Source Code: A Neural Code Search                  |   2018 | PLDI    | Code Search | word embedding |           | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)            |        |
10 | |  1 | Deep code search                                                |   2018 | ICSE    | Code Search | RNN            |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172) |        |
11 | |  2 | Improving Code Search with Co-Attentive Representation Learning |   2020 | ICPC    | Code Search | RNN            |           | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269)               |        |
12 | # Code Generation
13 | |    | title                                                                                   |   year | venue    | task            | model                | dataset                        | pdf                                                                                                                                                                               | code                                                           |
14 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:----------------|:---------------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
15 | |  0 | On the naturalness of software                                                          |   2012 | None     | Code Generation | N-gram               |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 )                                                                                                                                 |                                                                |
16 | |  1 | On the localness of software                                                            |   2014 | FSE/ESEC | Code Generation | N-gram               |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 )                                                                                                                         |                                                                |
17 | |  2 | Phrase-Based Statistical Translation of Programming Languages                           |   2014 | OOPSLA   | Code Generation | N-gram               |                                | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf  )                                                                                                                 |                                                                |
18 | |  3 | Code completion with statistical language models                                        |   2014 | PLDI     | Code Generation | RNN                  |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321  )                                                                                                                        |                                                                |
19 | |  4 | A deep language model for software code                                                 |   2016 | None     | Code Generation | LSTM                 |                                | [📑](https://arxiv.org/pdf/1608.02715  )                                                                                                                                          |                                                                |
20 | |  5 | Latent Attention For If-Then Program Synthesis                                          |   2016 | NuerIPs  | Code Generation | Bi-LSTM              |                                | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  )                                                                                 |                                                                |
21 | |  6 | Abstract Syntax Networks for Code Generation and Semantic Parsing                       |   2016 | ACL      | Code Generation | LSTM                 |                                | [📑](https://arxiv.org/pdf/1704.07535 )                                                                                                                                           |                                                                |
22 | |  7 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling |   2020 | IST      | Code Generation | GRU                  |                                | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) |                                                                |
23 | |  8 | A general path-based representation for predicting programproperties                    |   2018 | PLDL     | Code Generation | word2vec,CRF         | JavaScript, Java, Python, C#   | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)                                                                                                                          |                                                                |
24 | |  9 | Exploring API embedding for API usages and applications                                 |   2017 | ICSE     | Code Generation | word2vec             | Java, C#                       | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                       |                                                                |
25 | | 10 | Neural Code Completion                                                                  |   2018 | ICPC     | Code Generation | LSTM                 | JS150,PY150                    | [📑](https://openreview.net/pdf?id=rJbPBt9lg)                                                                                                                                     |                                                                |
26 | | 11 | Code Completion with Neural Attention and Pointer Networks                              |   2018 | IJCAI    | Code Generation | LSTM,pointer network | JS150,PY150                    | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                       | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) |
27 | | 12 | Code2vec: learning distributed representations of code                                  |   2019 | POPL     | Code Generation | LSTM                 | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473)                                                                                                                                            | [:octocat:](https://github.com/tech-srl/code2vec)              |
28 | | 13 | Modeling programs hierarchically with stack-augmented LSTM                              |   2020 | JSS      | Code Generation | LSTM                 | C, python                      | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ)  |                                                                |
29 | | 14 | Code2seq: Generating Sequences from Structured Representations of Code                  |   2019 | ICLR     | Code Generation | Bi-LSTM              | Java, C#(dataset of CodeNN)    | [📑](https://arxiv.org/pdf/1808.01400)                                                                                                                                            | [:octocat:](https://github.com/tech-srl/code2seq)              |
30 | | 15 | Pythia: AI-assisted Code Completion System                                              |   2019 | SIGKDD   | Code Generation | Bi-LSTM              | Python                         | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699)                                                                                                                          | [:octocat:](https://github.com/Microsoft/PTVS)                 |
31 | | 16 | Structural language models of code                                                      |   2020 | ICML     | Code Generation | Transformer          |                                | [📑](https://proceedings.mlr.press/v119/alon20a.html )                                                                                                                            |                                                                |
32 | | 17 | Code prediction by Feeding Trees to Transfomers                                         |   2021 | ICSE     | Code Generation | Transformer          |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261)                                                                                                                          |                                                                |
33 | | 18 | A self-attentional neural architecture for code completion with multi-task learning     |   2020 | ICPC     | Code Generation | Transformer          |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9402114)                                                                                                                       |                                                                |
34 | # Pretrain
35 | |    | title                                                                                                     |   year | venue   | task     | model       | dataset   | pdf                                                       | code                                               |
36 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:---------|:------------|:----------|:----------------------------------------------------------|:---------------------------------------------------|
37 | |  0 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages                                       |   2020 | EMNLP   | Pretrain | Transformer |           | [📑](https://arxiv.org/pdf/2002.08155.pdf  )              | [:octocat:](https://github.com/microsoft/CodeBERT) |
38 | |  1 | Learning and Evaluating Contextual Embedding of Source Code                                               |   2020 | ICML    | Pretrain | Transformer |           | [📑](https://proceedings.mlr.press/v119/kanade20a.html  ) |                                                    |
39 | |  2 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation |   2021 | EMNLP   | Pretrain | Transformer |           | [📑](https://arxiv.org/pdf/2109.00859)                    |                                                    |
40 | |  3 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language                                         |   2021 | UAI     | Pretrain | TreeBERT    |           | [📑](https://arxiv.org/abs/2105.12485)                    | [:octocat:](https://github.com/17385/TreeBERT)     |
41 | # Code representation
42 | |    | title                                                                   |   year | venue   | task                | model   | dataset   | pdf                                                                                                  | code   |
43 | |---:|:------------------------------------------------------------------------|-------:|:--------|:--------------------|:--------|:----------|:-----------------------------------------------------------------------------------------------------|:-------|
44 | |  0 | Neural Code Comprehension: A Learnable Representation of Code Semantics |   2018 | NuerIPs | Code representation | RNN     |           | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) |        |
45 | # Safety Analysis
46 | |    | title                                                          |   year | venue   | task            | model   | dataset   | pdf                                                                    | code   |
47 | |---:|:---------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------|
48 | |  0 | Automatically learning semantic features for defect prediction |   2016 | ICSE    | Safety Analysis | DBN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) |        |
49 | |  1 | Deep Semantic Feature Learning for Software Defect Prediction  |   2020 | TSE     | Safety Analysis | DBN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) |        |
50 | |  2 | Seml: A semantic lstm model for software defect prediction     |   2019 | None    | Safety Analysis | LSTM    |           | [📑](https://ieeexplore.ieee.org/abstract/document/8747001)            |        |
51 | |  3 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction  |   2019 |         | Safety Analysis | Bi-LSTM |           | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/)           |        |
52 | # Program Repair 
53 | |    | title                                                               |   year | venue   | task           | model       | dataset                  | pdf                                                                | code                                                  |
54 | |---:|:--------------------------------------------------------------------|-------:|:--------|:---------------|:------------|:-------------------------|:-------------------------------------------------------------------|:------------------------------------------------------|
55 | |  0 | Neural Program Repair by Jointly Learning to Localize and Repair    |   2019 | ICLR    | Program Repair | LSTM        | DeepFix                  | [📑](https://arxiv.org/pdf/1904.01720)                             | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) |
56 | |  1 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer |   2021 | ICML    | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix)          |
57 | # Clone Detection 
58 | |    | title                                                     |   year | venue   | task            | model   | dataset   | pdf                                                                          | code   |
59 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------------|:-------|
60 | |  0 | Cclearner: A deep learning-based clone detection approach |   2017 | ICSME   | Clone Detection | DNN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426)       |        |
61 | |  1 | Deep learning code fragments for code clone detection     |   2017 | ASE     | Clone Detection | RNN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1) |        |
62 | # Code Summarization
63 | |    | title                                                                                       |   year | venue    | task               | model       | dataset                        | pdf                                                                | code                                                                     |
64 | |---:|:--------------------------------------------------------------------------------------------|-------:|:---------|:-------------------|:------------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------|
65 | |  0 | A convolutional attention network for extreme summarization of source code                  |   2016 | ICML     | Code Summarization | CAN         | Java                           | [📑](http://proceedings.mlr.press/v48/allamanis16.html)            | [:octocat:](https://github.com/mast-group/convolutional-attention)       |
66 | |  1 | Summarizing Source Code using a Neural Attention Model                                      |   2016 | ACL      | Code Summarization | LSTM        | C#                             | [📑](https://aclanthology.org/P16-1195.pdf)                        | [:octocat:](https://github.com/sriniiyer/codenn)                         |
67 | |  2 | A transformer-based approach for source code summarization                                  |   2020 | ACL      | Code Summarization | Transformer |                                | [📑](https://arxiv.org/abs/2005.00653 )                            | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum)                  |
68 | |  3 | Deep code comment generation                                                                |   2018 | ICPC     | Code Summarization | LSTM        |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8973050)        | [:octocat:](https://github.com/LRNavin/AutoComments)                     |
69 | |  4 | A neural model for generating natural language summaries of program subroutines(astted-gru) |   2019 | ICSE     | Code Summarization | GRU         |                                | [📑](https://arxiv.org/pdf/1902.01954v1.pdf)                       | [:octocat:](https://github.com/mcmillco/funcom)                          |
70 | |  5 | Deep code comment generation with hybrid lexical and syntactical information                |   2020 | FSE/EFEC | Code Summarization | GRU         | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
71 | |  6 | Retrieval-based Neural Source Code Summarization                                            |   2020 | ICSE     | Code Summarization | Others      |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)        |                                                                          |
72 | 


--------------------------------------------------------------------------------
/graph_based_models/years.md:
--------------------------------------------------------------------------------
 1 | # 2021
 2 | |    | title                                                                                                                             |   year | venue                                                                   | task                    | model                                             | dataset                                                                                | pdf                                                                        | code                                                                                |
 3 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:------------------------|:--------------------------------------------------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
 4 | |  0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks |   2021 | EMNLP                                                                   | Code Summarization      | HAConvGNN                                         | notebookcdg                                                                            | [📑](https://arxiv.org/abs/2104.01002)                                     | [:octocat:](https://github.com/xuyeliu/HAConvGNN)                                   |
 5 | |  1 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs                                                             |   2021 | AAAI                                                                    | Code Generation         | GAT                                               | JS150, PY150                                                                           | [📑](http://arxiv.org/abs/2103.09499)                                      |                                                                                     |
 6 | |  2 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees                        |   2021 | EMNLP                                                                   | Code Summarization      | RNN,attention                                     | TL-CodeSum                                                                             | [📑](http://arxiv.org/abs/2108.12987)                                      | [:octocat:](https://anonymous.4open.science/r/CAST/)                                |
 7 | |  3 | Code Clone Detection with Hierarchical Attentive Graph Embedding                                                                  |   2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection         | GCN                                               | IJDataset2.0                                                                           | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X)    |                                                                                     |
 8 | |  4 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting                                                       |   2021 | ICPC                                                                    | Code Summarization      | Tree-LSTM                                         | CodeSearchNet, Hybrid-DeepCom Dataset                                                  | [📑](https://arxiv.org/abs/2103.07845v2)                                   | [:octocat:](https://github.com/XMUDM/BASTS)                                         |
 9 | |  5 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network                                           |   2021 | ASIA CCS                                                                | Vulnerability Detection | GTN                                               | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |                                                                                     |
10 | |  6 | How could Neural Networks understand Programs?                                                                                    |   2021 | ICML                                                                    | Clone Detection         | Transformer                                       | OJClone                                                                                | [📑](http://arxiv.org/abs/2105.04297)                                      | [:octocat:](https://github.com/pdlan/OSCAR)                                         |
11 | |  7 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network                                                 |   2021 | arxiv                                                                   | Code Summarization      | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet                                                                 | [📑](https://arxiv.org/abs/2107.01933v1)                                   |                                                                                     |
12 | |  8 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN                                                              |   2021 | ICLR                                                                    | Code Summarization      | GNN                                               | C Program Dataset                                                                      | [📑](https://arxiv.org/abs/2006.05405v5)                                   | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) |
13 | # 2020
14 | |    | title                                                                              |   year | venue                                                                     | task                                       | model                    | dataset                            | pdf                                                         | code                                                          |
15 | |---:|:-----------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------------------|:-------------------------|:-----------------------------------|:------------------------------------------------------------|:--------------------------------------------------------------|
16 | |  0 | CODIT: Code Editing with Tree-Based Neural Models                                  |   2020 | IEEE Transactions on Software Engineering                                 | Program Repair                             | LSTM                     | Defects4J,Code-Change-Data         | [📑](http://arxiv.org/abs/1810.00314)                       | [:octocat:](https://git.io/JJGwU)                             |
17 | |  1 | GGF: A graph-based method for programming language syntax error correction         |   2020 | ICPC                                                                      | Program Repair                             | GGNN                     | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252)        |                                                               |
18 | |  2 | Improved code summarization via a graph neural network                             |   2020 | ICPC                                                                      | Code Summarization                         | ConvGNN                  | Java method-comment pairs          | [📑](https://arxiv.org/abs/2004.02843v2)                    |                                                               |
19 | |  3 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback               |   2020 | ICML                                                                      | Program Repair                             | GAT, LSTM                | SPoC                               | [📑](http://arxiv.org/abs/2005.10636)                       | [:octocat:](https://github.com/michiyasunaga/DrRepair)        |
20 | |  4 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing         |   2020 | NDSS                                                                      | Clone Detection                            | Text-associated DeepWalk | Coreutils, Diffutils, Findutils    | [📑](https://dx.doi.org/10.14722/ndss.2020.24311)           | [:octocat:](https://github.com/deepbindiff/DeepBinDiff)       |
21 | |  5 | Order matters: Semantic-aware neural networks for binary code similarity detection |   2020 | AAAI                                                                      | Clone Detection                            | MPNN,CNN                 | gcc dataset                        | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466) |                                                               |
22 | |  6 | Learning semantic program embeddings with graph interval neural network            |   2020 | Proceedings of the ACM on Programming Languages                           | Program Repair                             | GINN                     | PY150                              | [📑](https://arxiv.org/abs/2005.09997v2)                    |                                                               |
23 | |  7 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network             |   2020 | QRS                                                                       | Clone Detection                            | Transformer, GAT, CNN    | OJClone                            | [📑](https://ieeexplore.ieee.org/document/9282778/)         | [:octocat:](https://github.com/lbzwoaini/CSEM.git)            |
24 | |  8 | Flow2Vec:value-flow-based precise code embedding                                   |   2020 | Proceedings of the ACM on Program ming Languages                          | Code Summarization, Program Classification | Flow2Vec                 | C Dataset                          | [📑](https://dl.acm.org/doi/abs/10.1145/3428301)            |                                                               |
25 | |  9 | Compiler-based graph representations for deep learning models of code              |   2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification                     | GGNN                     | OpenCL Dataset                     | [📑](https://doi.org/10.1145/3377555.3377894)               | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) |
26 | # 2019
27 | |    | title                                                                                                                |   year | venue                                                                             | task                                            | model                         | dataset                                                                     | pdf                                                                                  | code                                                                                      |
28 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------------------------------------|:------------------------------|:----------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------|
29 | |  0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree                                              |   2019 | ICSE                                                                              | Program Classification, Clone Detection         | bidirectional RNN             | OJClone,BCB                                                                 | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn)                                          |
30 | |  1 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST                                     |   2019 | ACM International Conference on Computing Frontiers                               | Clone Detection,Code Search, Code Summarization | tree-based LSTM               | OJClone,BigCloneBench                                                       | [📑](https://doi.org/10.1145/3310273.3321560)                                        | [:octocat:](https://github.com/milkfan/TBCAA)                                             |
31 | |  2 | Structured neural summarization                                                                                      |   2019 | ICLR                                                                              | Code Summarization                              | GGNN                          | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4)                                             | [:octocat:](https://github.com/CoderPat/structured-neural-summarization)                  |
32 | |  3 | Generative code modeling with graphs                                                                                 |   2019 | ICLR                                                                              | Program Repair,Code Generation                  | GRU,GGNN                      | C# dataset                                                                  | [📑](https://arxiv.org/abs/1805.08490v2)                                             | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling)                      |
33 | |  4 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network                 |   2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction                               | DGCNN                         | MSKCFG Dataset, YANCFG Dataset                                              | [📑](https://ieeexplore.ieee.org/document/8809504)                                   |                                                                                           |
34 | |  5 | Multi-modal attention network learning for semantic source code retrieval                                            |   2019 | ASE                                                                               | Code Search                                     | GGNN, Tree-LSTM               | C dataset                                                                   | [📑](https://arxiv.org/abs/1909.13516v1)                                             |                                                                                           |
35 | |  6 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks |   2019 | NIPS                                                                              | Vulnerability Detection                         | GGNN, GRU, CNN                | Devign Dataset                                                              | [📑](https://arxiv.org/abs/1909.03496v1)                                             | [:octocat:](https://sites.google.com/view/devign)                                         |
36 | |  7 | Improving bug detection via context-based code representation learning and attention-based neural networks           |   2019 | Proceedings of the ACM on Programming Languages                                   | Defect Prediction                               | GRU, CNN, Attention mechanism | Java Dataset collected in this work                                         | [📑](https://dl.acm.org/doi/abs/10.1145/3360588)                                     |                                                                                           |
37 | |  8 | Open vocabulary learning on source code with a graph-structured caches                                               |   2019 | ICML                                                                              | Code Generation                                 | MPNN,CharCNN                  | Java repos collected in this work                                           | [📑](https://arxiv.org/abs/1810.08305v2)                                             | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) |
38 | # 2018
39 | |    | title                                                                         |   year | venue    | task                              | model                       | dataset                              | pdf                                           | code                                                                         |
40 | |---:|:------------------------------------------------------------------------------|-------:|:---------|:----------------------------------|:----------------------------|:-------------------------------------|:----------------------------------------------|:-----------------------------------------------------------------------------|
41 | |  0 | Learning to represent programs with graphs                                    |   2018 | ICLR     | Defect Prediction,Code Generation | GGNN                        | iclr18-prog-graphs-dataset           | [📑](https://arxiv.org/abs/1711.00740)        | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) |
42 | |  1 | Improving automatic source code summarization via deep reinforcement learning |   2018 | ASE      | Code Summarization                | RNN,Tree-RNN                | code-comment pairs                   | [📑](https://arxiv.org/abs/1811.07234v1)      |                                                                              |
43 | |  2 | DeepSim: Deep Learning Code Functional Similarity                             |   2018 | ESEC/FSE | Clone Detection                   | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068) | [:octocat:](https://github.com/parasol-aser/deepsim)                         |
44 | # 2017
45 | |    | title                                                                                    |   year | venue                                                                     | task           | model         | dataset                                | pdf                                   | code                                                                 |
46 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:---------------|:--------------|:---------------------------------------|:--------------------------------------|:---------------------------------------------------------------------|
47 | |  0 | Neural network-based graph embedding for cross-platform binary code similarity detection |   2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) |
48 | # 2016
49 | |    | title                                            |   year | venue   | task            | model         | dataset      | pdf                                                  | code   |
50 | |---:|:-------------------------------------------------|-------:|:--------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------|
51 | |  0 | Probabilistic model for code with decision trees |   2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) |        |
52 | # 2015
53 | |    | title                                |   year | venue   | task                 | model   | dataset                                         | pdf                                   | code                                         |
54 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------|
55 | |  0 | Gated graph sequence neural networks |   2015 | ICLR    | Program Verification | GGNN    | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) |
56 | # 2014
57 | |    | title                                                                                |   year | venue                                              | task                    | model                | dataset                                    | pdf                                                | code                                                     |
58 | |---:|:-------------------------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:-------------------------------------------|:---------------------------------------------------|:---------------------------------------------------------|
59 | |  0 | TBCNN: A tree-based convolutional neural network for programming language processing |   2014 | arixiv                                             | Program Classification  | TBCNN                | OJClone                                    | [📑](https://arxiv.org/abs/1409.5718v1)            | [:octocat:](https://sites.google.com/site/treebasedcnn/) |
60 | |  1 | Modeling and discovering vulnerabilities with code property graphs                   |   2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) |                                                          |
61 | 


--------------------------------------------------------------------------------
/sequence_based_models/models.md:
--------------------------------------------------------------------------------
 1 | # N-gram
 2 | |    | title                                                         |   year | venue    | task            | model   | dataset   | pdf                                                               | code   |
 3 | |---:|:--------------------------------------------------------------|-------:|:---------|:----------------|:--------|:----------|:------------------------------------------------------------------|:-------|
 4 | |  0 | On the naturalness of software                                |   2012 | None     | Code Generation | N-gram  |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 )                 |        |
 5 | |  1 | On the localness of software                                  |   2014 | FSE/ESEC | Code Generation | N-gram  |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 )         |        |
 6 | |  2 | Phrase-Based Statistical Translation of Programming Languages |   2014 | OOPSLA   | Code Generation | N-gram  |           | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf  ) |        |
 7 | # TreeBERT
 8 | |    | title                                                             |   year | venue   | task     | model    | dataset   | pdf                                    | code                                           |
 9 | |---:|:------------------------------------------------------------------|-------:|:--------|:---------|:---------|:----------|:---------------------------------------|:-----------------------------------------------|
10 | |  0 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language |   2021 | UAI     | Pretrain | TreeBERT |           | [📑](https://arxiv.org/abs/2105.12485) | [:octocat:](https://github.com/17385/TreeBERT) |
11 | # Others
12 | |    | title                                            |   year | venue   | task               | model   | dataset   | pdf                                                         | code   |
13 | |---:|:-------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:------------------------------------------------------------|:-------|
14 | |  0 | Retrieval-based Neural Source Code Summarization |   2020 | ICSE    | Code Summarization | Others  |           | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) |        |
15 | # word2vec
16 | |    | title                                                                |   year | venue   | task            | model        | dataset                      | pdf                                                         | code   |
17 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:------------------------------------------------------------|:-------|
18 | |  0 | A general path-based representation for predicting programproperties |   2018 | PLDL    | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)    |        |
19 | |  1 | Exploring API embedding for API usages and applications              |   2017 | ICSE    | Code Generation | word2vec     | Java, C#                     | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) |        |
20 | # Multinomial Naive Bayes (MNB) 
21 | | title   | year   | venue   | task   | model   | dataset   | pdf   | code   |
22 | |---------|--------|---------|--------|---------|-----------|-------|--------|
23 | # GRU
24 | |    | title                                                                                       |   year | venue    | task               | model   | dataset                        | pdf                                                                                                                                                                               | code                                                                     |
25 | |---:|:--------------------------------------------------------------------------------------------|-------:|:---------|:-------------------|:--------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------|
26 | |  0 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling     |   2020 | IST      | Code Generation    | GRU     |                                | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) |                                                                          |
27 | |  1 | A neural model for generating natural language summaries of program subroutines(astted-gru) |   2019 | ICSE     | Code Summarization | GRU     |                                | [📑](https://arxiv.org/pdf/1902.01954v1.pdf)                                                                                                                                      | [:octocat:](https://github.com/mcmillco/funcom)                          |
28 | |  2 | Deep code comment generation with hybrid lexical and syntactical information                |   2020 | FSE/EFEC | Code Summarization | GRU     | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9)                                                                                                                | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
29 | # Bi-LSTM
30 | |    | title                                                                  |   year | venue   | task            | model   | dataset                     | pdf                                                                                               | code                                              |
31 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------------------------|:--------------------------------------------------------------------------------------------------|:--------------------------------------------------|
32 | |  0 | Latent Attention For If-Then Program Synthesis                         |   2016 | NuerIPs | Code Generation | Bi-LSTM |                             | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  ) |                                                   |
33 | |  1 | Code2seq: Generating Sequences from Structured Representations of Code |   2019 | ICLR    | Code Generation | Bi-LSTM | Java, C#(dataset of CodeNN) | [📑](https://arxiv.org/pdf/1808.01400)                                                            | [:octocat:](https://github.com/tech-srl/code2seq) |
34 | |  2 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction          |   2019 |         | Safety Analysis | Bi-LSTM |                             | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/)                                      |                                                   |
35 | |  3 | Pythia: AI-assisted Code Completion System                             |   2019 | SIGKDD  | Code Generation | Bi-LSTM | Python                      | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699)                                          | [:octocat:](https://github.com/Microsoft/PTVS)    |
36 | # word embedding
37 | |    | title                                          |   year | venue   | task        | model          | dataset   | pdf                                                         | code   |
38 | |---:|:-----------------------------------------------|-------:|:--------|:------------|:---------------|:----------|:------------------------------------------------------------|:-------|
39 | |  0 | Retrieval on Source Code: A Neural Code Search |   2018 | PLDI    | Code Search | word embedding |           | [📑](https://ieeexplore.ieee.org/abstract/document/9284039) |        |
40 | # Transformer
41 | |    | title                                                                                                     |   year | venue   | task               | model       | dataset                  | pdf                                                                | code                                                    |
42 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:------------|:-------------------------|:-------------------------------------------------------------------|:--------------------------------------------------------|
43 | |  0 | A transformer-based approach for source code summarization                                                |   2020 | ACL     | Code Summarization | Transformer |                          | [📑](https://arxiv.org/abs/2005.00653 )                            | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum) |
44 | |  1 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages                                       |   2020 | EMNLP   | Pretrain           | Transformer |                          | [📑](https://arxiv.org/pdf/2002.08155.pdf  )                       | [:octocat:](https://github.com/microsoft/CodeBERT)      |
45 | |  2 | Learning and Evaluating Contextual Embedding of Source Code                                               |   2020 | ICML    | Pretrain           | Transformer |                          | [📑](https://proceedings.mlr.press/v119/kanade20a.html  )          |                                                         |
46 | |  3 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation |   2021 | EMNLP   | Pretrain           | Transformer |                          | [📑](https://arxiv.org/pdf/2109.00859)                             |                                                         |
47 | |  4 | Structural language models of code                                                                        |   2020 | ICML    | Code Generation    | Transformer |                          | [📑](https://proceedings.mlr.press/v119/alon20a.html )             |                                                         |
48 | |  5 | Code prediction by Feeding Trees to Transfomers                                                           |   2021 | ICSE    | Code Generation    | Transformer |                          | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261)           |                                                         |
49 | |  6 | A self-attentional neural architecture for code completion with multi-task learning                       |   2020 | ICPC    | Code Generation    | Transformer |                          | [📑](https://ieeexplore.ieee.org/abstract/document/9402114)        |                                                         |
50 | |  7 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer                                       |   2021 | ICML    | Program Repair     | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix)            |
51 | # CRF
52 | |    | title                                                                |   year | venue   | task            | model        | dataset                      | pdf                                                      | code   |
53 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-------|
54 | |  0 | A general path-based representation for predicting programproperties |   2018 | PLDL    | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) |        |
55 | # DNN
56 | |    | title                                                     |   year | venue   | task            | model   | dataset   | pdf                                                                    | code   |
57 | |---:|:----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------|
58 | |  0 | Cclearner: A deep learning-based clone detection approach |   2017 | ICSME   | Clone Detection | DNN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426) |        |
59 | # pointer network
60 | |    | title                                                      |   year | venue   | task            | model                | dataset     | pdf                                                         | code                                                           |
61 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------|
62 | |  0 | Code Completion with Neural Attention and Pointer Networks |   2018 | IJCAI   | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) |
63 | # DBN
64 | |    | title                                                          |   year | venue   | task            | model   | dataset   | pdf                                                                    | code   |
65 | |---:|:---------------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:-----------------------------------------------------------------------|:-------|
66 | |  0 | Automatically learning semantic features for defect prediction |   2016 | ICSE    | Safety Analysis | DBN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912) |        |
67 | |  1 | Deep Semantic Feature Learning for Software Defect Prediction  |   2020 | TSE     | Safety Analysis | DBN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853) |        |
68 | # CAN
69 | |    | title                                                                      |   year | venue   | task               | model   | dataset   | pdf                                                     | code                                                               |
70 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:----------|:--------------------------------------------------------|:-------------------------------------------------------------------|
71 | |  0 | A convolutional attention network for extreme summarization of source code |   2016 | ICML    | Code Summarization | CAN     | Java      | [📑](http://proceedings.mlr.press/v48/allamanis16.html) | [:octocat:](https://github.com/mast-group/convolutional-attention) |
72 | # LSTM
73 | |    | title                                                                        |   year | venue   | task                   | model                | dataset                        | pdf                                                                                                                                                                              | code                                                           |
74 | |---:|:-----------------------------------------------------------------------------|-------:|:--------|:-----------------------|:---------------------|:-------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
75 | |  0 | A deep language model for software code                                      |   2016 | None    | Code Generation        | LSTM                 |                                | [📑](https://arxiv.org/pdf/1608.02715  )                                                                                                                                         |                                                                |
76 | |  1 | Summarizing Source Code using a Neural Attention Model                       |   2016 | ACL     | Code Summarization     | LSTM                 | C#                             | [📑](https://aclanthology.org/P16-1195.pdf)                                                                                                                                      | [:octocat:](https://github.com/sriniiyer/codenn)               |
77 | |  2 | Latent Attention For If-Then Program Synthesis                               |   2016 | NuerIPs | Code Generation        | Bi-LSTM              |                                | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  )                                                                                |                                                                |
78 | |  3 | Abstract Syntax Networks for Code Generation and Semantic Parsing            |   2016 | ACL     | Code Generation        | LSTM                 |                                | [📑](https://arxiv.org/pdf/1704.07535 )                                                                                                                                          |                                                                |
79 | |  4 | Neural Code Completion                                                       |   2018 | ICPC    | Code Generation        | LSTM                 | JS150,PY150                    | [📑](https://openreview.net/pdf?id=rJbPBt9lg)                                                                                                                                    |                                                                |
80 | |  5 | Code Completion with Neural Attention and Pointer Networks                   |   2018 | IJCAI   | Code Generation        | LSTM,pointer network | JS150,PY150                    | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                      | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) |
81 | |  6 | Deep code comment generation                                                 |   2018 | ICPC    | Code Summarization     | LSTM                 |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8973050)                                                                                                                      | [:octocat:](https://github.com/LRNavin/AutoComments)           |
82 | |  7 | Code2vec: learning distributed representations of code                       |   2019 | POPL    | Code Generation        | LSTM                 | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473)                                                                                                                                           | [:octocat:](https://github.com/tech-srl/code2vec)              |
83 | |  8 | Seml: A semantic lstm model for software defect prediction                   |   2019 | None    | Safety Analysis        | LSTM                 |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8747001)                                                                                                                      |                                                                |
84 | |  9 | Modeling programs hierarchically with stack-augmented LSTM                   |   2020 | JSS     | Code Generation        | LSTM                 | C, python                      | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) |                                                                |
85 | | 10 | Code2seq: Generating Sequences from Structured Representations of Code       |   2019 | ICLR    | Code Generation        | Bi-LSTM              | Java, C#(dataset of CodeNN)    | [📑](https://arxiv.org/pdf/1808.01400)                                                                                                                                           | [:octocat:](https://github.com/tech-srl/code2seq)              |
86 | | 11 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction                |   2019 |         | Safety Analysis        | Bi-LSTM              |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/)                                                                                                                     |                                                                |
87 | | 12 | Pythia: AI-assisted Code Completion System                                   |   2019 | SIGKDD  | Code Generation        | Bi-LSTM              | Python                         | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699)                                                                                                                         | [:octocat:](https://github.com/Microsoft/PTVS)                 |
88 | | 13 | Neural Program Repair by Jointly Learning to Localize and Repair             |   2019 | ICLR    | Program Repair         | LSTM                 | DeepFix                        | [📑](https://arxiv.org/pdf/1904.01720)                                                                                                                                           | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND)          |
89 | | 14 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation |   2020 |         | Program Classification | LSTM                 |                                | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf)                                                                                                               | [:octocat:](https://github.com/eth-sri/TFix)                   |
90 | # RNN
91 | |    | title                                                                   |   year | venue   | task                | model   | dataset   | pdf                                                                                                  | code   |
92 | |---:|:------------------------------------------------------------------------|-------:|:--------|:--------------------|:--------|:----------|:-----------------------------------------------------------------------------------------------------|:-------|
93 | |  0 | Code completion with statistical language models                        |   2014 | PLDI    | Code Generation     | RNN     |           | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321  )                                           |        |
94 | |  1 | Neural Code Comprehension: A Learnable Representation of Code Semantics |   2018 | NuerIPs | Code representation | RNN     |           | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html ) |        |
95 | |  2 | Deep code search                                                        |   2018 | ICSE    | Code Search         | RNN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172)                               |        |
96 | |  3 | Improving Code Search with Co-Attentive Representation Learning         |   2020 | ICPC    | Code Search         | RNN     |           | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269)                                             |        |
97 | |  4 | Deep learning code fragments for code clone detection                   |   2017 | ASE     | Clone Detection     | RNN     |           | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1)                         |        |
98 | 


--------------------------------------------------------------------------------
/graph_based_models/tasks.md:
--------------------------------------------------------------------------------
 1 | # Defect Prediction
 2 | |    | title                                                                                                      |   year | venue                                                                             | task                              | model                         | dataset                             | pdf                                                | code                                                                         |
 3 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:----------------------------------|:------------------------------|:------------------------------------|:---------------------------------------------------|:-----------------------------------------------------------------------------|
 4 | |  0 | Learning to represent programs with graphs                                                                 |   2018 | ICLR                                                                              | Defect Prediction,Code Generation | GGNN                          | iclr18-prog-graphs-dataset          | [📑](https://arxiv.org/abs/1711.00740)             | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) |
 5 | |  1 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network       |   2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction                 | DGCNN                         | MSKCFG Dataset, YANCFG Dataset      | [📑](https://ieeexplore.ieee.org/document/8809504) |                                                                              |
 6 | |  2 | Improving bug detection via context-based code representation learning and attention-based neural networks |   2019 | Proceedings of the ACM on Programming Languages                                   | Defect Prediction                 | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588)   |                                                                              |
 7 | # Code Search
 8 | |    | title                                                                            |   year | venue                                               | task                                            | model           | dataset               | pdf                                           | code                                          |
 9 | |---:|:---------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------|:----------------------|:----------------------------------------------|:----------------------------------------------|
10 | |  0 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST |   2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA) |
11 | |  1 | Multi-modal attention network learning for semantic source code retrieval        |   2019 | ASE                                                 | Code Search                                     | GGNN, Tree-LSTM | C dataset             | [📑](https://arxiv.org/abs/1909.13516v1)      |                                               |
12 | # Program Repair
13 | |    | title                                                                                    |   year | venue                                                                     | task                           | model         | dataset                                | pdf                                                  | code                                                                 |
14 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------|:--------------|:---------------------------------------|:-----------------------------------------------------|:---------------------------------------------------------------------|
15 | |  0 | CODIT: Code Editing with Tree-Based Neural Models                                        |   2020 | IEEE Transactions on Software Engineering                                 | Program Repair                 | LSTM          | Defects4J,Code-Change-Data             | [📑](http://arxiv.org/abs/1810.00314)                | [:octocat:](https://git.io/JJGwU)                                    |
16 | |  1 | GGF: A graph-based method for programming language syntax error correction               |   2020 | ICPC                                                                      | Program Repair                 | GGNN          | DeepFix dataset,CodeForces dataset     | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) |                                                                      |
17 | |  2 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback                     |   2020 | ICML                                                                      | Program Repair                 | GAT, LSTM     | SPoC                                   | [📑](http://arxiv.org/abs/2005.10636)                | [:octocat:](https://github.com/michiyasunaga/DrRepair)               |
18 | |  3 | Generative code modeling with graphs                                                     |   2019 | ICLR                                                                      | Program Repair,Code Generation | GRU,GGNN      | C# dataset                             | [📑](https://arxiv.org/abs/1805.08490v2)             | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling) |
19 | |  4 | Neural network-based graph embedding for cross-platform binary code similarity detection |   2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair                 | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525)                | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) |
20 | |  5 | Learning semantic program embeddings with graph interval neural network                  |   2020 | Proceedings of the ACM on Programming Languages                           | Program Repair                 | GINN          | PY150                                  | [📑](https://arxiv.org/abs/2005.09997v2)             |                                                                      |
21 | # Code Generation
22 | |    | title                                                                  |   year | venue   | task                              | model         | dataset                           | pdf                                                  | code                                                                                      |
23 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------------------------|:--------------|:----------------------------------|:-----------------------------------------------------|:------------------------------------------------------------------------------------------|
24 | |  0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs  |   2021 | AAAI    | Code Generation                   | GAT           | JS150, PY150                      | [📑](http://arxiv.org/abs/2103.09499)                |                                                                                           |
25 | |  1 | Learning to represent programs with graphs                             |   2018 | ICLR    | Defect Prediction,Code Generation | GGNN          | iclr18-prog-graphs-dataset        | [📑](https://arxiv.org/abs/1711.00740)               | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples)              |
26 | |  2 | Generative code modeling with graphs                                   |   2019 | ICLR    | Program Repair,Code Generation    | GRU,GGNN      | C# dataset                        | [📑](https://arxiv.org/abs/1805.08490v2)             | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling)                      |
27 | |  3 | Probabilistic model for code with decision trees                       |   2016 | SIGPLAN | Code Generation                   | Decision tree | PY150, JS150                      | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) |                                                                                           |
28 | |  4 | Open vocabulary learning on source code with a graph-structured caches |   2019 | ICML    | Code Generation                   | MPNN,CharCNN  | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2)             | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) |
29 | # Program Verification
30 | |    | title                                |   year | venue   | task                 | model   | dataset                                         | pdf                                   | code                                         |
31 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------|
32 | |  0 | Gated graph sequence neural networks |   2015 | ICLR    | Program Verification | GGNN    | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) |
33 | # Program Classification
34 | |    | title                                                                                |   year | venue                                                                     | task                                       | model             | dataset        | pdf                                                                                  | code                                                          |
35 | |---:|:-------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-------------------------------------------|:------------------|:---------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------|
36 | |  0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree              |   2019 | ICSE                                                                      | Program Classification, Clone Detection    | bidirectional RNN | OJClone,BCB    | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn)              |
37 | |  1 | TBCNN: A tree-based convolutional neural network for programming language processing |   2014 | arixiv                                                                    | Program Classification                     | TBCNN             | OJClone        | [📑](https://arxiv.org/abs/1409.5718v1)                                              | [:octocat:](https://sites.google.com/site/treebasedcnn/)      |
38 | |  2 | Flow2Vec:value-flow-based precise code embedding                                     |   2020 | Proceedings of the ACM on Program ming Languages                          | Code Summarization, Program Classification | Flow2Vec          | C Dataset      | [📑](https://dl.acm.org/doi/abs/10.1145/3428301)                                     |                                                               |
39 | |  3 | Compiler-based graph representations for deep learning models of code                |   2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification                     | GGNN              | OpenCL Dataset | [📑](https://doi.org/10.1145/3377555.3377894)                                        | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) |
40 | # Vulnerability Detection
41 | |    | title                                                                                                                |   year | venue                                              | task                    | model                | dataset                                                                                | pdf                                                                        | code                                              |
42 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:--------------------------------------------------|
43 | |  0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network                              |   2021 | ASIA CCS                                           | Vulnerability Detection | GTN                  | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |                                                   |
44 | |  1 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks |   2019 | NIPS                                               | Vulnerability Detection | GGNN, GRU, CNN       | Devign Dataset                                                                         | [📑](https://arxiv.org/abs/1909.03496v1)                                   | [:octocat:](https://sites.google.com/view/devign) |
45 | |  2 | Modeling and discovering vulnerabilities with code property graphs                                                   |   2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work                                             | [📑](http://ieeexplore.ieee.org/document/6956589/)                         |                                                   |
46 | # Clone Detection
47 | |    | title                                                                              |   year | venue                                                                   | task                                            | model                       | dataset                              | pdf                                                                                  | code                                                    |
48 | |---:|:-----------------------------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:------------------------------------------------|:----------------------------|:-------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------|
49 | |  0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree            |   2019 | ICSE                                                                    | Program Classification, Clone Detection         | bidirectional RNN           | OJClone,BCB                          | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn)        |
50 | |  1 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST   |   2019 | ACM International Conference on Computing Frontiers                     | Clone Detection,Code Search, Code Summarization | tree-based LSTM             | OJClone,BigCloneBench                | [📑](https://doi.org/10.1145/3310273.3321560)                                        | [:octocat:](https://github.com/milkfan/TBCAA)           |
51 | |  2 | Code Clone Detection with Hierarchical Attentive Graph Embedding                   |   2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection                                 | GCN                         | IJDataset2.0                         | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X)              |                                                         |
52 | |  3 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing         |   2020 | NDSS                                                                    | Clone Detection                                 | Text-associated DeepWalk    | Coreutils, Diffutils, Findutils      | [📑](https://dx.doi.org/10.14722/ndss.2020.24311)                                    | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) |
53 | |  4 | Order matters: Semantic-aware neural networks for binary code similarity detection |   2020 | AAAI                                                                    | Clone Detection                                 | MPNN,CNN                    | gcc dataset                          | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466)                          |                                                         |
54 | |  5 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network             |   2020 | QRS                                                                     | Clone Detection                                 | Transformer, GAT, CNN       | OJClone                              | [📑](https://ieeexplore.ieee.org/document/9282778/)                                  | [:octocat:](https://github.com/lbzwoaini/CSEM.git)      |
55 | |  6 | How could Neural Networks understand Programs?                                     |   2021 | ICML                                                                    | Clone Detection                                 | Transformer                 | OJClone                              | [📑](http://arxiv.org/abs/2105.04297)                                                | [:octocat:](https://github.com/pdlan/OSCAR)             |
56 | |  7 | DeepSim: Deep Learning Code Functional Similarity                                  |   2018 | ESEC/FSE                                                                | Clone Detection                                 | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068)                                        | [:octocat:](https://github.com/parasol-aser/deepsim)    |
57 | # Code Summarization
58 | |    | title                                                                                                                             |   year | venue                                               | task                                            | model                                             | dataset                                                                     | pdf                                              | code                                                                                |
59 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:--------------------------------------------------|:----------------------------------------------------------------------------|:-------------------------------------------------|:------------------------------------------------------------------------------------|
60 | |  0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks |   2021 | EMNLP                                               | Code Summarization                              | HAConvGNN                                         | notebookcdg                                                                 | [📑](https://arxiv.org/abs/2104.01002)           | [:octocat:](https://github.com/xuyeliu/HAConvGNN)                                   |
61 | |  1 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees                        |   2021 | EMNLP                                               | Code Summarization                              | RNN,attention                                     | TL-CodeSum                                                                  | [📑](http://arxiv.org/abs/2108.12987)            | [:octocat:](https://anonymous.4open.science/r/CAST/)                                |
62 | |  2 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST                                                  |   2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM                                   | OJClone,BigCloneBench                                                       | [📑](https://doi.org/10.1145/3310273.3321560)    | [:octocat:](https://github.com/milkfan/TBCAA)                                       |
63 | |  3 | Improving automatic source code summarization via deep reinforcement learning                                                     |   2018 | ASE                                                 | Code Summarization                              | RNN,Tree-RNN                                      | code-comment pairs                                                          | [📑](https://arxiv.org/abs/1811.07234v1)         |                                                                                     |
64 | |  4 | Structured neural summarization                                                                                                   |   2019 | ICLR                                                | Code Summarization                              | GGNN                                              | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4)         | [:octocat:](https://github.com/CoderPat/structured-neural-summarization)            |
65 | |  5 | Improved code summarization via a graph neural network                                                                            |   2020 | ICPC                                                | Code Summarization                              | ConvGNN                                           | Java method-comment pairs                                                   | [📑](https://arxiv.org/abs/2004.02843v2)         |                                                                                     |
66 | |  6 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting                                                       |   2021 | ICPC                                                | Code Summarization                              | Tree-LSTM                                         | CodeSearchNet, Hybrid-DeepCom Dataset                                       | [📑](https://arxiv.org/abs/2103.07845v2)         | [:octocat:](https://github.com/XMUDM/BASTS)                                         |
67 | |  7 | Flow2Vec:value-flow-based precise code embedding                                                                                  |   2020 | Proceedings of the ACM on Program ming Languages    | Code Summarization, Program Classification      | Flow2Vec                                          | C Dataset                                                                   | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) |                                                                                     |
68 | |  8 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network                                                 |   2021 | arxiv                                               | Code Summarization                              | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet                                                      | [📑](https://arxiv.org/abs/2107.01933v1)         |                                                                                     |
69 | |  9 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN                                                              |   2021 | ICLR                                                | Code Summarization                              | GNN                                               | C Program Dataset                                                           | [📑](https://arxiv.org/abs/2006.05405v5)         | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) |
70 | 


--------------------------------------------------------------------------------
/sequence_based_models/datasets.md:
--------------------------------------------------------------------------------
  1 | # Java
  2 | |    | title                                                                        |   year | venue    | task               | model        | dataset                        | pdf                                                                | code                                                                     |
  3 | |---:|:-----------------------------------------------------------------------------|-------:|:---------|:-------------------|:-------------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------|
  4 | |  0 | A convolutional attention network for extreme summarization of source code   |   2016 | ICML     | Code Summarization | CAN          | Java                           | [📑](http://proceedings.mlr.press/v48/allamanis16.html)            | [:octocat:](https://github.com/mast-group/convolutional-attention)       |
  5 | |  1 | A general path-based representation for predicting programproperties         |   2018 | PLDL     | Code Generation    | word2vec,CRF | JavaScript, Java, Python, C#   | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)           |                                                                          |
  6 | |  2 | Exploring API embedding for API usages and applications                      |   2017 | ICSE     | Code Generation    | word2vec     | Java, C#                       | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)        |                                                                          |
  7 | |  3 | Code2vec: learning distributed representations of code                       |   2019 | POPL     | Code Generation    | LSTM         | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473)                             | [:octocat:](https://github.com/tech-srl/code2vec)                        |
  8 | |  4 | Code2seq: Generating Sequences from Structured Representations of Code       |   2019 | ICLR     | Code Generation    | Bi-LSTM      | Java, C#(dataset of CodeNN)    | [📑](https://arxiv.org/pdf/1808.01400)                             | [:octocat:](https://github.com/tech-srl/code2seq)                        |
  9 | |  5 | Deep code comment generation with hybrid lexical and syntactical information |   2020 | FSE/EFEC | Code Summarization | GRU          | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
 10 | # DeepFix
 11 | |    | title                                                            |   year | venue   | task           | model   | dataset   | pdf                                    | code                                                  |
 12 | |---:|:-----------------------------------------------------------------|-------:|:--------|:---------------|:--------|:----------|:---------------------------------------|:------------------------------------------------------|
 13 | |  0 | Neural Program Repair by Jointly Learning to Localize and Repair |   2019 | ICLR    | Program Repair | LSTM    | DeepFix   | [📑](https://arxiv.org/pdf/1904.01720) | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND) |
 14 | # TFix's Code Patches Data
 15 | |    | title                                                               |   year | venue   | task           | model       | dataset                  | pdf                                                                | code                                         |
 16 | |---:|:--------------------------------------------------------------------|-------:|:--------|:---------------|:------------|:-------------------------|:-------------------------------------------------------------------|:---------------------------------------------|
 17 | |  0 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer |   2021 | ICML    | Program Repair | Transformer | TFix's Code Patches Data | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf) | [:octocat:](https://github.com/eth-sri/TFix) |
 18 | # C
 19 | |    | title                                                                  |   year | venue   | task               | model        | dataset                      | pdf                                                                                                                                                                              | code                                              |
 20 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-----------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------|
 21 | |  0 | Summarizing Source Code using a Neural Attention Model                 |   2016 | ACL     | Code Summarization | LSTM         | C#                           | [📑](https://aclanthology.org/P16-1195.pdf)                                                                                                                                      | [:octocat:](https://github.com/sriniiyer/codenn)  |
 22 | |  1 | A general path-based representation for predicting programproperties   |   2018 | PLDL    | Code Generation    | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)                                                                                                                         |                                                   |
 23 | |  2 | Exploring API embedding for API usages and applications                |   2017 | ICSE    | Code Generation    | word2vec     | Java, C#                     | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                      |                                                   |
 24 | |  3 | Modeling programs hierarchically with stack-augmented LSTM             |   2020 | JSS     | Code Generation    | LSTM         | C, python                    | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) |                                                   |
 25 | |  4 | Code2seq: Generating Sequences from Structured Representations of Code |   2019 | ICLR    | Code Generation    | Bi-LSTM      | Java, C#(dataset of CodeNN)  | [📑](https://arxiv.org/pdf/1808.01400)                                                                                                                                           | [:octocat:](https://github.com/tech-srl/code2seq) |
 26 | |  5 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer    |   2021 | ICML    | Program Repair     | Transformer  | TFix's Code Patches Data     | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf)                                                                                                               | [:octocat:](https://github.com/eth-sri/TFix)      |
 27 | # 9714 Java projects from GitHub
 28 | |    | title                                                                        |   year | venue    | task               | model   | dataset                        | pdf                                                                | code                                                                     |
 29 | |---:|:-----------------------------------------------------------------------------|-------:|:---------|:-------------------|:--------|:-------------------------------|:-------------------------------------------------------------------|:-------------------------------------------------------------------------|
 30 | |  0 | Deep code comment generation with hybrid lexical and syntactical information |   2020 | FSE/EFEC | Code Summarization | GRU     | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9) | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
 31 | # Python
 32 | |    | title                                                                |   year | venue   | task            | model        | dataset                      | pdf                                                      | code                                           |
 33 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-----------------------------------------------|
 34 | |  0 | A general path-based representation for predicting programproperties |   2018 | PLDL    | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) |                                                |
 35 | |  1 | Pythia: AI-assisted Code Completion System                           |   2019 | SIGKDD  | Code Generation | Bi-LSTM      | Python                       | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699) | [:octocat:](https://github.com/Microsoft/PTVS) |
 36 | # JavaScript
 37 | |    | title                                                                |   year | venue   | task            | model        | dataset                      | pdf                                                      | code   |
 38 | |---:|:---------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:-----------------------------|:---------------------------------------------------------|:-------|
 39 | |  0 | A general path-based representation for predicting programproperties |   2018 | PLDL    | Code Generation | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412) |        |
 40 | # JS150
 41 | |    | title                                                      |   year | venue   | task            | model                | dataset     | pdf                                                         | code                                                           |
 42 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------|
 43 | |  0 | Neural Code Completion                                     |   2018 | ICPC    | Code Generation | LSTM                 | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg)               |                                                                |
 44 | |  1 | Code Completion with Neural Attention and Pointer Networks |   2018 | IJCAI   | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) |
 45 | # python
 46 | |    | title                                                      |   year | venue   | task            | model   | dataset   | pdf                                                                                                                                                                              | code   |
 47 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:--------|:----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------|
 48 | |  0 | Modeling programs hierarchically with stack-augmented LSTM |   2020 | JSS     | Code Generation | LSTM    | C, python | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ) |        |
 49 | # Uncategorized
 50 | |    | title                                                                                                     |   year | venue    | task                   | model                         | dataset                        | pdf                                                                                                                                                                               | code                                                                     |
 51 | |---:|:----------------------------------------------------------------------------------------------------------|-------:|:---------|:-----------------------|:------------------------------|:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------|
 52 | |  0 | On the naturalness of software                                                                            |   2012 | None     | Code Generation        | N-gram                        |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2902362 )                                                                                                                                 |                                                                          |
 53 | |  1 | On the localness of software                                                                              |   2014 | FSE/ESEC | Code Generation        | N-gram                        |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2635868.2635875 )                                                                                                                         |                                                                          |
 54 | |  2 | Phrase-Based Statistical Translation of Programming Languages                                             |   2014 | OOPSLA   | Code Generation        | N-gram                        |                                | [📑](https://files.sri.inf.ethz.ch/website/papers/onward14.pdf  )                                                                                                                 |                                                                          |
 55 | |  3 | A convolutional attention network for extreme summarization of source code                                |   2016 | ICML     | Code Summarization     | CAN                           | Java                           | [📑](http://proceedings.mlr.press/v48/allamanis16.html)                                                                                                                           | [:octocat:](https://github.com/mast-group/convolutional-attention)       |
 56 | |  4 | Code completion with statistical language models                                                          |   2014 | PLDI     | Code Generation        | RNN                           |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/2594291.2594321  )                                                                                                                        |                                                                          |
 57 | |  5 | Neural Code Comprehension: A Learnable Representation of Code Semantics                                   |   2018 | NuerIPs  | Code representation    | RNN                           |                                | [📑](https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html )                                                                              |                                                                          |
 58 | |  6 | A deep language model for software code                                                                   |   2016 | None     | Code Generation        | LSTM                          |                                | [📑](https://arxiv.org/pdf/1608.02715  )                                                                                                                                          |                                                                          |
 59 | |  7 | Summarizing Source Code using a Neural Attention Model                                                    |   2016 | ACL      | Code Summarization     | LSTM                          | C#                             | [📑](https://aclanthology.org/P16-1195.pdf)                                                                                                                                       | [:octocat:](https://github.com/sriniiyer/codenn)                         |
 60 | |  8 | Latent Attention For If-Then Program Synthesis                                                            |   2016 | NuerIPs  | Code Generation        | Bi-LSTM                       |                                | [📑](https://proceedings.neurips.cc/paper/2016/file/716e1b8c6cd17b771da77391355749f3-Paper.pdf  )                                                                                 |                                                                          |
 61 | |  9 | Abstract Syntax Networks for Code Generation and Semantic Parsing                                         |   2016 | ACL      | Code Generation        | LSTM                          |                                | [📑](https://arxiv.org/pdf/1704.07535 )                                                                                                                                           |                                                                          |
 62 | | 10 | CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling                   |   2020 | IST      | Code Generation        | GRU                           |                                | [📑](https://www.sciencedirect.com/science/article/pii/S0950584920300616?casa_token=mKr3XC1pMD4AAAAA:AiVTPP7wnxInR_g-PFI5Y_XXlk-KpFlnK8DtKoNULlLamBJlMNfDgtplzgYSgiYyCx0qstFjbZE) |                                                                          |
 63 | | 11 | A transformer-based approach for source code summarization                                                |   2020 | ACL      | Code Summarization     | Transformer                   |                                | [📑](https://arxiv.org/abs/2005.00653 )                                                                                                                                           | [:octocat:](https://github.com/wasiahmad/NeuralCodeSum)                  |
 64 | | 12 | CodeBERT: A Pre-Trained Model for Programming and Natural Languages                                       |   2020 | EMNLP    | Pretrain               | Transformer                   |                                | [📑](https://arxiv.org/pdf/2002.08155.pdf  )                                                                                                                                      | [:octocat:](https://github.com/microsoft/CodeBERT)                       |
 65 | | 13 | Learning and Evaluating Contextual Embedding of Source Code                                               |   2020 | ICML     | Pretrain               | Transformer                   |                                | [📑](https://proceedings.mlr.press/v119/kanade20a.html  )                                                                                                                         |                                                                          |
 66 | | 14 | CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation |   2021 | EMNLP    | Pretrain               | Transformer                   |                                | [📑](https://arxiv.org/pdf/2109.00859)                                                                                                                                            |                                                                          |
 67 | | 15 | A general path-based representation for predicting programproperties                                      |   2018 | PLDL     | Code Generation        | word2vec,CRF                  | JavaScript, Java, Python, C#   | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)                                                                                                                          |                                                                          |
 68 | | 16 | Exploring API embedding for API usages and applications                                                   |   2017 | ICSE     | Code Generation        | word2vec                      | Java, C#                       | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                       |                                                                          |
 69 | | 17 | Automatically learning semantic features for defect prediction                                            |   2016 | ICSE     | Safety Analysis        | DBN                           |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7886912)                                                                                                            |                                                                          |
 70 | | 18 | Deep Semantic Feature Learning for Software Defect Prediction                                             |   2020 | TSE      | Safety Analysis        | DBN                           |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8502853)                                                                                                            |                                                                          |
 71 | | 19 | Neural Code Completion                                                                                    |   2018 | ICPC     | Code Generation        | LSTM                          | JS150,PY150                    | [📑](https://openreview.net/pdf?id=rJbPBt9lg)                                                                                                                                     |                                                                          |
 72 | | 20 | Code Completion with Neural Attention and Pointer Networks                                                |   2018 | IJCAI    | Code Generation        | LSTM,pointer network          | JS150,PY150                    | [📑](https://ieeexplore.ieee.org/abstract/document/7985683)                                                                                                                       | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion)           |
 73 | | 21 | Deep code comment generation                                                                              |   2018 | ICPC     | Code Summarization     | LSTM                          |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8973050)                                                                                                                       | [:octocat:](https://github.com/LRNavin/AutoComments)                     |
 74 | | 22 | Code2vec: learning distributed representations of code                                                    |   2019 | POPL     | Code Generation        | LSTM                          | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473)                                                                                                                                            | [:octocat:](https://github.com/tech-srl/code2vec)                        |
 75 | | 23 | Seml: A semantic lstm model for software defect prediction                                                |   2019 | None     | Safety Analysis        | LSTM                          |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8747001)                                                                                                                       |                                                                          |
 76 | | 24 | Modeling programs hierarchically with stack-augmented LSTM                                                |   2020 | JSS      | Code Generation        | LSTM                          | C, python                      | [📑](https://www.sciencedirect.com/science/article/pii/S0164121220300297?casa_token=B2mvgbpiwFUAAAAA:kpOAhKMiSEnvJPN0as8qH-_8EMDK-pF5bu_e8TT6_4c6Kae5gMhvi-00_nzSC3Y4VHNzoAFzqQ)  |                                                                          |
 77 | | 25 | Code2seq: Generating Sequences from Structured Representations of Code                                    |   2019 | ICLR     | Code Generation        | Bi-LSTM                       | Java, C#(dataset of CodeNN)    | [📑](https://arxiv.org/pdf/1808.01400)                                                                                                                                            | [:octocat:](https://github.com/tech-srl/code2seq)                        |
 78 | | 26 | DeepCPDP: Deep Learning Based Cross-Project Defect Prediction                                             |   2019 |          | Safety Analysis        | Bi-LSTM                       |                                | [📑](https://ieeexplore.ieee.org/abstract/document/8937501/)                                                                                                                      |                                                                          |
 79 | | 27 | Pythia: AI-assisted Code Completion System                                                                |   2019 | SIGKDD   | Code Generation        | Bi-LSTM                       | Python                         | [📑](https://dl.acm.org/doi/pdf/10.1145/3292500.3330699)                                                                                                                          | [:octocat:](https://github.com/Microsoft/PTVS)                           |
 80 | | 28 | A neural model for generating natural language summaries of program subroutines(astted-gru)               |   2019 | ICSE     | Code Summarization     | GRU                           |                                | [📑](https://arxiv.org/pdf/1902.01954v1.pdf)                                                                                                                                      | [:octocat:](https://github.com/mcmillco/funcom)                          |
 81 | | 29 | Deep code comment generation with hybrid lexical and syntactical information                              |   2020 | FSE/EFEC | Code Summarization     | GRU                           | 9714 Java projects from GitHub | [📑](https://link.springer.com/article/10.1007/s10664-019-09730-9)                                                                                                                | [:octocat:](https://github.com/Rick-Feng-u/Deep-code-comment-generation) |
 82 | | 30 | TreeBERT: A Tree-Based Pre-Trained Model for Programming Language                                         |   2021 | UAI      | Pretrain               | TreeBERT                      |                                | [📑](https://arxiv.org/abs/2105.12485)                                                                                                                                            | [:octocat:](https://github.com/17385/TreeBERT)                           |
 83 | | 31 | Structural language models of code                                                                        |   2020 | ICML     | Code Generation        | Transformer                   |                                | [📑](https://proceedings.mlr.press/v119/alon20a.html )                                                                                                                            |                                                                          |
 84 | | 32 | Code prediction by Feeding Trees to Transfomers                                                           |   2021 | ICSE     | Code Generation        | Transformer                   |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389261)                                                                                                                          |                                                                          |
 85 | | 33 | A self-attentional neural architecture for code completion with multi-task learning                       |   2020 | ICPC     | Code Generation        | Transformer                   |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9402114)                                                                                                                       |                                                                          |
 86 | | 34 | Retrieval-based Neural Source Code Summarization                                                          |   2020 | ICSE     | Code Summarization     | Others                        |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)                                                                                                                       |                                                                          |
 87 | | 35 | Retrieval on Source Code: A Neural Code Search                                                            |   2018 | PLDI     | Code Search            | word embedding                |                                | [📑](https://ieeexplore.ieee.org/abstract/document/9284039)                                                                                                                       |                                                                          |
 88 | | 36 | Deep code search                                                                                          |   2018 | ICSE     | Code Search            | RNN                           |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8453172)                                                                                                            |                                                                          |
 89 | | 37 | Improving Code Search with Co-Attentive Representation Learning                                           |   2020 | ICPC     | Code Search            | RNN                           |                                | [📑](https://dl.acm.org/doi/pdf/10.1145/3387904.3389269)                                                                                                                          |                                                                          |
 90 | | 38 | Cclearner: A deep learning-based clone detection approach                                                 |   2017 | ICSME    | Clone Detection        | DNN                           |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8094426)                                                                                                            |                                                                          |
 91 | | 39 | Deep learning code fragments for code clone detection                                                     |   2017 | ASE      | Clone Detection        | RNN                           |                                | [📑](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7582748&tag=1)                                                                                                      |                                                                          |
 92 | | 40 | Neural Program Repair by Jointly Learning to Localize and Repair                                          |   2019 | ICLR     | Program Repair         | LSTM                          | DeepFix                        | [📑](https://arxiv.org/pdf/1904.01720)                                                                                                                                            | [:octocat:](https://github.com/mdrafiqulrabin/SIVAND)                    |
 93 | | 41 | TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer                                       |   2021 | ICML     | Program Repair         | Transformer                   | TFix's Code Patches Data       | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf)                                                                                                                | [:octocat:](https://github.com/eth-sri/TFix)                             |
 94 | | 42 | Embedding Java Classes with code2vec: Improvements from Variable Obfuscation                              |   2020 |          | Program Classification | LSTM                          |                                | [📑](https://files.sri.inf.ethz.ch/website/papers/icml21-tfix.pdf)                                                                                                                | [:octocat:](https://github.com/eth-sri/TFix)                             |
 95 | | 43 | SCC: Automatic Classification of Code Snippets                                                            |   2018 |          | Program Classification | Multinomial Naive Bayes (MNB) |                                | [📑](https://arxiv.org/pdf/1809.07945v1.pdf)                                                                                                                                      | [:octocat:](https://github.com/mindscan-de/FluentGenesis-Classifie)      |
 96 | # PY150
 97 | |    | title                                                      |   year | venue   | task            | model                | dataset     | pdf                                                         | code                                                           |
 98 | |---:|:-----------------------------------------------------------|-------:|:--------|:----------------|:---------------------|:------------|:------------------------------------------------------------|:---------------------------------------------------------------|
 99 | |  0 | Neural Code Completion                                     |   2018 | ICPC    | Code Generation | LSTM                 | JS150,PY150 | [📑](https://openreview.net/pdf?id=rJbPBt9lg)               |                                                                |
100 | |  1 | Code Completion with Neural Attention and Pointer Networks |   2018 | IJCAI   | Code Generation | LSTM,pointer network | JS150,PY150 | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) | [:octocat:](https://github.com/jack57lee/neuralCodeCompletion) |
101 | # C#
102 | |    | title                                                                  |   year | venue   | task               | model        | dataset                      | pdf                                                         | code                                              |
103 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-----------------------------|:------------------------------------------------------------|:--------------------------------------------------|
104 | |  0 | Summarizing Source Code using a Neural Attention Model                 |   2016 | ACL     | Code Summarization | LSTM         | C#                           | [📑](https://aclanthology.org/P16-1195.pdf)                 | [:octocat:](https://github.com/sriniiyer/codenn)  |
105 | |  1 | A general path-based representation for predicting programproperties   |   2018 | PLDL    | Code Generation    | word2vec,CRF | JavaScript, Java, Python, C# | [📑](https://dl.acm.org/doi/pdf/10.1145/3296979.3192412)    |                                                   |
106 | |  2 | Exploring API embedding for API usages and applications                |   2017 | ICSE    | Code Generation    | word2vec     | Java, C#                     | [📑](https://ieeexplore.ieee.org/abstract/document/7985683) |                                                   |
107 | |  3 | Code2seq: Generating Sequences from Structured Representations of Code |   2019 | ICLR    | Code Generation    | Bi-LSTM      | Java, C#(dataset of CodeNN)  | [📑](https://arxiv.org/pdf/1808.01400)                      | [:octocat:](https://github.com/tech-srl/code2seq) |
108 | # 10072 Java GitHub repositories
109 | |    | title                                                  |   year | venue   | task            | model   | dataset                        | pdf                                    | code                                              |
110 | |---:|:-------------------------------------------------------|-------:|:--------|:----------------|:--------|:-------------------------------|:---------------------------------------|:--------------------------------------------------|
111 | |  0 | Code2vec: learning distributed representations of code |   2019 | POPL    | Code Generation | LSTM    | 10072 Java GitHub repositories | [📑](https://arxiv.org/pdf/1803.09473) | [:octocat:](https://github.com/tech-srl/code2vec) |
112 | # C#(dataset of CodeNN)
113 | | title   | year   | venue   | task   | model   | dataset   | pdf   | code   |
114 | |---------|--------|---------|--------|---------|-----------|-------|--------|
115 | 


--------------------------------------------------------------------------------
/graph_based_models/datasets.md:
--------------------------------------------------------------------------------
  1 | # Java repos collected in this work
  2 | |    | title                                                                  |   year | venue   | task            | model        | dataset                           | pdf                                      | code                                                                                      |
  3 | |---:|:-----------------------------------------------------------------------|-------:|:--------|:----------------|:-------------|:----------------------------------|:-----------------------------------------|:------------------------------------------------------------------------------------------|
  4 | |  0 | Open vocabulary learning on source code with a graph-structured caches |   2019 | ICML    | Code Generation | MPNN,CharCNN | Java repos collected in this work | [📑](https://arxiv.org/abs/1810.08305v2) | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) |
  5 | # Code-Change-Data
  6 | |    | title                                             |   year | venue                                     | task           | model   | dataset                    | pdf                                   | code                              |
  7 | |---:|:--------------------------------------------------|-------:|:------------------------------------------|:---------------|:--------|:---------------------------|:--------------------------------------|:----------------------------------|
  8 | |  0 | CODIT: Code Editing with Tree-Based Neural Models |   2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM    | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) |
  9 | # Hybrid-DeepCom Dataset
 10 | |    | title                                                                       |   year | venue   | task               | model     | dataset                               | pdf                                      | code                                        |
 11 | |---:|:----------------------------------------------------------------------------|-------:|:--------|:-------------------|:----------|:--------------------------------------|:-----------------------------------------|:--------------------------------------------|
 12 | |  0 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting |   2021 | ICPC    | Code Summarization | Tree-LSTM | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) |
 13 | # JAVA method naming datasets
 14 | |    | title                           |   year | venue   | task               | model   | dataset                                                                     | pdf                                      | code                                                                     |
 15 | |---:|:--------------------------------|-------:|:--------|:-------------------|:--------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------|
 16 | |  0 | Structured neural summarization |   2019 | ICLR    | Code Summarization | GGNN    | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) |
 17 | # ARM binary dataset
 18 | |    | title                                                                                   |   year | venue    | task                    | model   | dataset                                                                                | pdf                                                                        | code   |
 19 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------|
 20 | |  0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network |   2021 | ASIA CCS | Vulnerability Detection | GTN     | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |        |
 21 | # Genius Dataset
 22 | |    | title                                                                                    |   year | venue                                                                     | task           | model         | dataset                                | pdf                                   | code                                                                 |
 23 | |---:|:-----------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:---------------|:--------------|:---------------------------------------|:--------------------------------------|:---------------------------------------------------------------------|
 24 | |  0 | Neural network-based graph embedding for cross-platform binary code similarity detection |   2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair | Structure2vec | collected in this work, Genius Dataset | [📑](http://arxiv.org/abs/1708.06525) | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity) |
 25 | # Google Code Jam (GCJ)
 26 | | title   | year   | venue   | task   | model   | dataset   | pdf   | code   |
 27 | |---------|--------|---------|--------|---------|-----------|-------|--------|
 28 | # notebookcdg
 29 | |    | title                                                                                                                             |   year | venue   | task               | model     | dataset     | pdf                                    | code                                              |
 30 | |---:|:----------------------------------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:----------|:------------|:---------------------------------------|:--------------------------------------------------|
 31 | |  0 | HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks |   2021 | EMNLP   | Code Summarization | HAConvGNN | notebookcdg | [📑](https://arxiv.org/abs/2104.01002) | [:octocat:](https://github.com/xuyeliu/HAConvGNN) |
 32 | # program variables dataset produced in this work
 33 | |    | title                                |   year | venue   | task                 | model   | dataset                                         | pdf                                   | code                                         |
 34 | |---:|:-------------------------------------|-------:|:--------|:---------------------|:--------|:------------------------------------------------|:--------------------------------------|:---------------------------------------------|
 35 | |  0 | Gated graph sequence neural networks |   2015 | ICLR    | Program Verification | GGNN    | program variables dataset produced in this work | [📑](http://arxiv.org/abs/1511.05493) | [:octocat:](https://github.com/yujiali/ggnn) |
 36 | # gcc dataset
 37 | |    | title                                                                              |   year | venue   | task            | model    | dataset     | pdf                                                         | code   |
 38 | |---:|:-----------------------------------------------------------------------------------|-------:|:--------|:----------------|:---------|:------------|:------------------------------------------------------------|:-------|
 39 | |  0 | Order matters: Semantic-aware neural networks for binary code similarity detection |   2020 | AAAI    | Clone Detection | MPNN,CNN | gcc dataset | [📑](https://ojs.aaai.org/index.php/AAAI/article/view/5466) |        |
 40 | # Python method documentation dataset
 41 | |    | title                           |   year | venue   | task               | model   | dataset                                                                     | pdf                                      | code                                                                     |
 42 | |---:|:--------------------------------|-------:|:--------|:-------------------|:--------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------|
 43 | |  0 | Structured neural summarization |   2019 | ICLR    | Code Summarization | GGNN    | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) |
 44 | # JS150
 45 | |    | title                                                                 |   year | venue   | task            | model         | dataset      | pdf                                                  | code   |
 46 | |---:|:----------------------------------------------------------------------|-------:|:--------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------|
 47 | |  0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs |   2021 | AAAI    | Code Generation | GAT           | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499)                |        |
 48 | |  1 | Probabilistic model for code with decision trees                      |   2016 | SIGPLAN | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) |        |
 49 | # Findutils
 50 | |    | title                                                                      |   year | venue   | task            | model                    | dataset                         | pdf                                               | code                                                    |
 51 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------|
 52 | |  0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing |   2020 | NDSS    | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) |
 53 | # Validation dataset
 54 | |    | title                                                                                   |   year | venue    | task                    | model   | dataset                                                                                | pdf                                                                        | code   |
 55 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------|
 56 | |  0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network |   2021 | ASIA CCS | Vulnerability Detection | GTN     | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |        |
 57 | # Devign Dataset
 58 | |    | title                                                                                                                |   year | venue   | task                    | model          | dataset        | pdf                                      | code                                              |
 59 | |---:|:---------------------------------------------------------------------------------------------------------------------|-------:|:--------|:------------------------|:---------------|:---------------|:-----------------------------------------|:--------------------------------------------------|
 60 | |  0 | Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks |   2019 | NIPS    | Vulnerability Detection | GGNN, GRU, CNN | Devign Dataset | [📑](https://arxiv.org/abs/1909.03496v1) | [:octocat:](https://sites.google.com/view/devign) |
 61 | # C Dataset
 62 | |    | title                                            |   year | venue                                            | task                                       | model    | dataset   | pdf                                              | code   |
 63 | |---:|:-------------------------------------------------|-------:|:-------------------------------------------------|:-------------------------------------------|:---------|:----------|:-------------------------------------------------|:-------|
 64 | |  0 | Flow2Vec:value-flow-based precise code embedding |   2020 | Proceedings of the ACM on Program ming Languages | Code Summarization, Program Classification | Flow2Vec | C Dataset | [📑](https://dl.acm.org/doi/abs/10.1145/3428301) |        |
 65 | # Diffutils
 66 | |    | title                                                                      |   year | venue   | task            | model                    | dataset                         | pdf                                               | code                                                    |
 67 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------|
 68 | |  0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing |   2020 | NDSS    | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) |
 69 | # OpenCL Dataset
 70 | |    | title                                                                 |   year | venue                                                                     | task                   | model   | dataset        | pdf                                           | code                                                          |
 71 | |---:|:----------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:-----------------------|:--------|:---------------|:----------------------------------------------|:--------------------------------------------------------------|
 72 | |  0 | Compiler-based graph representations for deep learning models of code |   2020 | Proceedings of the 29th International Conference on Compiler Construction | Program Classification | GGNN    | OpenCL Dataset | [📑](https://doi.org/10.1145/3377555.3377894) | [:octocat:](https:ithub.com/tud-ccc/learning-compiler-graphs) |
 73 | # code-comment pairs
 74 | |    | title                                                                         |   year | venue   | task               | model        | dataset            | pdf                                      | code   |
 75 | |---:|:------------------------------------------------------------------------------|-------:|:--------|:-------------------|:-------------|:-------------------|:-----------------------------------------|:-------|
 76 | |  0 | Improving automatic source code summarization via deep reinforcement learning |   2018 | ASE     | Code Summarization | RNN,Tree-RNN | code-comment pairs | [📑](https://arxiv.org/abs/1811.07234v1) |        |
 77 | # BCB
 78 | |    | title                                                                   |   year | venue   | task                                    | model             | dataset     | pdf                                                                                  | code                                             |
 79 | |---:|:------------------------------------------------------------------------|-------:|:--------|:----------------------------------------|:------------------|:------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------|
 80 | |  0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree |   2019 | ICSE    | Program Classification, Clone Detection | bidirectional RNN | OJClone,BCB | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn) |
 81 | # collected in this work
 82 | |    | title                                                                                                      |   year | venue                                                                     | task                    | model                         | dataset                                    | pdf                                                | code                                                                                      |
 83 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:--------------------------------------------------------------------------|:------------------------|:------------------------------|:-------------------------------------------|:---------------------------------------------------|:------------------------------------------------------------------------------------------|
 84 | |  0 | Neural network-based graph embedding for cross-platform binary code similarity detection                   |   2017 | Proceedings of the ACM Conference on Computer and Communications Security | Program Repair          | Structure2vec                 | collected in this work, Genius Dataset     | [📑](http://arxiv.org/abs/1708.06525)              | [:octocat:](https://github.com/xiaojunxu/dnn-binary-code-similarity)                      |
 85 | |  1 | Improving bug detection via context-based code representation learning and attention-based neural networks |   2019 | Proceedings of the ACM on Programming Languages                           | Defect Prediction       | GRU, CNN, Attention mechanism | Java Dataset collected in this work        | [📑](https://dl.acm.org/doi/abs/10.1145/3360588)   |                                                                                           |
 86 | |  2 | Modeling and discovering vulnerabilities with code property graphs                                         |   2014 | Proceedings IEEE Symposium on Security and Privacy                        | Vulnerability Detection | code property graphs          | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) |                                                                                           |
 87 | |  3 | Open vocabulary learning on source code with a graph-structured caches                                     |   2019 | ICML                                                                      | Code Generation         | MPNN,CharCNN                  | Java repos collected in this work          | [📑](https://arxiv.org/abs/1810.08305v2)           | [:octocat:](https://github.com/mwcvitkovic/Deep_Learning_On_Code_With_A_Graph_Vocabulary) |
 88 | # Coreutils
 89 | |    | title                                                                      |   year | venue   | task            | model                    | dataset                         | pdf                                               | code                                                    |
 90 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:----------------|:-------------------------|:--------------------------------|:--------------------------------------------------|:--------------------------------------------------------|
 91 | |  0 | DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing |   2020 | NDSS    | Clone Detection | Text-associated DeepWalk | Coreutils, Diffutils, Findutils | [📑](https://dx.doi.org/10.14722/ndss.2020.24311) | [:octocat:](https://github.com/deepbindiff/DeepBinDiff) |
 92 | # DeepFix dataset
 93 | |    | title                                                                      |   year | venue   | task           | model   | dataset                            | pdf                                                  | code   |
 94 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:---------------|:--------|:-----------------------------------|:-----------------------------------------------------|:-------|
 95 | |  0 | GGF: A graph-based method for programming language syntax error correction |   2020 | ICPC    | Program Repair | GGNN    | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) |        |
 96 | # Syntax similar dataset
 97 | |    | title                                                                                   |   year | venue    | task                    | model   | dataset                                                                                | pdf                                                                        | code   |
 98 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------|
 99 | |  0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network |   2021 | ASIA CCS | Vulnerability Detection | GTN     | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |        |
100 | # SPoC
101 | |    | title                                                                |   year | venue   | task           | model     | dataset   | pdf                                   | code                                                   |
102 | |---:|:---------------------------------------------------------------------|-------:|:--------|:---------------|:----------|:----------|:--------------------------------------|:-------------------------------------------------------|
103 | |  0 | Graph-based, Self-Supervised Program Repair from Diagnostic Feedback |   2020 | ICML    | Program Repair | GAT, LSTM | SPoC      | [📑](http://arxiv.org/abs/2005.10636) | [:octocat:](https://github.com/michiyasunaga/DrRepair) |
104 | # IJDataset2.0
105 | |    | title                                                            |   year | venue                                                                   | task            | model   | dataset      | pdf                                                                     | code   |
106 | |---:|:-----------------------------------------------------------------|-------:|:------------------------------------------------------------------------|:----------------|:--------|:-------------|:------------------------------------------------------------------------|:-------|
107 | |  0 | Code Clone Detection with Hierarchical Attentive Graph Embedding |   2021 | International Journal of Software Engineering and Knowledge Engineering | Clone Detection | GCN     | IJDataset2.0 | [📑](https://www.worldscientific.com/doi/abs/10.1142/S021819402150025X) |        |
108 | # CodeForces dataset
109 | |    | title                                                                      |   year | venue   | task           | model   | dataset                            | pdf                                                  | code   |
110 | |---:|:---------------------------------------------------------------------------|-------:|:--------|:---------------|:--------|:-----------------------------------|:-----------------------------------------------------|:-------|
111 | |  0 | GGF: A graph-based method for programming language syntax error correction |   2020 | ICPC    | Program Repair | GGNN    | DeepFix dataset,CodeForces dataset | [📑](https://dl.acm.org/doi/10.1145/3387904.3389252) |        |
112 | # Linux kernel's code collected in this work
113 | |    | title                                                              |   year | venue                                              | task                    | model                | dataset                                    | pdf                                                | code   |
114 | |---:|:-------------------------------------------------------------------|-------:|:---------------------------------------------------|:------------------------|:---------------------|:-------------------------------------------|:---------------------------------------------------|:-------|
115 | |  0 | Modeling and discovering vulnerabilities with code property graphs |   2014 | Proceedings IEEE Symposium on Security and Privacy | Vulnerability Detection | code property graphs | Linux kernel's code collected in this work | [📑](http://ieeexplore.ieee.org/document/6956589/) |        |
116 | # OJClone
117 | |    | title                                                                                |   year | venue                                               | task                                            | model                 | dataset               | pdf                                                                                  | code                                                     |
118 | |---:|:-------------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------------|:----------------------|:-------------------------------------------------------------------------------------|:---------------------------------------------------------|
119 | |  0 | A Novel Neural Source Code Representation Based on Abstract Syntax Tree              |   2019 | ICSE                                                | Program Classification, Clone Detection         | bidirectional RNN     | OJClone,BCB           | [📑](https://www.semanticscholar.org/paper/1432c8378b1cafa3f91f09fa743082d154fdab92) | [:octocat:](https://github.com/zhangj1994/astnn)         |
120 | |  1 | TBCNN: A tree-based convolutional neural network for programming language processing |   2014 | arixiv                                              | Program Classification                          | TBCNN                 | OJClone               | [📑](https://arxiv.org/abs/1409.5718v1)                                              | [:octocat:](https://sites.google.com/site/treebasedcnn/) |
121 | |  2 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST     |   2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM       | OJClone,BigCloneBench | [📑](https://doi.org/10.1145/3310273.3321560)                                        | [:octocat:](https://github.com/milkfan/TBCAA)            |
122 | |  3 | Semantic Code Clone Detection Via Event Embedding Tree and GAT Network               |   2020 | QRS                                                 | Clone Detection                                 | Transformer, GAT, CNN | OJClone               | [📑](https://ieeexplore.ieee.org/document/9282778/)                                  | [:octocat:](https://github.com/lbzwoaini/CSEM.git)       |
123 | |  4 | How could Neural Networks understand Programs?                                       |   2021 | ICML                                                | Clone Detection                                 | Transformer           | OJClone               | [📑](http://arxiv.org/abs/2105.04297)                                                | [:octocat:](https://github.com/pdlan/OSCAR)              |
124 | # YANCFG Dataset
125 | |    | title                                                                                                |   year | venue                                                                             | task              | model   | dataset                        | pdf                                                | code   |
126 | |---:|:-----------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------|:--------|:-------------------------------|:---------------------------------------------------|:-------|
127 | |  0 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network |   2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN   | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) |        |
128 | # Defects4J
129 | |    | title                                             |   year | venue                                     | task           | model   | dataset                    | pdf                                   | code                              |
130 | |---:|:--------------------------------------------------|-------:|:------------------------------------------|:---------------|:--------|:---------------------------|:--------------------------------------|:----------------------------------|
131 | |  0 | CODIT: Code Editing with Tree-Based Neural Models |   2020 | IEEE Transactions on Software Engineering | Program Repair | LSTM    | Defects4J,Code-Change-Data | [📑](http://arxiv.org/abs/1810.00314) | [:octocat:](https://git.io/JJGwU) |
132 | # PY150
133 | |    | title                                                                   |   year | venue                                           | task            | model         | dataset      | pdf                                                  | code   |
134 | |---:|:------------------------------------------------------------------------|-------:|:------------------------------------------------|:----------------|:--------------|:-------------|:-----------------------------------------------------|:-------|
135 | |  0 | Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs   |   2021 | AAAI                                            | Code Generation | GAT           | JS150, PY150 | [📑](http://arxiv.org/abs/2103.09499)                |        |
136 | |  1 | Learning semantic program embeddings with graph interval neural network |   2020 | Proceedings of the ACM on Programming Languages | Program Repair  | GINN          | PY150        | [📑](https://arxiv.org/abs/2005.09997v2)             |        |
137 | |  2 | Probabilistic model for code with decision trees                        |   2016 | SIGPLAN                                         | Code Generation | Decision tree | PY150, JS150 | [📑](https://dl.acm.org/doi/10.1145/2983990.2984041) |        |
138 | # iclr18-prog-graphs-dataset
139 | |    | title                                      |   year | venue   | task                              | model   | dataset                    | pdf                                    | code                                                                         |
140 | |---:|:-------------------------------------------|-------:|:--------|:----------------------------------|:--------|:---------------------------|:---------------------------------------|:-----------------------------------------------------------------------------|
141 | |  0 | Learning to represent programs with graphs |   2018 | ICLR    | Defect Prediction,Code Generation | GGNN    | iclr18-prog-graphs-dataset | [📑](https://arxiv.org/abs/1711.00740) | [:octocat:](https://github.com/Microsoft/gated-graph-neural-network-samples) |
142 | # TL-CodeSum
143 | |    | title                                                                                                      |   year | venue   | task               | model         | dataset    | pdf                                   | code                                                 |
144 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------|:-----------|:--------------------------------------|:-----------------------------------------------------|
145 | |  0 | CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees |   2021 | EMNLP   | Code Summarization | RNN,attention | TL-CodeSum | [📑](http://arxiv.org/abs/2108.12987) | [:octocat:](https://anonymous.4open.science/r/CAST/) |
146 | # CoCoNet
147 | |    | title                                                                             |   year | venue   | task               | model                                             | dataset                | pdf                                      | code   |
148 | |---:|:----------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------------------------------------------|:-----------------------|:-----------------------------------------|:-------|
149 | |  0 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network |   2021 | arxiv   | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet | [📑](https://arxiv.org/abs/2107.01933v1) |        |
150 | # MSKCFG Dataset
151 | |    | title                                                                                                |   year | venue                                                                             | task              | model   | dataset                        | pdf                                                | code   |
152 | |---:|:-----------------------------------------------------------------------------------------------------|-------:|:----------------------------------------------------------------------------------|:------------------|:--------|:-------------------------------|:---------------------------------------------------|:-------|
153 | |  0 | Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network |   2019 | Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN | Defect Prediction | DGCNN   | MSKCFG Dataset, YANCFG Dataset | [📑](https://ieeexplore.ieee.org/document/8809504) |        |
154 | # C Program Dataset
155 | |    | title                                                                |   year | venue   | task               | model   | dataset           | pdf                                      | code                                                                                |
156 | |---:|:---------------------------------------------------------------------|-------:|:--------|:-------------------|:--------|:------------------|:-----------------------------------------|:------------------------------------------------------------------------------------|
157 | |  0 | Retrieval-Augmented Generation for Code Summarization via Hybrid GNN |   2021 | ICLR    | Code Summarization | GNN     | C Program Dataset | [📑](https://arxiv.org/abs/2006.05405v5) | [:octocat:](https://github.com/shangqing-liu/CCSD-benchmark-for-code-summarization) |
158 | # Java method-comment pairs
159 | |    | title                                                  |   year | venue   | task               | model   | dataset                   | pdf                                      | code   |
160 | |---:|:-------------------------------------------------------|-------:|:--------|:-------------------|:--------|:--------------------------|:-----------------------------------------|:-------|
161 | |  0 | Improved code summarization via a graph neural network |   2020 | ICPC    | Code Summarization | ConvGNN | Java method-comment pairs | [📑](https://arxiv.org/abs/2004.02843v2) |        |
162 | # BigCloneBench
163 | |    | title                                                                            |   year | venue                                               | task                                            | model                       | dataset                              | pdf                                           | code                                                 |
164 | |---:|:---------------------------------------------------------------------------------|-------:|:----------------------------------------------------|:------------------------------------------------|:----------------------------|:-------------------------------------|:----------------------------------------------|:-----------------------------------------------------|
165 | |  0 | Capturing Source Code Semantics via Tree-based Convolution over API-enhanced AST |   2019 | ACM International Conference on Computing Frontiers | Clone Detection,Code Search, Code Summarization | tree-based LSTM             | OJClone,BigCloneBench                | [📑](https://doi.org/10.1145/3310273.3321560) | [:octocat:](https://github.com/milkfan/TBCAA)        |
166 | |  1 | DeepSim: Deep Learning Code Functional Similarity                                |   2018 | ESEC/FSE                                            | Clone Detection                                 | Feed-forward neural network | Google Code Jam (GCJ), BigCloneBench | [📑](https://doi.org/10.1145/3236024.3236068) | [:octocat:](https://github.com/parasol-aser/deepsim) |
167 | # Firmware image dataset
168 | |    | title                                                                                   |   year | venue    | task                    | model   | dataset                                                                                | pdf                                                                        | code   |
169 | |---:|:----------------------------------------------------------------------------------------|-------:|:---------|:------------------------|:--------|:---------------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:-------|
170 | |  0 | BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network |   2021 | ASIA CCS | Vulnerability Detection | GTN     | Validation dataset, Syntax similar dataset, ARM binary dataset, Firmware image dataset | [📑](https://www2.seas.gwu.edu/~howie/publications/BugGraph-ASIACCS21.pdf) |        |
171 | # C# dataset
172 | |    | title                                |   year | venue   | task                           | model    | dataset                                                                     | pdf                                      | code                                                                     |
173 | |---:|:-------------------------------------|-------:|:--------|:-------------------------------|:---------|:----------------------------------------------------------------------------|:-----------------------------------------|:-------------------------------------------------------------------------|
174 | |  0 | Structured neural summarization      |   2019 | ICLR    | Code Summarization             | GGNN     | C# dataset,JAVA method naming datasets, Python method documentation dataset | [📑](https://arxiv.org/abs/1811.01824v4) | [:octocat:](https://github.com/CoderPat/structured-neural-summarization) |
175 | |  1 | Generative code modeling with graphs |   2019 | ICLR    | Program Repair,Code Generation | GRU,GGNN | C# dataset                                                                  | [📑](https://arxiv.org/abs/1805.08490v2) | [:octocat:](https://github.com/Microsoft/graph-based-code-modelling)     |
176 | # CodeSearchNet
177 | |    | title                                                                             |   year | venue   | task               | model                                             | dataset                               | pdf                                      | code                                        |
178 | |---:|:----------------------------------------------------------------------------------|-------:|:--------|:-------------------|:--------------------------------------------------|:--------------------------------------|:-----------------------------------------|:--------------------------------------------|
179 | |  0 | Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting       |   2021 | ICPC    | Code Summarization | Tree-LSTM                                         | CodeSearchNet, Hybrid-DeepCom Dataset | [📑](https://arxiv.org/abs/2103.07845v2) | [:octocat:](https://github.com/XMUDM/BASTS) |
180 | |  1 | CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network |   2021 | arxiv   | Code Summarization | Transformer,Multi-Relational Graph Neural Network | CodeSearchNet, CoCoNet                | [📑](https://arxiv.org/abs/2107.01933v1) |                                             |
181 | # Java Dataset collected in this work
182 | |    | title                                                                                                      |   year | venue                                           | task              | model                         | dataset                             | pdf                                              | code   |
183 | |---:|:-----------------------------------------------------------------------------------------------------------|-------:|:------------------------------------------------|:------------------|:------------------------------|:------------------------------------|:-------------------------------------------------|:-------|
184 | |  0 | Improving bug detection via context-based code representation learning and attention-based neural networks |   2019 | Proceedings of the ACM on Programming Languages | Defect Prediction | GRU, CNN, Attention mechanism | Java Dataset collected in this work | [📑](https://dl.acm.org/doi/abs/10.1145/3360588) |        |
185 | # C dataset
186 | |    | title                                                                     |   year | venue   | task        | model           | dataset   | pdf                                      | code   |
187 | |---:|:--------------------------------------------------------------------------|-------:|:--------|:------------|:----------------|:----------|:-----------------------------------------|:-------|
188 | |  0 | Multi-modal attention network learning for semantic source code retrieval |   2019 | ASE     | Code Search | GGNN, Tree-LSTM | C dataset | [📑](https://arxiv.org/abs/1909.13516v1) |        |
189 | 


--------------------------------------------------------------------------------