└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Flow2Vec: Value-Flow-Based Precise Code Embedding 2 | 3 | This package contains the implementation of Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows), from our paper: 4 | 5 | > Yulei Sui, Xiao Cheng, Guanqin Zhang, Haoyu Wang, "Flow2Vec: Value-Flow-Based Precise Code Embedding". 6 | 7 | This artifact is packaged as a Docker image for users to reproduce the results in the “Evaluation” section (Section 5) of the associated paper. 8 | 9 | ## Table of Contents 10 | 11 | - [Flow2Vec: Value-Flow-Based Precise Code Embedding](#flow2vec-value-flow-based-precise-code-embedding) 12 | - [Table of Contents](#table-of-contents) 13 | - [Requirements](#requirements) 14 | - [Size of the artifact](#size-of-the-artifact) 15 | - [1. Getting Started](#1-getting-started) 16 | - [1.1 Get benchmark statistics (Table 1)](#11-get-benchmark-statistics-table-1) 17 | - [1.2 Precision, recall and RMSE (Figure 7)](#12-precision-recall-and-rmse-figure-7) 18 | - [1.3 Efficiency (Figure 8)](#13-efficiency-figure-8) 19 | - [2. Step by step](#2-step-by-step) 20 | - [2.1 Code classification and summarization (Table 2).](#21-code-classification-and-summarization-table-2) 21 | - [2.2 F1 score under different lengths of code (Figure 9)](#22-f1-score-under-different-lengths-of-code-figure-9) 22 | - [2.3 F1 score under different embedding dimensions (Figure 10)](#23-f1-score-under-different-embedding-dimensions-figure-10) 23 | - [2.4 Ablation analysis (Figure 11)](#24-ablation-analysis-figure-11) 24 | - [2.5 Case study (Figure 12)](#25-case-study-figure-12) 25 | - [License](#license) 26 | - [MD5](#md5) 27 | 28 | ## Requirements 29 | 30 | The experiments were conducted in Unbuntu 18.04 Linux on a Intel Xeon Gold 6132 @ 2.60GHz CPUs with 128 GB RAM machine. Note that 31 | * A machine with less memory and CPU may take much longer time (than that in the paper) or may be unscalable to evaluate all the 32 benchmarks. We have also provided a subset of datasets with only 8 small benchmarks to do a fast experiment to reproduce their data in the paper (as suggested by the artifact submission website). To run this fast experiment, please try to allocate docker as much RAM memory (>8G memory) as possible on your local machine, otherwise the running might be killed by the underlying OS. 32 | * Some of the results are performance data (e.g., Figure 8) the exact numbers depend on the particular machine. It would be good to observe their trends. 33 | * The actual embedding particularly when performing random path sampling which might cause slightly different results when producing the trained model (e.g., Figure 7). 34 | 35 | ## Size of the artifact 36 | 37 | Docker image size is around 2.85GB (8.12 GB after decompression) 38 | 39 | 40 | ## 1. Getting Started 41 | 42 | ```sh 43 | docker pull flow2vec/oopsla 44 | docker run -i -t flow2vec/oopsla /bin/bash 45 | cd /home 46 | export LLVM_DIR=/home/clang10 47 | ``` 48 | 49 | The artifact package contains: 50 | 51 | - `README.md`: detailed instructions for running and testing the artifact. 52 | - `/home/script`: scripts to reproduce the experiments 53 | - `/home/clang10` LLVM pre-binary package 54 | - `SVF/`: source code for sparse value flow analysis and graph representations. 55 | - `flow2vec`: binary implementation of code embedding. 56 | - `classification` and `summarization`: binary implementations of code classification and summarization. 57 | - `resources/classification/`: value-flow path contexts datasets for code classification. 58 | - `resources/summarization/`: value-flow path contexts datasets for code summarization. 59 | - `resources/model/`: well-trained model for case study. 60 | - `resources/data/`: the value flow graphs and callgraphs of benchmarks. 61 | - `resources/case/`: the value flow graphs and callgraphs of code fragments for case study. 62 | - `resources/open-source/`: LLVM IR of benchmarks. 63 | - `result/`: reproduced results in json format. 64 | 65 | 66 | ### 1.1 Get benchmark statistics (Table 1) 67 | 68 | To show the statistics of the benchmarks shown in Table 1: 69 | 70 | ```sh 71 | # #LOI, #Method, #Pointers, #Objects, #Call 72 | bash /home/script/statics.sh 73 | # nodes |V| and edges |E| on the value-flow graph 74 | bash /home/script/V_E_count.sh 75 | ``` 76 | The results are also available in `result/statistics.txt` and `result/computed_nodes_edges.txt` 77 | 78 | Note that the statistics are produced by the lastest SVF and recent LLVM/Clang-10.0.0 for later integration/release purposes. The numbers might differ a bit from those in Table 1 (for #Pointers, #Objects, |V| and |E|) since previously we use an early version of SVF and LLVM-9.0 (1st sentence in Section 5.1 in our paper) as the front-end. 79 | 80 | 81 | ### 1.2 Precision, recall and RMSE (Figure 7) 82 | 83 | 1.2.1 Run embedding only on the eight small benchmarks, including bc, convert, dc, gzip, echogs, less, mkromfs_0, ctypes_test. (running time includes pointer analysis, IVFG generation, context-sensitive reachability, matrix transformation) 84 | ```sh 85 | # Rhis command triggers multiple runs for each benchmark under different embedding dimensions 10, 60, 110, 160 and 210. The estimated run time is around 20 mins 86 | bash ./script/embedding.sh small 87 | # Store the results under folder `/home/result/small` and print the results to terminal. 88 | bash /home/script/evaluation.sh /home/result/small 89 | ``` 90 | 91 | 1.2.2 Run embedding on a particular benchmark (e.g., gzip). The result will be stored in /home/gzip 92 | ``` sh 93 | bash /home/script/embedding_single.sh /home/resources/open-source/gzip.orig /home/gzip 94 | # To see the results of a particular benchmark (e.g., gzip). 95 | bash /home/script/evaluation.sh /home/gzip 96 | ``` 97 | 98 | 1.2.3 Run embedding on all benchmarks (it took around 35 hours including IO read/write on a 128RAM machine). The pre-run results on our machine are saved under "/home/result/embed-time" for your reference. 99 | ``` sh 100 | # To see the results of the pre-run results directly (stored in `/home/result/evaluation.txt`) 101 | bash /home/script/evaluation.sh /home/result/embed-time 102 | # Re-run all the benchmarks 103 | bash /home/script/embedding.sh all 104 | ``` 105 | 106 | 107 | ### 1.3 Efficiency (Figure 8) 108 | 109 | Obtain the running time including value-flow construction time and high-order embedding time. 110 | 111 | ``` sh 112 | bash /home/script/efficiency.sh 113 | cat /home/result/efficiency.txt 114 | ``` 115 | 116 | ## 2. Step by step 117 | 118 | ### 2.1 Code classification and summarization (Table 2). 119 | 120 | Run the following command to obtain the result for code classification and summarization (Table 2). It may take 30-40 minutes on a local machine with 8GB RAM. 121 | 122 | 123 | ```sh 124 | cd /home 125 | ./classification -op 1 & ./summarization -op 1 126 | ``` 127 | 128 | The result will be stored in `result/classification.json` and `result/summarization.json` for code classification and summarization respectively. 129 | 130 | **(Optional)** To regenerate datasets (composite) for code classification and code summarization: 131 | 132 | ```sh 133 | cd /home 134 | ./flow2vec -op 1 135 | ``` 136 | 137 | ### 2.2 F1 score under different lengths of code (Figure 9) 138 | 139 | Run the following command to directly reproduce the result under different lengths of code (Figure 9): 140 | 141 | ```sh 142 | cd /home 143 | ./classification -op 2 & ./summarization -op 2 144 | ``` 145 | 146 | The result will be stored in `result/classification-lengths.json` and `result/summarization-lengths.json` for code classification and summarization respectively. 147 | 148 | **(Optional)** To regenerate datasets under different lengths of code, run: 149 | 150 | ```sh 151 | cd /home 152 | ./flow2vec -op 2 153 | ``` 154 | 155 | ### 2.3 F1 score under different embedding dimensions (Figure 10) 156 | 157 | To get the result in Figure 10: 158 | 159 | ```sh 160 | cd /home 161 | ./classification -op 3 & ./summarization -op 3 162 | ``` 163 | 164 | The result will be stored in `result/classification-dimensions.json` and `result/summarization-dimensions.json` for code classification and summarization respectively. 165 | 166 | **(Optional)** To regenerate datasets under different dimensions for code classification and summarization: 167 | 168 | ```sh 169 | cd /home 170 | ./flow2vec -op 3 171 | ``` 172 | 173 | ### 2.4 Ablation analysis (Figure 11) 174 | 175 | To get the ablation analysis result: 176 | 177 | ```sh 178 | cd /home 179 | ./classification -op 4 & ./summarization -op 4 180 | ``` 181 | 182 | The result will be stored in `result/classification-ablation.json` and `result/summarization-ablation.json` for code classification and summarization respectively. 183 | 184 | **(Optional)** To regenerate datasets for ablation analysis for code classification and summarization. 185 | 186 | ```sh 187 | cd /home 188 | ./flow2vec -op 4 189 | ``` 190 | 191 | 192 | 193 | ### 2.5 Case study (Figure 12) 194 | 195 | The four code fragments (Source code A-D) in Figure 12 are put in `resources/case/case.cpp` and their value-flow graph is stored in `resources/case/svfg_final.dot`. We have also provided another example under `resources/case-ex/case-ex.cpp`. (Note that due to the natural of distributed representation of code embedding, the predicted results by a model can be imprecise and different from the ground truth, due to the model trained using limited available training samples.) 196 | 197 | To reproduce the result for Figure 12: 198 | 199 | ```sh 200 | cd /home 201 | ./summarization -op 5 202 | ``` 203 | 204 | The result will be stored in `result/case.json`. 205 | 206 | **(Optional)** To regenerate datasets for case study. 207 | 208 | ```sh 209 | cd /home 210 | ./flow2vec -op 5 211 | ``` 212 | 213 | 214 | ## License 215 | 216 | The artifact is available under the GNU Lesser General Public License v3.0 or later. 217 | 218 | ## MD5 219 | ```md5 220 | digest: sha256:cfd49586716d688ac19a38c1442ebc53cfbe93138ad35252c6e250a5029faa9f 221 | ``` 222 | --------------------------------------------------------------------------------