└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Flow2Vec: Value-Flow-Based Precise Code Embedding
  2 | 
  3 | This package contains the implementation of Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows), from our paper:
  4 | 
  5 | > Yulei Sui, Xiao Cheng, Guanqin Zhang, Haoyu Wang, "Flow2Vec: Value-Flow-Based Precise Code Embedding".
  6 | 
  7 | This artifact is packaged as a Docker image for users to reproduce the results in the “Evaluation” section (Section 5) of the associated paper.
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [Flow2Vec: Value-Flow-Based Precise Code Embedding](#flow2vec-value-flow-based-precise-code-embedding)
 12 |   - [Table of Contents](#table-of-contents)
 13 |   - [Requirements](#requirements)
 14 |   - [Size of the artifact](#size-of-the-artifact)
 15 |   - [1. Getting Started](#1-getting-started)
 16 |     - [1.1 Get benchmark statistics (Table 1)](#11-get-benchmark-statistics-table-1)
 17 |     - [1.2 Precision, recall and RMSE (Figure 7)](#12-precision-recall-and-rmse-figure-7)
 18 |     - [1.3 Efficiency (Figure 8)](#13-efficiency-figure-8)
 19 |   - [2. Step by step](#2-step-by-step)
 20 |     - [2.1 Code classification and summarization (Table 2).](#21-code-classification-and-summarization-table-2)
 21 |     - [2.2 F1 score under different lengths of code (Figure 9)](#22-f1-score-under-different-lengths-of-code-figure-9)
 22 |     - [2.3 F1 score under different embedding dimensions (Figure 10)](#23-f1-score-under-different-embedding-dimensions-figure-10)
 23 |     - [2.4 Ablation analysis (Figure 11)](#24-ablation-analysis-figure-11)
 24 |     - [2.5 Case study (Figure 12)](#25-case-study-figure-12)
 25 |   - [License](#license)
 26 |   - [MD5](#md5)
 27 | 
 28 | ## Requirements
 29 | 
 30 | The experiments were conducted in Unbuntu 18.04 Linux on a Intel Xeon Gold 6132 @ 2.60GHz CPUs with 128 GB RAM machine. Note that
 31 | * A machine with less memory and CPU may take much longer time (than that in the paper) or may be unscalable to evaluate all the 32 benchmarks. We have also provided a subset of datasets with only 8 small benchmarks to do a fast experiment to reproduce their  data in the paper (as suggested by the artifact submission website). To run this fast experiment, please try to allocate docker as much RAM memory (>8G memory) as possible on your local machine, otherwise the running might be killed by the underlying OS.
 32 | * Some of the results are performance data (e.g., Figure 8) the exact numbers depend on the particular machine. It would be good to observe their trends.
 33 | * The actual embedding particularly when performing random path sampling which might cause slightly different results  when producing the trained model (e.g., Figure 7).
 34 | 
 35 | ## Size of the artifact
 36 | 
 37 | Docker image size is around 2.85GB (8.12 GB after decompression)
 38 | 
 39 | 
 40 | ## 1. Getting Started
 41 | 
 42 | ```sh
 43 | docker pull flow2vec/oopsla
 44 | docker run -i -t flow2vec/oopsla /bin/bash
 45 | cd /home
 46 | export LLVM_DIR=/home/clang10
 47 | ```
 48 | 
 49 | The artifact package contains:
 50 | 
 51 | - `README.md`: detailed instructions for running and testing the artifact.
 52 | - `/home/script`:  scripts to reproduce the experiments
 53 | - `/home/clang10`  LLVM pre-binary package
 54 | - `SVF/`: source code for sparse value flow analysis and graph representations.
 55 | - `flow2vec`: binary implementation of code embedding.
 56 | - `classification` and `summarization`: binary implementations of code classification and summarization.
 57 | - `resources/classification/`: value-flow path contexts datasets for code classification.
 58 | - `resources/summarization/`: value-flow path contexts datasets for code summarization.
 59 | - `resources/model/`: well-trained model for case study.
 60 | - `resources/data/`: the value flow graphs and callgraphs of  benchmarks.
 61 | - `resources/case/`: the value flow graphs and callgraphs of code fragments for case study.
 62 | - `resources/open-source/`: LLVM IR of benchmarks.
 63 | - `result/`: reproduced results in json format.
 64 | 
 65 | 
 66 | ### 1.1 Get benchmark statistics (Table 1)
 67 | 
 68 | To show the statistics of the benchmarks shown in Table 1:
 69 | 
 70 | ```sh
 71 | # #LOI, #Method, #Pointers, #Objects, #Call
 72 | bash /home/script/statics.sh
 73 | # nodes |V| and edges |E| on the value-flow graph
 74 | bash /home/script/V_E_count.sh
 75 | ```
 76 | The results are also available in `result/statistics.txt` and `result/computed_nodes_edges.txt`
 77 | 
 78 | Note that the statistics are produced by the lastest SVF and recent LLVM/Clang-10.0.0 for later integration/release purposes. The numbers might differ a bit from those in Table 1 (for #Pointers, #Objects, |V| and |E|) since previously we use an early version of SVF and LLVM-9.0 (1st sentence in Section 5.1 in our paper) as the front-end.
 79 | 
 80 | 
 81 | ### 1.2 Precision, recall and RMSE (Figure 7)
 82 | 
 83 | 1.2.1 Run embedding only on the eight small benchmarks, including bc, convert, dc, gzip, echogs, less, mkromfs_0, ctypes_test. (running time includes pointer analysis, IVFG generation, context-sensitive reachability, matrix transformation)
 84 | ```sh
 85 | # Rhis command triggers multiple runs for each benchmark under different embedding dimensions 10, 60, 110, 160 and 210. The estimated run time is around 20 mins
 86 | bash ./script/embedding.sh small
 87 | # Store the results under folder `/home/result/small` and print the results to terminal.
 88 | bash /home/script/evaluation.sh /home/result/small
 89 | ```
 90 | 
 91 | 1.2.2 Run embedding on a particular benchmark (e.g., gzip). The result will be stored in /home/gzip
 92 | ``` sh
 93 | bash /home/script/embedding_single.sh /home/resources/open-source/gzip.orig /home/gzip
 94 | # To see the results of a particular benchmark (e.g., gzip).
 95 | bash /home/script/evaluation.sh /home/gzip
 96 | ```
 97 | 
 98 | 1.2.3 Run embedding on all benchmarks (it took around 35 hours including IO read/write on a 128RAM machine). The pre-run results on our machine are saved under "/home/result/embed-time" for your reference.
 99 | ``` sh
100 | # To see the results of the pre-run results directly (stored in `/home/result/evaluation.txt`)
101 | bash /home/script/evaluation.sh /home/result/embed-time
102 | # Re-run all the benchmarks
103 | bash /home/script/embedding.sh all
104 | ```
105 | 
106 | 
107 | ### 1.3 Efficiency (Figure 8)
108 | 
109 | Obtain the running time including value-flow construction time and high-order embedding time.
110 | 
111 | ``` sh
112 | bash /home/script/efficiency.sh
113 | cat /home/result/efficiency.txt
114 | ```
115 | 
116 | ## 2. Step by step
117 | 
118 | ### 2.1 Code classification and summarization (Table 2).
119 | 
120 | Run the following command to obtain the result for code classification and summarization (Table 2). It may take 30-40 minutes on a local machine with 8GB RAM.
121 | 
122 | 
123 | ```sh
124 | cd /home
125 | ./classification -op 1 & ./summarization -op 1
126 | ```
127 | 
128 | The result will be stored in `result/classification.json` and `result/summarization.json` for code classification and summarization respectively.
129 | 
130 | **(Optional)** To regenerate datasets (composite) for code classification and code summarization:
131 | 
132 | ```sh
133 | cd /home
134 | ./flow2vec -op 1
135 | ```
136 | 
137 | ### 2.2 F1 score under different lengths of code (Figure 9)
138 | 
139 | Run the following command to directly reproduce the result under different lengths of code (Figure 9):
140 | 
141 | ```sh
142 | cd /home
143 | ./classification -op 2 & ./summarization -op 2
144 | ```
145 | 
146 | The result will be stored in `result/classification-lengths.json` and `result/summarization-lengths.json` for code classification and summarization respectively.
147 | 
148 | **(Optional)** To regenerate datasets under different lengths of code, run:
149 | 
150 | ```sh
151 | cd /home
152 | ./flow2vec -op 2
153 | ```
154 | 
155 | ### 2.3 F1 score under different embedding dimensions (Figure 10)
156 | 
157 | To get the result in Figure 10:
158 | 
159 | ```sh
160 | cd /home
161 | ./classification -op 3 & ./summarization -op 3
162 | ```
163 | 
164 | The result will be stored in `result/classification-dimensions.json` and `result/summarization-dimensions.json` for code classification and summarization respectively.
165 | 
166 | **(Optional)** To regenerate datasets under different dimensions for code classification and summarization:
167 | 
168 | ```sh
169 | cd /home
170 | ./flow2vec -op 3
171 | ```
172 | 
173 | ### 2.4 Ablation analysis (Figure 11)
174 | 
175 | To get the ablation analysis result:
176 | 
177 | ```sh
178 | cd /home
179 | ./classification -op 4 & ./summarization -op 4
180 | ```
181 | 
182 | The result will be stored in `result/classification-ablation.json` and `result/summarization-ablation.json` for code classification and summarization respectively.
183 | 
184 | **(Optional)** To regenerate datasets for ablation analysis for code classification and summarization.
185 | 
186 | ```sh
187 | cd /home
188 | ./flow2vec -op 4
189 | ```
190 | 
191 | 
192 | 
193 | ### 2.5 Case study (Figure 12)
194 | 
195 | The four code fragments (Source code A-D) in Figure 12 are put in `resources/case/case.cpp` and their value-flow graph is stored in  `resources/case/svfg_final.dot`. We have also provided another example under `resources/case-ex/case-ex.cpp`. (Note that due to the natural of distributed representation of code embedding, the predicted results by a model can be imprecise and different from the ground truth, due to the model trained using limited available training samples.)
196 | 
197 | To reproduce the result for Figure 12:
198 | 
199 | ```sh
200 | cd /home
201 | ./summarization -op 5
202 | ```
203 | 
204 | The result will be stored in `result/case.json`.
205 | 
206 | **(Optional)** To regenerate datasets for case study.
207 | 
208 | ```sh
209 | cd /home
210 | ./flow2vec -op 5
211 | ```
212 | 
213 | 
214 | ## License
215 | 
216 | The artifact is available under the GNU Lesser General Public License v3.0 or later.
217 | 
218 | ## MD5
219 | ```md5
220 | digest: sha256:cfd49586716d688ac19a38c1442ebc53cfbe93138ad35252c6e250a5029faa9f
221 | ```
222 | 


--------------------------------------------------------------------------------