├── scan_rst_mwphotobrowser.png ├── LICENSE ├── README.md └── SameCodeFinder.py /scan_rst_mwphotobrowser.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/startry/SameCodeFinder/HEAD/scan_rst_mwphotobrowser.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Xing Chen 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SameCodeFinder 2 | 3 | SameCodeFinder is a static code text scanner which can find the similar or the same code file in a big directory. 4 | 5 | ## Feature 6 | 7 | SameCodeFinder could detect the same function in the source code files. The finder could show the `Hamming Distacnce` between two funcitons. 8 | 9 | * Find the same code which need to be extract to reuse 10 | * Show the Hamming Distance between each soucecode file(Support All kinds of soucecode type) 11 | * Show the Hamming Distance between each soucecode function(Support Java and Object-C now) 12 | 13 | The below photo show the calculate result of [MWPhotoBrowser](https://github.com/mwaterfall/MWPhotoBrowser) 14 | ![Scan result of MWPhotoBrowser](./scan_rst_mwphotobrowser.png) 15 | 16 | The result come from the command 17 | ``` 18 | python SameCodeFinder.py ~/Projects/opensource/MWPhotoBrowser/ .m --max-distance=10 --min-linecount=3 --functions --detail 19 | ``` 20 | 21 | ## Usage 22 | 23 | Install the python implement of [SimHash](https://github.com/leonsunliang/simhash) 24 | 25 | ``` Ruby 26 | pip install simhash 27 | ``` 28 | 29 | Visit [A Python Implementation of Simhash Algorithm](http://leons.im/posts/a-python-implementation-of-simhash-algorithm/) if you want to know more about the module. 30 | 31 | ``` Ruby 32 | python SameCodeFinder.py [arg0] [arg1] 33 | ``` 34 | 35 | #### Optional 36 | * ```[arg0]``` 37 | * Target Directory of files should be scan 38 | * ```[arg1]``` 39 | * Doc Suffix of files should be scan, eg 40 | * .m - Object-C file 41 | * .swift - Swift file 42 | * .java - Java file 43 | * ```--detail``` 44 | * show process detail of scan 45 | * ```--functions``` 46 | * Use Functions as code scan standard 47 | * ```--max-distance=[input]``` 48 | * max hamming distance to keep, default is 20 49 | * ```--min-linecount=[input]``` 50 | * for function scan, the function would be ignore if the total line count of the function less than min-linecount 51 | * ```--output=[intput]``` 52 | * Customize the output file, default is "out.txt" 53 | 54 | 55 | ## Requirement 56 | 57 | Python 2.6+, Pip 9.0+, [simhash](https://github.com/leonsunliang/simhash) 58 | 59 | ## License 60 | 61 | SameCodeFinder is available under the MIT license. See the LICENSE file for more info. -------------------------------------------------------------------------------- /SameCodeFinder.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #coding=utf-8 3 | 4 | #--------------------------------------------- 5 | # Name: SameCodeFinder.py 6 | # Author: Xing Chen 7 | # Date: 2016-12-02 8 | # Description: SameCodeFinder is a static code 9 | # text scanner which can find the similar or 10 | # the same code file in a big directory. 11 | #---------------------------------------------- 12 | 13 | import re 14 | import os 15 | import sys 16 | import fileinput 17 | import datetime 18 | from simhash import Simhash, SimhashIndex 19 | 20 | gb_detail = 0 21 | gb_max_dis = 20 22 | gb_min_linecount = 3 23 | gb_output = "out.txt" 24 | 25 | def main(): 26 | global gb_detail 27 | global gb_max_dis 28 | global gb_min_linecount 29 | global gb_output 30 | 31 | if len(sys.argv) <= 1: 32 | print_help() 33 | return 34 | 35 | # Accoring to Argv as a Scanner Target root path 36 | root_path= sys.argv[1] 37 | suffix = sys.argv[2] 38 | 39 | funciton_standard = 0 40 | 41 | for arg in sys.argv: 42 | if arg == "--help": 43 | print_help() 44 | return 45 | elif arg == "--detail": 46 | gb_detail = 1 47 | elif arg == "--functions": 48 | funciton_standard = 1 49 | elif arg.startswith("--max-distance="): 50 | arg_arr = arg.split("=") 51 | gb_max_dis = int(arg_arr[1]) 52 | elif arg.startswith("--min-linecount="): 53 | arg_arr = arg.split("=") 54 | gb_min_linecount = int(arg_arr[1]) 55 | elif arg.startswith("--output="): 56 | arg_arr = arg.split("=") 57 | gb_output = int(arg_arr[1]) 58 | 59 | if not suffix: 60 | print "You must assign a suffix. eg: \".m\" \".java\"" 61 | return 62 | 63 | if not os.path.isdir(root_path): 64 | print "You must assign a dir as first input" 65 | return 66 | 67 | if funciton_standard == 1: 68 | print "Hashing all the functions..." 69 | hashed_arr = hash_funcs(root_path, suffix) 70 | else: 71 | print "Hashing all the files..." 72 | hashed_arr = hash_files(root_path, suffix) 73 | 74 | if len(hashed_arr) == 0: 75 | return 76 | 77 | print "Ranking all the hash results..." 78 | ranked_arr = rank_hash(hashed_arr) 79 | 80 | print "Sorting all the ranked results..." 81 | sorted_arr = sorted(ranked_arr, cmp=lambda x,y:cmp(x[2],y[2])) 82 | 83 | output_file = open(gb_output, 'w+') 84 | for obj in sorted_arr: 85 | print >>output_file, obj 86 | 87 | print "Finished! Result saved in %s" % (gb_output) 88 | 89 | 90 | def hash_files(root_path, suffix): 91 | hashed_arr = [] 92 | for file_path in scan_files(root_path, None, suffix): 93 | single_file_name = file_name(file_path) 94 | if gb_detail == 1: 95 | print "Start Hash File %s" % (single_file_name) 96 | signle_file = open(file_path, 'r') 97 | single_file_content = signle_file.read() 98 | single_hash_result = Simhash(get_features(single_file_content)) 99 | hashed_arr.append((single_file_name, single_hash_result)) 100 | return hashed_arr 101 | 102 | def hash_funcs(root_path, suffix): 103 | hashed_arr = [] 104 | 105 | if not is_suffix_supported(suffix): 106 | print "The Funcs Standard SameCodeFinder just support Object-C and Java now. Use \".m\" or \".java\", please" 107 | return hashed_arr 108 | 109 | for file_path in scan_files(root_path, None, suffix): 110 | single_file_name = file_name(file_path) 111 | single_beauti_name = "" 112 | single_func_content = "" 113 | single_line_count = 0 114 | single_bracket_count = 0 115 | is_function_started = 0 116 | for line in fileinput.input(file_path): 117 | strip_line = line.strip() 118 | ## Skip comment lines 119 | if strip_line.startswith("//") or strip_line.startswith("/*") or strip_line.startswith("*"): 120 | continue 121 | 122 | if not is_function_started and re.findall(grammar_regex_by_suffix(suffix), line): 123 | ## for now, SameCodeFinder support Object-C and Java only 124 | if is_function_start_line(line, suffix): 125 | single_beauti_name = beautify_func_name(line, suffix) 126 | 127 | # Reset Line Content 128 | single_func_content = "" 129 | single_line_count = 0 130 | single_bracket_count = 0 131 | 132 | single_func_content = single_func_content + line 133 | single_line_count = single_line_count + 1 134 | single_bracket_count = single_bracket_count + count_func_bracket(line) 135 | if single_bracket_count > 0 and single_beauti_name: 136 | is_function_started = 1 137 | 138 | if single_bracket_count == 0 and is_function_started == 1: 139 | single_func_name = "%s(%s)" % (single_file_name, single_beauti_name) 140 | if gb_detail == 1: 141 | print "Start Hash Func %s" % (single_func_name) 142 | 143 | single_hash_result = Simhash(get_features(single_func_content.strip())) 144 | if single_line_count >= gb_min_linecount: 145 | hashed_arr.append((single_func_name, single_hash_result)) 146 | 147 | is_function_started = 0 148 | 149 | return hashed_arr 150 | 151 | def count_func_bracket(line): 152 | count = 0 153 | for word in line: 154 | if word=="{": 155 | count=count+1 156 | elif word=="}": 157 | count=count-1 158 | 159 | return count; 160 | 161 | def rank_hash(hashed_arr): 162 | global gb_detail 163 | global gb_max_dis 164 | 165 | ranked_arr = [] 166 | count = len(hashed_arr) 167 | for i in range(0, count): 168 | obj1 = hashed_arr[i] 169 | name1 = obj1[0] 170 | min_distance = 1000 171 | same_obj2_name = "" 172 | if gb_detail == 1: 173 | print "Start Rank %s" % (name1) 174 | for j in range(i + 1, count): 175 | obj2 = hashed_arr[j] 176 | name2 = obj2[0] 177 | distance = obj1[1].distance(obj2[1]) 178 | if distance < min_distance: 179 | min_distance = distance 180 | same_obj2_name = name2 181 | 182 | if name1 != same_obj2_name and same_obj2_name != "": 183 | if min_distance < gb_max_dis: 184 | ranked_arr.append((name1, same_obj2_name, min_distance)) 185 | 186 | return ranked_arr 187 | 188 | ################################################## 189 | ################ Start Util Funcs ################ 190 | def get_features(s): 191 | width = 3 192 | s = s.lower() 193 | s = re.sub(r'[^\w]+', '', s) 194 | return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))] 195 | 196 | ############################################# 197 | ############### File Processor ############## 198 | def scan_files(directory,prefix=None,postfix=None): 199 | files_list=[] 200 | 201 | for root, sub_dirs, files in os.walk(directory): 202 | for special_file in files: 203 | if postfix: 204 | if special_file.endswith(postfix): 205 | files_list.append(os.path.join(root,special_file)) 206 | elif prefix: 207 | if special_file.startswith(prefix): 208 | files_list.append(os.path.join(root,special_file)) 209 | else: 210 | files_list.append(os.path.join(root,special_file)) 211 | 212 | return files_list 213 | 214 | def file_name(file_path): 215 | name_arr=file_path.split("/") 216 | file_name=name_arr[len(name_arr) - 1] 217 | return file_name 218 | 219 | ############################################ 220 | ############# Language Diversity ########### 221 | def grammar_regex_by_suffix(suffix): 222 | if suffix == ".java": 223 | return ur"(public|private)(.*)\)\s?{"; 224 | elif suffix == ".m": 225 | return ur"(\-|\+)\s?\(.*\).*(\:\s?\(.*\).*)?{?"; 226 | 227 | return 228 | 229 | def is_suffix_supported(suffix): 230 | if suffix == ".m" or suffix == ".java": 231 | return 1 232 | else: 233 | return 0 234 | 235 | def is_function_start_line(line, suffix): 236 | if suffix == ".java": 237 | if line.strip().startswith("public") or line.strip().startswith("private"): 238 | return 1 239 | elif suffix == ".m": 240 | if line.startswith("+") or line.startswith("-"): 241 | return 1 242 | 243 | return 0 244 | 245 | ############################################# 246 | ########## Beautify function's name ######### 247 | def beautify_func_name(line, suffix): 248 | if suffix == ".m": 249 | return beautify_object_c_func_name(line) 250 | elif suffix == ".java": 251 | return beautify_java_func_name(line) 252 | else: 253 | return 254 | 255 | def beautify_java_func_name(func_name): 256 | name_arr = func_name.split("("); 257 | name_func_main = name_arr[0] 258 | name_func_arr = name_func_main.split(" ") 259 | b_name = name_func_arr[len(name_func_arr) - 1] 260 | if b_name: 261 | return b_name 262 | else: 263 | return func_name 264 | 265 | def beautify_object_c_func_name(func_name): 266 | has_startLeft = 0 267 | func_new_name = "" 268 | for char in func_name: 269 | if char=="(": 270 | has_startLeft = 1 271 | elif char == ')': 272 | has_startLeft = 0 273 | 274 | if has_startLeft == 0 and char != ')' and char != "+" and char != "-" and char != " " and char != "{": 275 | func_new_name = func_new_name + char; 276 | 277 | func_new_name = func_new_name.strip() 278 | func_new_name = func_new_name.replace(",...NS_REQUIRES_NIL_TERMINATION", "") 279 | 280 | return func_new_name 281 | 282 | ########################################## 283 | ############ Help ######################## 284 | def print_help(): 285 | print "Usage:\n" 286 | print "\tpython SameFileFinder.python [arg0] [arg1]\n" 287 | print "Args:\n" 288 | print "\t[arg0] - Target Directory of files should be scan" 289 | print "\t[arg1] - Doc Suffix of files should be scan, eg" 290 | print "\t\t .m - Object-C file" 291 | print "\t\t .swift - Swift file" 292 | print "\t\t .java - Java file\n" 293 | print "\t--max-distance=[input] - max hamming distance to keep, default is 20" 294 | print "\t--min-linecount=[input] - for function scan, the function would be ignore if the total line count of the function less than min-linecount" 295 | print "\t--functions - Use Functions as code scan standard" 296 | print "\t Attention: The \"--functions\" support just Object-C and Java now" 297 | print "\t--detail - Show the detail of process\n" 298 | print "\t--output=[intput] - Customize the output file, default is \"out.txt\"" 299 | 300 | ################ End Util Funcs ################ 301 | 302 | main() 303 | --------------------------------------------------------------------------------