├── .gitignore ├── README.md ├── compute_recency_metric.py ├── create_dataset.py ├── evaluate.py ├── extract_date_developer.py ├── extract_date_developer_new.py ├── faulty_lines_ml.py ├── get_developer_names.py ├── matrix_coverage.py ├── reweight.py ├── s2l_suspiciousness.py ├── sort_csv.py └── suspiciousness.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | # Idea 107 | .idea 108 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Fault Localization from Defects4J 2 | 3 | Fault localization is a technique to help identify buggy lines from source code based on failing test cases. We develop a hybrid technique based on different kinds of formulas that generate statement suspiciousness. This project is part of the CS5704: Software Engineering (Spring 2018). 4 | 5 | These are: 6 | 7 | * Tarantula 8 | * Ochiai 9 | * Opt2 10 | * Barinel 11 | * Dstar2 12 | * Muse 13 | * Jaccard 14 | 15 | ## Dataset 16 | 17 | We use the Defects4J dataset available [here](https://github.com/rjust/defects4j). The dataset has 395 real bugs from 6 Java projects - Lang, Closure, Chart, Mockito, Time, and Math. 18 | 19 | Specifically, we use only Lang and Math for our research results. 20 | 21 | ## Setup 22 | 23 | Download the data by running the following command. 24 | 25 | ```bash 26 | wget --recursive --no-parent --accept gzoltar-files.tar.gz http://fault-localization.cs.washington.edu/data 27 | ``` 28 | 29 | Subsequently extract the matrix and coverage to a directory using the `matrix_coverage.py` file. 30 | 31 | ```text 32 | usage: matrix_coverage.py [-h] --data-dir DATA_DIR --output-dir OUTPUT_DIR 33 | 34 | optional arguments: 35 | -h, --help show this help message and exit 36 | --data-dir DATA_DIR data directory that holds data 37 | --output-dir OUTPUT_DIR 38 | file to write coverage and spectra matrices to 39 | ``` 40 | 41 | For example, run `matrix_coverage.py` as: 42 | 43 | ```bash 44 | mkdir $HOME/fault-localization.cs.washington.edu/coverage 45 | python matrix_coverage.py --data-dir $HOME/fault-localization.cs.washington.edu/data --output-dir $HOME/fault-localization.cs.washington.edu/coverage 46 | ``` 47 | 48 | The suspiciousness values can be generated using `suspiciousness.py` file. 49 | 50 | ```text 51 | usage: suspiciousness.py [-h] --formula 52 | {muse,all,ochiai,tarantula,dstar2,jaccard,barinel,opt2} 53 | --data-dir DATA_DIR --output-dir OUTPUT_DIR 54 | 55 | optional arguments: 56 | -h, --help show this help message and exit 57 | --formula {muse,all,ochiai,tarantula,dstar2,jaccard,barinel,opt2} 58 | formula to use for suspiciousness calculation 59 | --data-dir DATA_DIR data directory that holds coverage and spectra files 60 | --output-dir OUTPUT_DIR 61 | file to write suspiciousness vector to 62 | ``` 63 | 64 | Run `suspiciousness.py` as follows: 65 | 66 | ```bash 67 | mkdir $HOME/fault-localization.cs.washington.edu/suspiciousness 68 | python suspiciousness.py --data-dir $HOME/fault-localization.cs.washington.edu/coverage --output-dir $HOME/fault-localization.cs.washington.edu/suspiciousness --formula all 69 | ``` 70 | 71 | Subsequently, the statement suspiciousness values need to be converted to line suspiciousness values. This can be done using the `s2l_suspiciousness.py`. The usage is as follows: 72 | 73 | ```text 74 | usage: s2l_suspiciousness.py [-h] -d SUSPICIOUSNESS_DATA_DIR -s 75 | SOURCE_CODE_LINES -o OUTPUT_DIR 76 | [-f {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2}] 77 | 78 | optional arguments: 79 | -h, --help show this help message and exit 80 | -d SUSPICIOUSNESS_DATA_DIR, --suspiciousness-data-dir SUSPICIOUSNESS_DATA_DIR 81 | Suspiciousness data directory 82 | -s SOURCE_CODE_LINES, --source-code-lines SOURCE_CODE_LINES 83 | Source code lines directory 84 | -o OUTPUT_DIR, --output-dir OUTPUT_DIR 85 | Output directory 86 | -f {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2}, --formula {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2} 87 | Formula to convert for 88 | ``` 89 | 90 | Finally, sort the resulting csv files using the `sort_csv.py` file. 91 | 92 | ```text 93 | usage: sort_csv.py [-h] -d SUSPICIOUSNESS_DATA_DIR -o OUTPUT_DIR 94 | 95 | optional arguments: 96 | -h, --help show this help message and exit 97 | -d SUSPICIOUSNESS_DATA_DIR, --suspiciousness-data-dir SUSPICIOUSNESS_DATA_DIR 98 | Suspiciousness data directory 99 | -o OUTPUT_DIR, --output-dir OUTPUT_DIR 100 | Output directory 101 | ``` 102 | -------------------------------------------------------------------------------- /compute_recency_metric.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import csv 4 | import operator 5 | import os 6 | import re 7 | import datetime 8 | from subprocess import call 9 | 10 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 11 | 12 | PROJECT_BUGS = [ 13 | [str(x) for x in range(1, 134)], 14 | [str(x) for x in range(1, 66)], 15 | [str(x) for x in range(1, 27)], 16 | [str(x) for x in range(1, 107)], 17 | [str(x) for x in range(1, 39)], 18 | [str(x) for x in range(1, 28)] 19 | ] 20 | 21 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'] 22 | FORMULA = ['tarantula'] 23 | 24 | 25 | def find_recency(input_file, output_file, project, bug, formula): 26 | """ 27 | find the recency of the last update for every suspiciouss line 28 | 29 | Parameters 30 | ---------- 31 | input_file : str (file contains sorted suspicousness lines with date) 32 | output_file: str (file contains sorted suspiciousness lines file with date, author and recency) 33 | project: str (project name) 34 | bug: str (bug id) 35 | formula: str (fault localization technique) 36 | commit_id: str (commit id of the buggy vesion) 37 | 38 | """ 39 | 40 | input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file 41 | sorted_susp_lines = read_susp_lines_from_file(input_file) 42 | 43 | # output file 44 | output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file 45 | 46 | dates = [] 47 | 48 | for susp_line in sorted_susp_lines: 49 | date = susp_line[-1].strip() 50 | if date != "NOT_FOUND": 51 | datetime_obj = datetime.datetime.strptime(date, "%Y-%m-%d") 52 | bug_date = datetime_obj.date() 53 | else: 54 | return 55 | dates.append(bug_date) 56 | 57 | if max(dates) == datetime.date.today(): 58 | first_max_date = max(dates) 59 | try: 60 | second_max_date = sorted(set(dates))[-2] 61 | except IndexError: 62 | return 63 | dates = [date if date != first_max_date else second_max_date for date in dates] 64 | 65 | min_date = min(dates) 66 | max_date = max(dates) 67 | 68 | no_of_days_elapsed = [] 69 | 70 | for date in dates: 71 | no_of_days_elapsed.append((max_date - date).days) 72 | 73 | min_days = min(no_of_days_elapsed) 74 | max_days = max(no_of_days_elapsed) 75 | diff_days = max_days - min_days 76 | 77 | line_counter = 0 78 | 79 | for susp_line in sorted_susp_lines: 80 | if diff_days == 0: 81 | print("Divide by Zero: Max and Min are same") 82 | continue 83 | normalized_time = (no_of_days_elapsed[line_counter] - min_days)/diff_days 84 | recency = 1 - normalized_time 85 | add_recency_to_file(output_file, susp_line, recency) 86 | line_counter += 1 87 | 88 | 89 | def add_recency_to_file(output_file, susp_line, recency): 90 | """ 91 | appends the author and date to the output file containing suspiciousness lines 92 | 93 | Paramaeters: 94 | ------------ 95 | output_file: str 96 | susp_line: str 97 | recency: str 98 | 99 | """ 100 | susp_line = ", ".join(susp_line) 101 | with open(output_file, mode="a", encoding="utf-8") as myFile: 102 | myFile.write(f"{susp_line}, {recency}\n") 103 | 104 | 105 | def read_susp_lines_from_file(input_file): 106 | """ 107 | reads the suspiciousness lines data from the sorted suspiciousness file 108 | 109 | Parameters: 110 | ---------- 111 | input_file: str 112 | 113 | return: 114 | ------ 115 | sorted_susp_lines: list (2D) 116 | 117 | """ 118 | susp_data = csv.reader(open(input_file), delimiter=',') 119 | sorted_susp_lines = [susp_line for susp_line in susp_data] 120 | 121 | return sorted_susp_lines 122 | 123 | 124 | if __name__ == '__main__': 125 | parser = argparse.ArgumentParser() 126 | parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory') 127 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory') 128 | 129 | args = parser.parse_args() 130 | 131 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 132 | for bug in bugs: 133 | for formula in FORMULA: 134 | input_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date" 135 | output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-recency" 136 | find_recency(os.path.join(args.suspiciousness_data_dir, input_csv), 137 | os.path.join(args.output_dir, output_csv), project, bug, formula) 138 | -------------------------------------------------------------------------------- /create_dataset.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import csv 3 | import os 4 | 5 | 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 7 | PROJECT_BUGS = [ 8 | [str(x) for x in range(1, 134)], 9 | [str(x) for x in range(1, 66)], 10 | [str(x) for x in range(1, 27)], 11 | [str(x) for x in range(1, 107)], 12 | [str(x) for x in range(1, 39)], 13 | [str(x) for x in range(1, 28)] 14 | ] 15 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'} 16 | 17 | 18 | class Dataset(object): 19 | def __init__(self, formula, num_lines): 20 | self.base_formula = formula 21 | self.rows = {} 22 | self.num_lines = num_lines 23 | for project in PROJECTS: 24 | self.rows[project] = {} 25 | 26 | def to_csv(self, output_csv): 27 | """ 28 | Write dataset to output csv file 29 | 30 | Parameters 31 | ---------- 32 | output_csv : str 33 | Output file to write the dataset to 34 | """ 35 | output = [] 36 | columns = 'project,bug,' 37 | columns += ','.join(['line_%s' % (i+1) for i in range(self.num_lines)]) 38 | for formula in FORMULA: 39 | for i in range(self.num_lines): 40 | columns += ',line_%s_%s' % (i+1, formula) 41 | output.append(columns) 42 | for project in PROJECTS: 43 | for bug in self.rows[project]: 44 | output.append(self.rows[project][bug].to_csv()) 45 | with open(output_csv, 'w') as fwriter: 46 | fwriter.write('\n'.join(output)) 47 | 48 | def __len__(self): 49 | return self.num_lines 50 | 51 | 52 | class Row(object): 53 | """ 54 | The row class holds information about a specific row. Mainly, it holds information about the project, bug_id, 55 | lines affected, suspiciousness scores for each line given by that formula 56 | """ 57 | def __init__(self, project, bug_id): 58 | self.project = project 59 | self.bug_id = bug_id 60 | self.lines = [] 61 | self.data = {} 62 | 63 | for formula in FORMULA: 64 | self.data[formula] = [] 65 | 66 | def to_csv(self): 67 | lines_output = ','.join(self.lines) 68 | suspiciousness = [] 69 | for formula in FORMULA: 70 | susp = ','.join([str(x) for x in self.data[formula]]) 71 | suspiciousness.append(susp) 72 | return '%s,%s,%s,' % (self.project, self.bug_id, lines_output) + ','.join(suspiciousness) 73 | 74 | 75 | def add_rows_for_formula(dataset, data_dir, formula): 76 | """ 77 | Add rows for a particular formula to the dataset 78 | 79 | Parameters 80 | ---------- 81 | dataset : Dataset 82 | an object of the type dataset with dataset.lines already populated 83 | data_dir : str 84 | the data directory for the suspiciousness files 85 | formula : str 86 | formula to add for 87 | 88 | Returns 89 | ------- 90 | None 91 | """ 92 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 93 | for bug in bugs: 94 | data = {} 95 | input_file = os.path.join(data_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula)) 96 | with open(input_file) as freader: 97 | csvreader = csv.DictReader(freader) 98 | for line in csvreader: 99 | data[line['Line']] = float(line['Suspiciousness']) 100 | for line in dataset.rows[project][bug].lines: 101 | if line in data: 102 | dataset.rows[project][bug].data[formula].append(data[line]) 103 | else: 104 | dataset.rows[project][bug].data[formula].append(0.0) 105 | 106 | 107 | def create_dataset(data_dir, formula, num_lines): 108 | """ 109 | Create a dataset object and add rows from the sorted suspiciousness value file to it 110 | 111 | Parameters 112 | ---------- 113 | data_dir : str 114 | the data directory for the suspiciousness files 115 | formula : str 116 | formula to use as the base 117 | num_lines : str 118 | number of lines to read from the csv 119 | 120 | Returns 121 | ------- 122 | Dataset 123 | a dataset object 124 | """ 125 | dataset = Dataset(formula, num_lines) 126 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 127 | for bug in bugs: 128 | row = Row(project=project, bug_id=bug) 129 | input_file = os.path.join(data_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula)) 130 | with open(input_file) as freader: 131 | csvreader = csv.DictReader(freader) 132 | for line in csvreader: 133 | row.lines.append(line['Line']) 134 | row.data[formula].append(float(line['Suspiciousness'])) 135 | if len(row.lines) == num_lines: 136 | break 137 | dataset.rows[project][bug] = row 138 | return dataset 139 | 140 | 141 | if __name__ == '__main__': 142 | parser = argparse.ArgumentParser() 143 | parser.add_argument('-f', '--formula', required=True, choices=FORMULA, 144 | help='Base formula to use while creating the dataset') 145 | parser.add_argument('-d', '--data-dir', required=True, help='Data directory with all sorted suspiciousness values') 146 | parser.add_argument('-n', '--num-lines', required=True, type=int, help='Number of lines to consider') 147 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory to write dataset to') 148 | 149 | args = parser.parse_args() 150 | 151 | dataset = create_dataset(args.data_dir, args.formula, args.num_lines) 152 | for formula in FORMULA: 153 | if formula != args.formula: 154 | add_rows_for_formula(dataset, args.data_dir, formula) 155 | dataset.to_csv(os.path.join(args.output_dir, 'dataset-%s-%s.csv' % (args.formula, args.num_lines))) 156 | -------------------------------------------------------------------------------- /evaluate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import csv 3 | import os 4 | 5 | 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 7 | PROJECT_BUGS = [ 8 | [str(x) for x in range(1, 134)], 9 | [str(x) for x in range(1, 66)], 10 | [str(x) for x in range(1, 27)], 11 | [str(x) for x in range(1, 107)], 12 | [str(x) for x in range(1, 39)], 13 | [str(x) for x in range(1, 28)] 14 | ] 15 | BUGGY_LINES_SUFFIX = 'buggy.lines' 16 | 17 | 18 | def get_reweighted_lines(input_file, top_n): 19 | lines = set() 20 | index = 0 21 | with open(input_file) as freader: 22 | csvreader = csv.DictReader(freader) 23 | for row in csvreader: 24 | lines.add(row['Line']) 25 | index += 1 26 | if index >= top_n: 27 | break 28 | return lines 29 | 30 | 31 | def get_buggy_lines(input_file): 32 | buggy_lines = set() 33 | with open(input_file) as freader: 34 | for line in freader: 35 | buggy_lines.add('#'.join(line.split('#')[0:2])) 36 | return buggy_lines 37 | 38 | 39 | def calculate_accuracy(input_dir, bug_dir, top_n): 40 | accuracies = [] 41 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 42 | for bug in bugs: 43 | try: 44 | input_file = os.path.join(input_dir, '%s-%s-sorted-susp' % (project, bug)) 45 | reweighted_lines = get_reweighted_lines(input_file, top_n) 46 | bug_file = os.path.join(bug_dir, '%s-%s.%s' % (project, bug, BUGGY_LINES_SUFFIX)) 47 | buggy_lines = get_buggy_lines(bug_file) 48 | accuracy = len(reweighted_lines.intersection(buggy_lines)) / len(buggy_lines) 49 | accuracies.append(accuracy) 50 | except: 51 | continue 52 | return sum(accuracies) / len(accuracies) 53 | 54 | 55 | def calculate_accuracy_formula(input_dir, bug_dir, formula, top_n): 56 | accuracies = [] 57 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 58 | for bug in bugs: 59 | try: 60 | input_file = os.path.join(input_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula)) 61 | reweighted_lines = get_reweighted_lines(input_file, top_n) 62 | bug_file = os.path.join(bug_dir, '%s-%s.%s' % (project, bug, BUGGY_LINES_SUFFIX)) 63 | buggy_lines = get_buggy_lines(bug_file) 64 | accuracy = len(reweighted_lines.intersection(buggy_lines)) / len(buggy_lines) 65 | accuracies.append(accuracy) 66 | except: 67 | continue 68 | return sum(accuracies) / len(accuracies) 69 | 70 | 71 | if __name__ == '__main__': 72 | parser = argparse.ArgumentParser() 73 | parser.add_argument('-d', '--input-dir', required=True, help='Path to input directory') 74 | parser.add_argument('-b', '--bug-dir', required=True, help='Path to directory containing bugs') 75 | parser.add_argument('-n', '--top-n', required=True, type=int, help='Top-n accuracy') 76 | parser.add_argument('-f', '--formula', required=False, help='Supply a formula to check against') 77 | 78 | args = parser.parse_args() 79 | 80 | if args.formula is None: 81 | accuracy = calculate_accuracy(args.input_dir, args.bug_dir, args.top_n) 82 | print(accuracy) 83 | else: 84 | accuracy = calculate_accuracy_formula(args.input_dir, args.bug_dir, args.formula, args.top_n) 85 | print(accuracy) 86 | -------------------------------------------------------------------------------- /extract_date_developer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import csv 4 | import operator 5 | import os 6 | import re 7 | from subprocess import call 8 | 9 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 10 | 11 | PROJECT_BUGS = [ 12 | [str(x) for x in range(1, 134)], 13 | [str(x) for x in range(1, 66)], 14 | [str(x) for x in range(1, 27)], 15 | [str(x) for x in range(1, 107)], 16 | [str(x) for x in range(1, 39)], 17 | [str(x) for x in range(1, 28)] 18 | ] 19 | 20 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'] 21 | FORMULA = ['tarantula'] 22 | 23 | 24 | def find_author_date(input_file, output_file, project, bug, formula, commit_id): 25 | """ 26 | find the author and date of the last update for every suspiciouss line 27 | 28 | Parameters 29 | ---------- 30 | input_file : str (sorted suspicousness lines file) 31 | output_file: str (sorted suspiciousness lines file with date field and author field appended) 32 | project: str (project name) 33 | bug: str (bug id) 34 | formula: str (fault localization technique) 35 | 36 | """ 37 | 38 | # reading the suspiciousness values from the input file 39 | input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file 40 | sorted_susp_lines = read_susp_lines_from_file(input_file) 41 | 42 | # output file 43 | output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file 44 | 45 | # Running git checkout buggy_version 46 | checkout_project_git(project, bug) 47 | 48 | git_blame_output = f"/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/Git_blame_output/git_blame_{project}_{bug}" 49 | 50 | line_counter = 0 51 | prev_file_name = "" 52 | git_blame_lines = None 53 | 54 | for susp_line in sorted_susp_lines: 55 | file_name_full, line_number = susp_line[0].split("#") 56 | line_number = int(line_number) 57 | file_name = extract_file_name_from_path(file_name_full) 58 | 59 | if prev_file_name != file_name: 60 | checkout_project_git_using_tag(project, bug) 61 | git_blame_lines = extract_git_blame_lines(file_name, git_blame_output) 62 | prev_file_name = file_name 63 | 64 | # BUG FIX 65 | if line_number > len(git_blame_lines): 66 | print(" ########## ERROR ########### ") 67 | print(f"Line number {line_number} from the suspiciousness file is not present in Git_blame_output_file") 68 | print(f"Line number to be searched: {line_number} ; Number of lines in the Git_blame_output: {len(git_blame_lines)}") 69 | 70 | add_author_date_to_file(output_file, susp_line, "NOT_FOUND", "NOT_FOUND") 71 | 72 | print(f"Not collecting date and author for the project: {project} and bug: {bug}") 73 | return 74 | 75 | blame_line = git_blame_lines[line_number] # picking the line 76 | 77 | # find the author and date 78 | author, date = extract_author_date(blame_line[0]) 79 | 80 | ## To handle the "defects4j" author 81 | if author == "defects4j": 82 | author, date = find_correct_author_date(blame_line[0], commit_id, file_name, git_blame_output) 83 | checkout_project_git_using_tag(project, bug) 84 | 85 | # adding to the output file 86 | add_author_date_to_file(output_file, susp_line, author, date) 87 | 88 | line_counter += 1 89 | 90 | # print(f"=================LINE : {line_counter}==========\n\n\n") 91 | 92 | 93 | def extract_file_name_from_path(file_name): 94 | """ 95 | take the file name from the path 96 | 97 | Parameters: 98 | ----------- 99 | file_name: str 100 | """ 101 | return file_name.split("/")[-1] 102 | 103 | 104 | def add_author_date_to_file(output_file, susp_line, author, date): 105 | """ 106 | appends the author and date to the output file containing suspiciousness lines 107 | 108 | Paramaeters: 109 | ------------ 110 | output_file: str 111 | susp_line: str 112 | author: str 113 | date: str 114 | 115 | """ 116 | susp_line = ", ".join(susp_line) 117 | with open(output_file, mode="a", encoding="utf-8") as myFile: 118 | myFile.write(f"{susp_line}, {author}, {date}\n") 119 | 120 | 121 | 122 | def read_susp_lines_from_file(input_file): 123 | """ 124 | reads the suspiciousness lines data from the sorted suspiciousness file 125 | 126 | Parameters: 127 | ---------- 128 | input_file: str 129 | 130 | return: 131 | ------ 132 | sorted_susp_lines: list (2D) 133 | 134 | """ 135 | susp_data = csv.reader(open(input_file), delimiter=',') 136 | sorted_susp_lines_all = [susp_line for susp_line in susp_data] 137 | sorted_susp_lines_all = sorted_susp_lines_all[1:] # header line is not needed 138 | 139 | # pick top 100 lines 140 | sorted_susp_lines = [susp_line for susp_line in sorted_susp_lines_all] 141 | sorted_susp_lines = sorted_susp_lines[:500] 142 | 143 | return sorted_susp_lines 144 | 145 | 146 | def checkout_project_git_using_tag(project, bug): 147 | """ 148 | Checkout to the Defects4j buggy version 149 | 150 | Parameters: 151 | ---------- 152 | project: str 153 | bug: str 154 | """ 155 | commit_tag = f"D4J_{project}_{bug}_BUGGY_VERSION" 156 | os.system(f"git checkout {commit_tag}") 157 | 158 | 159 | def checkout_project_git(project, bug): 160 | """ 161 | checkout to the project using git commands 162 | 163 | Parameters: 164 | ---------- 165 | project: str 166 | bug: str 167 | """ 168 | 169 | checkout_directory = f"/tmp/{project}_{bug}_buggy_ver" 170 | command_git_checkout = f"defects4j checkout -p {project} -v {bug}b -w {checkout_directory}" 171 | os.system(command_git_checkout) 172 | os.chdir(checkout_directory) 173 | 174 | 175 | def extract_author_date(line): 176 | """ 177 | parses the line and extract the author and date 178 | 179 | Parameters: 180 | ---------- 181 | line: str (line from the git output file) 182 | 183 | return: 184 | ------- 185 | author: author of the line 186 | date: date of change 187 | """ 188 | 189 | name_start_index = line.find("(") + 1 190 | end_index = line.find(")") - 1 191 | 192 | author = "" 193 | i = name_start_index 194 | while not (line[i].isdigit() and line[i+1].isdigit() and line[i+2].isdigit() and line[i+3].isdigit()) : 195 | author += line[i] 196 | i += 1 197 | 198 | author = author.strip() # removing sorrounding white spaces 199 | date = "".join([line[indx] for indx in range(i, i+10)]) 200 | 201 | return author, date 202 | 203 | 204 | def find_correct_author_date(line, commit_id, file_name, git_blame_output): 205 | """ 206 | Checkout the original buggy version of the project and collect the 207 | author and data for the lines 208 | 209 | Parameters: 210 | ---------- 211 | line: str 212 | commit_id: str 213 | file_name: str 214 | git_blame_output: str 215 | """ 216 | os.system(f"git checkout {commit_id}") 217 | file_path = find_file_path(file_name) 218 | git_blame_output += "_orig" 219 | os.system(f"git blame {file_path} > {git_blame_output}") 220 | git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n') 221 | git_blame_list = list(git_blame_data) 222 | git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))} 223 | source_code = extract_src_code_from_line(line) 224 | orig_line_id, orig_line = find_line_in_orig_buggy_version(git_blame_lines, source_code) 225 | 226 | if orig_line_id != -1: 227 | author, date = extract_author_date(orig_line) 228 | else: 229 | author = "defects4j" 230 | date = "2018-04-25" 231 | return author, date 232 | 233 | 234 | def find_line_in_orig_buggy_version(git_blame_lines, source_code): 235 | """ 236 | Search the line in the original buggy version and return found line id 237 | 238 | Parameters: 239 | ---------- 240 | git_blame_lines: str 241 | source_code: str 242 | """ 243 | for line_id, git_line in git_blame_lines.items(): 244 | if git_line[0].find(source_code) != -1: 245 | return line_id, git_line[0] 246 | 247 | print("#### LINE NOT FOUND IN THE BUGGY VERSION") 248 | return -1, "NOT-FOUND" 249 | 250 | 251 | def extract_src_code_from_line(line): 252 | """ 253 | Extract the source code text from the git blame line 254 | 255 | Parameters: 256 | ---------- 257 | line: str 258 | """ 259 | 260 | loc1 = line.find(")") 261 | i = loc1 262 | source_code = "" 263 | 264 | while i < len(line): 265 | source_code += line[i] 266 | i += 1 267 | 268 | return source_code 269 | 270 | 271 | def find_file_path(file_name): 272 | """ 273 | Find the full path of the file 274 | 275 | Parameters: 276 | ---------- 277 | file_name: str 278 | """ 279 | find_command = f"find -name {file_name}" 280 | file_path = None 281 | os.system(f"{find_command} > find_output.txt") 282 | with open("find_output.txt") as file: 283 | file_path = file.readline() 284 | 285 | file_path = file_path.strip("\n") 286 | return file_path 287 | 288 | 289 | def extract_git_blame_lines(file_name, git_blame_output): 290 | """ 291 | run git blame on the given file and return the git blame lines 292 | Parameters: 293 | ---------- 294 | file_name: str (file name passed to git blame command) 295 | git_blame_output : str (output file where git blame command output is dumped) 296 | 297 | """ 298 | file_path = find_file_path(file_name) 299 | os.system(f"git blame {file_path} > {git_blame_output}") 300 | git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n') 301 | git_blame_list = list(git_blame_data) 302 | git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))} 303 | 304 | return git_blame_lines 305 | 306 | 307 | def get_commit_ids(project): 308 | """ 309 | finds the commit data for all the bugs for the given defects4j project 310 | 311 | Parameters: 312 | ---------- 313 | project: str (name of the project) 314 | 315 | return: 316 | ------ 317 | commit_ids : dictionary (key:value pair is bug id:commit id) 318 | 319 | """ 320 | 321 | # Getting the commit data for each project 322 | path_to_commit_db = "/home/kanag23/Desktop/Defects4j_v2/defects4j/framework/projects/" 323 | commit_db_file = os.path.join(path_to_commit_db, project, "commit-db") 324 | commit_db_data = csv.reader(open(commit_db_file), delimiter=',') 325 | commit_db_list = list(commit_db_data) 326 | commit_ids = {commit_id_line[0]:commit_id_line[1] for commit_id_line in commit_db_list} 327 | 328 | return commit_ids 329 | 330 | 331 | if __name__ == '__main__': 332 | parser = argparse.ArgumentParser() 333 | parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory') 334 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory') 335 | 336 | args = parser.parse_args() 337 | 338 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 339 | commit_ids = get_commit_ids(project) 340 | for bug in bugs: 341 | for formula in FORMULA: 342 | input_csv = f"{project}-{bug}-{formula}-sorted-susp" 343 | output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date" 344 | find_author_date(os.path.join(args.suspiciousness_data_dir, input_csv), 345 | os.path.join(args.output_dir, output_csv), project, bug, formula, commit_ids[bug]) 346 | -------------------------------------------------------------------------------- /extract_date_developer_new.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import csv 4 | import operator 5 | import os 6 | import re 7 | from subprocess import call 8 | 9 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 10 | 11 | PROJECT_BUGS = [ 12 | [str(x) for x in range(1, 134)], 13 | [str(x) for x in range(1, 66)], 14 | [str(x) for x in range(1, 27)], 15 | [str(x) for x in range(1, 107)], 16 | [str(x) for x in range(1, 39)], 17 | [str(x) for x in range(1, 28)] 18 | ] 19 | 20 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'] 21 | FORMULA = ['tarantula'] 22 | 23 | 24 | def find_author_date(input_file, output_file, project, bug, formula, commit_id): 25 | """ 26 | find the author and date of the last update for every suspiciouss line 27 | 28 | Parameters 29 | ---------- 30 | input_file : str (sorted suspicousness lines file) 31 | output_file: str (sorted suspiciousness lines file with date field and author field appended) 32 | project: str (project name) 33 | bug: str (bug id) 34 | formula: str (fault localization technique) 35 | 36 | """ 37 | 38 | # reading the suspiciousness values from the input file 39 | input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file 40 | sorted_susp_lines = read_susp_lines_from_file(input_file) 41 | 42 | # output file 43 | output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file 44 | 45 | # Running git checkout buggy_version 46 | checkout_project_git(project, bug) 47 | 48 | git_blame_output = f"/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/Git_blame_output/git_blame_{project}_{bug}" 49 | 50 | line_counter = 0 51 | prev_file_name = "" 52 | git_blame_lines = None 53 | 54 | for susp_line in sorted_susp_lines: 55 | file_name_full, line_number = susp_line[0].split("#") 56 | line_number = int(line_number) 57 | file_name = extract_file_name_from_path(file_name_full) 58 | 59 | if prev_file_name != file_name: 60 | checkout_project_git_using_tag(project, bug) 61 | git_blame_lines = extract_git_blame_lines(file_name, file_name_full, git_blame_output) 62 | prev_file_name = file_name 63 | 64 | # BUG FIX 65 | if line_number > len(git_blame_lines): 66 | print(" ########## ERROR ########### ") 67 | print(f"Line number {line_number} from the suspiciousness file is not present in Git_blame_output_file") 68 | print(f"Line number to be searched: {line_number} ; Number of lines in the Git_blame_output: {len(git_blame_lines)}") 69 | 70 | add_author_date_to_file(output_file, susp_line, "NOT_FOUND", "NOT_FOUND") 71 | 72 | print(f"Not collecting date and author for the project: {project} and bug: {bug}") 73 | return 74 | 75 | blame_line = git_blame_lines[line_number] # picking the line 76 | 77 | # find the author and date 78 | author, date = extract_author_date(blame_line[0]) 79 | 80 | ## To handle the "defects4j" author 81 | if author == "defects4j": 82 | author, date = find_correct_author_date(blame_line[0], commit_id, file_name, file_name_full, git_blame_output) 83 | checkout_project_git_using_tag(project, bug) 84 | 85 | # adding to the output file 86 | add_author_date_to_file(output_file, susp_line, author, date) 87 | 88 | line_counter += 1 89 | 90 | # print(f"=================LINE : {line_counter}==========\n\n\n") 91 | 92 | 93 | def extract_file_name_from_path(file_name): 94 | """ 95 | take the file name from the path 96 | 97 | Parameters: 98 | ----------- 99 | file_name: str 100 | """ 101 | return file_name.split("/")[-1] 102 | 103 | 104 | def add_author_date_to_file(output_file, susp_line, author, date): 105 | """ 106 | appends the author and date to the output file containing suspiciousness lines 107 | 108 | Paramaeters: 109 | ------------ 110 | output_file: str 111 | susp_line: str 112 | author: str 113 | date: str 114 | 115 | """ 116 | susp_line = ", ".join(susp_line) 117 | with open(output_file, mode="a", encoding="utf-8") as myFile: 118 | myFile.write(f"{susp_line}, {author}, {date}\n") 119 | 120 | 121 | 122 | def read_susp_lines_from_file(input_file): 123 | """ 124 | reads the suspiciousness lines data from the sorted suspiciousness file 125 | 126 | Parameters: 127 | ---------- 128 | input_file: str 129 | 130 | return: 131 | ------ 132 | sorted_susp_lines: list (2D) 133 | 134 | """ 135 | susp_data = csv.reader(open(input_file), delimiter=',') 136 | sorted_susp_lines_all = [susp_line for susp_line in susp_data] 137 | sorted_susp_lines_all = sorted_susp_lines_all[1:] # header line is not needed 138 | 139 | # pick top 100 lines 140 | sorted_susp_lines = [susp_line for susp_line in sorted_susp_lines_all] 141 | # sorted_susp_lines = sorted_susp_lines[:500] 142 | 143 | return sorted_susp_lines 144 | 145 | 146 | def checkout_project_git_using_tag(project, bug): 147 | """ 148 | Checkout to the Defects4j buggy version 149 | 150 | Parameters: 151 | ---------- 152 | project: str 153 | bug: str 154 | """ 155 | commit_tag = f"D4J_{project}_{bug}_BUGGY_VERSION" 156 | os.system(f"git checkout {commit_tag}") 157 | 158 | 159 | def checkout_project_git(project, bug): 160 | """ 161 | checkout to the project using git commands 162 | 163 | Parameters: 164 | ---------- 165 | project: str 166 | bug: str 167 | """ 168 | 169 | checkout_directory = f"/tmp/{project}_{bug}_buggy_ver" 170 | command_git_checkout = f"defects4j checkout -p {project} -v {bug}b -w {checkout_directory}" 171 | os.system(command_git_checkout) 172 | os.chdir(checkout_directory) 173 | 174 | 175 | def extract_author_date(line): 176 | """ 177 | parses the line and extract the author and date 178 | 179 | Parameters: 180 | ---------- 181 | line: str (line from the git output file) 182 | 183 | return: 184 | ------- 185 | author: author of the line 186 | date: date of change 187 | """ 188 | 189 | name_start_index = line.find("(") + 1 190 | end_index = line.find(")") - 1 191 | 192 | author = "" 193 | i = name_start_index 194 | while not (line[i].isdigit() and line[i+1].isdigit() and line[i+2].isdigit() and line[i+3].isdigit()) : 195 | author += line[i] 196 | i += 1 197 | 198 | author = author.strip() # removing sorrounding white spaces 199 | date = "".join([line[indx] for indx in range(i, i+10)]) 200 | 201 | return author, date 202 | 203 | 204 | def find_correct_author_date(line, commit_id, file_name, file_name_full, git_blame_output): 205 | """ 206 | Checkout the original buggy version of the project and collect the 207 | author and data for the lines 208 | 209 | Parameters: 210 | ---------- 211 | line: str 212 | commit_id: str 213 | file_name: str 214 | git_blame_output: str 215 | """ 216 | os.system(f"git checkout {commit_id}") 217 | file_path = find_file_path(file_name, file_name_full) 218 | git_blame_output += "_orig" 219 | os.system(f"git blame {file_path} > {git_blame_output}") 220 | git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n') 221 | git_blame_list = list(git_blame_data) 222 | git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))} 223 | source_code = extract_src_code_from_line(line) 224 | orig_line_id, orig_line = find_line_in_orig_buggy_version(git_blame_lines, source_code) 225 | 226 | if orig_line_id != -1: 227 | author, date = extract_author_date(orig_line) 228 | else: 229 | author = "defects4j" 230 | date = "2018-04-25" 231 | return author, date 232 | 233 | 234 | def find_line_in_orig_buggy_version(git_blame_lines, source_code): 235 | """ 236 | Search the line in the original buggy version and return found line id 237 | 238 | Parameters: 239 | ---------- 240 | git_blame_lines: str 241 | source_code: str 242 | """ 243 | for line_id, git_line in git_blame_lines.items(): 244 | if git_line[0].find(source_code) != -1: 245 | return line_id, git_line[0] 246 | 247 | print("#### LINE NOT FOUND IN THE BUGGY VERSION") 248 | return -1, "NOT-FOUND" 249 | 250 | 251 | def extract_src_code_from_line(line): 252 | """ 253 | Extract the source code text from the git blame line 254 | 255 | Parameters: 256 | ---------- 257 | line: str 258 | """ 259 | 260 | loc1 = line.find(")") 261 | i = loc1 262 | source_code = "" 263 | 264 | while i < len(line): 265 | source_code += line[i] 266 | i += 1 267 | 268 | return source_code 269 | 270 | 271 | def find_file_path(file_name, susp_file_path): 272 | """ 273 | Find the full path of the file 274 | 275 | Parameters: 276 | ---------- 277 | file_name: str 278 | susp_file_path: str 279 | """ 280 | find_command = f"find -name {file_name}" 281 | os.system(f"{find_command} > find_output.txt") 282 | with open("find_output.txt") as file: 283 | file_paths = file.readlines() 284 | 285 | if len(file_paths) == 1: 286 | return file_paths[0].strip("\n") 287 | 288 | for file_path in file_paths: 289 | if susp_file_path in file_path: 290 | return file_path.strip("\n") 291 | 292 | 293 | # def find_file_path(file_name): 294 | # """ 295 | # Find the full path of the file 296 | 297 | # Parameters: 298 | # ---------- 299 | # file_name: str 300 | # """ 301 | # find_command = f"find -name {file_name}" 302 | # file_path = None 303 | # os.system(f"{find_command} > find_output.txt") 304 | # with open("find_output.txt") as file: 305 | # file_path = file.readline() 306 | 307 | # file_path = file_path.strip("\n") 308 | # return file_path 309 | 310 | 311 | def extract_git_blame_lines(file_name, susp_file_path, git_blame_output): 312 | """ 313 | run git blame on the given file and return the git blame lines 314 | Parameters: 315 | ---------- 316 | file_name: str (file name passed to git blame command) 317 | susp_file_path: str (file path in the suspiciousness output) 318 | git_blame_output : str (output file where git blame command output is dumped) 319 | 320 | """ 321 | file_path = find_file_path(file_name, susp_file_path) 322 | os.system(f"git blame {file_path} > {git_blame_output}") 323 | git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n') 324 | git_blame_list = list(git_blame_data) 325 | git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))} 326 | 327 | return git_blame_lines 328 | 329 | 330 | def get_commit_ids(project): 331 | """ 332 | finds the commit data for all the bugs for the given defects4j project 333 | 334 | Parameters: 335 | ---------- 336 | project: str (name of the project) 337 | 338 | return: 339 | ------ 340 | commit_ids : dictionary (key:value pair is bug id:commit id) 341 | 342 | """ 343 | 344 | # Getting the commit data for each project 345 | path_to_commit_db = "/home/kanag23/Desktop/Defects4j_v2/defects4j/framework/projects/" 346 | commit_db_file = os.path.join(path_to_commit_db, project, "commit-db") 347 | commit_db_data = csv.reader(open(commit_db_file), delimiter=',') 348 | commit_db_list = list(commit_db_data) 349 | commit_ids = {commit_id_line[0]:commit_id_line[1] for commit_id_line in commit_db_list} 350 | 351 | return commit_ids 352 | 353 | 354 | if __name__ == '__main__': 355 | parser = argparse.ArgumentParser() 356 | parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory') 357 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory') 358 | 359 | args = parser.parse_args() 360 | 361 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 362 | commit_ids = get_commit_ids(project) 363 | for bug in bugs: 364 | for formula in FORMULA: 365 | input_csv = f"{project}-{bug}-{formula}-sorted-susp" 366 | output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date" 367 | find_author_date(os.path.join(args.suspiciousness_data_dir, input_csv), 368 | os.path.join(args.output_dir, output_csv), project, bug, formula, commit_ids[bug]) 369 | -------------------------------------------------------------------------------- /faulty_lines_ml.py: -------------------------------------------------------------------------------- 1 | from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids 2 | from sklearn.ensemble import RandomForestClassifier 3 | from sklearn.linear_model import LogisticRegression 4 | from sklearn.svm import LinearSVC 5 | from sklearn.model_selection import StratifiedKFold 6 | from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score 7 | from collections import defaultdict 8 | 9 | import csv 10 | import numpy as np 11 | 12 | 13 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 14 | BUGS = [ 15 | [str(x) for x in range(1, 134)], 16 | [str(x) for x in range(1, 66)], 17 | [str(x) for x in range(1, 27)], 18 | [str(x) for x in range(1, 107)], 19 | [str(x) for x in range(1, 39)], 20 | [str(x) for x in range(1, 28)] 21 | ] 22 | FORMULAE = ['barinel', 'jaccard', 'opt2', 'tarantula', 'ochiai', 'dstar2', 'muse'] 23 | 24 | 25 | dataset = defaultdict(lambda: defaultdict(lambda: defaultdict(list))) 26 | filename = '/Users/ashish/code/cs5704/software-engineering/fault-localization.cs.washington.edu/' \ 27 | 'susp/%s-%s-%s-line-suspiciousness' 28 | 29 | for project, bugs in zip(PROJECTS, BUGS): 30 | for bug in bugs: 31 | for formula in FORMULAE: 32 | file = filename % (project, bug, formula) 33 | with open(file) as freader: 34 | csvreader = csv.DictReader(freader) 35 | for row in csvreader: 36 | dataset[project][bug][row['Line']].append(row['Suspiciousness']) 37 | 38 | 39 | buggy_lines = '/Users/ashish/code/cs5704/software-engineering/fault-localization-data/analysis/pipeline-scripts/' \ 40 | 'buggy-lines/%s-%s.buggy.lines' 41 | faults = defaultdict(lambda: defaultdict(list)) 42 | 43 | for project, bugs in zip(PROJECTS, BUGS): 44 | for bug in bugs: 45 | buggy_lines_file = buggy_lines % (project, bug) 46 | with open(buggy_lines_file) as freader: 47 | try: 48 | for line in freader: 49 | faulty_line = '#'.join(line.split('#')[:2]) 50 | faults[project][bug].append(faulty_line) 51 | except: 52 | pass 53 | 54 | recency = '/Users/ashish/code/cs5704/recency/%s-%s-tarantula-sorted-susp-with-recency' 55 | for project, bugs in zip(PROJECTS, BUGS): 56 | for bug in bugs: 57 | recency_file = recency % (project, bug) 58 | try: 59 | with open(recency_file) as freader: 60 | for line in freader: 61 | split = line.split(',') 62 | code_line = split[0].strip() 63 | recency_value = split[-1].strip() 64 | if code_line in dataset[project][bug]: 65 | dataset[project][bug][code_line].append(recency_value) 66 | except: 67 | pass 68 | 69 | 70 | for project, bugs in zip(PROJECTS, BUGS): 71 | for bug in bugs: 72 | for line in dataset[project][bug]: 73 | if line in faults[project][bug]: 74 | dataset[project][bug][line].append('1') 75 | else: 76 | dataset[project][bug][line].append('0') 77 | 78 | 79 | delete_list = [] 80 | for project, bugs in zip(PROJECTS, BUGS): 81 | for bug in bugs: 82 | for line in dataset[project][bug]: 83 | if len(dataset[project][bug][line]) != 9: 84 | delete_list.append((project, bug, line)) 85 | 86 | for item in delete_list: 87 | del dataset[item[0]][item[1]][item[2]] 88 | 89 | data = [] 90 | for project, bugs in zip(PROJECTS, BUGS): 91 | for bug in bugs: 92 | for line in dataset[project][bug]: 93 | data.append(dataset[project][bug][line]) 94 | 95 | data = np.array(data, dtype=np.float32) 96 | X = data[:, :8] 97 | y = data[:, 8] 98 | 99 | rus = RandomUnderSampler(random_state=42) 100 | X_res, y_res = rus.fit_sample(X, y) 101 | 102 | skf = StratifiedKFold(n_splits=5, random_state=42) 103 | 104 | score_rfc = [] 105 | score_svm = [] 106 | score_lr = [] 107 | 108 | precision_rfc = [] 109 | precision_svm = [] 110 | precision_lr = [] 111 | 112 | recall_rfc = [] 113 | recall_svm = [] 114 | recall_lr = [] 115 | 116 | f1_rfc = [] 117 | f1_svm = [] 118 | f1_lr = [] 119 | 120 | for train_index, val_index in skf.split(X_res, y_res): 121 | X_train, X_test = X_res[train_index], X_res[val_index] 122 | y_train, y_test = y_res[train_index], y_res[val_index] 123 | 124 | # Random Forest Classifier 125 | rfc = RandomForestClassifier(random_state=42) 126 | rfc.fit(X_train, y_train) 127 | y_pred = rfc.predict(X_test) 128 | score_rfc.append(accuracy_score(y_test, y_pred)) 129 | precision_rfc.append(precision_score(y_test, y_pred)) 130 | recall_rfc.append(recall_score(y_test, y_pred)) 131 | f1_rfc.append(f1_score(y_test, y_pred)) 132 | print("Random Forest Accuracy: %s" % accuracy_score(y_pred, y_test)) 133 | 134 | # Random Forest Classifier 135 | lr = LogisticRegression(random_state=42) 136 | lr.fit(X_train, y_train) 137 | y_pred = lr.predict(X_test) 138 | score_lr.append(accuracy_score(y_pred, y_test)) 139 | precision_lr.append(precision_score(y_test, y_pred)) 140 | recall_lr.append(recall_score(y_test, y_pred)) 141 | f1_lr.append(f1_score(y_test, y_pred)) 142 | print("Logistic Regression Accuracy: %s" % accuracy_score(y_pred, y_test)) 143 | 144 | # Random Forest Classifier 145 | svm = LinearSVC(random_state=42) 146 | svm.fit(X_train, y_train) 147 | y_pred = svm.predict(X_test) 148 | score_svm.append(accuracy_score(y_pred, y_test)) 149 | precision_svm.append(precision_score(y_test, y_pred)) 150 | recall_svm.append(recall_score(y_test, y_pred)) 151 | f1_svm.append(f1_score(y_test, y_pred)) 152 | print("SVM Accuracy: %s" % accuracy_score(y_pred, y_test)) 153 | print() 154 | 155 | print("******** Random Forest ********") 156 | print("Accuracy : %s" % np.average(score_rfc)) 157 | print("Precision : %s" % np.average(precision_rfc)) 158 | print("Recall : %s" % np.average(recall_rfc)) 159 | print("F1-Score : %s" % np.average(f1_rfc)) 160 | print("*******************************\n") 161 | 162 | print("******** SVM ********") 163 | print("Accuracy : %s" % np.average(score_svm)) 164 | print("Precision : %s" % np.average(precision_svm)) 165 | print("Recall : %s" % np.average(recall_svm)) 166 | print("F1-Score : %s" % np.average(f1_svm)) 167 | print("*********************\n") 168 | 169 | print("******** Logistic Regression ********") 170 | print("Accuracy : %s" % np.average(score_lr)) 171 | print("Precision : %s" % np.average(precision_lr)) 172 | print("Recall : %s" % np.average(recall_lr)) 173 | print("F1-Score : %s" % np.average(f1_lr)) 174 | print("*************************************\n") 175 | 176 | 177 | -------------------------------------------------------------------------------- /get_developer_names.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | 3 | from pygit2 import Repository 4 | from pygit2 import GIT_SORT_TOPOLOGICAL 5 | 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 7 | PROJECT_BUGS = [ 8 | [str(x) for x in range(1, 134)], 9 | [str(x) for x in range(1, 66)], 10 | [str(x) for x in range(1, 27)], 11 | [str(x) for x in range(1, 107)], 12 | [str(x) for x in range(1, 39)], 13 | [str(x) for x in range(1, 28)] 14 | ] 15 | 16 | developer_details = [] 17 | 18 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 19 | for bug in bugs: 20 | defects4j_checkout_command = ['defects4j', 'checkout', '-p', project, '-v', '%sf' % bug, '-w', 21 | '/Users/ashish/code/cs5704/temp/%s_%s' % (project, bug)] 22 | subprocess.call(defects4j_checkout_command) 23 | 24 | repo_dir = '/Users/ashish/code/cs5704/temp/%s_%s/.git' % (project, bug) 25 | repo = Repository(repo_dir) 26 | 27 | author_name = None 28 | for commit in repo.walk(repo.head.target, GIT_SORT_TOPOLOGICAL): 29 | if commit.author.name != 'defects4j': 30 | author_name = commit.author.name 31 | break 32 | 33 | if author_name is not None: 34 | developer_details.append('%s,%s,%s\n' % (project, bug, author_name)) 35 | 36 | 37 | with open('developers.csv', 'w') as fwriter: 38 | for row in developer_details: 39 | fwriter.write(row) 40 | -------------------------------------------------------------------------------- /matrix_coverage.py: -------------------------------------------------------------------------------- 1 | import tarfile 2 | import os 3 | import argparse 4 | 5 | 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 7 | PROJECT_BUGS = [ 8 | [str(x) for x in range(1, 134)], 9 | [str(x) for x in range(1, 66)], 10 | [str(x) for x in range(1, 27)], 11 | [str(x) for x in range(1, 107)], 12 | [str(x) for x in range(1, 39)], 13 | [str(x) for x in range(1, 28)] 14 | ] 15 | TAR_FILE = 'gzoltar-files.tar.gz' 16 | 17 | 18 | def extract_files(data_dir, output_dir): 19 | """ 20 | Extract tar gz files from data directory to get coverage and spectra files 21 | 22 | Parameters 23 | ---------- 24 | data_dir : str 25 | the main data directory which holds all projects 26 | 27 | output_dir : str 28 | the output directory to write coverage and spectra files to 29 | this directory must exist 30 | 31 | Returns 32 | ------- 33 | None 34 | """ 35 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 36 | for bug in bugs: 37 | tar = os.path.join(data_dir, project, bug, TAR_FILE) 38 | coverage, spectra = extract_tar_file(tar) 39 | if coverage is None or spectra is None: 40 | print('Could not get coverage/spectra for %s' % tar) 41 | else: 42 | coverage_file = os.path.join(output_dir, '%s-%s-%s' % (project, bug, 'coverage')) 43 | spectra_file = os.path.join(output_dir, '%s-%s-%s' % (project, bug, 'spectra')) 44 | write_file(coverage_file, coverage) 45 | write_file(spectra_file, spectra) 46 | 47 | 48 | def extract_tar_file(tar): 49 | """ 50 | Extract a tar file and return the matrix and spectra content 51 | 52 | Parameters 53 | ---------- 54 | tar : str 55 | the path to the tarfile 56 | 57 | Returns 58 | tuple(str, str) 59 | the contents of the coverage matrix and spectra 60 | """ 61 | coverage, spectra = None, None 62 | tar = tarfile.open(tar) 63 | for member in tar.getmembers(): 64 | if 'matrix' in member.get_info()['name']: 65 | f = tar.extractfile(member) 66 | if f is not None: 67 | coverage = f.read() 68 | if 'spectra' in member.get_info()['name']: 69 | f = tar.extractfile(member) 70 | if f is not None: 71 | spectra = f.read() 72 | return coverage, spectra 73 | 74 | 75 | def write_file(filename, content): 76 | """ 77 | Write content to the given filename 78 | 79 | Parameters 80 | ---------- 81 | filename : str 82 | content : str 83 | """ 84 | with open(filename, 'wb') as fwriter: 85 | fwriter.write(content) 86 | 87 | 88 | if __name__ == '__main__': 89 | parser = argparse.ArgumentParser() 90 | parser.add_argument('--data-dir', required=True, help='data directory that holds data') 91 | parser.add_argument('--output-dir', required=True, help='file to write coverage and spectra matrices to') 92 | 93 | args = parser.parse_args() 94 | 95 | extract_files(args.data_dir, args.output_dir) 96 | -------------------------------------------------------------------------------- /reweight.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import csv 3 | import os 4 | 5 | 6 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'} 7 | 8 | 9 | def get_lines(data, num_lines): 10 | """ 11 | Get the line numbers from the dataset 12 | 13 | Parameters 14 | ---------- 15 | data : OrderedDict 16 | an OrderedDict containing information about lines present in a bug 17 | num_lines : int 18 | number of lines to consider for each bug 19 | 20 | Returns 21 | ------- 22 | list : an array of lines 23 | """ 24 | lines = [] 25 | for i in range(1, num_lines+1): 26 | lines.append(data['line_%s' % i]) 27 | return lines 28 | 29 | 30 | def multiplicative_reweighting(data, num_lines): 31 | """ 32 | Simple reweight multiplies all the values of suspiciousness across all formulae 33 | 34 | Parameters 35 | ---------- 36 | data : OrderedDict 37 | each row of the csv file represents an OrderedDict called data 38 | num_lines : int 39 | number of lines to consider for each bug 40 | 41 | Returns 42 | ------- 43 | list : a reweighted suspiciousness value for each line in the bug 44 | """ 45 | hybrid_suspiciousness = [1000 for _ in range(num_lines)] 46 | for i in range(1, num_lines+1): 47 | for formula in FORMULA: 48 | key = 'line_%s_%s' % (i, formula) 49 | if data[key] is None: 50 | return None 51 | hybrid_suspiciousness[i-1] *= float(data[key]) 52 | return hybrid_suspiciousness 53 | 54 | 55 | def reweight_dataset(dataset, num_lines, output_dir): 56 | """ 57 | Reweight a dataset based on the multiplicative principle 58 | 59 | Parameters 60 | ---------- 61 | dataset : str 62 | a csv file containing the dataset 63 | num_lines : int 64 | number of lines to consider for each bug 65 | output_dir : str 66 | output directory for storing the csv files 67 | """ 68 | with open(dataset) as freader: 69 | csvreader = csv.DictReader(freader) 70 | for row in csvreader: 71 | suspiciousness = multiplicative_reweighting(row, num_lines) 72 | if suspiciousness is None: 73 | continue 74 | lines = get_lines(row, num_lines) 75 | output = os.path.join(output_dir, 'reweighted-%s-%s.csv' % (row['project'], row['bug'])) 76 | with open(output, 'w') as fwriter: 77 | fwriter.write('Line,Suspiciousness\n') 78 | for line, susp in zip(lines, suspiciousness): 79 | fwriter.write('%s,%s\n' % (line, susp)) 80 | 81 | 82 | if __name__ == '__main__': 83 | parser = argparse.ArgumentParser() 84 | parser.add_argument('-d', '--dataset', required=True, help='path to the dataset csv') 85 | parser.add_argument('-n', '--num-lines', required=True, type=int, help='number of lines for each bug') 86 | parser.add_argument('-o', '--output-dir', required=True, 87 | help='output directory to write reweighted suspiciousness to') 88 | 89 | args = parser.parse_args() 90 | 91 | reweight_dataset(args.dataset, args.num_lines, args.output_dir) 92 | -------------------------------------------------------------------------------- /s2l_suspiciousness.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import csv 3 | import os 4 | 5 | 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 7 | PROJECT_BUGS = [ 8 | [str(x) for x in range(1, 134)], 9 | [str(x) for x in range(1, 66)], 10 | [str(x) for x in range(1, 27)], 11 | [str(x) for x in range(1, 107)], 12 | [str(x) for x in range(1, 39)], 13 | [str(x) for x in range(1, 28)] 14 | ] 15 | SOURCE_CODE_SUFFIX = '.source-code.lines' 16 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'} 17 | 18 | 19 | def classname_to_filename(classname): 20 | """ 21 | Convert classname to filename 22 | 23 | Parameters 24 | ---------- 25 | classname : str 26 | the value from of the classname from suspiciouss spectra file 27 | 28 | Returns 29 | ------- 30 | str 31 | filename 32 | """ 33 | if '$' in classname: 34 | classname = classname[:classname.find('$')] 35 | return classname.replace('.', '/') + '.java' 36 | 37 | 38 | def stmt_to_line(statement): 39 | """ 40 | Convert statement to line 41 | 42 | Parameters 43 | ---------- 44 | statement : str 45 | the statement number along with the classname 46 | 47 | Returns 48 | ------- 49 | str 50 | line number in file 51 | """ 52 | classname, line_number = statement.rsplit('#', 1) 53 | return '{}#{}'.format(classname_to_filename(classname), line_number) 54 | 55 | 56 | def convert_statement_to_line(source_code_lines_file, statement_suspiciousness, output_file): 57 | source_code = dict() 58 | with open(source_code_lines_file) as f: 59 | for line in f: 60 | line = line.strip() 61 | entry = line.split(':') 62 | key = entry[0] 63 | if key in source_code: 64 | source_code[key].append(entry[1]) 65 | else: 66 | source_code[key] = [] 67 | source_code[key].append(entry[1]) 68 | 69 | with open(statement_suspiciousness) as fin: 70 | reader = csv.DictReader(fin) 71 | with open(output_file, 'w') as f: 72 | writer = csv.DictWriter(f, ['Line','Suspiciousness']) 73 | writer.writeheader() 74 | for row in reader: 75 | line = stmt_to_line(row['Statement']) 76 | susps = row['Suspiciousness'] 77 | 78 | writer.writerow({ 79 | 'Line': line, 80 | 'Suspiciousness': susps}) 81 | 82 | # check whether there are any sub-lines 83 | if line in source_code: 84 | for additional_line in source_code[line]: 85 | writer.writerow({'Line': additional_line, 'Suspiciousness': susps}) 86 | f.close() 87 | fin.close() 88 | 89 | 90 | if __name__ == '__main__': 91 | parser = argparse.ArgumentParser() 92 | parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory') 93 | parser.add_argument('-s', '--source-code-lines', required=True, help='Source code lines directory') 94 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory') 95 | parser.add_argument('-f', '--formula', choices=FORMULA, required=False, help='Formula to convert for') 96 | 97 | args = parser.parse_args() 98 | 99 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 100 | for bug in bugs: 101 | source_code_lines_file = os.path.join(args.source_code_lines, 102 | '%s-%sb%s' % (project, bug, SOURCE_CODE_SUFFIX)) 103 | if args.formula == None: 104 | for key in FORMULA: 105 | statement_suspiciousness_file = os.path.join(args.suspiciousness_data_dir, 106 | '%s-%s-%s-suspiciousness' % (project, bug, key)) 107 | output_file = os.path.join(args.output_dir, '%s-%s-%s-line-suspiciousness' % (project, bug, key)) 108 | convert_statement_to_line(source_code_lines_file, statement_suspiciousness_file, output_file) 109 | else: 110 | statement_suspiciousness_file = os.path.join(args.suspiciousness_data_dir, 111 | '%s-%s-%s-suspiciousness' % (project, bug, args.formula)) 112 | output_file = os.path.join(args.output_dir, '%s-%s-%s-line-suspiciousness' % (project, bug, args.formula)) 113 | convert_statement_to_line(source_code_lines_file, statement_suspiciousness_file, output_file) 114 | -------------------------------------------------------------------------------- /sort_csv.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import sys 3 | import csv 4 | import operator 5 | import os 6 | 7 | 8 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time'] 9 | PROJECT_BUGS = [ 10 | [str(x) for x in range(1, 134)], 11 | [str(x) for x in range(1, 66)], 12 | [str(x) for x in range(1, 27)], 13 | [str(x) for x in range(1, 107)], 14 | [str(x) for x in range(1, 39)], 15 | [str(x) for x in range(1, 28)] 16 | ] 17 | FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'] 18 | 19 | 20 | def sort(input_csv, output_csv, column=1): 21 | """ 22 | Sort a csv based on a column 23 | 24 | Parameters 25 | ---------- 26 | input_csv : str 27 | input csv file 28 | output_csv : str 29 | output csv file 30 | column : int 31 | the column number to sort the input csv file on 32 | """ 33 | data = csv.reader(open(input_csv),delimiter=',') 34 | sortedlist = sorted(data, key=operator.itemgetter(column)) 35 | sortedlist.reverse() 36 | 37 | with open(output_csv, 'w') as f: 38 | fileWriter = csv.writer(f, delimiter=',') 39 | for row in sortedlist: 40 | fileWriter.writerow(row) 41 | 42 | 43 | if __name__ == '__main__': 44 | parser = argparse.ArgumentParser() 45 | parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory') 46 | parser.add_argument('-o', '--output-dir', required=True, help='Output directory') 47 | 48 | args = parser.parse_args() 49 | 50 | for project, bugs in zip(PROJECTS, PROJECT_BUGS): 51 | for bug in bugs: 52 | for formula in FORMULA: 53 | input_csv = '%s-%s-%s-line-suspiciousness' % (project, bug, formula) 54 | output_csv = '%s-%s-%s-sorted-susp' % (project, bug, formula) 55 | sort(os.path.join(args.suspiciousness_data_dir, input_csv), 56 | os.path.join(args.output_dir, output_csv)) 57 | -------------------------------------------------------------------------------- /suspiciousness.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | from glob import glob 4 | 5 | import collections 6 | import re 7 | import argparse 8 | import csv 9 | import os 10 | import sys 11 | 12 | 13 | def eprint(*args, **kwargs): 14 | """ 15 | Print to stderr 16 | """ 17 | print(*args, file=sys.stderr, **kwargs) 18 | 19 | 20 | def tarantula(passed, failed, totalpassed, totalfailed): 21 | """ 22 | Calculates the Tarantula suspiciousness score for the line 23 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 24 | in file analysis/pipeline-scripts/crush-matrix 25 | 26 | Parameters 27 | ---------- 28 | passed : int 29 | number of test cases passed with the line 30 | failed : int 31 | number of test cases failed with the line 32 | totalpassed : int 33 | total number of test cases passed 34 | totalfailed : int 35 | total number of test cases failed 36 | 37 | Returns 38 | ------- 39 | float 40 | the Tarantula suspiciousness value 41 | """ 42 | if totalpassed == 0 or totalfailed == 0 or (passed+failed == 0): 43 | return 0 44 | return (failed/totalfailed)/(failed/totalfailed + passed/totalpassed) 45 | 46 | 47 | def tarantula_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 48 | """ 49 | Calculates the Tarantula suspiciousness score for the line 50 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 51 | in file analysis/pipeline-scripts/crush-matrix 52 | 53 | Parameters 54 | ---------- 55 | passed : int 56 | number of test cases passed with the line 57 | failed : int 58 | number of test cases failed with the line 59 | totalpassed : int 60 | total number of test cases passed 61 | totalfailed : int 62 | total number of test cases failed 63 | was_covered : bool 64 | 65 | Returns 66 | ------- 67 | float 68 | the Tarantula hybrid suspiciousness value 69 | """ 70 | if totalpassed == 0 or totalfailed == 0 or (passed+failed == 0): 71 | return 0 72 | return (passed/totalpassed + (1 if was_covered else 0)/(passed/totalpassed + failed/totalfailed)) 73 | 74 | 75 | def ochiai(passed, failed, totalpassed, totalfailed): 76 | """ 77 | Calculates the Ochiai suspiciousness score for the line 78 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 79 | in file analysis/pipeline-scripts/crush-matrix 80 | 81 | Parameters 82 | ---------- 83 | passed : int 84 | number of test cases passed with the line 85 | failed : int 86 | number of test cases failed with the line 87 | totalpassed : int 88 | total number of test cases passed 89 | totalfailed : int 90 | total number of test cases failed 91 | 92 | Returns 93 | ------- 94 | float 95 | the Ochiai hybrid suspiciousness value 96 | """ 97 | if totalfailed == 0 or (passed+failed == 0): 98 | return 0 99 | return failed/(totalfailed*(failed+passed))**0.5 100 | 101 | 102 | def ochiai_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 103 | """ 104 | Calculates the Ochiai suspiciousness score for the line 105 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 106 | in file analysis/pipeline-scripts/crush-matrix 107 | 108 | Parameters 109 | ---------- 110 | passed : int 111 | number of test cases passed with the line 112 | failed : int 113 | number of test cases failed with the line 114 | totalpassed : int 115 | total number of test cases passed 116 | totalfailed : int 117 | total number of test cases failed 118 | was_covered : bool 119 | 120 | Returns 121 | ------- 122 | float 123 | the Ochiai hybrid suspiciousness value 124 | """ 125 | if totalfailed == 0 or (passed+failed == 0): 126 | return 0 127 | return (failed + (1 if was_covered else 0))/(totalfailed*(failed+passed))**0.5 128 | 129 | 130 | def opt2(passed, failed, totalpassed, totalfailed): 131 | """ 132 | Calculates the Opt2 suspiciousness score for the line 133 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 134 | in file analysis/pipeline-scripts/crush-matrix 135 | 136 | Parameters 137 | ---------- 138 | passed : int 139 | number of test cases passed with the line 140 | failed : int 141 | number of test cases failed with the line 142 | totalpassed : int 143 | total number of test cases passed 144 | totalfailed : int 145 | total number of test cases failed 146 | 147 | Returns 148 | ------- 149 | float 150 | the Opt2 hybrid suspiciousness value 151 | """ 152 | return failed - passed/(totalpassed+1) 153 | 154 | 155 | def opt2_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 156 | """ 157 | Calculates the Opt2 suspiciousness score for the line 158 | 159 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 160 | in file analysis/pipeline-scripts/crush-matrix 161 | 162 | Parameters 163 | ---------- 164 | passed : int 165 | number of test cases passed with the line 166 | failed : int 167 | number of test cases failed with the line 168 | totalpassed : int 169 | total number of test cases passed 170 | totalfailed : int 171 | total number of test cases failed 172 | was_covered : bool 173 | 174 | Returns 175 | ------- 176 | float 177 | the Opt2 hybrid suspiciousness value 178 | """ 179 | return failed - (passed-(1 if was_covered else 0))/(totalpassed+1) 180 | 181 | 182 | def barinel(passed, failed, totalpassed, totalfailed): 183 | """ 184 | Calculates the barinel suspiciousness score for the line 185 | 186 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 187 | in file analysis/pipeline-scripts/crush-matrix 188 | 189 | Parameters 190 | ---------- 191 | passed : int 192 | number of test cases passed with the line 193 | failed : int 194 | number of test cases failed with the line 195 | totalpassed : int 196 | total number of test cases passed 197 | totalfailed : int 198 | total number of test cases failed 199 | 200 | Returns 201 | ------- 202 | float 203 | the barinel hybrid suspiciousness value 204 | """ 205 | if failed == 0: 206 | return 0 207 | h = passed/(passed+failed) 208 | return 1-h 209 | 210 | 211 | def barinel_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 212 | """ 213 | Calculates the Barinel suspiciousness score for the line 214 | 215 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 216 | in file analysis/pipeline-scripts/crush-matrix 217 | 218 | Parameters 219 | ---------- 220 | passed : int 221 | number of test cases passed with the line 222 | failed : int 223 | number of test cases failed with the line 224 | totalpassed : int 225 | total number of test cases passed 226 | totalfailed : int 227 | total number of test cases failed 228 | was_covered : bool 229 | 230 | Returns 231 | ------- 232 | float 233 | the Barinel hybrid suspiciousness value 234 | """ 235 | if failed == 0: 236 | return 0 237 | h = (passed - (1 if was_covered else 0))/(passed+failed) 238 | return 1-h 239 | 240 | def dstar2(passed, failed, totalpassed, totalfailed): 241 | """ 242 | Calculates the dstar2 suspiciousness score for the line 243 | 244 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 245 | in file analysis/pipeline-scripts/crush-matrix 246 | 247 | Parameters 248 | ---------- 249 | passed : int 250 | number of test cases passed with the line 251 | failed : int 252 | number of test cases failed with the line 253 | totalpassed : int 254 | total number of test cases passed 255 | totalfailed : int 256 | total number of test cases failed 257 | 258 | Returns 259 | ------- 260 | float 261 | the dstar2 hybrid suspiciousness value 262 | """ 263 | if passed + totalfailed - failed == 0: 264 | assert passed==0 and failed==totalfailed 265 | return totalfailed**2 + 1 # slightly higher than otherwise possible 266 | return failed**2 / (passed + totalfailed - failed) 267 | 268 | 269 | def dstar2_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 270 | """ 271 | Calculates the dstar2 hybrid suspiciousness score for the line 272 | 273 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 274 | in file analysis/pipeline-scripts/crush-matrix 275 | 276 | Parameters 277 | ---------- 278 | passed : int 279 | number of test cases passed with the line 280 | failed : int 281 | number of test cases failed with the line 282 | totalpassed : int 283 | total number of test cases passed 284 | totalfailed : int 285 | total number of test cases failed 286 | was_covered : bool 287 | 288 | Returns 289 | ------- 290 | float 291 | the dstar2 hybrid suspiciousness value 292 | """ 293 | if passed + totalfailed - failed == 0: 294 | assert passed==0 and failed==totalfailed 295 | return totalfailed**2 + 2 # slightly higher than otherwise possible 296 | return (failed**2 + (1 if was_covered else 0)) / (passed + totalfailed - failed) 297 | 298 | 299 | def muse(passed, failed, totalpassed, totalfailed): 300 | """ 301 | Calculates the dstar2 suspiciousness score for the line 302 | 303 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 304 | in file analysis/pipeline-scripts/crush-matrix 305 | 306 | Parameters 307 | ---------- 308 | passed : int 309 | number of test cases passed with the line 310 | failed : int 311 | number of test cases failed with the line 312 | totalpassed : int 313 | total number of test cases passed 314 | totalfailed : int 315 | total number of test cases failed 316 | Returns 317 | ------- 318 | float 319 | the dstar2 hybrid suspiciousness value 320 | """ 321 | if totalpassed == 0: 322 | return 0 323 | return failed - totalfailed/totalpassed * passed 324 | 325 | 326 | def muse_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 327 | """ 328 | Calculates the dstar2 suspiciousness score for the line 329 | 330 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 331 | in file analysis/pipeline-scripts/crush-matrix 332 | 333 | Parameters 334 | ---------- 335 | passed : int 336 | number of test cases passed with the line 337 | failed : int 338 | number of test cases failed with the line 339 | totalpassed : int 340 | total number of test cases passed 341 | totalfailed : int 342 | total number of test cases failed 343 | was_covered : bool 344 | 345 | Returns 346 | ------- 347 | float 348 | the dstar2 hybrid suspiciousness value 349 | """ 350 | if totalpassed == 0: 351 | return 0 352 | return failed - (totalfailed-(1 if was_covered else 0))/totalpassed * passed 353 | 354 | 355 | def jaccard(passed, failed, totalpassed, totalfailed): 356 | """ 357 | Calculates the dstar2 suspiciousness score for the line 358 | 359 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 360 | in file analysis/pipeline-scripts/crush-matrix 361 | 362 | Parameters 363 | ---------- 364 | passed : int 365 | number of test cases passed with the line 366 | failed : int 367 | number of test cases failed with the line 368 | totalpassed : int 369 | total number of test cases passed 370 | totalfailed : int 371 | total number of test cases failed 372 | 373 | Returns 374 | ------- 375 | float 376 | the dstar2 hybrid suspiciousness value 377 | """ 378 | if totalfailed + passed == 0: 379 | return failed 380 | return failed / (totalfailed + passed) 381 | 382 | 383 | def jaccard_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered): 384 | """ 385 | Calculates the dstar2 suspiciousness score for the line 386 | 387 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 388 | in file analysis/pipeline-scripts/crush-matrix 389 | 390 | Parameters 391 | ---------- 392 | passed : int 393 | number of test cases passed with the line 394 | failed : int 395 | number of test cases failed with the line 396 | totalpassed : int 397 | total number of test cases passed 398 | totalfailed : int 399 | total number of test cases failed 400 | was_covered : bool 401 | 402 | Returns 403 | ------- 404 | float 405 | the dstar2 hybrid suspiciousness value 406 | """ 407 | if totalfailed + passed == 0: 408 | return (failed + (1 if was_covered else 0)) 409 | return (failed + (1 if was_covered else 0)) / (totalfailed + passed) 410 | 411 | 412 | # Formula to function 413 | FORMULAS = { 414 | 'tarantula': tarantula, 415 | 'ochiai': ochiai, 416 | 'opt2': opt2, 417 | 'barinel': barinel, 418 | 'dstar2': dstar2, 419 | 'muse': muse, 420 | 'jaccard': jaccard 421 | } 422 | 423 | # Hybrid formula to function 424 | HYBRID_NUMERATOR_FORMULAS = { 425 | 'tarantula': tarantula_hybrid_numerator, 426 | 'ochiai': ochiai_hybrid_numerator, 427 | 'opt2': opt2_hybrid_numerator, 428 | 'barinel': barinel_hybrid_numerator, 429 | 'dstar2': dstar2_hybrid_numerator, 430 | 'muse': muse_hybrid_numerator, 431 | 'jaccard': jaccard_hybrid_numerator 432 | } 433 | 434 | 435 | def crush_row(formula, hybrid_scheme, passed, failed, totalpassed, totalfailed, 436 | passed_covered=None, failed_covered=None, totalpassed_covered=0, 437 | totalfailed_covered=0): 438 | """ 439 | Calculates the suspiciousness of a statement or mutant based on the forumula 440 | given 441 | 442 | Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/ 443 | in file analysis/pipeline-scripts/crush-matrix 444 | 445 | Parameters 446 | ---------- 447 | formula : str 448 | the name of the formula to plug passed/failed/totalpassed/totalfailed into. 449 | hybrid_scheme : str 450 | one of ("numerator", "constant", or "mirror") 451 | useful for MBFL only, specifies how the formula should be modified to 452 | incorporate the number of passing/failing tests that *cover* the mutant 453 | (rather than kill it). 454 | """ 455 | if hybrid_scheme is None: 456 | return FORMULAS[formula](passed, failed, totalpassed, totalfailed) 457 | elif hybrid_scheme == 'numerator': 458 | return HYBRID_NUMERATOR_FORMULAS[formula](passed, failed, totalpassed, totalfailed, failed_covered > 0) 459 | elif hybrid_scheme == 'constant': 460 | return FORMULAS[formula](passed, failed, totalpassed, totalfailed) + (1 if failed_covered > 0 else 0) 461 | elif hybrid_scheme == 'mirror': 462 | return (FORMULAS[formula](passed, failed, totalpassed, totalfailed) + 463 | FORMULAS[formula](passed_covered, failed_covered, totalpassed_covered, totalfailed_covered))/2. 464 | elif hybrid_scheme == 'coverage-only': 465 | return FORMULAS[formula](passed_covered, failed_covered, totalpassed_covered, totalfailed_covered) 466 | raise ValueError('unrecognized hybrid scheme name: {!r}'.format(hybrid_scheme)) 467 | 468 | 469 | # PassFailTally is a container class for number of test cases passed, failed, and total counts 470 | PassFailTally = collections.namedtuple('PassFailTally', ('n_elements', 'passed', 'failed', 'totalpassed', 'totalfailed')) 471 | 472 | 473 | def suspiciousnesses_from_tallies(formula, hybrid_scheme, tally, hybrid_coverage_tally): 474 | """ 475 | Returns a dict mapping element-number to suspiciousness. 476 | 477 | Parameters 478 | ---------- 479 | formula : str 480 | the formula to use to calculate suspiciousness 481 | hybrid_scheme : str 482 | numerator, constant, or mirror 483 | tally : PassFailTally 484 | an object of the PassFailTally class 485 | hybrid_coverage_tally : PassFailTally 486 | an object of the PassFailTally class 487 | 488 | Returns 489 | ------- 490 | dict 491 | a dictionary of element-number to suspiciousness value 492 | """ 493 | if hybrid_coverage_tally is None: 494 | passed_covered = failed_covered = collections.defaultdict(lambda: None) 495 | totalpassed_covered = totalfailed_covered = 0 496 | else: 497 | passed_covered = hybrid_coverage_tally.passed 498 | failed_covered = hybrid_coverage_tally.failed 499 | totalpassed_covered = hybrid_coverage_tally.totalpassed 500 | totalfailed_covered = hybrid_coverage_tally.totalfailed 501 | return { 502 | element: crush_row( 503 | formula=formula, hybrid_scheme=hybrid_scheme, 504 | passed=tally.passed[element], failed=tally.failed[element], 505 | totalpassed=tally.totalpassed, totalfailed=tally.totalfailed, 506 | passed_covered=passed_covered[element], failed_covered=failed_covered[element], 507 | totalpassed_covered=totalpassed_covered, totalfailed_covered=totalfailed_covered) 508 | for element in range(tally.n_elements) 509 | } 510 | 511 | 512 | # TestSummary is a container class for test summary information 513 | TestSummary = collections.namedtuple('TestSummary', ('triggering', 'covered_elements')) 514 | 515 | 516 | def parse_test_summary(line, n_elements): 517 | words = line.strip().split(' ') 518 | coverages, sign = words[:-1], words[-1] 519 | if len(coverages) != n_elements: 520 | raise ValueError("expected {expected} elements in each row, got {actual} in {line!r}".format(expected=n_elements, actual=len(coverages), line=line)) 521 | return TestSummary( 522 | triggering=(sign == '-'), 523 | covered_elements=set(i for i in range(len(words)) if words[i]=='1')) 524 | 525 | 526 | def tally_matrix(matrix_file, total_defn, n_elements): 527 | """ 528 | Returns a PassFailTally describing how many passing/failing tests there are, 529 | and how many of each cover each code element. 530 | 531 | Parameters 532 | ---------- 533 | 534 | total_defn : str 535 | may be "tests" (in which case the tally's ``totalpassed`` will be the 536 | number of passing tests) or "elements" (in which case it'll be the 537 | number of times a passing test covers a code element) 538 | (and same for ``totalfailed``). 539 | n_elements : str 540 | is the number of code elements that each row of the matrix 541 | should indicate coverage for. 542 | 543 | Returns 544 | ------- 545 | namedtuple 546 | 547 | """ 548 | summaries = (parse_test_summary(line, n_elements) for line in matrix_file) 549 | 550 | passed = {i: 0 for i in range(n_elements)} 551 | failed = {i: 0 for i in range(n_elements)} 552 | totalpassed = 0 553 | totalfailed = 0 554 | for summary in summaries: 555 | if summary.triggering: 556 | totalfailed += (1 if total_defn == 'tests' else len(summary.covered_elements)) 557 | for element_number in summary.covered_elements: 558 | failed[element_number] += 1 559 | else: 560 | totalpassed += (1 if total_defn == 'tests' else len(summary.covered_elements)) 561 | for element_number in summary.covered_elements: 562 | passed[element_number] += 1 563 | 564 | return PassFailTally(n_elements, passed, failed, totalpassed, totalfailed) 565 | 566 | 567 | def parse_bug_from_file_name(coverage_file): 568 | """ 569 | Parse bug from file name by looking 570 | 571 | Parameters 572 | ---------- 573 | coverage_file : str 574 | the path to the coverage file 575 | 576 | Returns 577 | ------- 578 | str 579 | returns the project and bug eg Closure-11 580 | """ 581 | return '-'.join(coverage_file.split('/')[-1].split('-')[:-1]) 582 | 583 | 584 | def generate_suspiciousness(formula, coverage_file, spectra_file, output_dir): 585 | """ 586 | Generates the suspiciousness value for each bug in the dataset. 587 | 588 | The method reads the coverage file and the spectra file, and uses 589 | the method given in the fault-localization-data to generate a value 590 | between 0 and 1. 591 | 592 | Parameters 593 | ---------- 594 | formula : str 595 | which formula to use for computing suspiciousness 596 | coverage_file : str 597 | path to the coverage file 598 | spectra_file : str 599 | path to the spectra file 600 | output_dir : str 601 | 602 | 603 | Returns 604 | ------- 605 | dict 606 | suspiciousness value for all lines 607 | """ 608 | with open(spectra_file) as name_file: 609 | element_names = {i: name.strip() for i, name in enumerate(name_file)} 610 | 611 | n_elements = len(element_names) 612 | 613 | with open(coverage_file) as matrix_file: 614 | tally = tally_matrix(matrix_file, 'tests', n_elements=n_elements) 615 | 616 | suspiciousnesses = suspiciousnesses_from_tallies( 617 | formula=formula, hybrid_scheme=None, 618 | tally=tally, hybrid_coverage_tally=None) 619 | 620 | bug = parse_bug_from_file_name(coverage_file) 621 | 622 | with open(os.path.join(output_dir, '%s-%s-suspiciousness' % (bug, formula)), 'w') as output_file: 623 | writer = csv.DictWriter(output_file, ['Statement','Suspiciousness']) 624 | writer.writeheader() 625 | for element in range(n_elements): 626 | writer.writerow({ 627 | 'Statement': element_names[element], 628 | 'Suspiciousness': suspiciousnesses[element]}) 629 | 630 | return suspiciousnesses 631 | 632 | 633 | if __name__ == '__main__': 634 | # Slight hack to allow for 'all' as a choice 635 | formula_choices = set(FORMULAS.keys()) 636 | formula_choices.add('all') 637 | 638 | parser = argparse.ArgumentParser() 639 | parser.add_argument('--formula', required=True, choices=formula_choices, help='formula to use for suspiciousness calculation') 640 | parser.add_argument('--data-dir', required=True, help='data directory that holds coverage and spectra files') 641 | parser.add_argument('--output-dir', required=True, help='file to write suspiciousness vector to') 642 | 643 | args = parser.parse_args() 644 | 645 | coverage_files = sorted(glob(os.path.join(args.data_dir, '*coverage'))) 646 | spectra_files = sorted(glob(os.path.join(args.data_dir, '*spectra'))) 647 | 648 | if len(coverage_files) != len(spectra_files): 649 | eprint('Number of coverage files is not equal to the number of spectra files') 650 | sys.exit(-1) 651 | 652 | if args.formula == 'all': 653 | for formula in FORMULAS.keys(): 654 | for coverage_file, spectra_file in zip(coverage_files, spectra_files): 655 | generate_suspiciousness(formula, coverage_file, spectra_file, args.output_dir) 656 | else: 657 | for coverage_file, spectra_file in zip(coverage_files, spectra_files): 658 | generate_suspiciousness(args.formula, coverage_file, spectra_file, args.output_dir) --------------------------------------------------------------------------------