├── .gitignore
├── README.md
├── compute_recency_metric.py
├── create_dataset.py
├── evaluate.py
├── extract_date_developer.py
├── extract_date_developer_new.py
├── faulty_lines_ml.py
├── get_developer_names.py
├── matrix_coverage.py
├── reweight.py
├── s2l_suspiciousness.py
├── sort_csv.py
└── suspiciousness.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 
106 | # Idea
107 | .idea
108 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Fault Localization from Defects4J
  2 | 
  3 | Fault localization is a technique to help identify buggy lines from source code based on failing test cases. We develop a hybrid technique based on different kinds of formulas that generate statement suspiciousness. This project is part of the CS5704: Software Engineering (Spring 2018).
  4 | 
  5 | These are:
  6 | 
  7 | * Tarantula
  8 | * Ochiai
  9 | * Opt2
 10 | * Barinel
 11 | * Dstar2
 12 | * Muse
 13 | * Jaccard
 14 | 
 15 | ## Dataset
 16 | 
 17 | We use the Defects4J dataset available [here](https://github.com/rjust/defects4j). The dataset has 395 real bugs from 6 Java projects - Lang, Closure, Chart, Mockito, Time, and Math.
 18 | 
 19 | Specifically, we use only Lang and Math for our research results.
 20 | 
 21 | ## Setup
 22 | 
 23 | Download the data by running the following command.
 24 | 
 25 | ```bash
 26 | wget --recursive --no-parent --accept gzoltar-files.tar.gz http://fault-localization.cs.washington.edu/data
 27 | ```
 28 | 
 29 | Subsequently extract the matrix and coverage to a directory using the `matrix_coverage.py` file.
 30 | 
 31 | ```text
 32 | usage: matrix_coverage.py [-h] --data-dir DATA_DIR --output-dir OUTPUT_DIR
 33 | 
 34 | optional arguments:
 35 |   -h, --help            show this help message and exit
 36 |   --data-dir DATA_DIR   data directory that holds data
 37 |   --output-dir OUTPUT_DIR
 38 |                         file to write coverage and spectra matrices to
 39 | ```
 40 | 
 41 | For example, run `matrix_coverage.py` as:
 42 | 
 43 | ```bash
 44 | mkdir $HOME/fault-localization.cs.washington.edu/coverage
 45 | python matrix_coverage.py --data-dir $HOME/fault-localization.cs.washington.edu/data --output-dir $HOME/fault-localization.cs.washington.edu/coverage
 46 | ```
 47 | 
 48 | The suspiciousness values can be generated using `suspiciousness.py` file.
 49 | 
 50 | ```text
 51 | usage: suspiciousness.py [-h] --formula
 52 |                          {muse,all,ochiai,tarantula,dstar2,jaccard,barinel,opt2}
 53 |                          --data-dir DATA_DIR --output-dir OUTPUT_DIR
 54 | 
 55 | optional arguments:
 56 |   -h, --help            show this help message and exit
 57 |   --formula {muse,all,ochiai,tarantula,dstar2,jaccard,barinel,opt2}
 58 |                         formula to use for suspiciousness calculation
 59 |   --data-dir DATA_DIR   data directory that holds coverage and spectra files
 60 |   --output-dir OUTPUT_DIR
 61 |                         file to write suspiciousness vector to
 62 | ```
 63 | 
 64 | Run `suspiciousness.py` as follows:
 65 | 
 66 | ```bash
 67 | mkdir $HOME/fault-localization.cs.washington.edu/suspiciousness
 68 | python suspiciousness.py --data-dir $HOME/fault-localization.cs.washington.edu/coverage --output-dir $HOME/fault-localization.cs.washington.edu/suspiciousness --formula all
 69 | ```
 70 | 
 71 | Subsequently, the statement suspiciousness values need to be converted to line suspiciousness values. This can be done using the `s2l_suspiciousness.py`. The usage is as follows:
 72 | 
 73 | ```text
 74 | usage: s2l_suspiciousness.py [-h] -d SUSPICIOUSNESS_DATA_DIR -s
 75 |                              SOURCE_CODE_LINES -o OUTPUT_DIR
 76 |                              [-f {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2}]
 77 | 
 78 | optional arguments:
 79 |   -h, --help            show this help message and exit
 80 |   -d SUSPICIOUSNESS_DATA_DIR, --suspiciousness-data-dir SUSPICIOUSNESS_DATA_DIR
 81 |                         Suspiciousness data directory
 82 |   -s SOURCE_CODE_LINES, --source-code-lines SOURCE_CODE_LINES
 83 |                         Source code lines directory
 84 |   -o OUTPUT_DIR, --output-dir OUTPUT_DIR
 85 |                         Output directory
 86 |   -f {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2}, --formula {jaccard,tarantula,muse,dstar2,ochiai,barinel,opt2}
 87 |                         Formula to convert for
 88 | ```
 89 | 
 90 | Finally, sort the resulting csv files using the `sort_csv.py` file.
 91 | 
 92 | ```text
 93 | usage: sort_csv.py [-h] -d SUSPICIOUSNESS_DATA_DIR -o OUTPUT_DIR
 94 | 
 95 | optional arguments:
 96 |   -h, --help            show this help message and exit
 97 |   -d SUSPICIOUSNESS_DATA_DIR, --suspiciousness-data-dir SUSPICIOUSNESS_DATA_DIR
 98 |                         Suspiciousness data directory
 99 |   -o OUTPUT_DIR, --output-dir OUTPUT_DIR
100 |                         Output directory
101 | ```
102 | 


--------------------------------------------------------------------------------
/compute_recency_metric.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | import csv
  4 | import operator
  5 | import os
  6 | import re
  7 | import datetime
  8 | from subprocess import call
  9 | 
 10 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 11 | 
 12 | PROJECT_BUGS = [
 13 |     [str(x) for x in range(1, 134)],
 14 |     [str(x) for x in range(1, 66)],
 15 |     [str(x) for x in range(1, 27)],
 16 |     [str(x) for x in range(1, 107)],
 17 |     [str(x) for x in range(1, 39)],
 18 |     [str(x) for x in range(1, 28)]
 19 | ]
 20 | 
 21 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula']
 22 | FORMULA = ['tarantula']
 23 | 
 24 | 
 25 | def find_recency(input_file, output_file, project, bug, formula):
 26 |     """
 27 |     find the recency of the last update for every suspiciouss line
 28 | 
 29 |     Parameters
 30 |     ----------
 31 |     input_file : str (file contains sorted suspicousness lines with date)
 32 |     output_file: str (file contains sorted suspiciousness lines file with date, author and recency)
 33 |     project: str (project name)
 34 |     bug: str (bug id) 
 35 |     formula: str (fault localization technique)
 36 |     commit_id: str (commit id of the buggy vesion)
 37 | 
 38 |     """
 39 |    
 40 |     input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file
 41 |     sorted_susp_lines = read_susp_lines_from_file(input_file)
 42 |     
 43 |     # output file
 44 |     output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file
 45 | 
 46 |     dates = []
 47 | 
 48 |     for susp_line in sorted_susp_lines:
 49 |         date = susp_line[-1].strip()
 50 |         if date != "NOT_FOUND":
 51 |             datetime_obj = datetime.datetime.strptime(date, "%Y-%m-%d")
 52 |             bug_date = datetime_obj.date()
 53 |         else:
 54 |             return
 55 |         dates.append(bug_date)
 56 | 
 57 |     if max(dates) == datetime.date.today():
 58 |         first_max_date = max(dates)
 59 |         try:
 60 |             second_max_date = sorted(set(dates))[-2]
 61 |         except IndexError:
 62 |             return
 63 |         dates = [date if date != first_max_date else second_max_date  for date in dates]
 64 | 
 65 |     min_date = min(dates)
 66 |     max_date = max(dates)
 67 | 
 68 |     no_of_days_elapsed = []
 69 | 
 70 |     for date in dates:
 71 |         no_of_days_elapsed.append((max_date - date).days)
 72 | 
 73 |     min_days = min(no_of_days_elapsed)
 74 |     max_days = max(no_of_days_elapsed)
 75 |     diff_days = max_days - min_days
 76 | 
 77 |     line_counter = 0
 78 | 
 79 |     for susp_line in sorted_susp_lines:
 80 |         if diff_days == 0:
 81 |             print("Divide by Zero: Max and Min are same")
 82 |             continue
 83 |         normalized_time = (no_of_days_elapsed[line_counter] - min_days)/diff_days
 84 |         recency = 1 - normalized_time      
 85 |         add_recency_to_file(output_file, susp_line, recency)
 86 |         line_counter += 1
 87 |         
 88 | 
 89 | def add_recency_to_file(output_file, susp_line, recency):
 90 |     """
 91 |     appends the author and date to the output file containing suspiciousness lines
 92 |     
 93 |     Paramaeters:
 94 |     ------------
 95 |     output_file: str 
 96 |     susp_line: str
 97 |     recency: str
 98 | 
 99 |     """
100 |     susp_line = ", ".join(susp_line)
101 |     with open(output_file, mode="a", encoding="utf-8") as myFile:
102 |         myFile.write(f"{susp_line}, {recency}\n")
103 | 
104 | 
105 | def read_susp_lines_from_file(input_file):
106 |     """
107 |     reads the suspiciousness lines data from the sorted suspiciousness file
108 | 
109 |     Parameters:
110 |     ----------
111 |     input_file: str
112 | 
113 |     return:
114 |     ------
115 |     sorted_susp_lines: list (2D)
116 | 
117 |     """
118 |     susp_data = csv.reader(open(input_file), delimiter=',')
119 |     sorted_susp_lines = [susp_line for susp_line in susp_data]
120 |     
121 |     return sorted_susp_lines   
122 |         
123 | 
124 | if __name__ == '__main__':
125 |     parser = argparse.ArgumentParser()
126 |     parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory')
127 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory')
128 | 
129 |     args = parser.parse_args()
130 | 
131 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
132 |         for bug in bugs:
133 |             for formula in FORMULA:
134 |                 input_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date"
135 |                 output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-recency"
136 |                 find_recency(os.path.join(args.suspiciousness_data_dir, input_csv),
137 |                      os.path.join(args.output_dir, output_csv), project, bug, formula)
138 | 


--------------------------------------------------------------------------------
/create_dataset.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import csv
  3 | import os
  4 | 
  5 | 
  6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
  7 | PROJECT_BUGS = [
  8 |     [str(x) for x in range(1, 134)],
  9 |     [str(x) for x in range(1, 66)],
 10 |     [str(x) for x in range(1, 27)],
 11 |     [str(x) for x in range(1, 107)],
 12 |     [str(x) for x in range(1, 39)],
 13 |     [str(x) for x in range(1, 28)]
 14 | ]
 15 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'}
 16 | 
 17 | 
 18 | class Dataset(object):
 19 |     def __init__(self, formula, num_lines):
 20 |         self.base_formula = formula
 21 |         self.rows = {}
 22 |         self.num_lines = num_lines
 23 |         for project in PROJECTS:
 24 |             self.rows[project] = {}
 25 | 
 26 |     def to_csv(self, output_csv):
 27 |         """
 28 |         Write dataset to output csv file
 29 | 
 30 |         Parameters
 31 |         ----------
 32 |         output_csv : str
 33 |             Output file to write the dataset to
 34 |         """
 35 |         output = []
 36 |         columns = 'project,bug,'
 37 |         columns += ','.join(['line_%s' % (i+1) for i in range(self.num_lines)])
 38 |         for formula in FORMULA:
 39 |             for i in range(self.num_lines):
 40 |                 columns += ',line_%s_%s' % (i+1, formula)
 41 |         output.append(columns)
 42 |         for project in PROJECTS:
 43 |             for bug in self.rows[project]:
 44 |                 output.append(self.rows[project][bug].to_csv())
 45 |         with open(output_csv, 'w') as fwriter:
 46 |             fwriter.write('\n'.join(output))
 47 | 
 48 |     def __len__(self):
 49 |         return self.num_lines
 50 | 
 51 | 
 52 | class Row(object):
 53 |     """
 54 |     The row class holds information about a specific row. Mainly, it holds information about the project, bug_id,
 55 |     lines affected, suspiciousness scores for each line given by that formula
 56 |     """
 57 |     def __init__(self, project, bug_id):
 58 |         self.project = project
 59 |         self.bug_id = bug_id
 60 |         self.lines = []
 61 |         self.data = {}
 62 | 
 63 |         for formula in FORMULA:
 64 |             self.data[formula] = []
 65 | 
 66 |     def to_csv(self):
 67 |         lines_output = ','.join(self.lines)
 68 |         suspiciousness = []
 69 |         for formula in FORMULA:
 70 |             susp = ','.join([str(x) for x in self.data[formula]])
 71 |             suspiciousness.append(susp)
 72 |         return '%s,%s,%s,' % (self.project, self.bug_id, lines_output) + ','.join(suspiciousness)
 73 | 
 74 | 
 75 | def add_rows_for_formula(dataset, data_dir, formula):
 76 |     """
 77 |     Add rows for a particular formula to the dataset
 78 | 
 79 |     Parameters
 80 |     ----------
 81 |     dataset : Dataset
 82 |         an object of the type dataset with dataset.lines already populated
 83 |     data_dir : str
 84 |         the data directory for the suspiciousness files
 85 |     formula : str
 86 |         formula to add for
 87 | 
 88 |     Returns
 89 |     -------
 90 |     None
 91 |     """
 92 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
 93 |         for bug in bugs:
 94 |             data = {}
 95 |             input_file = os.path.join(data_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula))
 96 |             with open(input_file) as freader:
 97 |                 csvreader = csv.DictReader(freader)
 98 |                 for line in csvreader:
 99 |                     data[line['Line']] = float(line['Suspiciousness'])
100 |             for line in dataset.rows[project][bug].lines:
101 |                 if line in data:
102 |                     dataset.rows[project][bug].data[formula].append(data[line])
103 |                 else:
104 |                     dataset.rows[project][bug].data[formula].append(0.0)
105 | 
106 | 
107 | def create_dataset(data_dir, formula, num_lines):
108 |     """
109 |     Create a dataset object and add rows from the sorted suspiciousness value file to it
110 | 
111 |     Parameters
112 |     ----------
113 |     data_dir : str
114 |         the data directory for the suspiciousness files
115 |     formula : str
116 |         formula to use as the base
117 |     num_lines : str
118 |         number of lines to read from the csv
119 | 
120 |     Returns
121 |     -------
122 |     Dataset
123 |         a dataset object
124 |     """
125 |     dataset = Dataset(formula, num_lines)
126 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
127 |         for bug in bugs:
128 |             row = Row(project=project, bug_id=bug)
129 |             input_file = os.path.join(data_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula))
130 |             with open(input_file) as freader:
131 |                 csvreader = csv.DictReader(freader)
132 |                 for line in csvreader:
133 |                     row.lines.append(line['Line'])
134 |                     row.data[formula].append(float(line['Suspiciousness']))
135 |                     if len(row.lines) == num_lines:
136 |                         break
137 |             dataset.rows[project][bug] = row
138 |     return dataset
139 | 
140 | 
141 | if __name__ == '__main__':
142 |     parser = argparse.ArgumentParser()
143 |     parser.add_argument('-f', '--formula', required=True, choices=FORMULA,
144 |                         help='Base formula to use while creating the dataset')
145 |     parser.add_argument('-d', '--data-dir', required=True, help='Data directory with all sorted suspiciousness values')
146 |     parser.add_argument('-n', '--num-lines', required=True, type=int, help='Number of lines to consider')
147 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory to write dataset to')
148 | 
149 |     args = parser.parse_args()
150 | 
151 |     dataset = create_dataset(args.data_dir, args.formula, args.num_lines)
152 |     for formula in FORMULA:
153 |         if formula != args.formula:
154 |             add_rows_for_formula(dataset, args.data_dir, formula)
155 |     dataset.to_csv(os.path.join(args.output_dir, 'dataset-%s-%s.csv' % (args.formula, args.num_lines)))
156 | 


--------------------------------------------------------------------------------
/evaluate.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import csv
 3 | import os
 4 | 
 5 | 
 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 7 | PROJECT_BUGS = [
 8 |     [str(x) for x in range(1, 134)],
 9 |     [str(x) for x in range(1, 66)],
10 |     [str(x) for x in range(1, 27)],
11 |     [str(x) for x in range(1, 107)],
12 |     [str(x) for x in range(1, 39)],
13 |     [str(x) for x in range(1, 28)]
14 | ]
15 | BUGGY_LINES_SUFFIX = 'buggy.lines'
16 | 
17 | 
18 | def get_reweighted_lines(input_file, top_n):
19 |     lines = set()
20 |     index = 0
21 |     with open(input_file) as freader:
22 |         csvreader = csv.DictReader(freader)
23 |         for row in csvreader:
24 |             lines.add(row['Line'])
25 |             index += 1
26 |             if index >= top_n:
27 |                 break
28 |     return lines
29 | 
30 | 
31 | def get_buggy_lines(input_file):
32 |     buggy_lines = set()
33 |     with open(input_file) as freader:
34 |         for line in freader:
35 |             buggy_lines.add('#'.join(line.split('#')[0:2]))
36 |     return buggy_lines
37 | 
38 | 
39 | def calculate_accuracy(input_dir, bug_dir, top_n):
40 |     accuracies = []
41 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
42 |         for bug in bugs:
43 |             try:
44 |                 input_file = os.path.join(input_dir, '%s-%s-sorted-susp' % (project, bug))
45 |                 reweighted_lines = get_reweighted_lines(input_file, top_n)
46 |                 bug_file = os.path.join(bug_dir, '%s-%s.%s' % (project, bug, BUGGY_LINES_SUFFIX))
47 |                 buggy_lines = get_buggy_lines(bug_file)
48 |                 accuracy = len(reweighted_lines.intersection(buggy_lines)) / len(buggy_lines)
49 |                 accuracies.append(accuracy)
50 |             except:
51 |                 continue
52 |     return sum(accuracies) / len(accuracies)
53 | 
54 | 
55 | def calculate_accuracy_formula(input_dir, bug_dir, formula, top_n):
56 |     accuracies = []
57 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
58 |         for bug in bugs:
59 |             try:
60 |                 input_file = os.path.join(input_dir, '%s-%s-%s-sorted-susp' % (project, bug, formula))
61 |                 reweighted_lines = get_reweighted_lines(input_file, top_n)
62 |                 bug_file = os.path.join(bug_dir, '%s-%s.%s' % (project, bug, BUGGY_LINES_SUFFIX))
63 |                 buggy_lines = get_buggy_lines(bug_file)
64 |                 accuracy = len(reweighted_lines.intersection(buggy_lines)) / len(buggy_lines)
65 |                 accuracies.append(accuracy)
66 |             except:
67 |                 continue
68 |     return sum(accuracies) / len(accuracies)
69 | 
70 | 
71 | if __name__ == '__main__':
72 |     parser = argparse.ArgumentParser()
73 |     parser.add_argument('-d', '--input-dir', required=True, help='Path to input directory')
74 |     parser.add_argument('-b', '--bug-dir', required=True, help='Path to directory containing bugs')
75 |     parser.add_argument('-n', '--top-n', required=True, type=int, help='Top-n accuracy')
76 |     parser.add_argument('-f', '--formula', required=False, help='Supply a formula to check against')
77 | 
78 |     args = parser.parse_args()
79 | 
80 |     if args.formula is None:
81 |         accuracy = calculate_accuracy(args.input_dir, args.bug_dir, args.top_n)
82 |         print(accuracy)
83 |     else:
84 |         accuracy = calculate_accuracy_formula(args.input_dir, args.bug_dir, args.formula, args.top_n)
85 |         print(accuracy)
86 | 


--------------------------------------------------------------------------------
/extract_date_developer.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | import csv
  4 | import operator
  5 | import os
  6 | import re
  7 | from subprocess import call
  8 | 
  9 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 10 | 
 11 | PROJECT_BUGS = [
 12 |     [str(x) for x in range(1, 134)],
 13 |     [str(x) for x in range(1, 66)],
 14 |     [str(x) for x in range(1, 27)],
 15 |     [str(x) for x in range(1, 107)],
 16 |     [str(x) for x in range(1, 39)],
 17 |     [str(x) for x in range(1, 28)]
 18 | ]
 19 | 
 20 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula']
 21 | FORMULA = ['tarantula']
 22 | 
 23 | 
 24 | def find_author_date(input_file, output_file, project, bug, formula, commit_id):
 25 |     """
 26 |     find the author and date of the last update for every suspiciouss line
 27 | 
 28 |     Parameters
 29 |     ----------
 30 |     input_file : str (sorted suspicousness lines file)
 31 |     output_file: str (sorted suspiciousness lines file with date field and author field appended)
 32 |     project: str (project name)
 33 |     bug: str (bug id) 
 34 |     formula: str (fault localization technique)
 35 | 
 36 |     """
 37 |     
 38 |     # reading the suspiciousness values from the input file 
 39 |     input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file
 40 |     sorted_susp_lines = read_susp_lines_from_file(input_file)
 41 |     
 42 |     # output file
 43 |     output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file
 44 | 
 45 |     # Running git checkout buggy_version
 46 |     checkout_project_git(project, bug)
 47 | 
 48 |     git_blame_output = f"/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/Git_blame_output/git_blame_{project}_{bug}"
 49 | 
 50 |     line_counter = 0
 51 |     prev_file_name = ""
 52 |     git_blame_lines = None
 53 | 
 54 |     for susp_line in sorted_susp_lines:
 55 |         file_name_full, line_number = susp_line[0].split("#")
 56 |         line_number = int(line_number)
 57 |         file_name = extract_file_name_from_path(file_name_full)
 58 |         
 59 |         if prev_file_name != file_name:
 60 |             checkout_project_git_using_tag(project, bug)
 61 |             git_blame_lines = extract_git_blame_lines(file_name, git_blame_output)
 62 |             prev_file_name = file_name
 63 | 
 64 |         # BUG FIX
 65 |         if line_number > len(git_blame_lines):
 66 |             print(" ########## ERROR ########### ")
 67 |             print(f"Line number {line_number} from the suspiciousness file is not present in Git_blame_output_file")
 68 |             print(f"Line number to be searched: {line_number} ; Number of lines in the Git_blame_output: {len(git_blame_lines)}")
 69 |             
 70 |             add_author_date_to_file(output_file, susp_line, "NOT_FOUND", "NOT_FOUND")
 71 | 
 72 |             print(f"Not collecting date and author for the project: {project} and bug: {bug}")
 73 |             return
 74 | 
 75 |         blame_line = git_blame_lines[line_number] # picking the line
 76 | 
 77 |         # find the author and date
 78 |         author, date = extract_author_date(blame_line[0])  
 79 | 
 80 |         ## To handle the "defects4j" author
 81 |         if author == "defects4j":
 82 |             author, date = find_correct_author_date(blame_line[0], commit_id, file_name, git_blame_output)
 83 |             checkout_project_git_using_tag(project, bug)
 84 | 
 85 |         # adding to the output file
 86 |         add_author_date_to_file(output_file, susp_line, author, date)
 87 |         
 88 |         line_counter += 1 
 89 | 
 90 |         # print(f"=================LINE : {line_counter}==========\n\n\n")
 91 | 
 92 | 
 93 | def extract_file_name_from_path(file_name):
 94 |     """
 95 |     take the file name from the path
 96 |  
 97 |     Parameters:
 98 |     -----------
 99 |     file_name: str
100 |     """
101 |     return file_name.split("/")[-1]
102 | 
103 | 
104 | def add_author_date_to_file(output_file, susp_line, author, date):
105 |     """
106 |     appends the author and date to the output file containing suspiciousness lines
107 |     
108 |     Paramaeters:
109 |     ------------
110 |     output_file: str 
111 |     susp_line: str
112 |     author: str
113 |     date: str
114 | 
115 |     """
116 |     susp_line = ", ".join(susp_line)
117 |     with open(output_file, mode="a", encoding="utf-8") as myFile:
118 |         myFile.write(f"{susp_line}, {author}, {date}\n")
119 | 
120 | 
121 | 
122 | def read_susp_lines_from_file(input_file):
123 |     """
124 |     reads the suspiciousness lines data from the sorted suspiciousness file
125 | 
126 |     Parameters:
127 |     ----------
128 |     input_file: str
129 | 
130 |     return:
131 |     ------
132 |     sorted_susp_lines: list (2D)
133 | 
134 |     """
135 |     susp_data = csv.reader(open(input_file), delimiter=',')
136 |     sorted_susp_lines_all = [susp_line for susp_line in susp_data]
137 |     sorted_susp_lines_all = sorted_susp_lines_all[1:] # header line is not needed
138 |     
139 |      # pick top 100 lines
140 |     sorted_susp_lines = [susp_line for susp_line in sorted_susp_lines_all]
141 |     sorted_susp_lines = sorted_susp_lines[:500]
142 |    
143 |     return  sorted_susp_lines   
144 | 
145 | 
146 | def checkout_project_git_using_tag(project, bug):
147 |     """ 
148 |     Checkout to the Defects4j buggy version
149 | 
150 |     Parameters:
151 |     ----------
152 |     project: str
153 |     bug: str
154 |     """
155 |     commit_tag = f"D4J_{project}_{bug}_BUGGY_VERSION"
156 |     os.system(f"git checkout {commit_tag}")
157 | 
158 | 
159 | def checkout_project_git(project, bug):
160 |     """
161 |     checkout to the project using git commands
162 | 
163 |     Parameters:
164 |     ----------
165 |     project: str
166 |     bug: str
167 |     """
168 | 
169 |     checkout_directory = f"/tmp/{project}_{bug}_buggy_ver"
170 |     command_git_checkout = f"defects4j checkout -p {project} -v {bug}b -w {checkout_directory}" 
171 |     os.system(command_git_checkout)
172 |     os.chdir(checkout_directory)
173 | 
174 | 
175 | def extract_author_date(line):
176 |     """
177 |     parses the line and extract the author and date
178 | 
179 |     Parameters:
180 |     ----------
181 |     line: str (line from the git output file)
182 | 
183 |     return:
184 |     -------
185 |     author: author of the line
186 |     date: date of change
187 |     """
188 | 
189 |     name_start_index = line.find("(") + 1
190 |     end_index = line.find(")") - 1
191 | 
192 |     author = ""
193 |     i = name_start_index
194 |     while not (line[i].isdigit() and line[i+1].isdigit() and line[i+2].isdigit() and line[i+3].isdigit()) :
195 |         author += line[i]
196 |         i += 1
197 | 
198 |     author = author.strip()     # removing sorrounding white spaces
199 |     date = "".join([line[indx] for indx in range(i, i+10)])
200 | 
201 |     return author, date
202 | 
203 | 
204 | def find_correct_author_date(line, commit_id, file_name, git_blame_output):
205 |     """
206 |     Checkout the original buggy version of the project and collect the 
207 |     author and data for the lines
208 | 
209 |     Parameters:
210 |     ----------
211 |     line: str
212 |     commit_id: str
213 |     file_name: str
214 |     git_blame_output: str
215 |     """
216 |     os.system(f"git checkout {commit_id}")
217 |     file_path = find_file_path(file_name)
218 |     git_blame_output += "_orig"
219 |     os.system(f"git blame {file_path} > {git_blame_output}")
220 |     git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n')
221 |     git_blame_list =  list(git_blame_data)
222 |     git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))}        
223 |     source_code = extract_src_code_from_line(line)
224 |     orig_line_id, orig_line = find_line_in_orig_buggy_version(git_blame_lines, source_code)
225 |  
226 |     if orig_line_id != -1:
227 |         author, date = extract_author_date(orig_line)
228 |     else:
229 |         author = "defects4j"
230 |         date = "2018-04-25"
231 |     return author, date
232 | 
233 | 
234 | def find_line_in_orig_buggy_version(git_blame_lines, source_code):
235 |     """
236 |     Search the line in the original buggy version and return found line id 
237 |     
238 |     Parameters:
239 |     ----------
240 |     git_blame_lines: str
241 |     source_code: str
242 |     """
243 |     for line_id, git_line in git_blame_lines.items():
244 |         if git_line[0].find(source_code) != -1:
245 |             return line_id, git_line[0]
246 | 
247 |     print("#### LINE NOT FOUND IN THE BUGGY VERSION")
248 |     return -1, "NOT-FOUND"
249 | 
250 | 
251 | def extract_src_code_from_line(line):
252 |     """
253 |     Extract the source code text from the git blame line
254 | 
255 |     Parameters:
256 |     ----------
257 |     line: str    
258 |     """
259 | 
260 |     loc1 = line.find(")")
261 |     i = loc1
262 |     source_code = ""
263 | 
264 |     while i < len(line):
265 |         source_code += line[i]
266 |         i += 1
267 | 
268 |     return source_code
269 | 
270 | 
271 | def find_file_path(file_name):
272 |     """
273 |     Find the full path of the file
274 | 
275 |     Parameters:
276 |     ----------
277 |     file_name: str
278 |     """
279 |     find_command = f"find -name {file_name}"
280 |     file_path = None
281 |     os.system(f"{find_command} > find_output.txt")
282 |     with open("find_output.txt") as file:
283 |         file_path = file.readline()
284 | 
285 |     file_path = file_path.strip("\n")
286 |     return file_path    
287 | 
288 | 
289 | def extract_git_blame_lines(file_name, git_blame_output):
290 |     """
291 |     run git blame on the given file and return the git blame lines    
292 |     Parameters:
293 |     ----------
294 |     file_name: str (file name passed to git blame command)
295 |     git_blame_output : str (output file where git blame command output is dumped)
296 | 
297 |     """
298 |     file_path = find_file_path(file_name)
299 |     os.system(f"git blame {file_path} > {git_blame_output}")
300 |     git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n')
301 |     git_blame_list =  list(git_blame_data)
302 |     git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))}    
303 |   
304 |     return git_blame_lines
305 | 
306 | 
307 | def get_commit_ids(project):
308 |     """
309 |     finds the commit data for all the bugs for the given defects4j project
310 | 
311 |     Parameters:
312 |     ----------
313 |     project: str (name of the project)
314 |     
315 |     return:
316 |     ------
317 |     commit_ids : dictionary (key:value pair is bug id:commit id)
318 | 
319 |     """
320 |     
321 |     # Getting the commit data for each project 
322 |     path_to_commit_db = "/home/kanag23/Desktop/Defects4j_v2/defects4j/framework/projects/"
323 |     commit_db_file = os.path.join(path_to_commit_db, project, "commit-db")
324 |     commit_db_data = csv.reader(open(commit_db_file), delimiter=',')
325 |     commit_db_list = list(commit_db_data)
326 |     commit_ids = {commit_id_line[0]:commit_id_line[1] for commit_id_line in commit_db_list}
327 | 
328 |     return commit_ids
329 | 
330 | 
331 | if __name__ == '__main__':
332 |     parser = argparse.ArgumentParser()
333 |     parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory')
334 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory')
335 | 
336 |     args = parser.parse_args()
337 | 
338 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
339 |         commit_ids = get_commit_ids(project)
340 |         for bug in bugs:
341 |             for formula in FORMULA:
342 |                 input_csv = f"{project}-{bug}-{formula}-sorted-susp"
343 |                 output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date"
344 |                 find_author_date(os.path.join(args.suspiciousness_data_dir, input_csv),
345 |                      os.path.join(args.output_dir, output_csv), project, bug, formula, commit_ids[bug])
346 | 


--------------------------------------------------------------------------------
/extract_date_developer_new.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import sys
  3 | import csv
  4 | import operator
  5 | import os
  6 | import re
  7 | from subprocess import call
  8 | 
  9 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 10 | 
 11 | PROJECT_BUGS = [
 12 |     [str(x) for x in range(1, 134)],
 13 |     [str(x) for x in range(1, 66)],
 14 |     [str(x) for x in range(1, 27)],
 15 |     [str(x) for x in range(1, 107)],
 16 |     [str(x) for x in range(1, 39)],
 17 |     [str(x) for x in range(1, 28)]
 18 | ]
 19 | 
 20 | #FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula']
 21 | FORMULA = ['tarantula']
 22 | 
 23 | 
 24 | def find_author_date(input_file, output_file, project, bug, formula, commit_id):
 25 |     """
 26 |     find the author and date of the last update for every suspiciouss line
 27 | 
 28 |     Parameters
 29 |     ----------
 30 |     input_file : str (sorted suspicousness lines file)
 31 |     output_file: str (sorted suspiciousness lines file with date field and author field appended)
 32 |     project: str (project name)
 33 |     bug: str (bug id) 
 34 |     formula: str (fault localization technique)
 35 | 
 36 |     """
 37 |     
 38 |     # reading the suspiciousness values from the input file 
 39 |     input_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + input_file
 40 |     sorted_susp_lines = read_susp_lines_from_file(input_file)
 41 |     
 42 |     # output file
 43 |     output_file = "/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/" + output_file
 44 | 
 45 |     # Running git checkout buggy_version
 46 |     checkout_project_git(project, bug)
 47 | 
 48 |     git_blame_output = f"/home/kanag23/Desktop/Fault_loc/Python_scripts_Apr_10/Git_blame_output/git_blame_{project}_{bug}"
 49 | 
 50 |     line_counter = 0
 51 |     prev_file_name = ""
 52 |     git_blame_lines = None
 53 | 
 54 |     for susp_line in sorted_susp_lines:
 55 |         file_name_full, line_number = susp_line[0].split("#")
 56 |         line_number = int(line_number)
 57 |         file_name = extract_file_name_from_path(file_name_full)
 58 |         
 59 |         if prev_file_name != file_name:
 60 |             checkout_project_git_using_tag(project, bug)
 61 |             git_blame_lines = extract_git_blame_lines(file_name, file_name_full, git_blame_output)
 62 |             prev_file_name = file_name
 63 | 
 64 |         # BUG FIX
 65 |         if line_number > len(git_blame_lines):
 66 |             print(" ########## ERROR ########### ")
 67 |             print(f"Line number {line_number} from the suspiciousness file is not present in Git_blame_output_file")
 68 |             print(f"Line number to be searched: {line_number} ; Number of lines in the Git_blame_output: {len(git_blame_lines)}")
 69 |             
 70 |             add_author_date_to_file(output_file, susp_line, "NOT_FOUND", "NOT_FOUND")
 71 | 
 72 |             print(f"Not collecting date and author for the project: {project} and bug: {bug}")
 73 |             return
 74 | 
 75 |         blame_line = git_blame_lines[line_number] # picking the line
 76 | 
 77 |         # find the author and date
 78 |         author, date = extract_author_date(blame_line[0])  
 79 | 
 80 |         ## To handle the "defects4j" author
 81 |         if author == "defects4j":
 82 |             author, date = find_correct_author_date(blame_line[0], commit_id, file_name, file_name_full, git_blame_output)
 83 |             checkout_project_git_using_tag(project, bug)
 84 | 
 85 |         # adding to the output file
 86 |         add_author_date_to_file(output_file, susp_line, author, date)
 87 |         
 88 |         line_counter += 1 
 89 | 
 90 |         # print(f"=================LINE : {line_counter}==========\n\n\n")
 91 | 
 92 | 
 93 | def extract_file_name_from_path(file_name):
 94 |     """
 95 |     take the file name from the path
 96 |  
 97 |     Parameters:
 98 |     -----------
 99 |     file_name: str
100 |     """
101 |     return file_name.split("/")[-1]
102 | 
103 | 
104 | def add_author_date_to_file(output_file, susp_line, author, date):
105 |     """
106 |     appends the author and date to the output file containing suspiciousness lines
107 |     
108 |     Paramaeters:
109 |     ------------
110 |     output_file: str 
111 |     susp_line: str
112 |     author: str
113 |     date: str
114 | 
115 |     """
116 |     susp_line = ", ".join(susp_line)
117 |     with open(output_file, mode="a", encoding="utf-8") as myFile:
118 |         myFile.write(f"{susp_line}, {author}, {date}\n")
119 | 
120 | 
121 | 
122 | def read_susp_lines_from_file(input_file):
123 |     """
124 |     reads the suspiciousness lines data from the sorted suspiciousness file
125 | 
126 |     Parameters:
127 |     ----------
128 |     input_file: str
129 | 
130 |     return:
131 |     ------
132 |     sorted_susp_lines: list (2D)
133 | 
134 |     """
135 |     susp_data = csv.reader(open(input_file), delimiter=',')
136 |     sorted_susp_lines_all = [susp_line for susp_line in susp_data]
137 |     sorted_susp_lines_all = sorted_susp_lines_all[1:] # header line is not needed
138 |     
139 |      # pick top 100 lines
140 |     sorted_susp_lines = [susp_line for susp_line in sorted_susp_lines_all]
141 |     # sorted_susp_lines = sorted_susp_lines[:500]
142 |    
143 |     return  sorted_susp_lines   
144 | 
145 | 
146 | def checkout_project_git_using_tag(project, bug):
147 |     """ 
148 |     Checkout to the Defects4j buggy version
149 | 
150 |     Parameters:
151 |     ----------
152 |     project: str
153 |     bug: str
154 |     """
155 |     commit_tag = f"D4J_{project}_{bug}_BUGGY_VERSION"
156 |     os.system(f"git checkout {commit_tag}")
157 | 
158 | 
159 | def checkout_project_git(project, bug):
160 |     """
161 |     checkout to the project using git commands
162 | 
163 |     Parameters:
164 |     ----------
165 |     project: str
166 |     bug: str
167 |     """
168 | 
169 |     checkout_directory = f"/tmp/{project}_{bug}_buggy_ver"
170 |     command_git_checkout = f"defects4j checkout -p {project} -v {bug}b -w {checkout_directory}" 
171 |     os.system(command_git_checkout)
172 |     os.chdir(checkout_directory)
173 | 
174 | 
175 | def extract_author_date(line):
176 |     """
177 |     parses the line and extract the author and date
178 | 
179 |     Parameters:
180 |     ----------
181 |     line: str (line from the git output file)
182 | 
183 |     return:
184 |     -------
185 |     author: author of the line
186 |     date: date of change
187 |     """
188 | 
189 |     name_start_index = line.find("(") + 1
190 |     end_index = line.find(")") - 1
191 | 
192 |     author = ""
193 |     i = name_start_index
194 |     while not (line[i].isdigit() and line[i+1].isdigit() and line[i+2].isdigit() and line[i+3].isdigit()) :
195 |         author += line[i]
196 |         i += 1
197 | 
198 |     author = author.strip()     # removing sorrounding white spaces
199 |     date = "".join([line[indx] for indx in range(i, i+10)])
200 | 
201 |     return author, date
202 | 
203 | 
204 | def find_correct_author_date(line, commit_id, file_name, file_name_full, git_blame_output):
205 |     """
206 |     Checkout the original buggy version of the project and collect the 
207 |     author and data for the lines
208 | 
209 |     Parameters:
210 |     ----------
211 |     line: str
212 |     commit_id: str
213 |     file_name: str
214 |     git_blame_output: str
215 |     """
216 |     os.system(f"git checkout {commit_id}")
217 |     file_path = find_file_path(file_name, file_name_full)
218 |     git_blame_output += "_orig"
219 |     os.system(f"git blame {file_path} > {git_blame_output}")
220 |     git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n')
221 |     git_blame_list =  list(git_blame_data)
222 |     git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))}        
223 |     source_code = extract_src_code_from_line(line)
224 |     orig_line_id, orig_line = find_line_in_orig_buggy_version(git_blame_lines, source_code)
225 |  
226 |     if orig_line_id != -1:
227 |         author, date = extract_author_date(orig_line)
228 |     else:
229 |         author = "defects4j"
230 |         date = "2018-04-25"
231 |     return author, date
232 | 
233 | 
234 | def find_line_in_orig_buggy_version(git_blame_lines, source_code):
235 |     """
236 |     Search the line in the original buggy version and return found line id 
237 |     
238 |     Parameters:
239 |     ----------
240 |     git_blame_lines: str
241 |     source_code: str
242 |     """
243 |     for line_id, git_line in git_blame_lines.items():
244 |         if git_line[0].find(source_code) != -1:
245 |             return line_id, git_line[0]
246 | 
247 |     print("#### LINE NOT FOUND IN THE BUGGY VERSION")
248 |     return -1, "NOT-FOUND"
249 | 
250 | 
251 | def extract_src_code_from_line(line):
252 |     """
253 |     Extract the source code text from the git blame line
254 | 
255 |     Parameters:
256 |     ----------
257 |     line: str    
258 |     """
259 | 
260 |     loc1 = line.find(")")
261 |     i = loc1
262 |     source_code = ""
263 | 
264 |     while i < len(line):
265 |         source_code += line[i]
266 |         i += 1
267 | 
268 |     return source_code
269 | 
270 | 
271 | def find_file_path(file_name, susp_file_path):
272 |     """
273 |     Find the full path of the file
274 | 
275 |     Parameters:
276 |     ----------
277 |     file_name: str
278 |     susp_file_path: str
279 |     """
280 |     find_command = f"find -name {file_name}"
281 |     os.system(f"{find_command} > find_output.txt")
282 |     with open("find_output.txt") as file:
283 |         file_paths = file.readlines()
284 | 
285 |     if len(file_paths) == 1:
286 |         return file_paths[0].strip("\n")
287 | 
288 |     for file_path in file_paths:
289 |         if susp_file_path in file_path:
290 |             return file_path.strip("\n")
291 | 
292 | 
293 | # def find_file_path(file_name):
294 | #     """
295 | #     Find the full path of the file
296 | 
297 | #     Parameters:
298 | #     ----------
299 | #     file_name: str
300 | #     """
301 | #     find_command = f"find -name {file_name}"
302 | #     file_path = None
303 | #     os.system(f"{find_command} > find_output.txt")
304 | #     with open("find_output.txt") as file:
305 | #         file_path = file.readline()
306 | 
307 | #     file_path = file_path.strip("\n")
308 | #     return file_path    
309 | 
310 | 
311 | def extract_git_blame_lines(file_name, susp_file_path, git_blame_output):
312 |     """
313 |     run git blame on the given file and return the git blame lines    
314 |     Parameters:
315 |     ----------
316 |     file_name: str (file name passed to git blame command)
317 |     susp_file_path: str (file path in the suspiciousness output)
318 |     git_blame_output : str (output file where git blame command output is dumped)
319 | 
320 |     """
321 |     file_path = find_file_path(file_name, susp_file_path)
322 |     os.system(f"git blame {file_path} > {git_blame_output}")
323 |     git_blame_data = csv.reader(open(git_blame_output, encoding='ISO-8859-1'), delimiter='\n')
324 |     git_blame_list =  list(git_blame_data)
325 |     git_blame_lines = {(i+1):git_blame_list[i] for i in range(len(git_blame_list))}    
326 |   
327 |     return git_blame_lines
328 | 
329 | 
330 | def get_commit_ids(project):
331 |     """
332 |     finds the commit data for all the bugs for the given defects4j project
333 | 
334 |     Parameters:
335 |     ----------
336 |     project: str (name of the project)
337 |     
338 |     return:
339 |     ------
340 |     commit_ids : dictionary (key:value pair is bug id:commit id)
341 | 
342 |     """
343 |     
344 |     # Getting the commit data for each project 
345 |     path_to_commit_db = "/home/kanag23/Desktop/Defects4j_v2/defects4j/framework/projects/"
346 |     commit_db_file = os.path.join(path_to_commit_db, project, "commit-db")
347 |     commit_db_data = csv.reader(open(commit_db_file), delimiter=',')
348 |     commit_db_list = list(commit_db_data)
349 |     commit_ids = {commit_id_line[0]:commit_id_line[1] for commit_id_line in commit_db_list}
350 | 
351 |     return commit_ids
352 | 
353 | 
354 | if __name__ == '__main__':
355 |     parser = argparse.ArgumentParser()
356 |     parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory')
357 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory')
358 | 
359 |     args = parser.parse_args()
360 | 
361 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
362 |         commit_ids = get_commit_ids(project)
363 |         for bug in bugs:
364 |             for formula in FORMULA:
365 |                 input_csv = f"{project}-{bug}-{formula}-sorted-susp"
366 |                 output_csv = f"{project}-{bug}-{formula}-sorted-susp-with-date"
367 |                 find_author_date(os.path.join(args.suspiciousness_data_dir, input_csv),
368 |                      os.path.join(args.output_dir, output_csv), project, bug, formula, commit_ids[bug])
369 | 


--------------------------------------------------------------------------------
/faulty_lines_ml.py:
--------------------------------------------------------------------------------
  1 | from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids
  2 | from sklearn.ensemble import RandomForestClassifier
  3 | from sklearn.linear_model import LogisticRegression
  4 | from sklearn.svm import LinearSVC
  5 | from sklearn.model_selection import StratifiedKFold
  6 | from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
  7 | from collections import defaultdict
  8 | 
  9 | import csv
 10 | import numpy as np
 11 | 
 12 | 
 13 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 14 | BUGS = [
 15 |     [str(x) for x in range(1, 134)],
 16 |     [str(x) for x in range(1, 66)],
 17 |     [str(x) for x in range(1, 27)],
 18 |     [str(x) for x in range(1, 107)],
 19 |     [str(x) for x in range(1, 39)],
 20 |     [str(x) for x in range(1, 28)]
 21 | ]
 22 | FORMULAE = ['barinel', 'jaccard', 'opt2', 'tarantula', 'ochiai', 'dstar2', 'muse']
 23 | 
 24 | 
 25 | dataset = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
 26 | filename = '/Users/ashish/code/cs5704/software-engineering/fault-localization.cs.washington.edu/' \
 27 |            'susp/%s-%s-%s-line-suspiciousness'
 28 | 
 29 | for project, bugs in zip(PROJECTS, BUGS):
 30 |     for bug in bugs:
 31 |         for formula in FORMULAE:
 32 |             file = filename % (project, bug, formula)
 33 |             with open(file) as freader:
 34 |                 csvreader = csv.DictReader(freader)
 35 |                 for row in csvreader:
 36 |                     dataset[project][bug][row['Line']].append(row['Suspiciousness'])
 37 | 
 38 | 
 39 | buggy_lines = '/Users/ashish/code/cs5704/software-engineering/fault-localization-data/analysis/pipeline-scripts/' \
 40 |               'buggy-lines/%s-%s.buggy.lines'
 41 | faults = defaultdict(lambda: defaultdict(list))
 42 | 
 43 | for project, bugs in zip(PROJECTS, BUGS):
 44 |     for bug in bugs:
 45 |         buggy_lines_file = buggy_lines % (project, bug)
 46 |         with open(buggy_lines_file) as freader:
 47 |             try:
 48 |                 for line in freader:
 49 |                     faulty_line = '#'.join(line.split('#')[:2])
 50 |                     faults[project][bug].append(faulty_line)
 51 |             except:
 52 |                 pass
 53 | 
 54 | recency = '/Users/ashish/code/cs5704/recency/%s-%s-tarantula-sorted-susp-with-recency'
 55 | for project, bugs in zip(PROJECTS, BUGS):
 56 |     for bug in bugs:
 57 |         recency_file = recency % (project, bug)
 58 |         try:
 59 |             with open(recency_file) as freader:
 60 |                 for line in freader:
 61 |                     split = line.split(',')
 62 |                     code_line = split[0].strip()
 63 |                     recency_value = split[-1].strip()
 64 |                     if code_line in dataset[project][bug]:
 65 |                         dataset[project][bug][code_line].append(recency_value)
 66 |         except:
 67 |             pass
 68 | 
 69 | 
 70 | for project, bugs in zip(PROJECTS, BUGS):
 71 |     for bug in bugs:
 72 |         for line in dataset[project][bug]:
 73 |             if line in faults[project][bug]:
 74 |                 dataset[project][bug][line].append('1')
 75 |             else:
 76 |                 dataset[project][bug][line].append('0')
 77 | 
 78 | 
 79 | delete_list = []
 80 | for project, bugs in zip(PROJECTS, BUGS):
 81 |     for bug in bugs:
 82 |         for line in dataset[project][bug]:
 83 |             if len(dataset[project][bug][line]) != 9:
 84 |                 delete_list.append((project, bug, line))
 85 | 
 86 | for item in delete_list:
 87 |     del dataset[item[0]][item[1]][item[2]]
 88 | 
 89 | data = []
 90 | for project, bugs in zip(PROJECTS, BUGS):
 91 |     for bug in bugs:
 92 |         for line in dataset[project][bug]:
 93 |             data.append(dataset[project][bug][line])
 94 | 
 95 | data = np.array(data, dtype=np.float32)
 96 | X = data[:, :8]
 97 | y = data[:, 8]
 98 | 
 99 | rus = RandomUnderSampler(random_state=42)
100 | X_res, y_res = rus.fit_sample(X, y)
101 | 
102 | skf = StratifiedKFold(n_splits=5, random_state=42)
103 | 
104 | score_rfc = []
105 | score_svm = []
106 | score_lr = []
107 | 
108 | precision_rfc = []
109 | precision_svm = []
110 | precision_lr = []
111 | 
112 | recall_rfc = []
113 | recall_svm = []
114 | recall_lr = []
115 | 
116 | f1_rfc = []
117 | f1_svm = []
118 | f1_lr = []
119 | 
120 | for train_index, val_index in skf.split(X_res, y_res):
121 |     X_train, X_test = X_res[train_index], X_res[val_index]
122 |     y_train, y_test = y_res[train_index], y_res[val_index]
123 | 
124 |     # Random Forest Classifier
125 |     rfc = RandomForestClassifier(random_state=42)
126 |     rfc.fit(X_train, y_train)
127 |     y_pred = rfc.predict(X_test)
128 |     score_rfc.append(accuracy_score(y_test, y_pred))
129 |     precision_rfc.append(precision_score(y_test, y_pred))
130 |     recall_rfc.append(recall_score(y_test, y_pred))
131 |     f1_rfc.append(f1_score(y_test, y_pred))
132 |     print("Random Forest Accuracy: %s" % accuracy_score(y_pred, y_test))
133 | 
134 |     # Random Forest Classifier
135 |     lr = LogisticRegression(random_state=42)
136 |     lr.fit(X_train, y_train)
137 |     y_pred = lr.predict(X_test)
138 |     score_lr.append(accuracy_score(y_pred, y_test))
139 |     precision_lr.append(precision_score(y_test, y_pred))
140 |     recall_lr.append(recall_score(y_test, y_pred))
141 |     f1_lr.append(f1_score(y_test, y_pred))
142 |     print("Logistic Regression Accuracy: %s" % accuracy_score(y_pred, y_test))
143 | 
144 |     # Random Forest Classifier
145 |     svm = LinearSVC(random_state=42)
146 |     svm.fit(X_train, y_train)
147 |     y_pred = svm.predict(X_test)
148 |     score_svm.append(accuracy_score(y_pred, y_test))
149 |     precision_svm.append(precision_score(y_test, y_pred))
150 |     recall_svm.append(recall_score(y_test, y_pred))
151 |     f1_svm.append(f1_score(y_test, y_pred))
152 |     print("SVM Accuracy: %s" % accuracy_score(y_pred, y_test))
153 |     print()
154 | 
155 | print("******** Random Forest ********")
156 | print("Accuracy  :  %s" % np.average(score_rfc))
157 | print("Precision :  %s" % np.average(precision_rfc))
158 | print("Recall    :  %s" % np.average(recall_rfc))
159 | print("F1-Score  :  %s" % np.average(f1_rfc))
160 | print("*******************************\n")
161 | 
162 | print("******** SVM ********")
163 | print("Accuracy  :  %s" % np.average(score_svm))
164 | print("Precision :  %s" % np.average(precision_svm))
165 | print("Recall    :  %s" % np.average(recall_svm))
166 | print("F1-Score  :  %s" % np.average(f1_svm))
167 | print("*********************\n")
168 | 
169 | print("******** Logistic Regression ********")
170 | print("Accuracy  :  %s" % np.average(score_lr))
171 | print("Precision :  %s" % np.average(precision_lr))
172 | print("Recall    :  %s" % np.average(recall_lr))
173 | print("F1-Score  :  %s" % np.average(f1_lr))
174 | print("*************************************\n")
175 | 
176 | 
177 | 


--------------------------------------------------------------------------------
/get_developer_names.py:
--------------------------------------------------------------------------------
 1 | import subprocess
 2 | 
 3 | from pygit2 import Repository
 4 | from pygit2 import GIT_SORT_TOPOLOGICAL
 5 | 
 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 7 | PROJECT_BUGS = [
 8 |     [str(x) for x in range(1, 134)],
 9 |     [str(x) for x in range(1, 66)],
10 |     [str(x) for x in range(1, 27)],
11 |     [str(x) for x in range(1, 107)],
12 |     [str(x) for x in range(1, 39)],
13 |     [str(x) for x in range(1, 28)]
14 | ]
15 | 
16 | developer_details = []
17 | 
18 | for project, bugs in zip(PROJECTS, PROJECT_BUGS):
19 |     for bug in bugs:
20 |         defects4j_checkout_command = ['defects4j', 'checkout', '-p', project, '-v', '%sf' % bug, '-w',
21 |                                       '/Users/ashish/code/cs5704/temp/%s_%s' % (project, bug)]
22 |         subprocess.call(defects4j_checkout_command)
23 | 
24 |         repo_dir = '/Users/ashish/code/cs5704/temp/%s_%s/.git' % (project, bug)
25 |         repo = Repository(repo_dir)
26 | 
27 |         author_name = None
28 |         for commit in repo.walk(repo.head.target, GIT_SORT_TOPOLOGICAL):
29 |             if commit.author.name != 'defects4j':
30 |                 author_name = commit.author.name
31 |                 break
32 | 
33 |         if author_name is not None:
34 |             developer_details.append('%s,%s,%s\n' % (project, bug, author_name))
35 | 
36 | 
37 | with open('developers.csv', 'w') as fwriter:
38 |     for row in developer_details:
39 |         fwriter.write(row)
40 | 


--------------------------------------------------------------------------------
/matrix_coverage.py:
--------------------------------------------------------------------------------
 1 | import tarfile
 2 | import os
 3 | import argparse
 4 | 
 5 | 
 6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 7 | PROJECT_BUGS = [
 8 |     [str(x) for x in range(1, 134)],
 9 |     [str(x) for x in range(1, 66)],
10 |     [str(x) for x in range(1, 27)],
11 |     [str(x) for x in range(1, 107)],
12 |     [str(x) for x in range(1, 39)],
13 |     [str(x) for x in range(1, 28)]
14 | ]
15 | TAR_FILE = 'gzoltar-files.tar.gz'
16 | 
17 | 
18 | def extract_files(data_dir, output_dir):
19 |     """
20 |     Extract tar gz files from data directory to get coverage and spectra files
21 | 
22 |     Parameters
23 |     ----------
24 |     data_dir : str
25 |         the main data directory which holds all projects
26 |     
27 |     output_dir : str
28 |         the output directory to write coverage and spectra files to
29 |         this directory must exist
30 |     
31 |     Returns
32 |     -------
33 |     None
34 |     """
35 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
36 |         for bug in bugs:
37 |             tar = os.path.join(data_dir, project, bug, TAR_FILE)
38 |             coverage, spectra = extract_tar_file(tar)
39 |             if coverage is None or spectra is None:
40 |                 print('Could not get coverage/spectra for %s' % tar)
41 |             else:
42 |                 coverage_file = os.path.join(output_dir, '%s-%s-%s' % (project, bug, 'coverage'))
43 |                 spectra_file = os.path.join(output_dir, '%s-%s-%s' % (project, bug, 'spectra'))
44 |                 write_file(coverage_file, coverage)
45 |                 write_file(spectra_file, spectra)
46 | 
47 | 
48 | def extract_tar_file(tar):
49 |     """
50 |     Extract a tar file and return the matrix and spectra content
51 | 
52 |     Parameters
53 |     ----------
54 |     tar : str
55 |         the path to the tarfile
56 |     
57 |     Returns
58 |     tuple(str, str)
59 |         the contents of the coverage matrix and spectra
60 |     """
61 |     coverage, spectra = None, None
62 |     tar = tarfile.open(tar)
63 |     for member in tar.getmembers():
64 |         if 'matrix' in member.get_info()['name']:
65 |             f = tar.extractfile(member)
66 |             if f is not None:
67 |                 coverage = f.read()
68 |         if 'spectra' in member.get_info()['name']:
69 |             f = tar.extractfile(member)
70 |             if f is not None:
71 |                 spectra = f.read()
72 |     return coverage, spectra
73 | 
74 | 
75 | def write_file(filename, content):
76 |     """
77 |     Write content to the given filename
78 | 
79 |     Parameters
80 |     ----------
81 |     filename : str
82 |     content : str
83 |     """
84 |     with open(filename, 'wb') as fwriter:
85 |         fwriter.write(content)
86 | 
87 | 
88 | if __name__ == '__main__':
89 |     parser = argparse.ArgumentParser()
90 |     parser.add_argument('--data-dir', required=True, help='data directory that holds data')
91 |     parser.add_argument('--output-dir', required=True, help='file to write coverage and spectra matrices to')
92 | 
93 |     args = parser.parse_args()
94 |     
95 |     extract_files(args.data_dir, args.output_dir)
96 | 


--------------------------------------------------------------------------------
/reweight.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import csv
 3 | import os
 4 | 
 5 | 
 6 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'}
 7 | 
 8 | 
 9 | def get_lines(data, num_lines):
10 |     """
11 |     Get the line numbers from the dataset
12 | 
13 |     Parameters
14 |     ----------
15 |     data : OrderedDict
16 |         an OrderedDict containing information about lines present in a bug
17 |     num_lines : int
18 |         number of lines to consider for each bug
19 | 
20 |     Returns
21 |     -------
22 |     list : an array of lines
23 |     """
24 |     lines = []
25 |     for i in range(1, num_lines+1):
26 |         lines.append(data['line_%s' % i])
27 |     return lines
28 | 
29 | 
30 | def multiplicative_reweighting(data, num_lines):
31 |     """
32 |     Simple reweight multiplies all the values of suspiciousness across all formulae
33 | 
34 |     Parameters
35 |     ----------
36 |     data : OrderedDict
37 |         each row of the csv file represents an OrderedDict called data
38 |     num_lines : int
39 |         number of lines to consider for each bug
40 | 
41 |     Returns
42 |     -------
43 |     list : a reweighted suspiciousness value for each line in the bug
44 |     """
45 |     hybrid_suspiciousness = [1000 for _ in range(num_lines)]
46 |     for i in range(1, num_lines+1):
47 |         for formula in FORMULA:
48 |             key = 'line_%s_%s' % (i, formula)
49 |             if data[key] is None:
50 |                 return None
51 |             hybrid_suspiciousness[i-1] *= float(data[key])
52 |     return hybrid_suspiciousness
53 | 
54 | 
55 | def reweight_dataset(dataset, num_lines, output_dir):
56 |     """
57 |     Reweight a dataset based on the multiplicative principle
58 | 
59 |     Parameters
60 |     ----------
61 |     dataset : str
62 |         a csv file containing the dataset
63 |     num_lines : int
64 |         number of lines to consider for each bug
65 |     output_dir : str
66 |         output directory for storing the csv files
67 |     """
68 |     with open(dataset) as freader:
69 |         csvreader = csv.DictReader(freader)
70 |         for row in csvreader:
71 |             suspiciousness = multiplicative_reweighting(row, num_lines)
72 |             if suspiciousness is None:
73 |                 continue
74 |             lines = get_lines(row, num_lines)
75 |             output = os.path.join(output_dir, 'reweighted-%s-%s.csv' % (row['project'], row['bug']))
76 |             with open(output, 'w') as fwriter:
77 |                 fwriter.write('Line,Suspiciousness\n')
78 |                 for line, susp in zip(lines, suspiciousness):
79 |                     fwriter.write('%s,%s\n' % (line, susp))
80 | 
81 | 
82 | if __name__ == '__main__':
83 |     parser = argparse.ArgumentParser()
84 |     parser.add_argument('-d', '--dataset', required=True, help='path to the dataset csv')
85 |     parser.add_argument('-n', '--num-lines', required=True, type=int, help='number of lines for each bug')
86 |     parser.add_argument('-o', '--output-dir', required=True,
87 |                         help='output directory to write reweighted suspiciousness to')
88 | 
89 |     args = parser.parse_args()
90 | 
91 |     reweight_dataset(args.dataset, args.num_lines, args.output_dir)
92 | 


--------------------------------------------------------------------------------
/s2l_suspiciousness.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import csv
  3 | import os
  4 | 
  5 | 
  6 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
  7 | PROJECT_BUGS = [
  8 |     [str(x) for x in range(1, 134)],
  9 |     [str(x) for x in range(1, 66)],
 10 |     [str(x) for x in range(1, 27)],
 11 |     [str(x) for x in range(1, 107)],
 12 |     [str(x) for x in range(1, 39)],
 13 |     [str(x) for x in range(1, 28)]
 14 | ]
 15 | SOURCE_CODE_SUFFIX = '.source-code.lines'
 16 | FORMULA = {'barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula'}
 17 | 
 18 | 
 19 | def classname_to_filename(classname):
 20 |     """
 21 |     Convert classname to filename
 22 | 
 23 |     Parameters
 24 |     ----------
 25 |     classname : str
 26 |         the value from of the classname from suspiciouss spectra file
 27 |     
 28 |     Returns
 29 |     -------
 30 |     str
 31 |         filename
 32 |     """
 33 |     if '$' in classname:
 34 |         classname = classname[:classname.find('$')]
 35 |     return classname.replace('.', '/') + '.java'
 36 | 
 37 | 
 38 | def stmt_to_line(statement):
 39 |     """
 40 |     Convert statement to line
 41 | 
 42 |     Parameters
 43 |     ----------
 44 |     statement : str
 45 |         the statement number along with the classname
 46 |     
 47 |     Returns
 48 |     -------
 49 |     str
 50 |         line number in file
 51 |     """
 52 |     classname, line_number = statement.rsplit('#', 1)
 53 |     return '{}#{}'.format(classname_to_filename(classname), line_number)
 54 | 
 55 | 
 56 | def convert_statement_to_line(source_code_lines_file, statement_suspiciousness, output_file):
 57 |     source_code = dict()
 58 |     with open(source_code_lines_file) as f:
 59 |         for line in f:
 60 |             line = line.strip()
 61 |             entry = line.split(':')
 62 |             key = entry[0]
 63 |             if key in source_code:
 64 |                 source_code[key].append(entry[1])
 65 |             else:
 66 |                 source_code[key] = []
 67 |                 source_code[key].append(entry[1])
 68 | 
 69 |     with open(statement_suspiciousness) as fin:
 70 |         reader = csv.DictReader(fin)
 71 |         with open(output_file, 'w') as f:
 72 |             writer = csv.DictWriter(f, ['Line','Suspiciousness'])
 73 |             writer.writeheader()
 74 |             for row in reader:
 75 |                 line = stmt_to_line(row['Statement'])
 76 |                 susps = row['Suspiciousness']
 77 | 
 78 |                 writer.writerow({
 79 |                     'Line': line,
 80 |                     'Suspiciousness': susps})
 81 | 
 82 |                 # check whether there are any sub-lines
 83 |                 if line in source_code:
 84 |                     for additional_line in source_code[line]:
 85 |                         writer.writerow({'Line': additional_line, 'Suspiciousness': susps})
 86 |         f.close()
 87 |     fin.close()
 88 | 
 89 | 
 90 | if __name__ == '__main__':
 91 |     parser = argparse.ArgumentParser()
 92 |     parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory')
 93 |     parser.add_argument('-s', '--source-code-lines', required=True, help='Source code lines directory')
 94 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory')
 95 |     parser.add_argument('-f', '--formula', choices=FORMULA, required=False, help='Formula to convert for')
 96 | 
 97 |     args = parser.parse_args()
 98 | 
 99 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
100 |         for bug in bugs:
101 |             source_code_lines_file = os.path.join(args.source_code_lines, 
102 |                     '%s-%sb%s' % (project, bug, SOURCE_CODE_SUFFIX))
103 |             if args.formula == None:
104 |                 for key in FORMULA:
105 |                     statement_suspiciousness_file = os.path.join(args.suspiciousness_data_dir, 
106 |                         '%s-%s-%s-suspiciousness' % (project, bug, key))
107 |                     output_file = os.path.join(args.output_dir, '%s-%s-%s-line-suspiciousness' % (project, bug, key))
108 |                     convert_statement_to_line(source_code_lines_file, statement_suspiciousness_file, output_file)
109 |             else:
110 |                 statement_suspiciousness_file = os.path.join(args.suspiciousness_data_dir, 
111 |                         '%s-%s-%s-suspiciousness' % (project, bug, args.formula))
112 |                 output_file = os.path.join(args.output_dir, '%s-%s-%s-line-suspiciousness' % (project, bug, args.formula))
113 |                 convert_statement_to_line(source_code_lines_file, statement_suspiciousness_file, output_file)
114 | 


--------------------------------------------------------------------------------
/sort_csv.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import sys
 3 | import csv
 4 | import operator
 5 | import os
 6 | 
 7 | 
 8 | PROJECTS = ['Closure', 'Lang', 'Chart', 'Math', 'Mockito', 'Time']
 9 | PROJECT_BUGS = [
10 |     [str(x) for x in range(1, 134)],
11 |     [str(x) for x in range(1, 66)],
12 |     [str(x) for x in range(1, 27)],
13 |     [str(x) for x in range(1, 107)],
14 |     [str(x) for x in range(1, 39)],
15 |     [str(x) for x in range(1, 28)]
16 | ]
17 | FORMULA = ['barinel', 'dstar2', 'jaccard', 'muse', 'ochiai', 'opt2', 'tarantula']
18 | 
19 | 
20 | def sort(input_csv, output_csv, column=1):
21 |     """
22 |     Sort a csv based on a column
23 | 
24 |     Parameters
25 |     ----------
26 |     input_csv : str
27 |         input csv file
28 |     output_csv : str
29 |         output csv file
30 |     column : int
31 |         the column number to sort the input csv file on
32 |     """
33 |     data = csv.reader(open(input_csv),delimiter=',')
34 |     sortedlist = sorted(data, key=operator.itemgetter(column))
35 |     sortedlist.reverse()
36 |     
37 |     with open(output_csv, 'w') as f:
38 |         fileWriter = csv.writer(f, delimiter=',')
39 |         for row in sortedlist:
40 |             fileWriter.writerow(row)
41 | 
42 | 
43 | if __name__ == '__main__':
44 |     parser = argparse.ArgumentParser()
45 |     parser.add_argument('-d', '--suspiciousness-data-dir', required=True, help='Suspiciousness data directory')
46 |     parser.add_argument('-o', '--output-dir', required=True, help='Output directory')
47 |     
48 |     args = parser.parse_args()
49 | 
50 |     for project, bugs in zip(PROJECTS, PROJECT_BUGS):
51 |         for bug in bugs:
52 |             for formula in FORMULA:
53 |                 input_csv = '%s-%s-%s-line-suspiciousness' % (project, bug, formula)
54 |                 output_csv = '%s-%s-%s-sorted-susp' % (project, bug, formula)
55 |                 sort(os.path.join(args.suspiciousness_data_dir, input_csv), 
56 |                      os.path.join(args.output_dir, output_csv))
57 | 


--------------------------------------------------------------------------------
/suspiciousness.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | from __future__ import print_function
  3 | from glob import glob
  4 | 
  5 | import collections
  6 | import re
  7 | import argparse
  8 | import csv
  9 | import os
 10 | import sys
 11 | 
 12 | 
 13 | def eprint(*args, **kwargs):
 14 |     """
 15 |     Print to stderr
 16 |     """
 17 |     print(*args, file=sys.stderr, **kwargs)
 18 | 
 19 | 
 20 | def tarantula(passed, failed, totalpassed, totalfailed):
 21 |     """
 22 |     Calculates the Tarantula suspiciousness score for the line
 23 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
 24 |     in file analysis/pipeline-scripts/crush-matrix
 25 | 
 26 |     Parameters
 27 |     ----------
 28 |     passed : int
 29 |         number of test cases passed with the line
 30 |     failed : int
 31 |         number of test cases failed with the line
 32 |     totalpassed : int
 33 |         total number of test cases passed
 34 |     totalfailed : int
 35 |         total number of test cases failed
 36 | 
 37 |     Returns
 38 |     -------
 39 |     float
 40 |         the Tarantula suspiciousness value
 41 |     """
 42 |     if totalpassed == 0 or totalfailed == 0 or (passed+failed == 0):
 43 |         return 0
 44 |     return (failed/totalfailed)/(failed/totalfailed + passed/totalpassed)
 45 | 
 46 | 
 47 | def tarantula_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
 48 |     """
 49 |     Calculates the Tarantula suspiciousness score for the line
 50 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
 51 |     in file analysis/pipeline-scripts/crush-matrix
 52 | 
 53 |     Parameters
 54 |     ----------
 55 |     passed : int
 56 |         number of test cases passed with the line
 57 |     failed : int
 58 |         number of test cases failed with the line
 59 |     totalpassed : int
 60 |         total number of test cases passed
 61 |     totalfailed : int
 62 |         total number of test cases failed
 63 |     was_covered : bool
 64 | 
 65 |     Returns
 66 |     -------
 67 |     float
 68 |         the Tarantula hybrid suspiciousness value
 69 |     """
 70 |     if totalpassed == 0 or totalfailed == 0 or (passed+failed == 0):
 71 |         return 0
 72 |     return (passed/totalpassed + (1 if was_covered else 0)/(passed/totalpassed + failed/totalfailed))
 73 | 
 74 | 
 75 | def ochiai(passed, failed, totalpassed, totalfailed):
 76 |     """
 77 |     Calculates the Ochiai suspiciousness score for the line
 78 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
 79 |     in file analysis/pipeline-scripts/crush-matrix
 80 | 
 81 |     Parameters
 82 |     ----------
 83 |     passed : int
 84 |         number of test cases passed with the line
 85 |     failed : int
 86 |         number of test cases failed with the line
 87 |     totalpassed : int
 88 |         total number of test cases passed
 89 |     totalfailed : int
 90 |         total number of test cases failed
 91 |     
 92 |     Returns
 93 |     -------
 94 |     float
 95 |         the Ochiai hybrid suspiciousness value
 96 |     """
 97 |     if totalfailed == 0 or (passed+failed == 0):
 98 |         return 0
 99 |     return failed/(totalfailed*(failed+passed))**0.5
100 | 
101 | 
102 | def ochiai_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
103 |     """
104 |     Calculates the Ochiai suspiciousness score for the line
105 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
106 |     in file analysis/pipeline-scripts/crush-matrix
107 | 
108 |     Parameters
109 |     ----------
110 |     passed : int
111 |         number of test cases passed with the line
112 |     failed : int
113 |         number of test cases failed with the line
114 |     totalpassed : int
115 |         total number of test cases passed
116 |     totalfailed : int
117 |         total number of test cases failed
118 |     was_covered : bool
119 |     
120 |     Returns
121 |     -------
122 |     float
123 |         the Ochiai hybrid suspiciousness value
124 |     """
125 |     if totalfailed == 0 or (passed+failed == 0):
126 |         return 0
127 |     return (failed + (1 if was_covered else 0))/(totalfailed*(failed+passed))**0.5
128 | 
129 | 
130 | def opt2(passed, failed, totalpassed, totalfailed):
131 |     """
132 |     Calculates the Opt2 suspiciousness score for the line
133 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
134 |     in file analysis/pipeline-scripts/crush-matrix
135 | 
136 |     Parameters
137 |     ----------
138 |     passed : int
139 |         number of test cases passed with the line
140 |     failed : int
141 |         number of test cases failed with the line
142 |     totalpassed : int
143 |         total number of test cases passed
144 |     totalfailed : int
145 |         total number of test cases failed
146 |     
147 |     Returns
148 |     -------
149 |     float
150 |         the Opt2 hybrid suspiciousness value
151 |     """
152 |     return failed - passed/(totalpassed+1)
153 | 
154 | 
155 | def opt2_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
156 |     """
157 |     Calculates the Opt2 suspiciousness score for the line
158 | 
159 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
160 |     in file analysis/pipeline-scripts/crush-matrix
161 | 
162 |     Parameters
163 |     ----------
164 |     passed : int
165 |         number of test cases passed with the line
166 |     failed : int
167 |         number of test cases failed with the line
168 |     totalpassed : int
169 |         total number of test cases passed
170 |     totalfailed : int
171 |         total number of test cases failed
172 |     was_covered : bool
173 |     
174 |     Returns
175 |     -------
176 |     float
177 |         the Opt2 hybrid suspiciousness value
178 |     """
179 |     return failed - (passed-(1 if was_covered else 0))/(totalpassed+1)
180 | 
181 | 
182 | def barinel(passed, failed, totalpassed, totalfailed):
183 |     """
184 |     Calculates the barinel suspiciousness score for the line
185 | 
186 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
187 |     in file analysis/pipeline-scripts/crush-matrix
188 | 
189 |     Parameters
190 |     ----------
191 |     passed : int
192 |         number of test cases passed with the line
193 |     failed : int
194 |         number of test cases failed with the line
195 |     totalpassed : int
196 |         total number of test cases passed
197 |     totalfailed : int
198 |         total number of test cases failed
199 |     
200 |     Returns
201 |     -------
202 |     float
203 |         the barinel hybrid suspiciousness value
204 |     """
205 |     if failed == 0:
206 |         return 0
207 |     h = passed/(passed+failed)
208 |     return 1-h
209 | 
210 | 
211 | def barinel_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
212 |     """
213 |     Calculates the Barinel suspiciousness score for the line
214 | 
215 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
216 |     in file analysis/pipeline-scripts/crush-matrix
217 | 
218 |     Parameters
219 |     ----------
220 |     passed : int
221 |         number of test cases passed with the line
222 |     failed : int
223 |         number of test cases failed with the line
224 |     totalpassed : int
225 |         total number of test cases passed
226 |     totalfailed : int
227 |         total number of test cases failed
228 |     was_covered : bool
229 |     
230 |     Returns
231 |     -------
232 |     float
233 |         the Barinel hybrid suspiciousness value
234 |     """
235 |     if failed == 0:
236 |         return 0
237 |     h = (passed - (1 if was_covered else 0))/(passed+failed)
238 |     return 1-h
239 | 
240 | def dstar2(passed, failed, totalpassed, totalfailed):
241 |     """
242 |     Calculates the dstar2 suspiciousness score for the line
243 | 
244 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
245 |     in file analysis/pipeline-scripts/crush-matrix
246 | 
247 |     Parameters
248 |     ----------
249 |     passed : int
250 |         number of test cases passed with the line
251 |     failed : int
252 |         number of test cases failed with the line
253 |     totalpassed : int
254 |         total number of test cases passed
255 |     totalfailed : int
256 |         total number of test cases failed
257 |     
258 |     Returns
259 |     -------
260 |     float
261 |         the dstar2 hybrid suspiciousness value
262 |     """
263 |     if passed + totalfailed - failed == 0:
264 |         assert passed==0 and failed==totalfailed
265 |         return totalfailed**2 + 1 # slightly higher than otherwise possible
266 |     return failed**2 / (passed + totalfailed - failed)
267 | 
268 | 
269 | def dstar2_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
270 |     """
271 |     Calculates the dstar2 hybrid suspiciousness score for the line
272 | 
273 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
274 |     in file analysis/pipeline-scripts/crush-matrix
275 | 
276 |     Parameters
277 |     ----------
278 |     passed : int
279 |         number of test cases passed with the line
280 |     failed : int
281 |         number of test cases failed with the line
282 |     totalpassed : int
283 |         total number of test cases passed
284 |     totalfailed : int
285 |         total number of test cases failed
286 |     was_covered : bool
287 |     
288 |     Returns
289 |     -------
290 |     float
291 |         the dstar2 hybrid suspiciousness value
292 |     """
293 |     if passed + totalfailed - failed == 0:
294 |         assert passed==0 and failed==totalfailed
295 |         return totalfailed**2 + 2 # slightly higher than otherwise possible
296 |     return (failed**2 + (1 if was_covered else 0)) / (passed + totalfailed - failed)
297 | 
298 | 
299 | def muse(passed, failed, totalpassed, totalfailed):
300 |     """
301 |     Calculates the dstar2 suspiciousness score for the line
302 | 
303 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
304 |     in file analysis/pipeline-scripts/crush-matrix
305 | 
306 |     Parameters
307 |     ----------
308 |     passed : int
309 |         number of test cases passed with the line
310 |     failed : int
311 |         number of test cases failed with the line
312 |     totalpassed : int
313 |         total number of test cases passed
314 |     totalfailed : int
315 |         total number of test cases failed
316 |     Returns
317 |     -------
318 |     float
319 |         the dstar2 hybrid suspiciousness value
320 |     """
321 |     if totalpassed == 0:
322 |         return 0
323 |     return failed - totalfailed/totalpassed * passed
324 | 
325 | 
326 | def muse_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
327 |     """
328 |     Calculates the dstar2 suspiciousness score for the line
329 | 
330 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
331 |     in file analysis/pipeline-scripts/crush-matrix
332 | 
333 |     Parameters
334 |     ----------
335 |     passed : int
336 |         number of test cases passed with the line
337 |     failed : int
338 |         number of test cases failed with the line
339 |     totalpassed : int
340 |         total number of test cases passed
341 |     totalfailed : int
342 |         total number of test cases failed
343 |     was_covered : bool
344 |     
345 |     Returns
346 |     -------
347 |     float
348 |         the dstar2 hybrid suspiciousness value
349 |     """
350 |     if totalpassed == 0:
351 |         return 0
352 |     return failed - (totalfailed-(1 if was_covered else 0))/totalpassed * passed
353 | 
354 | 
355 | def jaccard(passed, failed, totalpassed, totalfailed):
356 |     """
357 |     Calculates the dstar2 suspiciousness score for the line
358 | 
359 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
360 |     in file analysis/pipeline-scripts/crush-matrix
361 | 
362 |     Parameters
363 |     ----------
364 |     passed : int
365 |         number of test cases passed with the line
366 |     failed : int
367 |         number of test cases failed with the line
368 |     totalpassed : int
369 |         total number of test cases passed
370 |     totalfailed : int
371 |         total number of test cases failed
372 |     
373 |     Returns
374 |     -------
375 |     float
376 |         the dstar2 hybrid suspiciousness value
377 |     """
378 |     if totalfailed + passed == 0:
379 |         return failed
380 |     return failed / (totalfailed + passed)
381 | 
382 | 
383 | def jaccard_hybrid_numerator(passed, failed, totalpassed, totalfailed, was_covered):
384 |     """
385 |     Calculates the dstar2 suspiciousness score for the line
386 | 
387 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
388 |     in file analysis/pipeline-scripts/crush-matrix
389 | 
390 |     Parameters
391 |     ----------
392 |     passed : int
393 |         number of test cases passed with the line
394 |     failed : int
395 |         number of test cases failed with the line
396 |     totalpassed : int
397 |         total number of test cases passed
398 |     totalfailed : int
399 |         total number of test cases failed
400 |     was_covered : bool
401 |     
402 |     Returns
403 |     -------
404 |     float
405 |         the dstar2 hybrid suspiciousness value
406 |     """
407 |     if totalfailed + passed == 0:
408 |         return (failed + (1 if was_covered else 0))
409 |     return (failed + (1 if was_covered else 0)) / (totalfailed + passed)
410 | 
411 | 
412 | # Formula to function
413 | FORMULAS = {
414 |     'tarantula': tarantula,
415 |     'ochiai': ochiai,
416 |     'opt2': opt2,
417 |     'barinel': barinel,
418 |     'dstar2': dstar2,
419 |     'muse': muse,
420 |     'jaccard': jaccard
421 | }
422 | 
423 | # Hybrid formula to function
424 | HYBRID_NUMERATOR_FORMULAS = {
425 |     'tarantula': tarantula_hybrid_numerator,
426 |     'ochiai': ochiai_hybrid_numerator,
427 |     'opt2': opt2_hybrid_numerator,
428 |     'barinel': barinel_hybrid_numerator,
429 |     'dstar2': dstar2_hybrid_numerator,
430 |     'muse': muse_hybrid_numerator,
431 |     'jaccard': jaccard_hybrid_numerator
432 | }
433 | 
434 | 
435 | def crush_row(formula, hybrid_scheme, passed, failed, totalpassed, totalfailed, 
436 |               passed_covered=None, failed_covered=None, totalpassed_covered=0, 
437 |               totalfailed_covered=0):
438 |     """
439 |     Calculates the suspiciousness of a statement or mutant based on the forumula
440 |     given
441 | 
442 |     Adapted directly from: https://bitbucket.org/rjust/fault-localization-data/
443 |     in file analysis/pipeline-scripts/crush-matrix
444 | 
445 |     Parameters
446 |     ----------
447 |     formula : str 
448 |         the name of the formula to plug passed/failed/totalpassed/totalfailed into.
449 |     hybrid_scheme : str 
450 |         one of ("numerator", "constant", or "mirror")
451 |         useful for MBFL only, specifies how the formula should be modified to 
452 |         incorporate the number of passing/failing tests that *cover* the mutant 
453 |         (rather than kill it).
454 |     """
455 |     if hybrid_scheme is None:
456 |         return FORMULAS[formula](passed, failed, totalpassed, totalfailed)
457 |     elif hybrid_scheme == 'numerator':
458 |         return HYBRID_NUMERATOR_FORMULAS[formula](passed, failed, totalpassed, totalfailed, failed_covered > 0)
459 |     elif hybrid_scheme == 'constant':
460 |         return FORMULAS[formula](passed, failed, totalpassed, totalfailed) + (1 if failed_covered > 0 else 0)
461 |     elif hybrid_scheme == 'mirror':
462 |         return (FORMULAS[formula](passed, failed, totalpassed, totalfailed) +
463 |                 FORMULAS[formula](passed_covered, failed_covered, totalpassed_covered, totalfailed_covered))/2.
464 |     elif hybrid_scheme == 'coverage-only':
465 |         return FORMULAS[formula](passed_covered, failed_covered, totalpassed_covered, totalfailed_covered)
466 |     raise ValueError('unrecognized hybrid scheme name: {!r}'.format(hybrid_scheme))
467 | 
468 | 
469 | # PassFailTally is a container class for number of test cases passed, failed, and total counts
470 | PassFailTally = collections.namedtuple('PassFailTally', ('n_elements', 'passed', 'failed', 'totalpassed', 'totalfailed'))
471 | 
472 | 
473 | def suspiciousnesses_from_tallies(formula, hybrid_scheme, tally, hybrid_coverage_tally):
474 |     """
475 |     Returns a dict mapping element-number to suspiciousness.
476 | 
477 |     Parameters
478 |     ----------
479 |     formula : str
480 |         the formula to use to calculate suspiciousness
481 |     hybrid_scheme : str
482 |         numerator, constant, or mirror
483 |     tally : PassFailTally
484 |         an object of the PassFailTally class
485 |     hybrid_coverage_tally : PassFailTally
486 |         an object of the PassFailTally class
487 | 
488 |     Returns
489 |     -------
490 |     dict
491 |         a dictionary of element-number to suspiciousness value
492 |     """
493 |     if hybrid_coverage_tally is None:
494 |         passed_covered = failed_covered = collections.defaultdict(lambda: None)
495 |         totalpassed_covered = totalfailed_covered = 0
496 |     else:
497 |         passed_covered = hybrid_coverage_tally.passed
498 |         failed_covered = hybrid_coverage_tally.failed
499 |         totalpassed_covered = hybrid_coverage_tally.totalpassed
500 |         totalfailed_covered = hybrid_coverage_tally.totalfailed
501 |     return {
502 |         element: crush_row(
503 |             formula=formula, hybrid_scheme=hybrid_scheme,
504 |             passed=tally.passed[element], failed=tally.failed[element],
505 |             totalpassed=tally.totalpassed, totalfailed=tally.totalfailed,
506 |             passed_covered=passed_covered[element], failed_covered=failed_covered[element],
507 |             totalpassed_covered=totalpassed_covered, totalfailed_covered=totalfailed_covered)
508 |             for element in range(tally.n_elements)
509 |         }
510 | 
511 | 
512 | # TestSummary is a container class for test summary information
513 | TestSummary = collections.namedtuple('TestSummary', ('triggering', 'covered_elements'))
514 | 
515 | 
516 | def parse_test_summary(line, n_elements):
517 |     words = line.strip().split(' ')
518 |     coverages, sign = words[:-1], words[-1]
519 |     if len(coverages) != n_elements:
520 |         raise ValueError("expected {expected} elements in each row, got {actual} in {line!r}".format(expected=n_elements, actual=len(coverages), line=line))
521 |     return TestSummary(
522 |         triggering=(sign == '-'),
523 |         covered_elements=set(i for i in range(len(words)) if words[i]=='1'))
524 | 
525 | 
526 | def tally_matrix(matrix_file, total_defn, n_elements):
527 |     """
528 |     Returns a PassFailTally describing how many passing/failing tests there are, 
529 |     and how many of each cover each code element.
530 | 
531 |     Parameters
532 |     ----------
533 | 
534 |     total_defn : str 
535 |         may be "tests" (in which case the tally's ``totalpassed`` will be the 
536 |         number of passing tests) or "elements" (in which case it'll be the 
537 |         number of times a passing test covers a code element) 
538 |         (and same for ``totalfailed``).
539 |     n_elements : str 
540 |         is the number of code elements that each row of the matrix 
541 |         should indicate coverage for.
542 |     
543 |     Returns
544 |     -------
545 |     namedtuple
546 | 
547 |     """
548 |     summaries = (parse_test_summary(line, n_elements) for line in matrix_file)
549 | 
550 |     passed = {i: 0 for i in range(n_elements)}
551 |     failed = {i: 0 for i in range(n_elements)}
552 |     totalpassed = 0
553 |     totalfailed = 0
554 |     for summary in summaries:
555 |         if summary.triggering:
556 |             totalfailed += (1 if total_defn == 'tests' else len(summary.covered_elements))
557 |             for element_number in summary.covered_elements:
558 |                 failed[element_number] += 1
559 |         else:
560 |             totalpassed += (1 if total_defn == 'tests' else len(summary.covered_elements))
561 |             for element_number in summary.covered_elements:
562 |                 passed[element_number] += 1
563 | 
564 |     return PassFailTally(n_elements, passed, failed, totalpassed, totalfailed)
565 | 
566 | 
567 | def parse_bug_from_file_name(coverage_file):
568 |     """
569 |     Parse bug from file name by looking
570 | 
571 |     Parameters
572 |     ----------
573 |     coverage_file : str
574 |         the path to the coverage file
575 |     
576 |     Returns
577 |     -------
578 |     str
579 |         returns the project and bug eg Closure-11
580 |     """
581 |     return '-'.join(coverage_file.split('/')[-1].split('-')[:-1])
582 | 
583 | 
584 | def generate_suspiciousness(formula, coverage_file, spectra_file, output_dir):
585 |     """
586 |     Generates the suspiciousness value for each bug in the dataset.
587 | 
588 |     The method reads the coverage file and the spectra file, and uses 
589 |     the method given in the fault-localization-data to generate a value
590 |     between 0 and 1.
591 | 
592 |     Parameters
593 |     ----------
594 |     formula : str
595 |         which formula to use for computing suspiciousness
596 |     coverage_file : str
597 |         path to the coverage file
598 |     spectra_file : str
599 |         path to the spectra file
600 |     output_dir : str
601 | 
602 |     
603 |     Returns
604 |     -------
605 |     dict
606 |         suspiciousness value for all lines
607 |     """
608 |     with open(spectra_file) as name_file:
609 |         element_names = {i: name.strip() for i, name in enumerate(name_file)}
610 | 
611 |     n_elements = len(element_names)
612 | 
613 |     with open(coverage_file) as matrix_file:
614 |         tally = tally_matrix(matrix_file, 'tests', n_elements=n_elements)
615 | 
616 |     suspiciousnesses = suspiciousnesses_from_tallies(
617 |         formula=formula, hybrid_scheme=None,
618 |         tally=tally, hybrid_coverage_tally=None)
619 | 
620 |     bug = parse_bug_from_file_name(coverage_file)
621 | 
622 |     with open(os.path.join(output_dir, '%s-%s-suspiciousness' % (bug, formula)), 'w') as output_file:
623 |         writer = csv.DictWriter(output_file, ['Statement','Suspiciousness'])
624 |         writer.writeheader()
625 |         for element in range(n_elements):
626 |             writer.writerow({
627 |                 'Statement': element_names[element],
628 |                 'Suspiciousness': suspiciousnesses[element]})
629 |     
630 |     return suspiciousnesses
631 | 
632 | 
633 | if __name__ == '__main__':
634 |     # Slight hack to allow for 'all' as a choice
635 |     formula_choices = set(FORMULAS.keys())
636 |     formula_choices.add('all')
637 | 
638 |     parser = argparse.ArgumentParser()
639 |     parser.add_argument('--formula', required=True, choices=formula_choices, help='formula to use for suspiciousness calculation')
640 |     parser.add_argument('--data-dir', required=True, help='data directory that holds coverage and spectra files')
641 |     parser.add_argument('--output-dir', required=True, help='file to write suspiciousness vector to')
642 | 
643 |     args = parser.parse_args()
644 | 
645 |     coverage_files = sorted(glob(os.path.join(args.data_dir, '*coverage')))
646 |     spectra_files = sorted(glob(os.path.join(args.data_dir, '*spectra')))
647 | 
648 |     if len(coverage_files) != len(spectra_files):
649 |         eprint('Number of coverage files is not equal to the number of spectra files')
650 |         sys.exit(-1)
651 |     
652 |     if args.formula == 'all':
653 |         for formula in FORMULAS.keys():
654 |             for coverage_file, spectra_file in zip(coverage_files, spectra_files):
655 |                 generate_suspiciousness(formula, coverage_file, spectra_file, args.output_dir)
656 |     else:
657 |         for coverage_file, spectra_file in zip(coverage_files, spectra_files):
658 |             generate_suspiciousness(args.formula, coverage_file, spectra_file, args.output_dir)


--------------------------------------------------------------------------------