├── .gitignore ├── LICENSE ├── README.md ├── columns.json ├── requirements.txt ├── run_algorithm.py ├── run_evaluation.py ├── run_experiment.sh ├── squeeze ├── __init__.py ├── anomaly_amount_fileter.py ├── clustering │ ├── __init__.py │ ├── cluster.py │ └── density_cluster.py ├── squeeze.py └── squeeze_option.py └── utility ├── __init__.py └── attribute_combination.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | output/ 3 | B0/ 4 | pB1/ 5 | B2/ 6 | B3/ 7 | B4/ 8 | A/ 9 | D/ 10 | B0.json 11 | B1.json 12 | B2.json 13 | B3.json 14 | B4.json 15 | A.json 16 | D.json 17 | B0 18 | B1 19 | B2 20 | B3 21 | B4 22 | A 23 | D 24 | .DS_Store 25 | .venv/ 26 | output.csv 27 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 NetManAIOps 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Squeeze 2 | Implementation and datasets for ISSRE 2019 REG paper 'Generic and Robust Localization of Multi-Dimensional Root Cause'. 3 | 4 | ## Requirements 5 | At least `python>=3.6, <3.7` is required. Though Python should be backward-compatible, there is no built wheel for some requirements like SciPy for a higher Python version. 6 | ``` bash 7 | pip install -r requirements.txt 8 | ``` 9 | 10 | ## Datasets 11 | 12 | Datasets `A, B0, B1, B2, B3, B4, D` in Table VII are on [Zenodo](https://zenodo.org/record/8153367) (updated on 2023-07-17). 13 | The ground truth root cause sets are in `injection_info.csv` in each subfolder. 14 | 15 | ## Usage 16 | 17 | ``` 18 | $python run_algorithm.py --help 19 | Usage: run_algorithm.py [OPTIONS] 20 | 21 | :param name: :param input_path: :param output_path: :param num_workers: 22 | :param kwargs: :return: 23 | 24 | Options: 25 | --name TEXT name of this setting 26 | --input-path TEXT will read data from {input_path}/{name} 27 | --output-path TEXT if {output_path} is a dir, save to 28 | {output_path}/{name}.json; otherwise save to 29 | {output_path} 30 | --num-workers INTEGER num of processes 31 | --derived means we should read {timestamp}.a.csv and 32 | {timestamp}.b.csv 33 | --help Show this message and exit. 34 | ``` 35 | 36 | ``` 37 | $python run_evaluation.py --help 38 | Usage: run_evaluation.py [OPTIONS] 39 | 40 | Options: 41 | -i, --injection-info TEXT injection_info.csv file 42 | -p, --predict TEXT output json file 43 | -c, --config TEXT config json file 44 | -o, --output-path TEXT output path 45 | --help Show this message and exit. 46 | ``` 47 | 48 | The config json file should contain the attribute names, e.g.: 49 | 50 | ``` 51 | { 52 | "columns": [ 53 | "a", "b", "c", "d" 54 | ] 55 | } 56 | ``` 57 | 58 | 59 | 60 | ## Example 61 | 62 | 1. Download `B3.tgz` and extract `B3.tgz` into `B3`. 63 | 64 | 2. Run this command: 65 | 66 | ``` 67 | python run_algorithm.py --name B_cuboid_layer_2_n_ele_2 --input-path B3 --output-path output/ --num-workers 10 68 | ``` 69 | 70 | ​ Then the results are summarized in `output/B_cuboid_layer_2_n_ele_2.json`: 71 | 72 | ```json 73 | [ 74 | { 75 | "timestamp": 1450653900, 76 | "elapsed_time": 10.794443607330322, 77 | "root_cause": "b=b31&d=d2;a=a1&b=b11" 78 | }, 79 | { 80 | "timestamp": 1450666800, 81 | "elapsed_time": 15.272005081176758, 82 | "root_cause": "b=b21&c=c1;a=a4&b=b9&c=c4" 83 | }, 84 | { 85 | "timestamp": 1450667700, 86 | "elapsed_time": 15.22673487663269, 87 | "root_cause": "b=b11&c=c4;a=a2&d=d1" 88 | }, 89 | ... 90 | ] 91 | ``` 92 | 93 | 3. Run evaluation scripts 94 | 95 | ``` bash 96 | python run_evaluation.py -i B3/B_cuboid_layer_2_n_ele_2/injection_info.csv -p output/B_cuboid_layer_2_n_ele_2.json -c columns.json 97 | ``` 98 | 99 | `columns.json` should contain all the attributes. 100 | 101 | ``` 102 | { 103 | "columns": [ 104 | "a", "b", "c", "d" 105 | ] 106 | } 107 | ``` 108 | 109 | Then we get the output (F1-score, precision, recall): 110 | 111 | ``` 112 | ...... 113 | 0.7858942065491183 0.7918781725888325 0.78 114 | ``` 115 | 116 | ## Known Issues 117 | This version of codes is faithful to the published version. 118 | However, two known severe issues are harming the localization performance. 119 | 1. The calculation of `_a1` and `_a2` in `squeeze/squeeze.py:184` is incorrect, which is not following the description in the paper. 120 | It should be corrected as follows 121 | ``` python 122 | reduced_data_p, _ = self.get_derived_dataframe( 123 | frozenset(elements[:partition]), cuboid=cuboid, 124 | reduction="sum", return_complement=True, 125 | subset_indices=np.concatenate([indices, self.normal_indices])) 126 | if len(reduced_data_p): 127 | _a1, _a2 = data_p.predict.values * ( 128 | reduced_data_p.real.item() / reduced_data_p.predict.item() 129 | ), data_n.predict.values 130 | else: 131 | # print(elements[:partition], data_p, reduced_data_p) 132 | assert len(data_p) == 0 133 | _a1 = 0 134 | _a2 = data_n.predict.values 135 | ``` 136 | 2. The calculation of `score_weight` in `squeeze/suqeeze.py:256` may produce negative values, which will cause incorrect localization results. Different from 1, the calculation here is faithful to the paper. See https://github.com/NetManAIOps/Squeeze/issues/6 137 | 138 | See also our [extended version](https://github.com/netmanaiops/psqueeze) 139 | 140 | ## Citation 141 | 142 | ``` 143 | @inproceedings{squeeze, 144 | title={Generic and Robust Localization of Multi-Dimensional Root Causes}, 145 | author={Li, Zeyan and Luo, Chengyang and Zhao, Yiwei and Sun, Yongqian and Sui, Kaixin and Wang, Xiping and Liu, Dapeng and Jin, Xing and Wang, Qi and Pei, Dan}, 146 | booktitle={2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE)}, 147 | year={2019}, 148 | organization={IEEE} 149 | } 150 | ``` 151 | -------------------------------------------------------------------------------- /columns.json: -------------------------------------------------------------------------------- 1 | { 2 | "columns": [ 3 | "a", "b", "c", "d" 4 | ] 5 | } 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | -i https://pypi.org/simple 2 | bidict==0.18.0 3 | click==7.0 4 | cycler==0.10.0 5 | joblib==0.13.2 6 | kiwisolver==1.1.0 7 | kneed==0.4.1 8 | loguru==0.3.2 9 | matplotlib==3.1.1 10 | numpy==1.17.0 11 | pandas==0.25.0 12 | pyparsing==2.4.2 13 | python-dateutil==2.8.0 14 | pytz==2019.2 15 | scikit-learn==0.21.3 16 | scipy==1.3.1 17 | seaborn==0.9.0 18 | six==1.12.0 19 | -------------------------------------------------------------------------------- /run_algorithm.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import time 3 | from pathlib import Path 4 | import click 5 | from functools import reduce 6 | from typing import Dict, List 7 | import json 8 | import numpy as np 9 | # from run_apriori import run 10 | from joblib import Parallel, delayed 11 | # noinspection PyProtectedMember 12 | from loguru._defaults import LOGURU_FORMAT 13 | 14 | from utility import AC, AttributeCombination 15 | from squeeze import Squeeze, SqueezeOption 16 | import pandas as pd 17 | from loguru import logger 18 | 19 | import os 20 | 21 | 22 | @click.command('Runner') 23 | @click.option("--name", default="", help="name of this setting") 24 | @click.option("--input-path", help="will read data from {input_path}/{name}") 25 | @click.option("--output-path", help="if {output_path} is a dir, save to {output_path}/{name}.json; \ 26 | otherwise save to {output_path}") 27 | @click.option("--num-workers", default=1, help="num of processes") 28 | @click.option("--derived", is_flag=True, help="means we should read {timestamp}.a.csv and {timestamp}.b.csv") 29 | def main(name, input_path, output_path, num_workers, **kwargs): 30 | """ 31 | :param name: 32 | :param input_path: 33 | :param output_path: 34 | :param num_workers: 35 | :param kwargs: 36 | :return: 37 | """ 38 | logger.remove() 39 | logger.add( 40 | sys.stdout, level="INFO", 41 | format="[{time}, {level}] {message}" 42 | ) 43 | dervied = kwargs.pop('derived') 44 | 45 | input_path = Path(input_path) 46 | assert input_path.exists(), f"{input_path} does not exist" 47 | output_path = Path(output_path) 48 | logger.info(f"read data from {input_path / name}") 49 | if output_path.is_dir(): 50 | output_path = output_path / f"{name}.json" 51 | elif not output_path.exists(): 52 | logger.info(f"create {output_path}") 53 | output_path.mkdir() 54 | output_path = output_path / f"{name}.json" 55 | logger.info(f"save to {output_path}") 56 | injection_info = pd.read_csv(input_path / name / 'injection_info.csv', engine='c') 57 | timestamps = sorted(injection_info['timestamp']) 58 | # results = list( 59 | # executor(file_path, output_path.parent, **kwargs) 60 | # for file_path in map(lambda x: input_path / name / f'{x}.csv', timestamps) 61 | # ) 62 | if not dervied: 63 | results = Parallel(n_jobs=num_workers, backend="multiprocessing", verbose=100)( 64 | delayed(executor)(file_path, output_path.parent, **kwargs) 65 | for file_path in map(lambda x: input_path / name / f'{x}.csv', timestamps)) 66 | else: 67 | results = Parallel(n_jobs=num_workers, backend="multiprocessing", verbose=100)( 68 | delayed(executor_derived)(file_path_list, output_path.parent, **kwargs) 69 | for file_path_list in map( 70 | lambda x: [input_path / name / f'{x}.a.csv', input_path / name / f'{x}.b.csv'], 71 | timestamps 72 | ) 73 | ) 74 | with open(str(output_path.resolve()), "w+") as f: 75 | json.dump(results, f, indent=4) 76 | logger.info(results) 77 | 78 | 79 | def executor(file_path: Path, output_path: Path, **kwargs) -> Dict: 80 | debug = kwargs.pop('debug', False), 81 | logger.remove() 82 | logger.add( 83 | sys.stdout, level='DEBUG', 84 | format=f"{file_path.name} - {LOGURU_FORMAT}", 85 | backtrace=True 86 | ) 87 | logger.info(f"running squeeze for {file_path}") 88 | df = pd.read_csv(file_path.resolve(), engine='python', dtype='str', delimiter=r"\s*,\s*") 89 | df['real'] = df['real'].astype(float) 90 | df['predict'] = df['predict'].astype(float) 91 | try: 92 | timestamp = int(file_path.name.rstrip('.csv')) 93 | except ValueError: 94 | timestamp = file_path.name.rstrip('.csv') 95 | logger.warning(f"Unresolved timestamp: {timestamp}") 96 | tic = time.time() 97 | 98 | model = Squeeze( 99 | data_list=[df], 100 | op=lambda x: x, 101 | option=SqueezeOption( 102 | debug=debug, 103 | fig_save_path=f"{output_path.resolve()}/{timestamp}" + "{suffix}" + ".pdf", 104 | **kwargs, 105 | ) 106 | ) 107 | model.run() 108 | logger.info("\n" + model.report) 109 | try: 110 | root_cause = AC.batch_to_string( 111 | frozenset(reduce(lambda x, y: x.union(y), model.root_cause, set()))) # type: 112 | except IndexError: 113 | root_cause = "" 114 | 115 | toc = time.time() 116 | elapsed_time = toc - tic 117 | return { 118 | 'timestamp': timestamp, 119 | 'elapsed_time': elapsed_time, 120 | 'root_cause': root_cause, 121 | } 122 | 123 | 124 | def executor_derived(file_path_list: List[Path], output_path: Path, **kwargs) -> Dict: 125 | debug = kwargs.pop('debug', False), 126 | logger.remove() 127 | ts = file_path_list[0].name.rstrip('.a.csv') 128 | logger.add( 129 | sys.stdout, level='DEBUG', 130 | format=f"{ts} - {LOGURU_FORMAT}", 131 | backtrace=True 132 | ) 133 | logger.info(f"running squeeze for {ts}") 134 | dfa = pd.read_csv(file_path_list[0].resolve(), engine='python', dtype='str', delimiter=r"\s*,\s*") 135 | dfa['real'] = dfa['real'].astype(float) 136 | dfa['predict'] = dfa['predict'].astype(float) 137 | dfb = pd.read_csv(file_path_list[1].resolve(), engine='python', dtype='str', delimiter=r"\s*,\s*") 138 | dfb['real'] = dfb['real'].astype(float) 139 | dfb['predict'] = dfb['predict'].astype(float) 140 | zero_index = (dfa.real == 0) & (dfa.predict == 0) & (dfb.real == 0) & (dfb.predict == 0) 141 | dfa = dfa[~zero_index] 142 | dfb = dfb[~zero_index] 143 | try: 144 | timestamp = int(ts) 145 | except ValueError: 146 | timestamp = ts 147 | logger.warning(f"Unresolved timestamp: {timestamp}") 148 | tic = time.time() 149 | 150 | divide = lambda x, y: np.divide(x, y, out=np.zeros_like(x), where=y != 0) 151 | model = Squeeze( 152 | data_list=[dfa, dfb], 153 | op=divide, 154 | option=SqueezeOption( 155 | debug=debug, 156 | fig_save_path=f"{output_path.resolve()}/{timestamp}" + "{suffix}" + ".pdf", 157 | enable_filter=True, 158 | **kwargs, 159 | ) 160 | ) 161 | model.run() 162 | logger.info("\n" + model.report) 163 | try: 164 | root_cause = AC.batch_to_string( 165 | frozenset(reduce(lambda x, y: x.union(y), model.root_cause, set()))) # type: 166 | except IndexError: 167 | root_cause = "" 168 | 169 | toc = time.time() 170 | elapsed_time = toc - tic 171 | return { 172 | 'timestamp': timestamp, 173 | 'elapsed_time': elapsed_time, 174 | 'root_cause': root_cause, 175 | } 176 | 177 | if __name__ == '__main__': 178 | main() 179 | -------------------------------------------------------------------------------- /run_evaluation.py: -------------------------------------------------------------------------------- 1 | import click 2 | import pandas as pd 3 | import json 4 | from utility import AttributeCombination as AC 5 | import numpy as np 6 | 7 | 8 | @click.command() 9 | @click.option("--injection-info", '-i', help='injection_info.csv file') 10 | @click.option("--predict", '-p', help='output json file') 11 | @click.option("--config", '-c', help='config json file') 12 | @click.option("--output-path", '-o', help="output path", default="./output.csv") 13 | def main(*args, **kwargs): 14 | evaluate(*args, **kwargs) 15 | 16 | 17 | def evaluate(injection_info, predict, config, output_path, verbose=True, return_detail=False): 18 | injection_info = pd.read_csv(injection_info) 19 | with open(predict, 'r') as f: 20 | predict = json.load(f) 21 | with open(config, 'r') as f: 22 | config = json.load(f) 23 | injection_info.set_index(['timestamp'], inplace=True) 24 | for idx, item in enumerate(predict): 25 | try: 26 | label = predict[idx]['label'] = AC.batch_from_string( 27 | injection_info.loc(axis=0)[int(item['timestamp']), 'set'], 28 | attribute_names=config['columns'] 29 | ) 30 | try: 31 | ret = AC.batch_from_string( 32 | item['root_cause'].replace('|', ';'), 33 | attribute_names=config['columns'] 34 | ) 35 | pred = predict[idx]['pred'] = ret 36 | except Exception as e: 37 | print(item, e) 38 | continue 39 | _fn = len(label) 40 | _tp, _fp = 0, 0 41 | for rc_item in pred: 42 | if rc_item in label: 43 | _fn -= 1 44 | _tp += 1 45 | else: 46 | _fp += 1 47 | except KeyError: 48 | continue 49 | predict[idx]['tp'] = _tp 50 | predict[idx]['fp'] = _fp 51 | predict[idx]['fn'] = _fn 52 | predict[idx]['cuboid_layer'] = len(list(label)[0].non_any_values) 53 | predict[idx]['num_elements'] = len(label) 54 | predict[idx]['significance'] = injection_info.loc(axis=0)[int(item['timestamp']), 'significance'] 55 | if verbose: 56 | print("========================================") 57 | print(f"timestamp:{item['timestamp']}") 58 | print(f"label:{AC.batch_to_string(label)}") 59 | print(f"pred :{AC.batch_to_string(pred)}") 60 | print(f"tp: {_tp}, fp: {_fp}, fn: {_fn}") 61 | del predict[idx]['root_cause'] 62 | df = pd.DataFrame.from_records(predict) 63 | total_fscore = 2 * np.sum(df.tp) / (2 * np.sum(df.tp) + np.sum(df.fp) + np.sum(df.fn)) 64 | total_precision = np.sum(df.tp) / (np.sum(df.tp) + np.sum(df.fp)) 65 | total_recall = np.sum(df.tp) / (np.sum(df.tp) + np.sum(df.fn)) 66 | df_total = pd.DataFrame.from_dict( 67 | {"tp": [np.sum(df.tp)], 68 | "fp": [np.sum(df.fp)], 69 | "fn": [np.sum(df.fn)], 70 | "F1-Score": [total_fscore], 71 | "Precision": [total_precision], 72 | "Recall": [total_recall], 73 | 'Time Cost (s)': [np.mean(df['elapsed_time'])], 74 | 'time_std': [np.std(df['elapsed_time'])], 75 | 'Total Time Cost (s)': [np.sum(df['elapsed_time'])], 76 | 'length': len(predict), 77 | # 'time_list': df['elapsed_time'].values, 78 | } 79 | ) 80 | if verbose: 81 | print(df_total) 82 | if output_path is not None: 83 | df_total.to_csv(output_path, index=False) 84 | if verbose: 85 | print(total_fscore, total_precision, total_recall) 86 | if return_detail: 87 | return df 88 | return df_total 89 | 90 | 91 | if __name__ == '__main__': 92 | main() 93 | 94 | 95 | -------------------------------------------------------------------------------- /run_experiment.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | DATASET=${1} 3 | SETTING=${2} 4 | NUM_WORKER=20 5 | python run_algorithm.py --name ${SETTING} --input-path ${DATASET} --output-path output/${DATASET}/ --num-workers ${NUM_WORKER} 6 | python run_evaluation.py -i ${DATASET}/${SETTING}/injection_info.csv -p output/${DATASET}/${SETTING}.json -c ${DATASET}.json 7 | -------------------------------------------------------------------------------- /squeeze/__init__.py: -------------------------------------------------------------------------------- 1 | from .squeeze import * 2 | from .squeeze_option import * 3 | -------------------------------------------------------------------------------- /squeeze/anomaly_amount_fileter.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from kneed import KneeLocator 3 | from loguru import logger 4 | from scipy.stats import gaussian_kde 5 | 6 | 7 | class KPIFilter: 8 | def __init__(self, real_array, predict_array): 9 | # self.select_metrics = np.log(np.abs(real_array - predict_array) + 1) / 10 10 | self.select_metrics = np.abs(real_array - predict_array) 11 | # self.select_metrics = np.abs(predict_array - real_array) / np.abs(real_array + predict_array) 12 | kernel = gaussian_kde(self.select_metrics) 13 | _x = sorted(np.linspace(np.min(self.select_metrics), np.max(self.select_metrics), 1000)) 14 | _y = np.cumsum(kernel(_x)) 15 | knee = KneeLocator(_x, _y, curve='concave', direction='increasing').knee 16 | logger.info(f"kneed: {knee}") 17 | if knee is None: 18 | logger.warning("no knee point found") 19 | knee = np.min(self.select_metrics) 20 | self.filtered_indices = np.where(self.select_metrics > knee) 21 | 22 | self.original_indices = np.arange(len(real_array))[self.filtered_indices] 23 | 24 | def inverse_map(self, indices): 25 | return self.original_indices[indices] 26 | -------------------------------------------------------------------------------- /squeeze/clustering/__init__.py: -------------------------------------------------------------------------------- 1 | from .cluster import * 2 | from .density_cluster import * 3 | 4 | 5 | def cluster_factory(option: SqueezeOption): 6 | method_map = { 7 | "density": DensityBased1dCluster, 8 | } 9 | return method_map[option.cluster_method](option) 10 | -------------------------------------------------------------------------------- /squeeze/clustering/cluster.py: -------------------------------------------------------------------------------- 1 | from ..squeeze_option import SqueezeOption 2 | from typing import List 3 | import numpy as np 4 | 5 | 6 | class Cluster: 7 | """ 8 | one dim cluster, give a 1d-array, return each clusters indices 9 | """ 10 | 11 | def __init__(self, option: SqueezeOption): 12 | self.option = option 13 | 14 | def __call__(self, array) -> List[np.ndarray]: 15 | raise NotImplementedError() 16 | 17 | -------------------------------------------------------------------------------- /squeeze/clustering/density_cluster.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | import seaborn as sns 3 | import numpy as np 4 | from loguru import logger 5 | from scipy.stats import gaussian_kde 6 | from scipy.signal import argrelextrema 7 | import matplotlib.pyplot as plt 8 | from squeeze.clustering.cluster import Cluster 9 | from squeeze.squeeze_option import SqueezeOption 10 | from kneed import KneeLocator 11 | 12 | 13 | def smooth(arr, window_size): 14 | new_arr = np.convolve(arr, np.ones(window_size), mode="valid") / window_size 15 | new_arr = np.concatenate([arr[:window_size - 1], new_arr]) 16 | assert np.shape(new_arr) == np.shape(arr) 17 | return new_arr 18 | 19 | 20 | class DensityBased1dCluster(Cluster): 21 | def __init__(self, option: SqueezeOption): 22 | super().__init__(option) 23 | assert option.density_estimation_method in {'kde', 'histogram'} 24 | self.density_estimation_func = { 25 | "kde": self._kde, 26 | "histogram": self._histogram, 27 | }[option.density_estimation_method] 28 | 29 | def _kde(self, array: np.ndarray): 30 | kernel = gaussian_kde(array, bw_method=self.option.kde_bw_method, weights=self.option.kde_weights) 31 | samples = np.arange(np.min(array), np.max(array), 0.01) 32 | kde_sample = kernel(points=samples) 33 | conv_kernel = self.option.density_smooth_conv_kernel 34 | kde_sample_smoothed = np.convolve(kde_sample, conv_kernel, 'full') / np.sum(conv_kernel) 35 | return kde_sample_smoothed, samples 36 | 37 | def _histogram(self, array: np.ndarray): 38 | def _get_hist(_width): 39 | if _width == 'auto': 40 | _edges = np.histogram_bin_edges(array, 'auto').tolist() 41 | _edges = [_edges[0] - 0.1 * i for i in range(-5, 0, -1)] + _edges + [_edges[-1] + 0.1 * i for i in 42 | range(1, 6)] 43 | else: 44 | _edges = np.arange(array_range[0] - _width * 6, array_range[1] + _width * 5, _width) 45 | h, edges = np.histogram(array, bins=_edges, density=True) 46 | h /= 100. 47 | # conv_kernel = self.option.density_smooth_conv_kernel 48 | # h = np.convolve(h, conv_kernel, 'full') / np.sum(conv_kernel) 49 | return h, np.convolve(edges, [1, 1], 'valid') / 2 50 | 51 | def _get_score(_clusters): 52 | if len(_clusters) <= 0: 53 | return float('-inf') 54 | _mu = np.concatenate([np.repeat(np.mean(array[idx]), np.size(idx)) for idx in _clusters]) 55 | _sigma = np.concatenate([np.repeat(np.std(array[idx]), np.size(idx)) for idx in _clusters]) + 1e-8 56 | # _arrays = np.concatenate([array[idx] for idx in _clusters]) 57 | # _scores = np.sum(- np.log(_sigma) - np.square((_arrays - _mu) / _sigma)) 58 | _scores = np.max(_sigma) 59 | return _scores 60 | 61 | array_range = np.min(array), np.max(array) 62 | width = self.option.histogram_bar_width 63 | # if width == 'auto': 64 | # x_list = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10] 65 | # hists = [_get_hist(_width) for _width in x_list] 66 | # # y_list = [len(argrelextrema( 67 | # # _get_hist(_width=_width)[0], comparator=np.greater_equal, 68 | # # axis=0, order=self.option.cluster_smooth_window_size, mode='clip')[0]) for _width in x_list] 69 | # clusters_list = [self._cluster(array, density_array, bins) for density_array, bins in hists] 70 | # y_list = [_get_score(clusters) for clusters in clusters_list] 71 | # split = KneeLocator(x_list, y_list, curve='concave', direction='increasing').knee 72 | # if split is None: 73 | # split = x_list[0] 74 | # # elbow = x_list[np.argmax(y_list)] 75 | # logger.debug(f"{x_list}, {y_list}, {split}") 76 | # width = split 77 | 78 | return _get_hist(width) 79 | 80 | def _cluster(self, array, density_array: np.ndarray, bins, plot=False): 81 | def significant_greater(a, b): 82 | return (a - b) / (a + b) > 0.1 83 | 84 | order = 1 85 | extreme_max_indices = argrelextrema( 86 | density_array, comparator=lambda x, y: x > y, 87 | axis=0, order=order, mode='wrap')[0] 88 | extreme_min_indices = argrelextrema( 89 | density_array, comparator=lambda x, y: x <= y, 90 | axis=0, order=order, mode='wrap')[0] 91 | extreme_max_indices = list(filter(lambda x: density_array[x] > 0, extreme_max_indices)) 92 | if plot: 93 | for idx in extreme_max_indices: 94 | plt.axvline(bins[idx], linestyle="-", color="red", label="relmax", alpha=0.5, linewidth=0.8) 95 | for idx in extreme_min_indices: 96 | plt.axvline(bins[idx], linestyle="--", color="blue", label="relmin", alpha=0.5, linewidth=0.8) 97 | 98 | cluster_list = [] 99 | boundaries = np.asarray([float('-inf')] + [bins[index] for index in extreme_min_indices] + [float('+inf')]) 100 | if self.option.max_normal_deviation == 'auto': 101 | mu = np.mean(np.abs(array)) 102 | max_normal = mu 103 | logger.debug(f"max normal {max_normal}") 104 | self.option.max_normal_deviation = max_normal 105 | for index in extreme_max_indices: 106 | left_boundary = boundaries[np.searchsorted(boundaries, bins[index], side='right') - 1] 107 | right_boundary = boundaries[np.searchsorted(boundaries, bins[index], side='left')] 108 | cluster_indices = np.where( 109 | np.logical_and( 110 | array <= right_boundary, 111 | array >= left_boundary, 112 | ) 113 | )[0] 114 | cluster = array[cluster_indices] 115 | mu = np.mean(np.abs(cluster)) 116 | logger.debug(f"({left_boundary, right_boundary}, {mu})") 117 | if np.abs(mu) < self.option.max_normal_deviation or len(cluster) <= 0: 118 | continue 119 | cluster_list.append(cluster_indices) 120 | return cluster_list 121 | 122 | def __call__(self, array): 123 | array = array.copy() 124 | density_array, bins = self.density_estimation_func(array) 125 | # normal_idxes = self._find_normal_indices(array, density_array, bins) 126 | # density_array, bins = self.density_estimation_func(array[~normal_idxes]) 127 | density_array = np.copy(density_array) 128 | if self.option.cluster_smooth_window_size == "auto": 129 | # window_size = max(int(np.log(np.count_nonzero(bins[density_array > 0.])) / np.log(10)), 1) 130 | window_size = max(np.count_nonzero(density_array > 0) // 10, 1) 131 | logger.debug(f"auto window size: {window_size} {np.count_nonzero(density_array > 0)}") 132 | else: 133 | window_size = self.option.cluster_smooth_window_size 134 | smoothed_density_array = smooth(density_array, window_size) 135 | if self.option.debug: 136 | fig, ax1 = plt.subplots(figsize=(3.6, 1.8)) 137 | sns.distplot(array, bins='auto', label="density", hist=True, kde=False, norm_hist=True, ax=ax1) 138 | ax1.set_ylim([0, None]) 139 | # ax2 = ax1.twinx() 140 | # ax2.plot(bins, smoothed_density_array, label="smoothed", linestyle="-.") 141 | # ax2.set_ylim([0, None]) 142 | clusters = self._cluster(array, smoothed_density_array, bins, plot=self.option.debug) 143 | if self.option.debug: 144 | for cluster in clusters: 145 | left_boundary, right_boundary = np.min(array[cluster]), np.max(array[cluster]) 146 | # plt.axvline(left_boundary, c='C0', alpha=0.5, linestyle='--') 147 | # plt.axvline(right_boundary, c='C1', alpha=0.5, linestyle=':') 148 | logger.debug(f"cluster: [{left_boundary}, {right_boundary}]") 149 | by_label1 = dict(zip(*reversed(ax1.get_legend_handles_labels()))) 150 | # by_label2 = dict(zip(*reversed(ax2.get_legend_handles_labels()))) 151 | by_label2 = {} 152 | # logger.debug(f"{by_label1}, {by_label2}") 153 | plt.legend( 154 | list(by_label1.values()) + list(by_label2.values()), 155 | list(by_label1.keys()) + list(by_label2.keys()), bbox_to_anchor=(0.47, 0.5) 156 | ) 157 | plt.xlim([-0.9, 1]) 158 | # plt.title(self.option.density_estimation_method) 159 | plt.xlabel('deviation score') 160 | plt.ylabel('pdf') 161 | plt.tight_layout() 162 | # plt.show() 163 | plt.savefig(self.option.fig_save_path.format(suffix="_density_cluster")) 164 | plt.close() 165 | return clusters 166 | 167 | -------------------------------------------------------------------------------- /squeeze/squeeze.py: -------------------------------------------------------------------------------- 1 | from functools import lru_cache 2 | from itertools import combinations 3 | import pandas as pd 4 | from typing import List, FrozenSet, Dict, Union 5 | from loguru import logger 6 | from scipy.stats import entropy, norm 7 | from sklearn.metrics import log_loss 8 | from typing import Tuple 9 | from utility import AttributeCombination as AC, AttributeCombination 10 | from bidict import bidict 11 | import numpy as np 12 | from squeeze.anomaly_amount_fileter import KPIFilter 13 | from squeeze.squeeze_option import SqueezeOption 14 | from squeeze.clustering import cluster_factory 15 | from scipy.spatial.distance import cityblock, euclidean 16 | 17 | 18 | class Squeeze: 19 | def __init__(self, data_list: List[pd.DataFrame], op=lambda x: x, option: SqueezeOption = SqueezeOption()): 20 | """ 21 | :param data_list: dataframe without index, 22 | must have 'real' and 'predict' columns, other columns are considered as attributes 23 | all elements in this list must have exactly the same attribute combinations in the same order 24 | """ 25 | self.option = option 26 | 27 | self.one_dim_cluster = cluster_factory(self.option) 28 | self.cluster_list = [] # type: List[np.ndarray] 29 | 30 | valid_idx = np.logical_and.reduce( 31 | [_.predict > 0 for _ in data_list], 32 | ) 33 | 34 | self.data_list = list(_[valid_idx] for _ in data_list) 35 | self.op = op 36 | self.derived_data = self.get_derived_dataframe(None) # type: pd.DataFrame 37 | # There is an error in injection 38 | self.derived_data.real -= min(np.min(self.derived_data.real), 0) 39 | 40 | self.attribute_names = list(sorted(set(self.derived_data.columns) - {'real', 'predict'})) 41 | logger.debug(f"available attributes: {self.attribute_names}") 42 | 43 | self.derived_data.sort_values(by=self.attribute_names, inplace=True) 44 | self.data_list = list(map(lambda x: x.sort_values(by=self.attribute_names), self.data_list)) 45 | 46 | self.attribute_values = list(list(set(self.derived_data.loc[:, name].values)) for name in self.attribute_names) 47 | logger.debug(f"available values: {self.attribute_values}") 48 | 49 | self.ac_array = np.asarray( 50 | [AC(**record) for record in self.derived_data[self.attribute_names].to_dict(orient='records')]) 51 | 52 | self._v = self.derived_data['real'].values 53 | self._f = self.derived_data['predict'].values 54 | assert all(self._v >= 0) and all(self._f >= 0), \ 55 | f"currently we assume that KPIs are non-negative, {self.derived_data[~(self._f >= 0)]}" 56 | 57 | self.__finished = False 58 | self._root_cause = [] 59 | 60 | self.filtered_indices = None 61 | 62 | @property 63 | @lru_cache() 64 | def root_cause(self): 65 | return self._root_cause 66 | 67 | @property 68 | @lru_cache() 69 | def report(self) -> str: 70 | cluster_impacts = [ 71 | np.sum(np.abs(self._f[idx] - self._v[idx])) for idx in self.cluster_list 72 | ] 73 | unique_root_cause, rc_indies = np.unique(self.root_cause, return_index=True) 74 | cluster_impacts = [ 75 | np.sum(cluster_impacts[idx]) for idx in rc_indies 76 | ] 77 | logger.debug(f"{unique_root_cause}, {cluster_impacts}") 78 | report_df = pd.DataFrame(columns=['root_cause', 'impact']) 79 | report_df['root_cause'] = list(AC.batch_to_string(_) for _ in unique_root_cause) 80 | report_df['impact'] = cluster_impacts 81 | report_df.sort_values(by=['impact'], inplace=True, ascending=False) 82 | return report_df.to_csv(index=False) 83 | 84 | @lru_cache() 85 | def get_cuboid_ac_array(self, cuboid: Tuple[str, ...]): 86 | return np.asarray(list(map(lambda x: x.mask(cuboid), self.ac_array))) 87 | 88 | @lru_cache() 89 | def get_indexed_data(self, cuboid: Tuple[str, ...]): 90 | return self.derived_data.set_index(list(cuboid)) 91 | 92 | @property 93 | @lru_cache() 94 | def normal_indices(self): 95 | abnormal = np.sort(np.concatenate(self.cluster_list)) 96 | idx = np.argsort(np.abs(self.leaf_deviation_score[abnormal])) 97 | abnormal = abnormal[idx] 98 | normal = np.where(np.abs(self.leaf_deviation_score) < self.leaf_deviation_score[abnormal[0]])[0] 99 | # normal = np.setdiff1d(np.arange(len(self.derived_data)), abnormal, assume_unique=True) 100 | # return np.intersect1d(normal, self.filtered_indices, assume_unique=True) 101 | return normal 102 | 103 | def run(self): 104 | if self.__finished: 105 | logger.warning(f"try to rerun {self}") 106 | return self 107 | if self.option.enable_filter: 108 | kpi_filter = KPIFilter(self._v, self._f) 109 | self.filtered_indices = kpi_filter.filtered_indices 110 | cluster_list = self.one_dim_cluster(self.leaf_deviation_score[self.filtered_indices]) 111 | cluster_list = list( 112 | [kpi_filter.inverse_map(_) for _ in cluster_list] 113 | ) 114 | cluster_list = list( 115 | [list( 116 | filter(lambda x: np.min(self.leaf_deviation_score[_]) <= self.leaf_deviation_score[x] <= np.max( 117 | self.leaf_deviation_score[_]), np.arange(len(self._f))) 118 | ) 119 | for _ in cluster_list] 120 | ) 121 | self.cluster_list = cluster_list 122 | else: 123 | self.filtered_indices = np.ones(len(self._v), dtype=bool) 124 | self.cluster_list = self.one_dim_cluster(self.leaf_deviation_score) 125 | 126 | self.locate_root_cause() 127 | self.__finished = True 128 | self._root_cause = self._root_cause 129 | return self 130 | 131 | def _locate_in_cuboid(self, cuboid, indices, **params) -> Tuple[FrozenSet[AC], float]: 132 | """ 133 | :param cuboid: try to find root cause in this cuboid 134 | :param indices: anomaly leaf nodes' indices 135 | :return: root causes and their score 136 | """ 137 | # mu = params.get("mu") 138 | # sigma = params.get("sigma") 139 | data_cuboid_indexed = self.get_indexed_data(cuboid) 140 | logger.debug(f"current cuboid: {cuboid}") 141 | 142 | abnormal_cuboid_ac_arr = self.get_cuboid_ac_array(cuboid)[indices] 143 | elements, num_elements = np.unique(abnormal_cuboid_ac_arr, return_counts=True) 144 | 145 | num_ele_descents = np.asarray(list( 146 | np.count_nonzero( 147 | _.index_dataframe(data_cuboid_indexed), 148 | ) for _ in elements 149 | )) 150 | # sort reversely by descent score 151 | descent_score = num_elements / np.maximum(num_ele_descents, 1e-4) 152 | idx = np.argsort(descent_score)[::-1] 153 | elements = elements[idx] 154 | num_ele_descents = num_ele_descents[idx] 155 | num_elements = num_elements[idx] 156 | 157 | # descent_score = descent_score[idx] 158 | del descent_score 159 | 160 | logger.debug(f"elements: {';'.join(str(_) for _ in elements)}") 161 | 162 | def _root_cause_score(partition: int) -> float: 163 | dis_f = cityblock 164 | data_p, data_n = self.get_derived_dataframe( 165 | frozenset(elements[:partition]), cuboid=cuboid, 166 | reduction=lambda x: x, return_complement=True, 167 | subset_indices=np.concatenate([indices, self.normal_indices])) 168 | assert len(data_p) + len(data_n) == len(indices) + len(self.normal_indices), \ 169 | f'{len(data_n)} {len(data_p)} {len(indices)} {len(self.normal_indices)}' 170 | # dp = self.__deviation_score(data_p['real'].values, data_p['predict'].values) 171 | # dn = self.__deviation_score(data_n['real'].values, data_n['predict'].values) if len(data_n) else [] 172 | # log_ll = np.mean(norm.pdf(dp, loc=mu, scale=sigma)) \ 173 | # + np.mean(norm.pdf(dn, loc=0, scale=self.option.normal_deviation_std)) 174 | _abnormal_descent_score = np.sum(num_elements[:partition]) / np.sum(num_ele_descents[:partition]) 175 | _normal_descent_score = 1 - np.sum(num_elements[partition:] / np.sum(num_ele_descents[partition:])) 176 | _ds = _normal_descent_score * _abnormal_descent_score 177 | succinct = partition + len(cuboid) * len(cuboid) 178 | _pv, _pf = np.sum(data_p.real.values), np.sum(data_p.predict.values) 179 | _lp = len(data_p) 180 | _v1, _v2 = data_p.real.values, data_n.real.values 181 | _v = np.concatenate([_v1, _v2]) 182 | _f1, _f2 = data_p.predict.values, data_n.predict.values 183 | _f = np.concatenate([_f1, _f2]) 184 | _a1, _a2 = data_p.predict.values * (_pv / _pf), data_n.predict.values 185 | _a = np.concatenate([_a1, _a2]) 186 | divide = lambda x, y: x / y if y > 0 else (0 if x == 0 else float('inf')) 187 | _ps = 1 - (divide(dis_f(_v1, _a1), len(_v1)) + divide(dis_f(_v2, _f2), len(_v2))) \ 188 | / (divide(dis_f(_v1, _f1), len(_v1)) + divide(dis_f(_v2, _f2), len(_v2))) 189 | logger.debug( 190 | f"partition:{partition} " 191 | # f"log_ll:{log_ll} " 192 | # f"impact: {impact_score} " 193 | f"succinct: {succinct} " 194 | f"ps: {_ps}" 195 | ) 196 | # return _p * self.option.score_weight / (-succinct) 197 | return _ps 198 | 199 | partitions = np.arange( 200 | min( 201 | len(elements), 202 | self.option.max_num_elements_single_cluster, 203 | len(set(self.get_indexed_data(cuboid).index.values)) - 1 204 | ) 205 | ) + 1 206 | if len(partitions) <= 0: 207 | return elements, float('-inf') 208 | rc_scores = np.asarray(list(map(_root_cause_score, partitions))) 209 | idx = np.argsort(rc_scores)[::-1] 210 | partitions = partitions[idx] 211 | rc_scores = rc_scores[idx] 212 | 213 | score = rc_scores[0] 214 | rc = elements[:partitions[0].item()] 215 | logger.debug(f"cuboid {cuboid} gives root cause {AC.batch_to_string(rc)} with score {score}") 216 | return rc.tolist(), score 217 | 218 | def _locate_in_cluster(self, indices: np.ndarray): 219 | """ 220 | :param indices: indices of leaf nodes in this cluster 221 | :return: None 222 | """ 223 | mu = np.mean(self.leaf_deviation_score[indices]) 224 | sigma = np.maximum(np.std(self.leaf_deviation_score[indices]), 1e-4) 225 | logger.debug(f"locate in cluster: {mu}(+-{sigma})") 226 | max_cuboid_layer = len(self.attribute_names) 227 | ret_lists = [] 228 | for cuboid_layer in np.arange(max_cuboid_layer) + 1: 229 | layer_ret_lists = list(map( 230 | lambda x, _i=indices, _mu=mu, _sigma=sigma: self._locate_in_cuboid(x, indices=_i, mu=_mu, sigma=_sigma), 231 | combinations(self.attribute_names, cuboid_layer) 232 | )) 233 | ret_lists.extend([ 234 | { 235 | 'rc': x[0], 'score': x[1], 'n_ele': len(x[0]), 'layer': cuboid_layer, 236 | 'rank': x[1] * self.option.score_weight - len(x[0]) * cuboid_layer 237 | } for x in layer_ret_lists 238 | ]) 239 | if len(list(filter(lambda x: x['score'] > self.option.ps_upper_bound, ret_lists))): 240 | break 241 | ret_lists = list(sorted( 242 | ret_lists, 243 | key=lambda x: x['rank'], 244 | reverse=True) 245 | ) 246 | if ret_lists: 247 | ret = ret_lists[0]['rc'] 248 | logger.debug( 249 | f"find root cause: {AC.batch_to_string(ret)}, rank: {ret_lists[0]['rank']}, score: {ret_lists[0]['score']}") 250 | self._root_cause.append(frozenset(ret)) 251 | else: 252 | logger.info("failed to find root cause") 253 | 254 | def locate_root_cause(self): 255 | if not self.cluster_list: 256 | logger.info("We do not have abnormal points") 257 | return 258 | if self.option.score_weight == 'auto': 259 | self.option.score_weight = - np.log( 260 | len(self.cluster_list) * 261 | sum(len(_) for _ in self.cluster_list) / len(self._f)) / np.log( 262 | sum(len(_) for _ in self.attribute_values)) * sum(len(_) for _ in self.attribute_values) 263 | # self.option.score_weight = len(self.cluster_list) * \ 264 | # (np.log(sum(len(_) for _ in self.cluster_list)) + np.sum([np.log(len(_)) for _ in self.attribute_values]) - np.log(len(self.cluster_list)) - np.log(len(self.leaf_deviation_score))) \ 265 | # / np.log(np.mean([len(_) for _ in self.attribute_values])) * 10 266 | logger.debug(f"auto score weight: {self.option.score_weight}") 267 | for indices in self.cluster_list: 268 | self._locate_in_cluster(indices) 269 | 270 | @property 271 | @lru_cache() 272 | def leaf_deviation_score(self): 273 | with np.errstate(divide='ignore', invalid='ignore'): 274 | deviation_scores = self.__deviation_score(self._v, self._f) 275 | assert np.shape(deviation_scores) == np.shape(self._v) == np.shape(self._f) 276 | assert np.sum(np.isnan(deviation_scores)) == 0, \ 277 | f"there are nan in deviation score {np.where(np.isnan(deviation_scores))}" 278 | assert np.sum(~np.isfinite(deviation_scores)) == 0, \ 279 | f"there are infinity in deviation score {np.where(~np.isfinite(deviation_scores))}" 280 | logger.debug(f"anomaly ratio ranges in [{np.min(deviation_scores)}, {np.max(deviation_scores)}]") 281 | return deviation_scores 282 | 283 | def get_derived_dataframe(self, ac_set: Union[FrozenSet[AC], None], cuboid: Tuple[str] = None, 284 | reduction=None, return_complement=False, subset_indices=None): 285 | subset = np.zeros(len(self.data_list[0]), dtype=np.bool) 286 | if subset_indices is not None: 287 | subset[subset_indices] = True 288 | else: 289 | subset[:] = True 290 | 291 | if reduction == "sum": 292 | reduce = lambda x, _axis=0: np.sum(x, axis=_axis, keepdims=True) 293 | else: 294 | reduce = lambda x: x 295 | 296 | if ac_set is None: 297 | idx = np.ones(shape=(len(self.data_list[0]),), dtype=np.bool) 298 | else: 299 | idx = AC.batch_index_dataframe(ac_set, self.get_indexed_data(cuboid)) 300 | 301 | def _get_ret(_data_list): 302 | if len(_data_list[0]) == 0: 303 | return pd.DataFrame(data=[], columns=['real', 'predict']) 304 | _values = self.op(*[reduce(_data[["real", "predict"]].values) for _data in _data_list]) 305 | if np.size(_values) == 0: 306 | _values = [] 307 | if reduction == 'sum': 308 | _ret = pd.DataFrame(data=_values, columns=['real', 'predict']) 309 | else: 310 | _ret = _data_list[0].copy(deep=True) 311 | _ret[['real', 'predict']] = _values 312 | return _ret 313 | 314 | data_list = list(_[idx & subset] for _ in self.data_list) 315 | if not return_complement: 316 | return _get_ret(data_list) 317 | complement_data_list = list(_[(~idx) & subset] for _ in self.data_list) 318 | return _get_ret(data_list), _get_ret(complement_data_list) 319 | 320 | @staticmethod 321 | def __deviation_score(v, f): 322 | n = 1 323 | with np.errstate(divide='ignore'): 324 | ret = n * (f - v) / (n * f + v) 325 | # ret = np.log(np.maximum(v, 1e-10)) - np.log(np.maximum(f, 1e-10)) 326 | # ret = (2 * sigmoid(1 - v / f) - 1) 327 | # k = np.log(np.maximum(v, 1e-100)) - np.log(np.maximum(f, 1e-100)) 328 | # ret = (1 - k) / (1 + k) 329 | ret[np.isnan(ret)] = 0. 330 | return ret 331 | -------------------------------------------------------------------------------- /squeeze/squeeze_option.py: -------------------------------------------------------------------------------- 1 | class SqueezeOption: 2 | def __init__(self, **kwargs): 3 | self.debug = False 4 | self.fig_save_path = "/outputs/fig_{suffix}.pdf" 5 | 6 | # Filter 7 | self.enable_filter = True 8 | 9 | # Density Estimation 10 | self.cluster_method = "density" 11 | self.density_estimation_method = 'histogram' 12 | 13 | # KDE 14 | self.density_smooth_conv_kernel = [1.] 15 | self.kde_bw_method = None 16 | self.kde_weights = None 17 | 18 | # Histogram 19 | self.histogram_bar_width = "auto" 20 | 21 | # relative max 22 | self.max_allowed_deviation_bias = 0.10 23 | self.max_allowed_deviation_std = 0.01 24 | 25 | # Cluster 26 | self.cluster_smooth_window_size = "auto" 27 | self.max_normal_deviation = 0.20 28 | 29 | # Group 30 | # self.least_score = 2.0 31 | self.least_descent_score = 0.6 32 | self.normal_deviation_std = 0.1 33 | self.score_weight = "auto" 34 | self.max_num_elements_single_cluster = 12 35 | self.ps_upper_bound = 0.90 36 | 37 | self.__dict__.update(kwargs) 38 | -------------------------------------------------------------------------------- /utility/__init__.py: -------------------------------------------------------------------------------- 1 | from .attribute_combination import * 2 | -------------------------------------------------------------------------------- /utility/attribute_combination.py: -------------------------------------------------------------------------------- 1 | import copy 2 | from functools import reduce, lru_cache 3 | import numpy as np 4 | import pandas as pd 5 | from loguru import logger 6 | from typing import List, FrozenSet, Sequence, Union, Iterable 7 | 8 | 9 | class AttributeCombination(dict): 10 | ANY = '__ANY__' 11 | 12 | def __init__(self, **kwargs): 13 | super().__init__(**{key: str(value) for key, value in kwargs.items()}) 14 | self.__id = None 15 | self.non_any_keys = tuple() 16 | self.non_any_values = tuple() 17 | self.__is_terminal = False 18 | self.__update() 19 | 20 | def __update(self): 21 | self.__id = tuple((key, self[key]) for key in sorted(self.keys())) 22 | self.non_any_keys = tuple(_ for _ in sorted(self.keys()) if self[_] != self.ANY) 23 | self.non_any_values = tuple(self[_] for _ in sorted(self.keys()) if self[_] != self.ANY) 24 | self.__is_terminal = not any(self.ANY == value for value in self.values()) 25 | 26 | def __eq__(self, other: 'AttributeCombination'): 27 | return self.__id == other.__id 28 | 29 | def __lt__(self, other): 30 | return self.__id < other.__id 31 | 32 | def __le__(self, other): 33 | return self.__id <= other.__id 34 | 35 | def __hash__(self): 36 | return hash(self.__id) 37 | 38 | def __setitem__(self, key, value): 39 | super().__setitem__(key, str(value)) 40 | self.__update() 41 | 42 | def __str__(self): 43 | return "&".join(f"{key}={value}" for key, value in zip(self.non_any_keys, self.non_any_values)) 44 | 45 | @staticmethod 46 | def from_string(string: str, attribute_names) -> 'AttributeCombination': 47 | ret = AttributeCombination.get_root_attribute_combination(attribute_names) 48 | for pair in string.split("&"): 49 | if pair == "": 50 | continue 51 | key, value = pair.split("=") 52 | ret[key] = value 53 | return ret 54 | 55 | @staticmethod 56 | def batch_from_string(string: str, attribute_names) -> 'FrozenSet[AttributeCombination]': 57 | return frozenset({AttributeCombination.from_string(_, attribute_names) for _ in string.split(";")}) 58 | 59 | @staticmethod 60 | def batch_to_string(sets: Iterable['AttributeCombination']) -> str: 61 | return ";".join(str(_) for _ in sets) 62 | 63 | def copy_and_update(self, other): 64 | o = copy.copy(self) 65 | o.update(other) 66 | o.__update() 67 | return o 68 | 69 | @staticmethod 70 | def get_attribute_combination(data: pd.DataFrame): 71 | columns = list(set(data.columns) - {'real', 'predict'}) 72 | _attributes = AttributeCombination() 73 | for column in columns: 74 | _attributes[column] = AttributeCombination.ANY 75 | return _attributes 76 | 77 | def index_dataframe_without_index(self, data: pd.DataFrame): 78 | # noinspection PyTypeChecker 79 | return reduce(np.logical_and, 80 | [data[key] == value for key, value in self.items() if value != self.ANY], 81 | np.ones(len(data), dtype=bool)) 82 | 83 | def index_dataframe(self, data: pd.DataFrame): 84 | if len(self.non_any_values) == 0: 85 | return np.ones(len(data), dtype=np.bool) 86 | try: 87 | arr = np.zeros(shape=len(data), dtype=np.bool) 88 | if len(self.non_any_values) == 1: 89 | idx = data.index.get_loc(self.non_any_values[0]) 90 | else: 91 | idx = data.index.get_loc(self.non_any_values) 92 | arr[idx] = True 93 | return arr 94 | except KeyError: 95 | return np.zeros(len(data), dtype=np.bool) 96 | 97 | def is_terminal(self): 98 | return self.__is_terminal 99 | 100 | @staticmethod 101 | def batch_index_dataframe(attribute_combinations, data: pd.DataFrame): 102 | # noinspection PyTypeChecker 103 | index = reduce(np.logical_or, 104 | (_.index_dataframe(data) for _ in attribute_combinations), 105 | np.zeros(len(data), dtype=np.bool)) 106 | return index 107 | 108 | @staticmethod 109 | def batch_index_dataframe_without_index(attribute_combinations, data: pd.DataFrame): 110 | # noinspection PyTypeChecker 111 | index = reduce(np.logical_or, 112 | (_.index_dataframe_without_index(data) for _ in attribute_combinations), 113 | np.zeros(len(data), dtype=np.bool)) 114 | return index 115 | 116 | @staticmethod 117 | def get_root_attribute_combination(attribute_names): 118 | return AttributeCombination(**{key: AttributeCombination.ANY for key in attribute_names}) 119 | 120 | def is_descent(self, other): 121 | return all(self.__attribute_is_descent(sorted(item_a), sorted(item_b)) 122 | for item_a, item_b in zip(self.items(), other.items())) 123 | 124 | @staticmethod 125 | def __attribute_is_descent(a, b): 126 | return a[0] == b[0] and (a[1] == b[1] or b[1] == AttributeCombination.ANY) 127 | 128 | def mask(self, keys): 129 | """ 130 | :param keys: keep which keys 131 | :return: a new attribute combination, keep keys, the others are set ANY 132 | """ 133 | to_fill_keys = set(self.keys()) - set(keys) 134 | return self.copy_and_update({key: self.ANY for key in to_fill_keys}) 135 | 136 | @staticmethod 137 | def from_iops_2019_format(string: str, attribute_names=None) -> FrozenSet['AttributeCombination']: 138 | """ 139 | :param attribute_names: 140 | :param string: 141 | :return: 142 | """ 143 | if attribute_names is None: 144 | attribute_names = ['i', 'e', 'c', 'p', 'l'] 145 | root = AttributeCombination(**{key: AttributeCombination.ANY for key in attribute_names}) 146 | results = {root.copy_and_update({_[0]: _ for _ in case.split('&') if _ != ''}) for case in string.split(';')} 147 | return frozenset(results) 148 | 149 | @staticmethod 150 | def to_iops_2019_format(attribute_combinations: Iterable['AttributeCombination']): 151 | return ";".join("&".join(_.non_any_values) for _ in attribute_combinations) 152 | 153 | 154 | AC = AttributeCombination 155 | --------------------------------------------------------------------------------