├── .gitignore
├── .idea
└── vcs.xml
├── AutosklearnModellingLossOverTimeExample.png
├── LICENSE.txt
├── README.md
├── bin
├── dataTransformationProcessing.py
├── evaluate-dataset-Adult.py
├── load-dataset-Adult.py
├── load-dataset-Titanic.py
├── utility.py
└── zeroconf.py
├── data
├── Adult.h5
├── Adult.h5.tar.gz
├── adult.data
├── adult.names
├── adult.test
└── adult.test.withid
├── log
└── .gitignore
├── parameter
├── default.yml
├── logger.yml
└── standard.yml
├── requirements.txt
└── work
└── .gitignore
/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 | work
3 | log
4 | data/zeroconf-result.csv
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
11 |
12 | ## Algoritms included
13 | bernoulli_nb,
14 | extra_trees,
15 | gaussian_nb,
16 | adaboost,
17 | gradient_boosting,
18 | k_nearest_neighbors,
19 | lda,
20 | liblinear_svc,
21 | multinomial_nb,
22 | passive_aggressive,
23 | random_forest,
24 | sgd
25 |
26 | plus samplers, scalers, imputers (14 feature processing methods, and 3 data preprocessing
27 | methods, giving rise to a structured hypothesis space with 100+ hyperparameters)
28 |
29 | ## Running autosklearn-zeroconf
30 | To run autosklearn-zeroconf start
python bin/zeroconf.py -d your_dataframe.h5from command line. 31 | The script was tested on Ubuntu and RedHat. It won't work on any WindowsOS because auto-sklearn doesn't support Windows. 32 | 33 | ## Data Format 34 | The code uses a pandas dataframe format to manage the data. It is stored in the HDF5 .h5 file for convenience. (Python module "tables") 35 | 36 | ## Example 37 | As an example you can run autosklearn-zeroconf on a "Census Income" dataset https://archive.ics.uci.edu/ml/datasets/Adult. 38 |
python ./bin/zeroconf.py -d ./data/Adult.h539 | And then to evaluate the prediction stored in zerconf-result.csv against the test dataset file adult.test.withid 40 |
python ./bin/evaluate-dataset-Adult.py41 | 42 | ## Installation 43 | The script itself needs no installation, just copy it with the rest of the files in your working directory. 44 | Alternatively you could use git clone 45 |
46 | sudo apt-get update && sudo apt-get install git && git clone https://github.com/paypal/autosklearn-zeroconf.git 47 |48 | 49 | ### Happy path installation on Ubuntu 18.04LTS 50 |
51 | sudo apt-get update && sudo apt-get install git gcc build-essential swig python-pip virtualenv python3-dev 52 | git clone https://github.com/paypal/autosklearn-zeroconf.git 53 | pip install virtualenv 54 | virtualenv zeroconf -p /usr/bin/python3.6 55 | source zeroconf/bin/activate 56 | curl https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/master/requirements.txt | xargs -n 1 -L 1 pip install 57 | git clone https://github.com/paypal/autosklearn-zeroconf.git 58 | cd autosklearn-zeroconf/ && python ./bin/zeroconf.py -d ./data/Adult.h5 2>/dev/null 59 |60 | 61 | ## License 62 | autosklearn-zeroconf is licensed under the [BSD 3-Clause License (Revised)](LICENSE.txt) 63 | 64 | ## Example of the output 65 |
66 | python zeroconf.py -d ./data/Adult.h5 2>/dev/null | grep [ZEROCONF] 67 | 68 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO - Program Call Parameter (Arguments and Parameter File Values): 69 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO - basedir: /home/ulrich/PycharmProjects/autosklearn-zeroconf 70 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO - data_file: /home/ulrich/PycharmProjects/autosklearn-zeroconf/data/Adult.h5 71 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - id_field: cust_id 72 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - max_classifier_time_budget: 1200 73 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - max_sample_size: 100000 74 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - memory_limit: 15000 75 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - parameter_file: /home/ulrich/PycharmProjects/autosklearn-zeroconf/parameter/default.yml 76 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - proc: zeroconf.py 77 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - resultfile: /home/ulrich/PycharmProjects/autosklearn-zeroconf/data/zeroconf-result.csv 78 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - runid: 20171011105215 79 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - runtype: Fresh Run Start 80 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - target_field: category 81 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO - workdir: /home/ulrich/PycharmProjects/autosklearn-zeroconf/work/20171011105215 82 | 2017-10-11 10:52:15,944 - [ZEROCONF] - zeroconf.py - INFO - Read dataset from the store 83 | 2017-10-11 10:52:15,945 - [ZEROCONF] - zeroconf.py - INFO - Values of y [ 0. 1. nan] 84 | 2017-10-11 10:52:15,945 - [ZEROCONF] - zeroconf.py - INFO - We need to protect NAs in y from the prediction dataset so we convert them to -1 85 | 2017-10-11 10:52:15,946 - [ZEROCONF] - zeroconf.py - INFO - New values of y [ 0. 1. -1.] 86 | 2017-10-11 10:52:15,946 - [ZEROCONF] - zeroconf.py - INFO - Filling missing values in X with the most frequent values 87 | 2017-10-11 10:52:16,043 - [ZEROCONF] - zeroconf.py - INFO - Factorizing the X 88 | 2017-10-11 10:52:16,176 - [ZEROCONF] - x_y_dataframe_split - INFO - Dataframe split into X and y 89 | 2017-10-11 10:52:16,178 - [ZEROCONF] - zeroconf.py - INFO - Preparing a sample to measure approx classifier run time and select features 90 | 2017-10-11 10:52:16,191 - [ZEROCONF] - zeroconf.py - INFO - train size:21815 91 | 2017-10-11 10:52:16,191 - [ZEROCONF] - zeroconf.py - INFO - test size:10746 92 | 2017-10-11 10:52:16,192 - [ZEROCONF] - zeroconf.py - INFO - Reserved 33% of the training dataset for validation (upto 33k rows) 93 | 2017-10-11 10:52:16,209 - [ZEROCONF] - max_estimators_fit_duration - INFO - Constructing preprocessor pipeline and transforming sample data 94 | 2017-10-11 10:52:18,712 - [ZEROCONF] - max_estimators_fit_duration - INFO - Running estimators on the sample 95 | 2017-10-11 10:52:18,729 - [ZEROCONF] - zeroconf.py - INFO - adaboost starting 96 | 2017-10-11 10:52:18,734 - [ZEROCONF] - zeroconf.py - INFO - bernoulli_nb starting 97 | 2017-10-11 10:52:18,761 - [ZEROCONF] - zeroconf.py - INFO - extra_trees starting 98 | 2017-10-11 10:52:18,769 - [ZEROCONF] - zeroconf.py - INFO - decision_tree starting 99 | 2017-10-11 10:52:18,780 - [ZEROCONF] - zeroconf.py - INFO - gaussian_nb starting 100 | 2017-10-11 10:52:18,800 - [ZEROCONF] - zeroconf.py - INFO - bernoulli_nb training time: 0.06455278396606445 101 | 2017-10-11 10:52:18,802 - [ZEROCONF] - zeroconf.py - INFO - gradient_boosting starting 102 | 2017-10-11 10:52:18,808 - [ZEROCONF] - zeroconf.py - INFO - k_nearest_neighbors starting 103 | 2017-10-11 10:52:18,809 - [ZEROCONF] - zeroconf.py - INFO - decision_tree training time: 0.03273773193359375 104 | 2017-10-11 10:52:18,826 - [ZEROCONF] - zeroconf.py - INFO - lda starting 105 | 2017-10-11 10:52:18,845 - [ZEROCONF] - zeroconf.py - INFO - liblinear_svc starting 106 | 2017-10-11 10:52:18,867 - [ZEROCONF] - zeroconf.py - INFO - gaussian_nb training time: 0.08569979667663574 107 | 2017-10-11 10:52:18,882 - [ZEROCONF] - zeroconf.py - INFO - multinomial_nb starting 108 | 2017-10-11 10:52:18,905 - [ZEROCONF] - zeroconf.py - INFO - passive_aggressive starting 109 | 2017-10-11 10:52:18,943 - [ZEROCONF] - zeroconf.py - INFO - random_forest starting 110 | 2017-10-11 10:52:18,971 - [ZEROCONF] - zeroconf.py - INFO - sgd starting 111 | 2017-10-11 10:52:19,012 - [ZEROCONF] - zeroconf.py - INFO - lda training time: 0.17656564712524414 112 | 2017-10-11 10:52:19,023 - [ZEROCONF] - zeroconf.py - INFO - multinomial_nb training time: 0.13777780532836914 113 | 2017-10-11 10:52:19,124 - [ZEROCONF] - zeroconf.py - INFO - liblinear_svc training time: 0.27405595779418945 114 | 2017-10-11 10:52:19,416 - [ZEROCONF] - zeroconf.py - INFO - passive_aggressive training time: 0.508676290512085 115 | 2017-10-11 10:52:19,473 - [ZEROCONF] - zeroconf.py - INFO - sgd training time: 0.49777913093566895 116 | 2017-10-11 10:52:20,471 - [ZEROCONF] - zeroconf.py - INFO - adaboost training time: 1.7392246723175049 117 | 2017-10-11 10:52:20,625 - [ZEROCONF] - zeroconf.py - INFO - k_nearest_neighbors training time: 1.8141863346099854 118 | 2017-10-11 10:52:22,258 - [ZEROCONF] - zeroconf.py - INFO - extra_trees training time: 3.4934401512145996 119 | 2017-10-11 10:52:22,696 - [ZEROCONF] - zeroconf.py - INFO - random_forest training time: 3.7496204376220703 120 | 2017-10-11 10:52:24,215 - [ZEROCONF] - zeroconf.py - INFO - gradient_boosting training time: 5.41023063659668 121 | 2017-10-11 10:52:24,230 - [ZEROCONF] - max_estimators_fit_duration - INFO - Test classifier fit completed 122 | 2017-10-11 10:52:24,239 - [ZEROCONF] - zeroconf.py - INFO - per_run_time_limit=5 123 | 2017-10-11 10:52:24,239 - [ZEROCONF] - zeroconf.py - INFO - Process pool size=2 124 | 2017-10-11 10:52:24,240 - [ZEROCONF] - zeroconf.py - INFO - Starting autosklearn classifiers fiting on a 67% sample up to 67k rows 125 | 2017-10-11 10:52:24,252 - [ZEROCONF] - train_multicore - INFO - Max time allowance for a model 1 minute(s) 126 | 2017-10-11 10:52:24,252 - [ZEROCONF] - train_multicore - INFO - Overal run time is about 10 minute(s) 127 | 2017-10-11 10:52:24,255 - [ZEROCONF] - train_multicore - INFO - Multicore process 2 started 128 | 2017-10-11 10:52:24,258 - [ZEROCONF] - train_multicore - INFO - Multicore process 3 started 129 | 2017-10-11 10:52:24,276 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Start AutoSklearnClassifier seed=2 130 | 2017-10-11 10:52:24,278 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Start AutoSklearnClassifier seed=3 131 | 2017-10-11 10:52:24,295 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Done AutoSklearnClassifier seed=3 132 | 2017-10-11 10:52:24,297 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Done AutoSklearnClassifier seed=2 133 | 2017-10-11 10:52:26,299 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Starting seed=2 134 | 2017-10-11 10:52:27,298 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Starting seed=3 135 | 2017-10-11 10:56:30,949 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - ####### Finished seed=2 136 | 2017-10-11 10:56:31,600 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - ####### Finished seed=3 137 | 2017-10-11 10:56:31,614 - [ZEROCONF] - train_multicore - INFO - Multicore fit completed 138 | 2017-10-11 10:56:31,626 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Building ensemble 139 | 2017-10-11 10:56:31,626 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Done AutoSklearnClassifier - seed:1 140 | 2017-10-11 10:56:54,017 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Ensemble built - seed:1 141 | 2017-10-11 10:56:54,017 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Show models - seed:1 142 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - [(0.400000, SimpleClassificationPipeline({'classifier:__choice__': 'adaboost', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:select_percentile_classification:percentile': 85.5410729966473, 'classifier:adaboost:n_estimators': 88, 'one_hot_encoding:minimum_fraction': 0.01805038589303469, 'rescaling:__choice__': 'minmax', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'select_percentile_classification', 'classifier:adaboost:max_depth': 1, 'classifier:adaboost:learning_rate': 0.10898092508755285, 'preprocessor:select_percentile_classification:score_func': 'chi2', 'imputation:strategy': 'most_frequent', 'classifier:adaboost:algorithm': 'SAMME.R'}, 143 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={ 144 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'task': 1, 145 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'signed': False, 146 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'sparse': False, 147 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multiclass': False, 148 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'target_type': 'classification', 149 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multilabel': False})), 150 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.300000, SimpleClassificationPipeline({'classifier:__choice__': 'random_forest', 'classifier:random_forest:min_weight_fraction_leaf': 0.0, 'one_hot_encoding:use_minimum_fraction': 'True', 'classifier:random_forest:criterion': 'gini', 'classifier:random_forest:min_samples_leaf': 4, 'classifier:random_forest:max_depth': 'None', 'classifier:random_forest:min_samples_split': 16, 'classifier:random_forest:bootstrap': 'False', 'one_hot_encoding:minimum_fraction': 0.1453954841364777, 'rescaling:__choice__': 'none', 'balancing:strategy': 'none', 'preprocessor:__choice__': 'select_percentile_classification', 'preprocessor:select_percentile_classification:percentile': 96.35414862145892, 'preprocessor:select_percentile_classification:score_func': 'chi2', 'imputation:strategy': 'mean', 'classifier:random_forest:max_leaf_nodes': 'None', 'classifier:random_forest:max_features': 3.342759426984195, 'classifier:random_forest:n_estimators': 100}, 151 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={ 152 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'task': 1, 153 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'signed': False, 154 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'sparse': False, 155 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multiclass': False, 156 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'target_type': 'classification', 157 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multilabel': False})), 158 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.200000, SimpleClassificationPipeline({'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:__choice__': 'extra_trees', 'classifier:extra_trees:n_estimators': 100, 'classifier:extra_trees:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:min_samples_split': 5, 'classifier:extra_trees:min_samples_leaf': 10, 'rescaling:__choice__': 'minmax', 'classifier:extra_trees:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_classification:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:criterion': 'gini', 'classifier:extra_trees:max_features': 4.413198608615693, 'classifier:extra_trees:criterion': 'gini', 'preprocessor:extra_trees_preproc_for_classification:n_estimators': 100, 'classifier:extra_trees:min_samples_split': 16, 'one_hot_encoding:use_minimum_fraction': 'False', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'extra_trees_preproc_for_classification', 'preprocessor:extra_trees_preproc_for_classification:min_samples_leaf': 1, 'preprocessor:extra_trees_preproc_for_classification:max_features': 1.4824479003506632, 'imputation:strategy': 'median', 'preprocessor:extra_trees_preproc_for_classification:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_classification:max_depth': 'None'}, 159 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={ 160 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'task': 1, 161 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'signed': False, 162 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'sparse': False, 163 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multiclass': False, 164 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'target_type': 'classification', 165 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multilabel': False})), 166 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.100000, SimpleClassificationPipeline({'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:__choice__': 'extra_trees', 'classifier:extra_trees:n_estimators': 100, 'classifier:extra_trees:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:min_samples_split': 16, 'classifier:extra_trees:min_samples_leaf': 10, 'rescaling:__choice__': 'minmax', 'classifier:extra_trees:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_classification:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:criterion': 'gini', 'classifier:extra_trees:max_features': 4.16852017424403, 'classifier:extra_trees:criterion': 'gini', 'preprocessor:extra_trees_preproc_for_classification:n_estimators': 100, 'classifier:extra_trees:min_samples_split': 16, 'one_hot_encoding:use_minimum_fraction': 'False', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'extra_trees_preproc_for_classification', 'preprocessor:extra_trees_preproc_for_classification:min_samples_leaf': 1, 'preprocessor:extra_trees_preproc_for_classification:max_features': 1.5781770540350555, 'imputation:strategy': 'median', 'preprocessor:extra_trees_preproc_for_classification:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_classification:max_depth': 'None'}, 167 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={ 168 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'task': 1, 169 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'signed': False, 170 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'sparse': False, 171 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multiclass': False, 172 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'target_type': 'classification', 173 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - 'multilabel': False})), 174 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - ] 175 | 2017-10-11 10:56:54,613 - [ZEROCONF] - zeroconf.py - INFO - Validating 176 | 2017-10-11 10:56:54,613 - [ZEROCONF] - zeroconf.py - INFO - Predicting on validation set 177 | 2017-10-11 10:56:57,373 - [ZEROCONF] - zeroconf.py - INFO - ######################################################################## 178 | 2017-10-11 10:56:57,374 - [ZEROCONF] - zeroconf.py - INFO - Accuracy score 84% 179 | 2017-10-11 10:56:57,374 - [ZEROCONF] - zeroconf.py - INFO - The below scores are calculated for predicting '1' category value 180 | 2017-10-11 10:56:57,379 - [ZEROCONF] - zeroconf.py - INFO - Precision: 64%, Recall: 77%, F1: 0.70 181 | 2017-10-11 10:56:57,379 - [ZEROCONF] - zeroconf.py - INFO - Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall 182 | 2017-10-11 10:56:57,386 - [ZEROCONF] - zeroconf.py - INFO - [7058 1100] 183 | 2017-10-11 10:56:57,386 - [ZEROCONF] - zeroconf.py - INFO - [ 603 1985] 184 | 2017-10-11 10:56:57,392 - [ZEROCONF] - zeroconf.py - INFO - Baseline 2588 positives from 10746 overall = 24.1% 185 | 2017-10-11 10:56:57,392 - [ZEROCONF] - zeroconf.py - INFO - ######################################################################## 186 | 2017-10-11 10:56:57,404 - [ZEROCONF] - x_y_dataframe_split - INFO - Dataframe split into X and y 187 | 2017-10-11 10:56:57,405 - [ZEROCONF] - zeroconf.py - INFO - Re-fitting the model ensemble on full known dataset to prepare for prediciton. This can take a long time. 188 | 2017-10-11 10:58:39,836 - [ZEROCONF] - zeroconf.py - INFO - Predicting. This can take a long time for a large prediction set. 189 | 2017-10-11 10:58:45,221 - [ZEROCONF] - zeroconf.py - INFO - Prediction done 190 | 2017-10-11 10:58:45,223 - [ZEROCONF] - zeroconf.py - INFO - Exporting the data 191 | 2017-10-11 10:58:45,267 - [ZEROCONF] - zeroconf.py - INFO - ##### Zeroconf Script Completed! ##### 192 | 2017-10-11 10:58:45,268 - [ZEROCONF] - zeroconf.py - INFO - Clean up / Delete work directory: /home/ulrich/PycharmProjects/autosklearn-zeroconf/work/20171011105215 193 | 194 | Process finished with exit code 0 195 |196 | 197 |
198 | python evaluate-dataset-Adult.py 199 | [ZEROCONF] # 00:37:43 # 200 | [ZEROCONF] ######################################################################## # 00:37:43 # 201 | [ZEROCONF] Accuracy score 85% # 00:37:43 # 202 | [ZEROCONF] The below scores are calculated for predicting '1' category value # 00:37:43 # 203 | [ZEROCONF] Precision: 65%, Recall: 78%, F1: 0.71 # 00:37:43 # 204 | [ZEROCONF] Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall # 00:37:43 # 205 | [ZEROCONF] [[10835 1600] # 00:37:43 # 206 | [ZEROCONF] [ 860 2986]] # 00:37:43 # 207 | [ZEROCONF] Baseline 3846 positives from 16281 overall = 23.6% # 00:37:43 # 208 | [ZEROCONF] ######################################################################## # 00:37:43 # 209 | [ZEROCONF] # 00:37:43 # 210 |211 | ## Workarounds 212 | these are not related to the autosklearn-zeroconf or auto-sklearn but rather general issues depending on your python and OS installation 213 | ### xgboost issues 214 | #### complains about ELF header 215 |
pip uninstall xgboost; pip install --no-cache-dir -v xgboost==0.4a30216 | #### can not find libraries 217 |
conda install libgcc # for xgboost218 | alternatively search for them with 219 |
sudo find / -name libgomp.so.1 220 | /usr/lib/x86_64-linux-gnu/libgomp.so.1221 | and explicitly add them to the libraries path 222 |
export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libstdc++.so.6":"/usr/lib/x86_64-linux-gnu/libgomp.so.1"; python zeroconf.py Titanic.h5 2>/dev/null|grep ZEROCONF223 | Also see https://github.com/automl/auto-sklearn/issues/247 224 | 225 | ### Install auto-sklearn 226 |
227 | # A compiler (gcc) is needed to compile a few things the from auto-sklearn requirements.txt 228 | # Chose just the line for your Linux flavor below 229 | 230 | # On Ubuntu 231 | sudo apt-get install gcc build-essential swig 232 | 233 | # On CentOS 7-1611 http://www.osboxes.org/centos/ https://drive.google.com/file/d/0B_HAFnYs6Ur-bl8wUWZfcHVpMm8/view?usp=sharing 234 | sudo yum -y update 235 | sudo reboot 236 | sudo yum install epel-release python34 python34-devel python34-setuptools 237 | sudo yum -y groupinstall 'Development Tools' 238 | 239 | # auto-sklearn requires swig 3.0 240 | wget downloads.sourceforge.net/project/swig/swig/swig-3.0.12/swig-3.0.12.tar.gz -O swig-3.0.12.tar.gz 241 | tar xf swig-3.0.12.tar.gz 242 | cd swig-3.0.12 243 | ./configure --without-pcre 244 | make 245 | sudo make install 246 | cd .. 247 | 248 | sudo easy_install-3.4 pip 249 | # if you want to use virtual environments 250 | sudo pip3 install virtualenv 251 | virtualenv zeroconf -p /usr/bin/python3.4 252 | source zeroconf/bin/activate 253 | 254 | curl https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/master/requirements.txt | xargs -n 1 -L 1 pip install 255 |256 | 257 | # Contributors 258 | Egor Kobylkin, Ulrich Arndt 259 | -------------------------------------------------------------------------------- /bin/dataTransformationProcessing.py: -------------------------------------------------------------------------------- 1 | import inspect 2 | import math 3 | import multiprocessing 4 | import time 5 | import traceback 6 | from time import sleep 7 | 8 | import autosklearn.pipeline 9 | import autosklearn.pipeline.components.classification 10 | import utility as utl 11 | import warnings 12 | warnings.simplefilter(action='ignore', category=FutureWarning) 13 | import pandas as pd 14 | import psutil 15 | from autosklearn.classification import AutoSklearnClassifier 16 | from autosklearn.constants import * 17 | from autosklearn.pipeline.classification import SimpleClassificationPipeline 18 | 19 | 20 | def time_single_estimator(clf_name, clf_class, X, y, max_clf_time, logger): 21 | if ('libsvm_svc' == clf_name # doesn't even scale to a 100k rows 22 | or 'qda' == clf_name): # crashes 23 | return 0 24 | logger.info(clf_name + " starting") 25 | default = clf_class.get_hyperparameter_search_space().get_default_configuration() 26 | clf = clf_class(**default._values) 27 | t0 = time.time() 28 | try: 29 | clf.fit(X, y) 30 | except Exception as e: 31 | logger.info(e) 32 | classifier_time = time.time() - t0 # keep time even if classifier crashed 33 | logger.info(clf_name + " training time: " + str(classifier_time)) 34 | if max_clf_time.value < int(classifier_time): 35 | max_clf_time.value = int(classifier_time) 36 | # no return statement here because max_clf_time is a managed object 37 | 38 | 39 | def max_estimators_fit_duration(X, y, max_classifier_time_budget, logger, sample_factor=1): 40 | lo = utl.get_logger(inspect.stack()[0][3]) 41 | 42 | lo.info("Constructing preprocessor pipeline and transforming sample data") 43 | # we don't care about the data here but need to preprocess, otherwise the classifiers crash 44 | 45 | pipeline = SimpleClassificationPipeline( 46 | include={'imputation': ['most_frequent'], 'rescaling': ['standardize']}) 47 | default_cs = pipeline.get_hyperparameter_search_space().get_default_configuration() 48 | pipeline = pipeline.set_hyperparameters(default_cs) 49 | 50 | pipeline.fit(X, y) 51 | X_tr, dummy = pipeline.fit_transformer(X, y) 52 | 53 | lo.info("Running estimators on the sample") 54 | # going over all default classifiers used by auto-sklearn 55 | clfs = autosklearn.pipeline.components.classification._classifiers 56 | 57 | processes = [] 58 | with multiprocessing.Manager() as manager: 59 | max_clf_time = manager.Value('i', 3) # default 3 sec 60 | for clf_name, clf_class in clfs.items(): 61 | pr = multiprocessing.Process(target=time_single_estimator, name=clf_name 62 | , args=(clf_name, clf_class, X_tr, y, max_clf_time, logger)) 63 | pr.start() 64 | processes.append(pr) 65 | for pr in processes: 66 | pr.join(max_classifier_time_budget) # will block for max_classifier_time_budget or 67 | # until the classifier fit process finishes. After max_classifier_time_budget 68 | # we will terminate all still running processes here. 69 | if pr.is_alive(): 70 | logger.info("Terminating " + pr.name + " process due to timeout") 71 | pr.terminate() 72 | result_max_clf_time = max_clf_time.value 73 | 74 | lo.info("Test classifier fit completed") 75 | 76 | per_run_time_limit = int(sample_factor * result_max_clf_time) 77 | return max_classifier_time_budget if per_run_time_limit > max_classifier_time_budget else per_run_time_limit 78 | 79 | 80 | def read_dataframe_h5(filename, logger): 81 | with pd.HDFStore(filename, mode='r') as store: 82 | df = store.select('data') 83 | logger.info("Read dataset from the store") 84 | return df 85 | 86 | 87 | def x_y_dataframe_split(dataframe, parameter, id=False): 88 | lo = utl.get_logger(inspect.stack()[0][3]) 89 | 90 | lo.info("Dataframe split into X and y") 91 | X = dataframe.drop([parameter["id_field"], parameter["target_field"]], axis=1) 92 | y = pd.np.array(dataframe[parameter["target_field"]], dtype='int') 93 | if id: 94 | row_id = dataframe[parameter["id_field"]] 95 | return X, y, row_id 96 | else: 97 | return X, y 98 | 99 | 100 | def define_pool_size(memory_limit): 101 | # some classifiers can use more than one core - so keep this at half memory and cores 102 | max_pool_size = int(math.ceil(psutil.virtual_memory().total / (memory_limit * 1000000))) 103 | half_of_cores = int(math.ceil(psutil.cpu_count() / 2.0)) 104 | 105 | lo = utl.get_logger(inspect.stack()[0][3]) 106 | lo.info("Virtual Memory Size = " + str(psutil.virtual_memory().total) ) 107 | lo.info("CPU Count =" + str(psutil.cpu_count()) ) 108 | lo.info("Max CPU Pool Size by Memory = " + str(max_pool_size) ) 109 | 110 | return half_of_cores if max_pool_size > half_of_cores else max_pool_size 111 | 112 | 113 | def calculate_time_left_for_this_task(pool_size, per_run_time_limit): 114 | half_cpu_cores = pool_size 115 | queue_factor = 30 116 | if queue_factor * half_cpu_cores < 100: # 100 models to test overall 117 | queue_factor = 100 / half_cpu_cores 118 | 119 | time_left_for_this_task = int(queue_factor * per_run_time_limit) 120 | return time_left_for_this_task 121 | 122 | 123 | def spawn_autosklearn_classifier(X_train, y_train, seed, dataset_name, time_left_for_this_task, per_run_time_limit, 124 | feat_type, memory_limit, atsklrn_tempdir): 125 | lo = utl.get_logger(inspect.stack()[0][3]) 126 | 127 | try: 128 | lo.info("Start AutoSklearnClassifier seed=" + str(seed)) 129 | clf = AutoSklearnClassifier(time_left_for_this_task=time_left_for_this_task, 130 | per_run_time_limit=per_run_time_limit, 131 | ml_memory_limit=memory_limit, 132 | shared_mode=True, 133 | tmp_folder=atsklrn_tempdir, 134 | output_folder=atsklrn_tempdir, 135 | delete_tmp_folder_after_terminate=False, 136 | delete_output_folder_after_terminate=False, 137 | initial_configurations_via_metalearning=0, 138 | ensemble_size=0, 139 | seed=seed) 140 | except Exception: 141 | lo.exception("Exception AutoSklearnClassifier seed=" + str(seed)) 142 | raise 143 | 144 | lo = utl.get_logger(inspect.stack()[0][3]) 145 | lo.info("Done AutoSklearnClassifier seed=" + str(seed)) 146 | 147 | sleep(seed) 148 | 149 | try: 150 | lo.info("Starting seed=" + str(seed)) 151 | try: 152 | clf.fit(X_train, y_train, metric=autosklearn.metrics.f1, feat_type=feat_type, dataset_name=dataset_name) 153 | except Exception: 154 | lo = utl.get_logger(inspect.stack()[0][3]) 155 | lo.exception("Error in clf.fit - seed:" + str(seed)) 156 | raise 157 | except Exception: 158 | lo = utl.get_logger(inspect.stack()[0][3]) 159 | lo.exception("Exception in seed=" + str(seed) + ". ") 160 | traceback.print_exc() 161 | raise 162 | lo = utl.get_logger(inspect.stack()[0][3]) 163 | lo.info("####### Finished seed=" + str(seed)) 164 | return None 165 | 166 | 167 | def train_multicore(X, y, feat_type, memory_limit, atsklrn_tempdir, pool_size=1, per_run_time_limit=60): 168 | lo = utl.get_logger(inspect.stack()[0][3]) 169 | 170 | time_left_for_this_task = calculate_time_left_for_this_task(pool_size, per_run_time_limit) 171 | 172 | lo.info("Max time allowance for a model " + str(math.ceil(per_run_time_limit / 60.0)) + " minute(s)") 173 | lo.info("Overal run time is about " + str(2 * math.ceil(time_left_for_this_task / 60.0)) + " minute(s)") 174 | 175 | processes = [] 176 | for i in range(2, pool_size + 2): # reserve seed 1 for the ensemble building 177 | seed = i 178 | pr = multiprocessing.Process(target=spawn_autosklearn_classifier 179 | , args=( 180 | X, y, i, 'foobar', time_left_for_this_task, per_run_time_limit, feat_type, memory_limit, atsklrn_tempdir)) 181 | pr.start() 182 | lo.info("Multicore process " + str(seed) + " started") 183 | processes.append(pr) 184 | for pr in processes: 185 | pr.join() 186 | 187 | lo.info("Multicore fit completed") 188 | 189 | 190 | def zeroconf_fit_ensemble(y, atsklrn_tempdir): 191 | lo = utl.get_logger(inspect.stack()[0][3]) 192 | 193 | lo.info("Building ensemble") 194 | 195 | seed = 1 196 | 197 | ensemble = AutoSklearnClassifier( 198 | time_left_for_this_task=300, per_run_time_limit=150, ml_memory_limit=20240, ensemble_size=50, 199 | ensemble_nbest=200, 200 | shared_mode=True, tmp_folder=atsklrn_tempdir, output_folder=atsklrn_tempdir, 201 | delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False, 202 | initial_configurations_via_metalearning=0, 203 | seed=seed) 204 | 205 | lo.info("Done AutoSklearnClassifier - seed:" + str(seed)) 206 | 207 | try: 208 | lo.debug("Start ensemble.fit_ensemble - seed:" + str(seed)) 209 | ensemble.fit_ensemble( 210 | task=BINARY_CLASSIFICATION 211 | , y=y 212 | , metric=autosklearn.metrics.f1 213 | , precision='32' 214 | , dataset_name='foobar' 215 | , ensemble_size=10 216 | , ensemble_nbest=15) 217 | except Exception: 218 | lo = utl.get_logger(inspect.stack()[0][3]) 219 | lo.exception("Error in ensemble.fit_ensemble - seed:" + str(seed)) 220 | raise 221 | 222 | lo = utl.get_logger(inspect.stack()[0][3]) 223 | lo.debug("Done ensemble.fit_ensemble - seed:" + str(seed)) 224 | 225 | sleep(20) 226 | lo.info("Ensemble built - seed:" + str(seed)) 227 | 228 | lo.info("Show models - seed:" + str(seed)) 229 | txtList = str(ensemble.show_models()).split("\n") 230 | for row in txtList: 231 | lo.info(row) 232 | 233 | return ensemble 234 | -------------------------------------------------------------------------------- /bin/evaluate-dataset-Adult.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 Egor Kobylkin 4 | Created on Sun Apr 23 11:52:59 2017 5 | @author: ekobylkin 6 | This is an example on how to prepare data for autosklearn-zeroconf. 7 | It is using a well known Adult (Salary) dataset from UCI https://archive.ics.uci.edu/ml/datasets/Adult . 8 | """ 9 | import pandas as pd 10 | 11 | test = pd.read_csv(filepath_or_buffer='./data/adult.test.withid',sep=',', error_bad_lines=False, index_col=False) 12 | #print(test) 13 | 14 | prediction = pd.read_csv(filepath_or_buffer='./data/zeroconf-result.csv',sep=',', error_bad_lines=False, index_col=False) 15 | #print(prediction) 16 | 17 | df=pd.merge(test, prediction, how='inner', on=['cust_id',]) 18 | 19 | y_test=df['category'] 20 | y_hat=df['prediction'] 21 | 22 | from sklearn.metrics import (confusion_matrix, precision_score 23 | , recall_score, f1_score, accuracy_score) 24 | from time import time,sleep,strftime 25 | def p(text): 26 | for line in str(text).splitlines(): 27 | print ('[ZEROCONF] '+line+" # "+strftime("%H:%M:%S")+" #") 28 | 29 | p("\n") 30 | p("#"*72) 31 | p("Accuracy score {0:2.0%}".format(accuracy_score(y_test, y_hat))) 32 | p("The below scores are calculated for predicting '1' category value") 33 | p("Precision: {0:2.0%}, Recall: {1:2.0%}, F1: {2:.2f}".format( 34 | precision_score(y_test, y_hat),recall_score(y_test, y_hat),f1_score(y_test, y_hat))) 35 | p("Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall") 36 | p(confusion_matrix(y_test, y_hat)) 37 | baseline_1 = str(sum(a for a in y_test)) 38 | baseline_all = str(len(y_test)) 39 | baseline_prcnt = "{0:2.0%}".format( float(sum(a for a in y_test)/len(y_test))) 40 | p("Baseline %s positives from %s overall = %1.1f%%" % 41 | (sum(a for a in y_test), len(y_test), 100*sum(a for a in y_test)/len(y_test))) 42 | p("#"*72) 43 | p("\n") 44 | -------------------------------------------------------------------------------- /bin/load-dataset-Adult.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 Egor Kobylkin 4 | Created on Sun Apr 23 11:52:59 2017 5 | @author: ekobylkin 6 | This is an example on how to prepare data for autosklearn-zeroconf. 7 | It is using a well known Adult (Salary) dataset from UCI https://archive.ics.uci.edu/ml/datasets/Adult . 8 | """ 9 | import pandas as pd 10 | # wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 11 | # wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test 12 | col_names=[ 13 | 'age', 14 | 'workclass', 15 | 'fnlwgt', 16 | 'education', 17 | 'education-num', 18 | 'marital-status', 19 | 'occupation', 20 | 'relationship', 21 | 'race', 22 | 'sex', 23 | 'capital-gain', 24 | 'capital-loss', 25 | 'hours-per-week', 26 | 'native-country', 27 | 'category' 28 | ] 29 | 30 | train = pd.read_csv(filepath_or_buffer='../data/adult.data',sep=',', error_bad_lines=False, index_col=False, names=col_names) 31 | category_mapping={' >50K':1,' <=50K':0} 32 | train['category']= train['category'].map(category_mapping) 33 | #dataframe=train 34 | 35 | test = pd.read_csv(filepath_or_buffer='../data/adult.test',sep=',', error_bad_lines=False, index_col=False, names=col_names, skiprows=1) 36 | test['set_name']='test' 37 | category_mapping={' >50K.':1,' <=50K.':0} 38 | test['category']= test['category'].map(category_mapping) 39 | 40 | dataframe=train.append(test) 41 | 42 | # autosklearn-zeroconf requires cust_id and category (target or "y" variable) columns, the rest is optional 43 | dataframe['cust_id']=dataframe.index 44 | 45 | # let's save the test with the cus_id and binarized category for the validation of the prediction afterwards 46 | test_df=dataframe.loc[dataframe['set_name']=='test'].drop(['set_name'], axis=1) 47 | test_df.to_csv('../data/adult.test.withid', index=False, header=True) 48 | 49 | # We will use the test.csv data to make a prediction. You can compare the predicted values with the ground truth yourself. 50 | dataframe.loc[dataframe['set_name']=='test','category']=None 51 | dataframe=dataframe.drop(['set_name'], axis=1) 52 | 53 | print(dataframe) 54 | 55 | store = pd.HDFStore('../data/Adult.h5') # this is the file cache for the data 56 | store['data'] = dataframe 57 | store.close() 58 | #Now run 'python zeroconf.py Adult.h5' (python >=3.5) 59 | -------------------------------------------------------------------------------- /bin/load-dataset-Titanic.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 PayPal 4 | Created on Sun Oct 02 17:13:59 2016 5 | @author: ekobylkin 6 | 7 | This is an example on how to prepare data for autosklearn-zeroconf. 8 | It is using a well known Titanic dataset from Kaggle https://www.kaggle.com/c/titanic . 9 | """ 10 | import pandas as pd 11 | # Dowlnoad these files from Kaggle dataset 12 | #https://www.kaggle.com/c/titanic/download/train.csv 13 | #https://www.kaggle.com/c/titanic/download/test.csv 14 | train = pd.read_csv(filepath_or_buffer='train.csv',sep=',', error_bad_lines=False, index_col=False) 15 | test = pd.read_csv(filepath_or_buffer='test.csv',sep=',', error_bad_lines=False, index_col=False) 16 | 17 | # We will use the test.csv data to make a prediction. You can compare the predicted values with the ground truth yourself. 18 | test['Survived']=None # The empty target column tells autosklearn-zeroconf to use these cases for the prediction 19 | 20 | dataframe=train.append(test) 21 | 22 | # autosklearn-zeroconf requires cust_id and category (target or "y" variable) columns, the rest is optional 23 | dataframe.rename(columns = {'PassengerId':'cust_id','Survived':'category'},inplace=True) 24 | 25 | store = pd.HDFStore('Titanic.h5') # this is the file cache for the data 26 | store['data'] = dataframe 27 | store.close() 28 | #Now run 'python zeroconf.py Titanic.h5' (python >=3.5) 29 | -------------------------------------------------------------------------------- /bin/utility.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | 4 | import os 5 | import ruamel.yaml as yaml 6 | import shutil 7 | 8 | 9 | def init_process(file, basedir=''): 10 | absfile = os.path.abspath(file) 11 | if (basedir == ''): 12 | basedir = os.path.join(*splitall(absfile)[0:(len(splitall(absfile)) - 2)]) 13 | proc = os.path.basename(absfile) 14 | if not os.path.isdir(basedir + '/work'): 15 | os.mkdir(basedir + '/work') 16 | runidfile = basedir + '/work/current_runid.txt' 17 | runid, runtype = get_runid(runidfile, basedir) 18 | 19 | parameter = {} 20 | parameter["runid"] = runid 21 | parameter["runtype"] = runtype 22 | parameter["proc"] = proc 23 | parameter["workdir"] = basedir + '/work/' + runid 24 | return parameter 25 | 26 | 27 | def get_logger(name): 28 | ################## 29 | # the repeating setup of the logging is related to an issue in the sklearn package 30 | # this resulted in a lost of the logger... 31 | ################## 32 | setup_logging() 33 | logger = logging.getLogger(name) 34 | return logger 35 | 36 | 37 | def handle_exception(exc_type, exc_value, exc_traceback): 38 | logger = get_logger(__file__) 39 | if issubclass(exc_type, KeyboardInterrupt): 40 | sys.__excepthook__(exc_type, exc_value, exc_traceback) 41 | return 42 | logger.error("Uncaught exception", exc_info=(exc_type, exc_value, exc_traceback)) 43 | 44 | def merge_two_dicts(x, y): 45 | """Given two dicts, merge them into a new dict as a shallow copy.""" 46 | z = x.copy() 47 | z.update(y) 48 | return z 49 | 50 | def read_parameter(parameter_file, parameter): 51 | fr = open(parameter_file, "r") 52 | param = yaml.load(fr, yaml.RoundTripLoader) 53 | return merge_two_dicts(parameter,param) 54 | 55 | 56 | def end_proc_success(parameter, logger): 57 | logger.info("Clean up / Delete work directory: " + parameter["basedir"] + "/work/" + parameter["runid"]) 58 | shutil.rmtree(parameter["basedir"] + "/work/" + parameter["runid"]) 59 | exit(0) 60 | 61 | 62 | def setup_logging( 63 | default_path='./parameter/logger.yml', 64 | default_level=logging.INFO, 65 | env_key='LOG_CFG' 66 | ): 67 | """Setup logging configuration 68 | 69 | """ 70 | path = os.path.abspath(default_path) 71 | value = os.getenv(env_key, None) 72 | if value: 73 | path = value 74 | if os.path.exists(os.path.abspath(path)): 75 | with open(path, 'rt') as f: 76 | config = yaml.safe_load(f.read()) 77 | logging.config.dictConfig(config) 78 | else: 79 | logging.basicConfig(level=default_level) 80 | 81 | 82 | def get_runid(runidfile, basedir): 83 | now = datetime.datetime.now().strftime('%Y%m%d%H%M%S') 84 | if os.path.isfile(runidfile): 85 | rf = open(runidfile, 'r') 86 | runid = rf.read().rstrip() 87 | rf.close() 88 | if os.path.isdir(basedir + '/work/' + runid): 89 | runtype = 'RESTART' 90 | else: 91 | runtype = 'Fresh Run Start' 92 | rf = open(runidfile, 'w') 93 | runid = now 94 | print(runid, file=rf) 95 | rf.close() 96 | os.mkdir(basedir + '/work/' + runid) 97 | else: 98 | runtype = 'Fresh Run Start - no current_runid file' 99 | rf = open(runidfile, 'w') 100 | runid = now 101 | print(runid, file=rf) 102 | rf.close() 103 | os.mkdir(basedir + '/work/' + runid) 104 | return runid, runtype 105 | 106 | 107 | def splitall(path): 108 | allparts = [] 109 | while 1: 110 | parts = os.path.split(path) 111 | if parts[0] == path: # sentinel for absolute paths 112 | allparts.insert(0, parts[0]) 113 | break 114 | elif parts[1] == path: # sentinel for relative paths 115 | allparts.insert(0, parts[1]) 116 | break 117 | else: 118 | path = parts[0] 119 | allparts.insert(0, parts[1]) 120 | return allparts 121 | -------------------------------------------------------------------------------- /bin/zeroconf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 3 | """ 4 | Copyright 2017 PayPal 5 | Created on Mon Feb 27 19:11:59 PST 2017 6 | @author: ekobylkin 7 | @version 0.2 8 | @author: ulrich arndt - data2knowledge 9 | @update: 2017-09-27 10 | """ 11 | 12 | import argparse 13 | import numpy as np 14 | import os 15 | import pandas as pd 16 | import shutil 17 | 18 | from sklearn.model_selection import train_test_split 19 | from sklearn.metrics import (confusion_matrix, precision_score, 20 | recall_score, f1_score, accuracy_score) 21 | 22 | import utility as utl 23 | import dataTransformationProcessing as dt 24 | 25 | parameter = utl.init_process(__file__) 26 | 27 | ########################################################### 28 | # define the command line argument parser 29 | ########################################################### 30 | # https://docs.python.org/2/howto/argparse.html 31 | parser = argparse.ArgumentParser( 32 | description='zero configuration predictic modeling script. Requires a pandas HDFS dataframe file ' + \ 33 | 'and a yaml parameter file as input as input') 34 | parser.add_argument('-d', 35 | '--data_file', 36 | nargs=1, 37 | help='input pandas HDFS dataframe .h5 with an unique indentifier and a target column\n' + 38 | 'as well as additional data columns\n' 39 | 'default values are cust_id and category or need to be defined in an\n' + 40 | 'optional parameter file ' 41 | ) 42 | parser.add_argument('-p', 43 | '--param_file', 44 | help='input yaml parameter file' 45 | ) 46 | 47 | args = parser.parse_args() 48 | logger = utl.get_logger(os.path.basename(__file__)) 49 | logger.info("Program started with the following arguments:") 50 | logger.info(args) 51 | 52 | ########################################################### 53 | # set dir to project dir 54 | ########################################################### 55 | abspath = os.path.abspath(__file__) 56 | dname = os.path.dirname(os.path.dirname(abspath)) 57 | os.chdir(dname) 58 | 59 | ########################################################### 60 | # file check for the parameter 61 | ########################################################### 62 | param_file = '' 63 | if args.param_file: 64 | param_file = args.param_file[0] 65 | else: 66 | param_file = os.path.abspath("./parameter/default.yml") 67 | logger.info("Using the default parameter file: " + param_file) 68 | if (not (os.path.isfile(param_file))): 69 | msg = 'the input parameter file: ' + param_file + ' does not exist!' 70 | logger.error(msg) 71 | exit(8) 72 | 73 | data_file = '' 74 | if args.data_file: 75 | data_file = args.data_file[0] 76 | else: 77 | msg = "A data file is mandatory!" 78 | logger.error(msg) 79 | exit(8) 80 | if (not (os.path.isfile(data_file))): 81 | msg = 'the input parameter file: ' + data_file + ' does not exist!' 82 | logger.error(msg) 83 | exit(8) 84 | 85 | parameter = utl.read_parameter(param_file, parameter) 86 | 87 | parameter["data_file"] = os.path.abspath(data_file) 88 | parameter["basedir"] = os.path.abspath(parameter["basedir"]) 89 | parameter["parameter_file"] = os.path.abspath(param_file) 90 | parameter["resultfile"] = os.path.abspath(parameter["resultfile"]) 91 | 92 | 93 | ########################################################### 94 | # set base dir 95 | ########################################################### 96 | os.chdir(parameter["basedir"]) 97 | logger.info("Set basedir to: " + parameter["basedir"]) 98 | 99 | logger = utl.get_logger(os.path.basename(__file__)) 100 | 101 | logger.info("Program Call Parameter (Arguments and Parameter File Values):") 102 | for key in sorted(parameter.keys()): 103 | logger.info(" " + key + ": " + str(parameter[key])) 104 | 105 | work_dir = parameter["workdir"] 106 | result_filename = parameter["resultfile"] 107 | atsklrn_tempdir = os.path.join(work_dir, 'atsklrn_tmp') 108 | shutil.rmtree(atsklrn_tempdir, ignore_errors=True) # cleanup - remove temp directory 109 | 110 | 111 | # if the memory limit is lower the model can fail and the whole process will crash 112 | memory_limit = parameter["memory_limit"] # MB 113 | global max_classifier_time_budget 114 | max_classifier_time_budget = parameter["max_classifier_time_budget"] # but 10 minutes is usually more than enough 115 | max_sample_size = parameter["max_sample_size"] # so that the classifiers fit method completes in a reasonable time 116 | 117 | dataframe = dt.read_dataframe_h5(data_file, logger) 118 | 119 | logger.info("Values of y " + str(dataframe[parameter["target_field"]].unique())) 120 | logger.info("We need to protect NAs in y from the prediction dataset so we convert them to -1") 121 | dataframe[parameter["target_field"]] = dataframe[parameter["target_field"]].fillna(-1) 122 | logger.info("New values of y " + str(dataframe[parameter["target_field"]].unique())) 123 | 124 | logger.info("Filling missing values in X with the most frequent values") 125 | dataframe = dataframe.fillna(dataframe.mode().iloc[0]) 126 | 127 | logger.info("Factorizing the X") 128 | # we need this list of original dtypes for the Autosklearn fit, create it before categorisation or split 129 | col_dtype_dict = {col: ('Numerical' if np.issubdtype(dataframe[col].dtype, np.number) else 'Categorical') 130 | for col in dataframe.columns if col not in [parameter["id_field"], parameter["target_field"]]} 131 | 132 | # http://stackoverflow.com/questions/25530504/encoding-column-labels-in-pandas-for-machine-learning 133 | # http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn?rq=1 134 | # https://github.com/automl/auto-sklearn/issues/121#issuecomment-251459036 135 | 136 | for col in dataframe.select_dtypes(exclude=[np.number]).columns: 137 | if col not in [parameter["id_field"], parameter["target_field"]]: 138 | dataframe[col] = dataframe[col].astype('category').cat.codes 139 | 140 | df_unknown = dataframe[dataframe[parameter["target_field"]] == -1] # 'None' gets categorzized into -1 141 | df_known = dataframe[dataframe[parameter["target_field"]] != -1] # not [0,1] for multiclass labeling compartibility 142 | logger.debug("Length of unknown dataframe:" + str(len(df_unknown))) 143 | logger.debug("Length of known dataframe:" + str(len(df_known))) 144 | 145 | del dataframe 146 | 147 | X, y = dt.x_y_dataframe_split(df_known, parameter) 148 | 149 | logger.info("Preparing a sample to measure approx classifier run time and select features") 150 | dataset_size = df_known.shape[0] 151 | 152 | if dataset_size > max_sample_size: 153 | sample_factor = dataset_size / float(max_sample_size) 154 | logger.info("Sample factor =" + str(sample_factor)) 155 | X_sample, y_sample = dt.x_y_dataframe_split(df_known.sample(max_sample_size, random_state=42), parameter) 156 | X_train, X_test, y_train, y_test = train_test_split(X.copy(), y, stratify=y, test_size=33000, 157 | random_state=42) # no need for larger test 158 | else: 159 | sample_factor = 1 160 | X_sample, y_sample = X.copy(), y 161 | X_train, X_test, y_train, y_test = train_test_split(X.copy(), y, stratify=y, test_size=0.33, random_state=42) 162 | logger.info("train size:" + str(len(X_train))) 163 | logger.info("test size:" + str(len(X_test))) 164 | logger.info("Reserved 33% of the training dataset for validation (upto 33k rows)") 165 | 166 | per_run_time_limit = dt.max_estimators_fit_duration(X_train.values, y_train, max_classifier_time_budget, logger, 167 | sample_factor) 168 | logger.info("per_run_time_limit=" + str(per_run_time_limit)) 169 | pool_size = dt.define_pool_size(int(memory_limit)) 170 | logger.info("Process pool size=" + str(pool_size)) 171 | feat_type = [col_dtype_dict[col] for col in X.columns] 172 | logger.info("Starting autosklearn classifiers fiting on a 67% sample up to 67k rows") 173 | dt.train_multicore(X_train.values, y_train, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, 174 | per_run_time_limit) 175 | 176 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 177 | 178 | logger = utl.get_logger(os.path.basename(__file__)) 179 | logger.info("Validating") 180 | logger.info("Predicting on validation set") 181 | y_hat = ensemble.predict(X_test.values) 182 | 183 | logger.info("#" * 72) 184 | logger.info("Accuracy score {0:2.0%}".format(accuracy_score(y_test, y_hat))) 185 | logger.info("The below scores are calculated for predicting '1' category value") 186 | logger.info("Precision: {0:2.0%}, Recall: {1:2.0%}, F1: {2:.2f}".format( 187 | precision_score(y_test, y_hat), recall_score(y_test, y_hat), f1_score(y_test, y_hat))) 188 | ############################# 189 | ## Print COnfusion Matrix 190 | ############################# 191 | logger.info("Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall") 192 | cm = confusion_matrix(y_test, y_hat) 193 | for row in cm: 194 | logger.info(row) 195 | 196 | baseline_1 = str(sum(a for a in y_test)) 197 | baseline_all = str(len(y_test)) 198 | baseline_prcnt = "{0:2.0%}".format(float(sum(a for a in y_test) / len(y_test))) 199 | logger.info("Baseline %s positives from %s overall = %1.1f%%" % 200 | (sum(a for a in y_test), len(y_test), 100 * sum(a for a in y_test) / len(y_test))) 201 | logger.info("#" * 72) 202 | 203 | if df_unknown.shape[0] == 0: # if there is nothing to predict we can stop already 204 | logger.info("##### Nothing to predict. Prediction dataset is empty. #####") 205 | exit(0) 206 | 207 | X_unknown, y_unknown, row_id_unknown = dt.x_y_dataframe_split(df_unknown, parameter, id=True) 208 | 209 | logger.info("Re-fitting the model ensemble on full known dataset to prepare for prediciton. This can take a long time.") 210 | try: 211 | ensemble.refit(X.copy().values, y) 212 | except Exception as e: 213 | logger.info("Refit failed, reshuffling the rows, restarting") 214 | logger.info(e) 215 | try: 216 | X2 = X.copy().values 217 | indices = np.arange(X2.shape[0]) 218 | np.random.shuffle(indices) # a workaround to algoritm shortcomings 219 | X2 = X2[indices] 220 | y = y[indices] 221 | ensemble.refit(X2, y) 222 | except Exception as e: 223 | logger.info("Second refit failed") 224 | logger.info(e) 225 | logger.info( 226 | " WORKAROUND: because Refitting fails due to an upstream bug https://github.com/automl/auto-sklearn/issues/263") 227 | logger.info(" WORKAROUND: we are fitting autosklearn classifiers a second time, now on the full dataset") 228 | dt.train_multicore(X.values, y, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, per_run_time_limit) 229 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 230 | 231 | logger.info("Predicting. This can take a long time for a large prediction set.") 232 | try: 233 | y_pred = ensemble.predict(X_unknown.copy().values) 234 | logger.info("Prediction done") 235 | except Exception as e: 236 | logger.info(e) 237 | logger.info( 238 | " WORKAROUND: because REfitting fails due to an upstream bug https://github.com/automl/auto-sklearn/issues/263") 239 | logger.info(" WORKAROUND: we are fitting autosklearn classifiers a second time, now on the full dataset") 240 | dt.train_multicore(X.values, y, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, per_run_time_limit) 241 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 242 | logger.info("Predicting. This can take a long time for a large prediction set.") 243 | try: 244 | y_pred = ensemble.predict(X_unknown.copy().values) 245 | logger.info("Prediction done") 246 | except Exception as e: 247 | logger.info("##### Prediction failed, exiting! #####") 248 | logger.info(e) 249 | exit(2) 250 | 251 | result_df = pd.DataFrame( 252 | {parameter["id_field"]: row_id_unknown, 'prediction': pd.Series(y_pred, index=row_id_unknown.index)}) 253 | logger.info("Exporting the data") 254 | result_df.to_csv(result_filename, index=False, header=True) 255 | logger.info("##### Zeroconf Script Completed! #####") 256 | utl.end_proc_success(parameter, logger) 257 | -------------------------------------------------------------------------------- /data/Adult.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/c61ada6d354a46535b51321ec85e6b3064aa3f4f/data/Adult.h5 -------------------------------------------------------------------------------- /data/Adult.h5.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/c61ada6d354a46535b51321ec85e6b3064aa3f4f/data/Adult.h5.tar.gz -------------------------------------------------------------------------------- /data/adult.names: -------------------------------------------------------------------------------- 1 | | This data was extracted from the census bureau database found at 2 | | http://www.census.gov/ftp/pub/DES/www/welcome.html 3 | | Donor: Ronny Kohavi and Barry Becker, 4 | | Data Mining and Visualization 5 | | Silicon Graphics. 6 | | e-mail: ronnyk@sgi.com for questions. 7 | | Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random). 8 | | 48842 instances, mix of continuous and discrete (train=32561, test=16281) 9 | | 45222 if instances with unknown values are removed (train=30162, test=15060) 10 | | Duplicate or conflicting instances : 6 11 | | Class probabilities for adult.all file 12 | | Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 13 | | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) 14 | | 15 | | Extraction was done by Barry Becker from the 1994 Census database. A set of 16 | | reasonably clean records was extracted using the following conditions: 17 | | ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) 18 | | 19 | | Prediction task is to determine whether a person makes over 50K 20 | | a year. 21 | | 22 | | First cited in: 23 | | @inproceedings{kohavi-nbtree, 24 | | author={Ron Kohavi}, 25 | | title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 26 | | Decision-Tree Hybrid}, 27 | | booktitle={Proceedings of the Second International Conference on 28 | | Knowledge Discovery and Data Mining}, 29 | | year = 1996, 30 | | pages={to appear}} 31 | | 32 | | Error Accuracy reported as follows, after removal of unknowns from 33 | | train/test sets): 34 | | C4.5 : 84.46+-0.30 35 | | Naive-Bayes: 83.88+-0.30 36 | | NBTree : 85.90+-0.28 37 | | 38 | | 39 | | Following algorithms were later run with the following error rates, 40 | | all after removal of unknowns and using the original train/test split. 41 | | All these numbers are straight runs using MLC++ with default values. 42 | | 43 | | Algorithm Error 44 | | -- ---------------- ----- 45 | | 1 C4.5 15.54 46 | | 2 C4.5-auto 14.46 47 | | 3 C4.5 rules 14.94 48 | | 4 Voted ID3 (0.6) 15.64 49 | | 5 Voted ID3 (0.8) 16.47 50 | | 6 T2 16.84 51 | | 7 1R 19.54 52 | | 8 NBTree 14.10 53 | | 9 CN2 16.00 54 | | 10 HOODG 14.82 55 | | 11 FSS Naive Bayes 14.05 56 | | 12 IDTM (Decision table) 14.46 57 | | 13 Naive-Bayes 16.12 58 | | 14 Nearest-neighbor (1) 21.42 59 | | 15 Nearest-neighbor (3) 20.35 60 | | 16 OC1 15.04 61 | | 17 Pebls Crashed. Unknown why (bounds WERE increased) 62 | | 63 | | Conversion of original data as follows: 64 | | 1. Discretized agrossincome into two ranges with threshold 50,000. 65 | | 2. Convert U.S. to US to avoid periods. 66 | | 3. Convert Unknown to "?" 67 | | 4. Run MLC++ GenCVFiles to generate data,test. 68 | | 69 | | Description of fnlwgt (final weight) 70 | | 71 | | The weights on the CPS files are controlled to independent estimates of the 72 | | civilian noninstitutional population of the US. These are prepared monthly 73 | | for us by Population Division here at the Census Bureau. We use 3 sets of 74 | | controls. 75 | | These are: 76 | | 1. A single cell estimate of the population 16+ for each state. 77 | | 2. Controls for Hispanic Origin by age and sex. 78 | | 3. Controls by Race, age and sex. 79 | | 80 | | We use all three sets of controls in our weighting program and "rake" through 81 | | them 6 times so that by the end we come back to all the controls we used. 82 | | 83 | | The term estimate refers to population totals derived from CPS by creating 84 | | "weighted tallies" of any specified socio-economic characteristics of the 85 | | population. 86 | | 87 | | People with similar demographic characteristics should have 88 | | similar weights. There is one important caveat to remember 89 | | about this statement. That is that since the CPS sample is 90 | | actually a collection of 51 state samples, each with its own 91 | | probability of selection, the statement only applies within 92 | | state. 93 | 94 | 95 | >50K, <=50K. 96 | 97 | age: continuous. 98 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 99 | fnlwgt: continuous. 100 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 101 | education-num: continuous. 102 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 103 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 104 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 105 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 106 | sex: Female, Male. 107 | capital-gain: continuous. 108 | capital-loss: continuous. 109 | hours-per-week: continuous. 110 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 111 | -------------------------------------------------------------------------------- /log/.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore everything in this directory 2 | * 3 | # Except this file 4 | !.gitignore -------------------------------------------------------------------------------- /parameter/default.yml: -------------------------------------------------------------------------------- 1 | ######################################################################### 2 | ## default parameters - used if no specific parameter set is given at 3 | ## program call 4 | ######################################################################### 5 | basedir: . 6 | resultfile: ./data/zeroconf-result.csv 7 | memory_limit: 15000 # MB 8 | max_classifier_time_budget: 1200 # but 10 minutes is usually more than enough 9 | max_sample_size: 100000 # so that the classifiers fit method completes in a reasonable time 10 | # definition of dataset fields 11 | id_field: cust_id # has to be ignored by the analysis 12 | target_field: category # name of the field with the target variable 13 | -------------------------------------------------------------------------------- /parameter/logger.yml: -------------------------------------------------------------------------------- 1 | ################################################################ 2 | ## Logger setup parameter 3 | ################################################################ 4 | --- 5 | version: 1 6 | disable_existing_loggers: False 7 | formatters: 8 | simple: 9 | format: "%(asctime)s - [ZEROCONF] - %(name)s - %(levelname)s - %(message)s" 10 | 11 | handlers: 12 | console: 13 | class: logging.StreamHandler 14 | level: DEBUG 15 | formatter: simple 16 | stream: ext://sys.stdout 17 | 18 | info_file_handler: 19 | class: logging.handlers.RotatingFileHandler 20 | level: INFO 21 | formatter: simple 22 | filename: ./log/info.log 23 | maxBytes: 10485760 # 10MB 24 | backupCount: 20 25 | encoding: utf8 26 | 27 | error_file_handler: 28 | class: logging.handlers.RotatingFileHandler 29 | level: ERROR 30 | formatter: simple 31 | filename: ./log/errors.log 32 | maxBytes: 10485760 # 10MB 33 | backupCount: 20 34 | encoding: utf8 35 | 36 | loggers: 37 | my_module: 38 | level: DEBUG 39 | handlers: [console] 40 | propagate: no 41 | 42 | root: 43 | level: INFO 44 | handlers: [console, info_file_handler, error_file_handler] 45 | -------------------------------------------------------------------------------- /parameter/standard.yml: -------------------------------------------------------------------------------- 1 | basedir: . 2 | resultfile: ./data/zeroconf-result.csv 3 | memory_limit: 15000 # MB 4 | max_classifier_time_budget: 1200 # but 10 minutes is usually more than enough 5 | max_sample_size: 100000 # so that the classifiers fit method completes in a reasonable time 6 | # definition of dataset fields 7 | id_field: cust_id # has to be ignored by the analysis 8 | target_field: category # name of the field with the target variable 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.16.2 2 | bunch 3 | psutil 4 | tables 5 | ruamel.yaml 6 | cython 7 | auto-sklearn==0.3.0 8 | -------------------------------------------------------------------------------- /work/.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore everything in this directory 2 | * 3 | # Except this file 4 | !.gitignore --------------------------------------------------------------------------------