├── .gitignore ├── .idea └── vcs.xml ├── AutosklearnModellingLossOverTimeExample.png ├── LICENSE.txt ├── README.md ├── bin ├── dataTransformationProcessing.py ├── evaluate-dataset-Adult.py ├── load-dataset-Adult.py ├── load-dataset-Titanic.py ├── utility.py └── zeroconf.py ├── data ├── Adult.h5 ├── Adult.h5.tar.gz ├── adult.data ├── adult.names ├── adult.test └── adult.test.withid ├── log └── .gitignore ├── parameter ├── default.yml ├── logger.yml └── standard.yml ├── requirements.txt └── work └── .gitignore /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | work 3 | log 4 | data/zeroconf-result.csv -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /AutosklearnModellingLossOverTimeExample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/c61ada6d354a46535b51321ec85e6b3064aa3f4f/AutosklearnModellingLossOverTimeExample.png -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright 2017 PayPal 2 | 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 4 | 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 6 | 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 8 | 9 | 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 10 | 11 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## What is autosklearn-zeroconf 2 | The autosklearn-zeroconf file takes a dataframe of any size and trains [auto-sklearn](https://github.com/automl/auto-sklearn) binary classifier ensemble. No configuration is needed as the name suggests. Auto-sklearn is the recent [AutoML Challenge](http://www.kdnuggets.com/2016/08/winning-automl-challenge-auto-sklearn.html) winner [more @microsoft.com](https://www.microsoft.com/en-us/research/blog/automl-challenge-leap-forward-machine-learning-competitions/). 3 | 4 | As a result of using automl-zeroconf running auto-sklearn becomes a "fire and forget" type of operation. It greatly increases the utility and decreases turnaround time for experiments. 5 | 6 | The main value proposition is that a data analyst or a data savvy business user can quickly run the iterations on the data (actual sources and feature design) side and on the ML side not a bit has to be changed. So it's a great tool for people not doing hardcore data science full time. Up to 90% of (marketing) data analysts may fall into this target group currently. 7 | 8 | ## How Does It Work 9 | To keep the training time reasonable autosklearn-zeroconf samples the data and tests all the models from autosklearn library on it once. The results of the test (duration) is used to calculate the per_run_time_limit, time_left_for_this_task and number of seeds parameters for autosklearn. The code also converts the pandas dataframe into a form that autosklearn can handle (categorical and float datatypes). 10 | 11 | 12 | ## Algoritms included 13 | bernoulli_nb, 14 | extra_trees, 15 | gaussian_nb, 16 | adaboost, 17 | gradient_boosting, 18 | k_nearest_neighbors, 19 | lda, 20 | liblinear_svc, 21 | multinomial_nb, 22 | passive_aggressive, 23 | random_forest, 24 | sgd 25 | 26 | plus samplers, scalers, imputers (14 feature processing methods, and 3 data preprocessing 27 | methods, giving rise to a structured hypothesis space with 100+ hyperparameters) 28 | 29 | ## Running autosklearn-zeroconf 30 | To run autosklearn-zeroconf start
python bin/zeroconf.py -d your_dataframe.h5
from command line. 31 | The script was tested on Ubuntu and RedHat. It won't work on any WindowsOS because auto-sklearn doesn't support Windows. 32 | 33 | ## Data Format 34 | The code uses a pandas dataframe format to manage the data. It is stored in the HDF5 .h5 file for convenience. (Python module "tables") 35 | 36 | ## Example 37 | As an example you can run autosklearn-zeroconf on a "Census Income" dataset https://archive.ics.uci.edu/ml/datasets/Adult. 38 |
python ./bin/zeroconf.py -d ./data/Adult.h5
39 | And then to evaluate the prediction stored in zerconf-result.csv against the test dataset file adult.test.withid 40 |
python ./bin/evaluate-dataset-Adult.py
41 | 42 | ## Installation 43 | The script itself needs no installation, just copy it with the rest of the files in your working directory. 44 | Alternatively you could use git clone 45 |
 46 | sudo apt-get update && sudo apt-get install git && git clone https://github.com/paypal/autosklearn-zeroconf.git
 47 | 
48 | 49 | ### Happy path installation on Ubuntu 18.04LTS 50 |
 51 | sudo apt-get update && sudo apt-get install git gcc build-essential swig python-pip virtualenv python3-dev
 52 | git clone https://github.com/paypal/autosklearn-zeroconf.git
 53 | pip install virtualenv
 54 | virtualenv zeroconf -p /usr/bin/python3.6
 55 | source zeroconf/bin/activate
 56 | curl https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/master/requirements.txt | xargs -n 1 -L 1 pip install
 57 | git clone https://github.com/paypal/autosklearn-zeroconf.git
 58 | cd autosklearn-zeroconf/ && python ./bin/zeroconf.py -d ./data/Adult.h5 2>/dev/null
 59 | 
60 | 61 | ## License 62 | autosklearn-zeroconf is licensed under the [BSD 3-Clause License (Revised)](LICENSE.txt) 63 | 64 | ## Example of the output 65 |
 66 | python zeroconf.py -d ./data/Adult.h5 2>/dev/null | grep [ZEROCONF]
 67 | 
 68 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO - Program Call Parameter (Arguments and Parameter File Values):
 69 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO -    basedir: /home/ulrich/PycharmProjects/autosklearn-zeroconf
 70 | 2017-10-11 10:52:15,893 - [ZEROCONF] - zeroconf.py - INFO -    data_file: /home/ulrich/PycharmProjects/autosklearn-zeroconf/data/Adult.h5
 71 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    id_field: cust_id
 72 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    max_classifier_time_budget: 1200
 73 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    max_sample_size: 100000
 74 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    memory_limit: 15000
 75 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    parameter_file: /home/ulrich/PycharmProjects/autosklearn-zeroconf/parameter/default.yml
 76 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    proc: zeroconf.py
 77 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    resultfile: /home/ulrich/PycharmProjects/autosklearn-zeroconf/data/zeroconf-result.csv
 78 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    runid: 20171011105215
 79 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    runtype: Fresh Run Start
 80 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    target_field: category
 81 | 2017-10-11 10:52:15,894 - [ZEROCONF] - zeroconf.py - INFO -    workdir: /home/ulrich/PycharmProjects/autosklearn-zeroconf/work/20171011105215
 82 | 2017-10-11 10:52:15,944 - [ZEROCONF] - zeroconf.py - INFO - Read dataset from the store
 83 | 2017-10-11 10:52:15,945 - [ZEROCONF] - zeroconf.py - INFO - Values of y [  0.   1.  nan]
 84 | 2017-10-11 10:52:15,945 - [ZEROCONF] - zeroconf.py - INFO - We need to protect NAs in y from the prediction dataset so we convert them to -1
 85 | 2017-10-11 10:52:15,946 - [ZEROCONF] - zeroconf.py - INFO - New values of y [ 0.  1. -1.]
 86 | 2017-10-11 10:52:15,946 - [ZEROCONF] - zeroconf.py - INFO - Filling missing values in X with the most frequent values
 87 | 2017-10-11 10:52:16,043 - [ZEROCONF] - zeroconf.py - INFO - Factorizing the X
 88 | 2017-10-11 10:52:16,176 - [ZEROCONF] - x_y_dataframe_split - INFO - Dataframe split into X and y
 89 | 2017-10-11 10:52:16,178 - [ZEROCONF] - zeroconf.py - INFO - Preparing a sample to measure approx classifier run time and select features
 90 | 2017-10-11 10:52:16,191 - [ZEROCONF] - zeroconf.py - INFO - train size:21815
 91 | 2017-10-11 10:52:16,191 - [ZEROCONF] - zeroconf.py - INFO - test size:10746
 92 | 2017-10-11 10:52:16,192 - [ZEROCONF] - zeroconf.py - INFO - Reserved 33% of the training dataset for validation (upto 33k rows)
 93 | 2017-10-11 10:52:16,209 - [ZEROCONF] - max_estimators_fit_duration - INFO - Constructing preprocessor pipeline and transforming sample data
 94 | 2017-10-11 10:52:18,712 - [ZEROCONF] - max_estimators_fit_duration - INFO - Running estimators on the sample
 95 | 2017-10-11 10:52:18,729 - [ZEROCONF] - zeroconf.py - INFO - adaboost starting
 96 | 2017-10-11 10:52:18,734 - [ZEROCONF] - zeroconf.py - INFO - bernoulli_nb starting
 97 | 2017-10-11 10:52:18,761 - [ZEROCONF] - zeroconf.py - INFO - extra_trees starting
 98 | 2017-10-11 10:52:18,769 - [ZEROCONF] - zeroconf.py - INFO - decision_tree starting
 99 | 2017-10-11 10:52:18,780 - [ZEROCONF] - zeroconf.py - INFO - gaussian_nb starting
100 | 2017-10-11 10:52:18,800 - [ZEROCONF] - zeroconf.py - INFO - bernoulli_nb training time: 0.06455278396606445
101 | 2017-10-11 10:52:18,802 - [ZEROCONF] - zeroconf.py - INFO - gradient_boosting starting
102 | 2017-10-11 10:52:18,808 - [ZEROCONF] - zeroconf.py - INFO - k_nearest_neighbors starting
103 | 2017-10-11 10:52:18,809 - [ZEROCONF] - zeroconf.py - INFO - decision_tree training time: 0.03273773193359375
104 | 2017-10-11 10:52:18,826 - [ZEROCONF] - zeroconf.py - INFO - lda starting
105 | 2017-10-11 10:52:18,845 - [ZEROCONF] - zeroconf.py - INFO - liblinear_svc starting
106 | 2017-10-11 10:52:18,867 - [ZEROCONF] - zeroconf.py - INFO - gaussian_nb training time: 0.08569979667663574
107 | 2017-10-11 10:52:18,882 - [ZEROCONF] - zeroconf.py - INFO - multinomial_nb starting
108 | 2017-10-11 10:52:18,905 - [ZEROCONF] - zeroconf.py - INFO - passive_aggressive starting
109 | 2017-10-11 10:52:18,943 - [ZEROCONF] - zeroconf.py - INFO - random_forest starting
110 | 2017-10-11 10:52:18,971 - [ZEROCONF] - zeroconf.py - INFO - sgd starting
111 | 2017-10-11 10:52:19,012 - [ZEROCONF] - zeroconf.py - INFO - lda training time: 0.17656564712524414
112 | 2017-10-11 10:52:19,023 - [ZEROCONF] - zeroconf.py - INFO - multinomial_nb training time: 0.13777780532836914
113 | 2017-10-11 10:52:19,124 - [ZEROCONF] - zeroconf.py - INFO - liblinear_svc training time: 0.27405595779418945
114 | 2017-10-11 10:52:19,416 - [ZEROCONF] - zeroconf.py - INFO - passive_aggressive training time: 0.508676290512085
115 | 2017-10-11 10:52:19,473 - [ZEROCONF] - zeroconf.py - INFO - sgd training time: 0.49777913093566895
116 | 2017-10-11 10:52:20,471 - [ZEROCONF] - zeroconf.py - INFO - adaboost training time: 1.7392246723175049
117 | 2017-10-11 10:52:20,625 - [ZEROCONF] - zeroconf.py - INFO - k_nearest_neighbors training time: 1.8141863346099854
118 | 2017-10-11 10:52:22,258 - [ZEROCONF] - zeroconf.py - INFO - extra_trees training time: 3.4934401512145996
119 | 2017-10-11 10:52:22,696 - [ZEROCONF] - zeroconf.py - INFO - random_forest training time: 3.7496204376220703
120 | 2017-10-11 10:52:24,215 - [ZEROCONF] - zeroconf.py - INFO - gradient_boosting training time: 5.41023063659668
121 | 2017-10-11 10:52:24,230 - [ZEROCONF] - max_estimators_fit_duration - INFO - Test classifier fit completed
122 | 2017-10-11 10:52:24,239 - [ZEROCONF] - zeroconf.py - INFO - per_run_time_limit=5
123 | 2017-10-11 10:52:24,239 - [ZEROCONF] - zeroconf.py - INFO - Process pool size=2
124 | 2017-10-11 10:52:24,240 - [ZEROCONF] - zeroconf.py - INFO - Starting autosklearn classifiers fiting on a 67% sample up to 67k rows
125 | 2017-10-11 10:52:24,252 - [ZEROCONF] - train_multicore - INFO - Max time allowance for a model 1 minute(s)
126 | 2017-10-11 10:52:24,252 - [ZEROCONF] - train_multicore - INFO - Overal run time is about 10 minute(s)
127 | 2017-10-11 10:52:24,255 - [ZEROCONF] - train_multicore - INFO - Multicore process 2 started
128 | 2017-10-11 10:52:24,258 - [ZEROCONF] - train_multicore - INFO - Multicore process 3 started
129 | 2017-10-11 10:52:24,276 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Start AutoSklearnClassifier seed=2
130 | 2017-10-11 10:52:24,278 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Start AutoSklearnClassifier seed=3
131 | 2017-10-11 10:52:24,295 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Done AutoSklearnClassifier seed=3
132 | 2017-10-11 10:52:24,297 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Done AutoSklearnClassifier seed=2
133 | 2017-10-11 10:52:26,299 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Starting seed=2
134 | 2017-10-11 10:52:27,298 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - Starting seed=3
135 | 2017-10-11 10:56:30,949 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - ####### Finished seed=2
136 | 2017-10-11 10:56:31,600 - [ZEROCONF] - spawn_autosklearn_classifier - INFO - ####### Finished seed=3
137 | 2017-10-11 10:56:31,614 - [ZEROCONF] - train_multicore - INFO - Multicore fit completed
138 | 2017-10-11 10:56:31,626 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Building ensemble
139 | 2017-10-11 10:56:31,626 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Done AutoSklearnClassifier - seed:1
140 | 2017-10-11 10:56:54,017 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Ensemble built - seed:1
141 | 2017-10-11 10:56:54,017 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - Show models - seed:1
142 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - [(0.400000, SimpleClassificationPipeline({'classifier:__choice__': 'adaboost', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:select_percentile_classification:percentile': 85.5410729966473, 'classifier:adaboost:n_estimators': 88, 'one_hot_encoding:minimum_fraction': 0.01805038589303469, 'rescaling:__choice__': 'minmax', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'select_percentile_classification', 'classifier:adaboost:max_depth': 1, 'classifier:adaboost:learning_rate': 0.10898092508755285, 'preprocessor:select_percentile_classification:score_func': 'chi2', 'imputation:strategy': 'most_frequent', 'classifier:adaboost:algorithm': 'SAMME.R'},
143 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={
144 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'task': 1,
145 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'signed': False,
146 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'sparse': False,
147 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multiclass': False,
148 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'target_type': 'classification',
149 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multilabel': False})),
150 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.300000, SimpleClassificationPipeline({'classifier:__choice__': 'random_forest', 'classifier:random_forest:min_weight_fraction_leaf': 0.0, 'one_hot_encoding:use_minimum_fraction': 'True', 'classifier:random_forest:criterion': 'gini', 'classifier:random_forest:min_samples_leaf': 4, 'classifier:random_forest:max_depth': 'None', 'classifier:random_forest:min_samples_split': 16, 'classifier:random_forest:bootstrap': 'False', 'one_hot_encoding:minimum_fraction': 0.1453954841364777, 'rescaling:__choice__': 'none', 'balancing:strategy': 'none', 'preprocessor:__choice__': 'select_percentile_classification', 'preprocessor:select_percentile_classification:percentile': 96.35414862145892, 'preprocessor:select_percentile_classification:score_func': 'chi2', 'imputation:strategy': 'mean', 'classifier:random_forest:max_leaf_nodes': 'None', 'classifier:random_forest:max_features': 3.342759426984195, 'classifier:random_forest:n_estimators': 100},
151 | 2017-10-11 10:56:54,596 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={
152 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'task': 1,
153 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'signed': False,
154 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'sparse': False,
155 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multiclass': False,
156 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'target_type': 'classification',
157 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multilabel': False})),
158 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.200000, SimpleClassificationPipeline({'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:__choice__': 'extra_trees', 'classifier:extra_trees:n_estimators': 100, 'classifier:extra_trees:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:min_samples_split': 5, 'classifier:extra_trees:min_samples_leaf': 10, 'rescaling:__choice__': 'minmax', 'classifier:extra_trees:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_classification:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:criterion': 'gini', 'classifier:extra_trees:max_features': 4.413198608615693, 'classifier:extra_trees:criterion': 'gini', 'preprocessor:extra_trees_preproc_for_classification:n_estimators': 100, 'classifier:extra_trees:min_samples_split': 16, 'one_hot_encoding:use_minimum_fraction': 'False', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'extra_trees_preproc_for_classification', 'preprocessor:extra_trees_preproc_for_classification:min_samples_leaf': 1, 'preprocessor:extra_trees_preproc_for_classification:max_features': 1.4824479003506632, 'imputation:strategy': 'median', 'preprocessor:extra_trees_preproc_for_classification:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_classification:max_depth': 'None'},
159 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={
160 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'task': 1,
161 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'signed': False,
162 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'sparse': False,
163 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multiclass': False,
164 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'target_type': 'classification',
165 | 2017-10-11 10:56:54,597 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multilabel': False})),
166 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - (0.100000, SimpleClassificationPipeline({'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:__choice__': 'extra_trees', 'classifier:extra_trees:n_estimators': 100, 'classifier:extra_trees:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:min_samples_split': 16, 'classifier:extra_trees:min_samples_leaf': 10, 'rescaling:__choice__': 'minmax', 'classifier:extra_trees:max_depth': 'None', 'preprocessor:extra_trees_preproc_for_classification:bootstrap': 'True', 'preprocessor:extra_trees_preproc_for_classification:criterion': 'gini', 'classifier:extra_trees:max_features': 4.16852017424403, 'classifier:extra_trees:criterion': 'gini', 'preprocessor:extra_trees_preproc_for_classification:n_estimators': 100, 'classifier:extra_trees:min_samples_split': 16, 'one_hot_encoding:use_minimum_fraction': 'False', 'balancing:strategy': 'weighting', 'preprocessor:__choice__': 'extra_trees_preproc_for_classification', 'preprocessor:extra_trees_preproc_for_classification:min_samples_leaf': 1, 'preprocessor:extra_trees_preproc_for_classification:max_features': 1.5781770540350555, 'imputation:strategy': 'median', 'preprocessor:extra_trees_preproc_for_classification:min_weight_fraction_leaf': 0.0, 'preprocessor:extra_trees_preproc_for_classification:max_depth': 'None'},
167 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - dataset_properties={
168 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'task': 1,
169 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'signed': False,
170 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'sparse': False,
171 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multiclass': False,
172 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'target_type': 'classification',
173 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO -   'multilabel': False})),
174 | 2017-10-11 10:56:54,598 - [ZEROCONF] - zeroconf_fit_ensemble - INFO - ]
175 | 2017-10-11 10:56:54,613 - [ZEROCONF] - zeroconf.py - INFO - Validating
176 | 2017-10-11 10:56:54,613 - [ZEROCONF] - zeroconf.py - INFO - Predicting on validation set
177 | 2017-10-11 10:56:57,373 - [ZEROCONF] - zeroconf.py - INFO - ########################################################################
178 | 2017-10-11 10:56:57,374 - [ZEROCONF] - zeroconf.py - INFO - Accuracy score 84%
179 | 2017-10-11 10:56:57,374 - [ZEROCONF] - zeroconf.py - INFO - The below scores are calculated for predicting '1' category value
180 | 2017-10-11 10:56:57,379 - [ZEROCONF] - zeroconf.py - INFO - Precision: 64%, Recall: 77%, F1: 0.70
181 | 2017-10-11 10:56:57,379 - [ZEROCONF] - zeroconf.py - INFO - Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall
182 | 2017-10-11 10:56:57,386 - [ZEROCONF] - zeroconf.py - INFO - [7058 1100]
183 | 2017-10-11 10:56:57,386 - [ZEROCONF] - zeroconf.py - INFO - [ 603 1985]
184 | 2017-10-11 10:56:57,392 - [ZEROCONF] - zeroconf.py - INFO - Baseline 2588 positives from 10746 overall = 24.1%
185 | 2017-10-11 10:56:57,392 - [ZEROCONF] - zeroconf.py - INFO - ########################################################################
186 | 2017-10-11 10:56:57,404 - [ZEROCONF] - x_y_dataframe_split - INFO - Dataframe split into X and y
187 | 2017-10-11 10:56:57,405 - [ZEROCONF] - zeroconf.py - INFO - Re-fitting the model ensemble on full known dataset to prepare for prediciton. This can take a long time.
188 | 2017-10-11 10:58:39,836 - [ZEROCONF] - zeroconf.py - INFO - Predicting. This can take a long time for a large prediction set.
189 | 2017-10-11 10:58:45,221 - [ZEROCONF] - zeroconf.py - INFO - Prediction done
190 | 2017-10-11 10:58:45,223 - [ZEROCONF] - zeroconf.py - INFO - Exporting the data
191 | 2017-10-11 10:58:45,267 - [ZEROCONF] - zeroconf.py - INFO - ##### Zeroconf Script Completed! #####
192 | 2017-10-11 10:58:45,268 - [ZEROCONF] - zeroconf.py - INFO - Clean up / Delete work directory: /home/ulrich/PycharmProjects/autosklearn-zeroconf/work/20171011105215
193 | 
194 | Process finished with exit code 0
195 | 
196 | 197 |
198 | python evaluate-dataset-Adult.py 
199 | [ZEROCONF]  # 00:37:43 #
200 | [ZEROCONF] ######################################################################## # 00:37:43 #
201 | [ZEROCONF] Accuracy score 85% # 00:37:43 #
202 | [ZEROCONF] The below scores are calculated for predicting '1' category value # 00:37:43 #
203 | [ZEROCONF] Precision: 65%, Recall: 78%, F1: 0.71 # 00:37:43 #
204 | [ZEROCONF] Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall # 00:37:43 #
205 | [ZEROCONF] [[10835  1600] # 00:37:43 #
206 | [ZEROCONF]  [  860  2986]] # 00:37:43 #
207 | [ZEROCONF] Baseline 3846 positives from 16281 overall = 23.6% # 00:37:43 #
208 | [ZEROCONF] ######################################################################## # 00:37:43 #
209 | [ZEROCONF]  # 00:37:43 #
210 | 
211 | ## Workarounds 212 | these are not related to the autosklearn-zeroconf or auto-sklearn but rather general issues depending on your python and OS installation 213 | ### xgboost issues 214 | #### complains about ELF header 215 |
pip uninstall xgboost; pip install --no-cache-dir -v xgboost==0.4a30
216 | #### can not find libraries 217 |
conda install libgcc # for xgboost
218 | alternatively search for them with 219 |
sudo find / -name libgomp.so.1
220 | /usr/lib/x86_64-linux-gnu/libgomp.so.1
221 | and explicitly add them to the libraries path 222 |
export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libstdc++.so.6":"/usr/lib/x86_64-linux-gnu/libgomp.so.1"; python zeroconf.py Titanic.h5 2>/dev/null|grep ZEROCONF
223 | Also see https://github.com/automl/auto-sklearn/issues/247 224 | 225 | ### Install auto-sklearn 226 |
227 | # A compiler (gcc) is needed to compile a few things the from auto-sklearn requirements.txt
228 | # Chose just the line for your Linux flavor below
229 | 
230 | # On Ubuntu
231 | sudo apt-get install gcc build-essential swig
232 | 
233 | # On CentOS 7-1611 http://www.osboxes.org/centos/ https://drive.google.com/file/d/0B_HAFnYs6Ur-bl8wUWZfcHVpMm8/view?usp=sharing
234 | sudo yum -y update 
235 | sudo reboot
236 | sudo yum install epel-release python34 python34-devel python34-setuptools
237 | sudo yum -y groupinstall 'Development Tools'
238 | 
239 | # auto-sklearn requires swig 3.0 
240 | wget downloads.sourceforge.net/project/swig/swig/swig-3.0.12/swig-3.0.12.tar.gz -O swig-3.0.12.tar.gz
241 | tar xf swig-3.0.12.tar.gz 
242 | cd swig-3.0.12 
243 | ./configure --without-pcre
244 | make
245 | sudo make install
246 | cd ..
247 | 
248 | sudo easy_install-3.4 pip
249 | # if you want to use virtual environments
250 | sudo pip3 install virtualenv
251 | virtualenv zeroconf -p /usr/bin/python3.4
252 | source zeroconf/bin/activate
253 | 
254 | curl https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/master/requirements.txt | xargs -n 1 -L 1 pip install
255 | 
256 | 257 | # Contributors 258 | Egor Kobylkin, Ulrich Arndt 259 | -------------------------------------------------------------------------------- /bin/dataTransformationProcessing.py: -------------------------------------------------------------------------------- 1 | import inspect 2 | import math 3 | import multiprocessing 4 | import time 5 | import traceback 6 | from time import sleep 7 | 8 | import autosklearn.pipeline 9 | import autosklearn.pipeline.components.classification 10 | import utility as utl 11 | import warnings 12 | warnings.simplefilter(action='ignore', category=FutureWarning) 13 | import pandas as pd 14 | import psutil 15 | from autosklearn.classification import AutoSklearnClassifier 16 | from autosklearn.constants import * 17 | from autosklearn.pipeline.classification import SimpleClassificationPipeline 18 | 19 | 20 | def time_single_estimator(clf_name, clf_class, X, y, max_clf_time, logger): 21 | if ('libsvm_svc' == clf_name # doesn't even scale to a 100k rows 22 | or 'qda' == clf_name): # crashes 23 | return 0 24 | logger.info(clf_name + " starting") 25 | default = clf_class.get_hyperparameter_search_space().get_default_configuration() 26 | clf = clf_class(**default._values) 27 | t0 = time.time() 28 | try: 29 | clf.fit(X, y) 30 | except Exception as e: 31 | logger.info(e) 32 | classifier_time = time.time() - t0 # keep time even if classifier crashed 33 | logger.info(clf_name + " training time: " + str(classifier_time)) 34 | if max_clf_time.value < int(classifier_time): 35 | max_clf_time.value = int(classifier_time) 36 | # no return statement here because max_clf_time is a managed object 37 | 38 | 39 | def max_estimators_fit_duration(X, y, max_classifier_time_budget, logger, sample_factor=1): 40 | lo = utl.get_logger(inspect.stack()[0][3]) 41 | 42 | lo.info("Constructing preprocessor pipeline and transforming sample data") 43 | # we don't care about the data here but need to preprocess, otherwise the classifiers crash 44 | 45 | pipeline = SimpleClassificationPipeline( 46 | include={'imputation': ['most_frequent'], 'rescaling': ['standardize']}) 47 | default_cs = pipeline.get_hyperparameter_search_space().get_default_configuration() 48 | pipeline = pipeline.set_hyperparameters(default_cs) 49 | 50 | pipeline.fit(X, y) 51 | X_tr, dummy = pipeline.fit_transformer(X, y) 52 | 53 | lo.info("Running estimators on the sample") 54 | # going over all default classifiers used by auto-sklearn 55 | clfs = autosklearn.pipeline.components.classification._classifiers 56 | 57 | processes = [] 58 | with multiprocessing.Manager() as manager: 59 | max_clf_time = manager.Value('i', 3) # default 3 sec 60 | for clf_name, clf_class in clfs.items(): 61 | pr = multiprocessing.Process(target=time_single_estimator, name=clf_name 62 | , args=(clf_name, clf_class, X_tr, y, max_clf_time, logger)) 63 | pr.start() 64 | processes.append(pr) 65 | for pr in processes: 66 | pr.join(max_classifier_time_budget) # will block for max_classifier_time_budget or 67 | # until the classifier fit process finishes. After max_classifier_time_budget 68 | # we will terminate all still running processes here. 69 | if pr.is_alive(): 70 | logger.info("Terminating " + pr.name + " process due to timeout") 71 | pr.terminate() 72 | result_max_clf_time = max_clf_time.value 73 | 74 | lo.info("Test classifier fit completed") 75 | 76 | per_run_time_limit = int(sample_factor * result_max_clf_time) 77 | return max_classifier_time_budget if per_run_time_limit > max_classifier_time_budget else per_run_time_limit 78 | 79 | 80 | def read_dataframe_h5(filename, logger): 81 | with pd.HDFStore(filename, mode='r') as store: 82 | df = store.select('data') 83 | logger.info("Read dataset from the store") 84 | return df 85 | 86 | 87 | def x_y_dataframe_split(dataframe, parameter, id=False): 88 | lo = utl.get_logger(inspect.stack()[0][3]) 89 | 90 | lo.info("Dataframe split into X and y") 91 | X = dataframe.drop([parameter["id_field"], parameter["target_field"]], axis=1) 92 | y = pd.np.array(dataframe[parameter["target_field"]], dtype='int') 93 | if id: 94 | row_id = dataframe[parameter["id_field"]] 95 | return X, y, row_id 96 | else: 97 | return X, y 98 | 99 | 100 | def define_pool_size(memory_limit): 101 | # some classifiers can use more than one core - so keep this at half memory and cores 102 | max_pool_size = int(math.ceil(psutil.virtual_memory().total / (memory_limit * 1000000))) 103 | half_of_cores = int(math.ceil(psutil.cpu_count() / 2.0)) 104 | 105 | lo = utl.get_logger(inspect.stack()[0][3]) 106 | lo.info("Virtual Memory Size = " + str(psutil.virtual_memory().total) ) 107 | lo.info("CPU Count =" + str(psutil.cpu_count()) ) 108 | lo.info("Max CPU Pool Size by Memory = " + str(max_pool_size) ) 109 | 110 | return half_of_cores if max_pool_size > half_of_cores else max_pool_size 111 | 112 | 113 | def calculate_time_left_for_this_task(pool_size, per_run_time_limit): 114 | half_cpu_cores = pool_size 115 | queue_factor = 30 116 | if queue_factor * half_cpu_cores < 100: # 100 models to test overall 117 | queue_factor = 100 / half_cpu_cores 118 | 119 | time_left_for_this_task = int(queue_factor * per_run_time_limit) 120 | return time_left_for_this_task 121 | 122 | 123 | def spawn_autosklearn_classifier(X_train, y_train, seed, dataset_name, time_left_for_this_task, per_run_time_limit, 124 | feat_type, memory_limit, atsklrn_tempdir): 125 | lo = utl.get_logger(inspect.stack()[0][3]) 126 | 127 | try: 128 | lo.info("Start AutoSklearnClassifier seed=" + str(seed)) 129 | clf = AutoSklearnClassifier(time_left_for_this_task=time_left_for_this_task, 130 | per_run_time_limit=per_run_time_limit, 131 | ml_memory_limit=memory_limit, 132 | shared_mode=True, 133 | tmp_folder=atsklrn_tempdir, 134 | output_folder=atsklrn_tempdir, 135 | delete_tmp_folder_after_terminate=False, 136 | delete_output_folder_after_terminate=False, 137 | initial_configurations_via_metalearning=0, 138 | ensemble_size=0, 139 | seed=seed) 140 | except Exception: 141 | lo.exception("Exception AutoSklearnClassifier seed=" + str(seed)) 142 | raise 143 | 144 | lo = utl.get_logger(inspect.stack()[0][3]) 145 | lo.info("Done AutoSklearnClassifier seed=" + str(seed)) 146 | 147 | sleep(seed) 148 | 149 | try: 150 | lo.info("Starting seed=" + str(seed)) 151 | try: 152 | clf.fit(X_train, y_train, metric=autosklearn.metrics.f1, feat_type=feat_type, dataset_name=dataset_name) 153 | except Exception: 154 | lo = utl.get_logger(inspect.stack()[0][3]) 155 | lo.exception("Error in clf.fit - seed:" + str(seed)) 156 | raise 157 | except Exception: 158 | lo = utl.get_logger(inspect.stack()[0][3]) 159 | lo.exception("Exception in seed=" + str(seed) + ". ") 160 | traceback.print_exc() 161 | raise 162 | lo = utl.get_logger(inspect.stack()[0][3]) 163 | lo.info("####### Finished seed=" + str(seed)) 164 | return None 165 | 166 | 167 | def train_multicore(X, y, feat_type, memory_limit, atsklrn_tempdir, pool_size=1, per_run_time_limit=60): 168 | lo = utl.get_logger(inspect.stack()[0][3]) 169 | 170 | time_left_for_this_task = calculate_time_left_for_this_task(pool_size, per_run_time_limit) 171 | 172 | lo.info("Max time allowance for a model " + str(math.ceil(per_run_time_limit / 60.0)) + " minute(s)") 173 | lo.info("Overal run time is about " + str(2 * math.ceil(time_left_for_this_task / 60.0)) + " minute(s)") 174 | 175 | processes = [] 176 | for i in range(2, pool_size + 2): # reserve seed 1 for the ensemble building 177 | seed = i 178 | pr = multiprocessing.Process(target=spawn_autosklearn_classifier 179 | , args=( 180 | X, y, i, 'foobar', time_left_for_this_task, per_run_time_limit, feat_type, memory_limit, atsklrn_tempdir)) 181 | pr.start() 182 | lo.info("Multicore process " + str(seed) + " started") 183 | processes.append(pr) 184 | for pr in processes: 185 | pr.join() 186 | 187 | lo.info("Multicore fit completed") 188 | 189 | 190 | def zeroconf_fit_ensemble(y, atsklrn_tempdir): 191 | lo = utl.get_logger(inspect.stack()[0][3]) 192 | 193 | lo.info("Building ensemble") 194 | 195 | seed = 1 196 | 197 | ensemble = AutoSklearnClassifier( 198 | time_left_for_this_task=300, per_run_time_limit=150, ml_memory_limit=20240, ensemble_size=50, 199 | ensemble_nbest=200, 200 | shared_mode=True, tmp_folder=atsklrn_tempdir, output_folder=atsklrn_tempdir, 201 | delete_tmp_folder_after_terminate=False, delete_output_folder_after_terminate=False, 202 | initial_configurations_via_metalearning=0, 203 | seed=seed) 204 | 205 | lo.info("Done AutoSklearnClassifier - seed:" + str(seed)) 206 | 207 | try: 208 | lo.debug("Start ensemble.fit_ensemble - seed:" + str(seed)) 209 | ensemble.fit_ensemble( 210 | task=BINARY_CLASSIFICATION 211 | , y=y 212 | , metric=autosklearn.metrics.f1 213 | , precision='32' 214 | , dataset_name='foobar' 215 | , ensemble_size=10 216 | , ensemble_nbest=15) 217 | except Exception: 218 | lo = utl.get_logger(inspect.stack()[0][3]) 219 | lo.exception("Error in ensemble.fit_ensemble - seed:" + str(seed)) 220 | raise 221 | 222 | lo = utl.get_logger(inspect.stack()[0][3]) 223 | lo.debug("Done ensemble.fit_ensemble - seed:" + str(seed)) 224 | 225 | sleep(20) 226 | lo.info("Ensemble built - seed:" + str(seed)) 227 | 228 | lo.info("Show models - seed:" + str(seed)) 229 | txtList = str(ensemble.show_models()).split("\n") 230 | for row in txtList: 231 | lo.info(row) 232 | 233 | return ensemble 234 | -------------------------------------------------------------------------------- /bin/evaluate-dataset-Adult.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 Egor Kobylkin 4 | Created on Sun Apr 23 11:52:59 2017 5 | @author: ekobylkin 6 | This is an example on how to prepare data for autosklearn-zeroconf. 7 | It is using a well known Adult (Salary) dataset from UCI https://archive.ics.uci.edu/ml/datasets/Adult . 8 | """ 9 | import pandas as pd 10 | 11 | test = pd.read_csv(filepath_or_buffer='./data/adult.test.withid',sep=',', error_bad_lines=False, index_col=False) 12 | #print(test) 13 | 14 | prediction = pd.read_csv(filepath_or_buffer='./data/zeroconf-result.csv',sep=',', error_bad_lines=False, index_col=False) 15 | #print(prediction) 16 | 17 | df=pd.merge(test, prediction, how='inner', on=['cust_id',]) 18 | 19 | y_test=df['category'] 20 | y_hat=df['prediction'] 21 | 22 | from sklearn.metrics import (confusion_matrix, precision_score 23 | , recall_score, f1_score, accuracy_score) 24 | from time import time,sleep,strftime 25 | def p(text): 26 | for line in str(text).splitlines(): 27 | print ('[ZEROCONF] '+line+" # "+strftime("%H:%M:%S")+" #") 28 | 29 | p("\n") 30 | p("#"*72) 31 | p("Accuracy score {0:2.0%}".format(accuracy_score(y_test, y_hat))) 32 | p("The below scores are calculated for predicting '1' category value") 33 | p("Precision: {0:2.0%}, Recall: {1:2.0%}, F1: {2:.2f}".format( 34 | precision_score(y_test, y_hat),recall_score(y_test, y_hat),f1_score(y_test, y_hat))) 35 | p("Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall") 36 | p(confusion_matrix(y_test, y_hat)) 37 | baseline_1 = str(sum(a for a in y_test)) 38 | baseline_all = str(len(y_test)) 39 | baseline_prcnt = "{0:2.0%}".format( float(sum(a for a in y_test)/len(y_test))) 40 | p("Baseline %s positives from %s overall = %1.1f%%" % 41 | (sum(a for a in y_test), len(y_test), 100*sum(a for a in y_test)/len(y_test))) 42 | p("#"*72) 43 | p("\n") 44 | -------------------------------------------------------------------------------- /bin/load-dataset-Adult.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 Egor Kobylkin 4 | Created on Sun Apr 23 11:52:59 2017 5 | @author: ekobylkin 6 | This is an example on how to prepare data for autosklearn-zeroconf. 7 | It is using a well known Adult (Salary) dataset from UCI https://archive.ics.uci.edu/ml/datasets/Adult . 8 | """ 9 | import pandas as pd 10 | # wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 11 | # wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test 12 | col_names=[ 13 | 'age', 14 | 'workclass', 15 | 'fnlwgt', 16 | 'education', 17 | 'education-num', 18 | 'marital-status', 19 | 'occupation', 20 | 'relationship', 21 | 'race', 22 | 'sex', 23 | 'capital-gain', 24 | 'capital-loss', 25 | 'hours-per-week', 26 | 'native-country', 27 | 'category' 28 | ] 29 | 30 | train = pd.read_csv(filepath_or_buffer='../data/adult.data',sep=',', error_bad_lines=False, index_col=False, names=col_names) 31 | category_mapping={' >50K':1,' <=50K':0} 32 | train['category']= train['category'].map(category_mapping) 33 | #dataframe=train 34 | 35 | test = pd.read_csv(filepath_or_buffer='../data/adult.test',sep=',', error_bad_lines=False, index_col=False, names=col_names, skiprows=1) 36 | test['set_name']='test' 37 | category_mapping={' >50K.':1,' <=50K.':0} 38 | test['category']= test['category'].map(category_mapping) 39 | 40 | dataframe=train.append(test) 41 | 42 | # autosklearn-zeroconf requires cust_id and category (target or "y" variable) columns, the rest is optional 43 | dataframe['cust_id']=dataframe.index 44 | 45 | # let's save the test with the cus_id and binarized category for the validation of the prediction afterwards 46 | test_df=dataframe.loc[dataframe['set_name']=='test'].drop(['set_name'], axis=1) 47 | test_df.to_csv('../data/adult.test.withid', index=False, header=True) 48 | 49 | # We will use the test.csv data to make a prediction. You can compare the predicted values with the ground truth yourself. 50 | dataframe.loc[dataframe['set_name']=='test','category']=None 51 | dataframe=dataframe.drop(['set_name'], axis=1) 52 | 53 | print(dataframe) 54 | 55 | store = pd.HDFStore('../data/Adult.h5') # this is the file cache for the data 56 | store['data'] = dataframe 57 | store.close() 58 | #Now run 'python zeroconf.py Adult.h5' (python >=3.5) 59 | -------------------------------------------------------------------------------- /bin/load-dataset-Titanic.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Copyright 2017 PayPal 4 | Created on Sun Oct 02 17:13:59 2016 5 | @author: ekobylkin 6 | 7 | This is an example on how to prepare data for autosklearn-zeroconf. 8 | It is using a well known Titanic dataset from Kaggle https://www.kaggle.com/c/titanic . 9 | """ 10 | import pandas as pd 11 | # Dowlnoad these files from Kaggle dataset 12 | #https://www.kaggle.com/c/titanic/download/train.csv 13 | #https://www.kaggle.com/c/titanic/download/test.csv 14 | train = pd.read_csv(filepath_or_buffer='train.csv',sep=',', error_bad_lines=False, index_col=False) 15 | test = pd.read_csv(filepath_or_buffer='test.csv',sep=',', error_bad_lines=False, index_col=False) 16 | 17 | # We will use the test.csv data to make a prediction. You can compare the predicted values with the ground truth yourself. 18 | test['Survived']=None # The empty target column tells autosklearn-zeroconf to use these cases for the prediction 19 | 20 | dataframe=train.append(test) 21 | 22 | # autosklearn-zeroconf requires cust_id and category (target or "y" variable) columns, the rest is optional 23 | dataframe.rename(columns = {'PassengerId':'cust_id','Survived':'category'},inplace=True) 24 | 25 | store = pd.HDFStore('Titanic.h5') # this is the file cache for the data 26 | store['data'] = dataframe 27 | store.close() 28 | #Now run 'python zeroconf.py Titanic.h5' (python >=3.5) 29 | -------------------------------------------------------------------------------- /bin/utility.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import logging 3 | 4 | import os 5 | import ruamel.yaml as yaml 6 | import shutil 7 | 8 | 9 | def init_process(file, basedir=''): 10 | absfile = os.path.abspath(file) 11 | if (basedir == ''): 12 | basedir = os.path.join(*splitall(absfile)[0:(len(splitall(absfile)) - 2)]) 13 | proc = os.path.basename(absfile) 14 | if not os.path.isdir(basedir + '/work'): 15 | os.mkdir(basedir + '/work') 16 | runidfile = basedir + '/work/current_runid.txt' 17 | runid, runtype = get_runid(runidfile, basedir) 18 | 19 | parameter = {} 20 | parameter["runid"] = runid 21 | parameter["runtype"] = runtype 22 | parameter["proc"] = proc 23 | parameter["workdir"] = basedir + '/work/' + runid 24 | return parameter 25 | 26 | 27 | def get_logger(name): 28 | ################## 29 | # the repeating setup of the logging is related to an issue in the sklearn package 30 | # this resulted in a lost of the logger... 31 | ################## 32 | setup_logging() 33 | logger = logging.getLogger(name) 34 | return logger 35 | 36 | 37 | def handle_exception(exc_type, exc_value, exc_traceback): 38 | logger = get_logger(__file__) 39 | if issubclass(exc_type, KeyboardInterrupt): 40 | sys.__excepthook__(exc_type, exc_value, exc_traceback) 41 | return 42 | logger.error("Uncaught exception", exc_info=(exc_type, exc_value, exc_traceback)) 43 | 44 | def merge_two_dicts(x, y): 45 | """Given two dicts, merge them into a new dict as a shallow copy.""" 46 | z = x.copy() 47 | z.update(y) 48 | return z 49 | 50 | def read_parameter(parameter_file, parameter): 51 | fr = open(parameter_file, "r") 52 | param = yaml.load(fr, yaml.RoundTripLoader) 53 | return merge_two_dicts(parameter,param) 54 | 55 | 56 | def end_proc_success(parameter, logger): 57 | logger.info("Clean up / Delete work directory: " + parameter["basedir"] + "/work/" + parameter["runid"]) 58 | shutil.rmtree(parameter["basedir"] + "/work/" + parameter["runid"]) 59 | exit(0) 60 | 61 | 62 | def setup_logging( 63 | default_path='./parameter/logger.yml', 64 | default_level=logging.INFO, 65 | env_key='LOG_CFG' 66 | ): 67 | """Setup logging configuration 68 | 69 | """ 70 | path = os.path.abspath(default_path) 71 | value = os.getenv(env_key, None) 72 | if value: 73 | path = value 74 | if os.path.exists(os.path.abspath(path)): 75 | with open(path, 'rt') as f: 76 | config = yaml.safe_load(f.read()) 77 | logging.config.dictConfig(config) 78 | else: 79 | logging.basicConfig(level=default_level) 80 | 81 | 82 | def get_runid(runidfile, basedir): 83 | now = datetime.datetime.now().strftime('%Y%m%d%H%M%S') 84 | if os.path.isfile(runidfile): 85 | rf = open(runidfile, 'r') 86 | runid = rf.read().rstrip() 87 | rf.close() 88 | if os.path.isdir(basedir + '/work/' + runid): 89 | runtype = 'RESTART' 90 | else: 91 | runtype = 'Fresh Run Start' 92 | rf = open(runidfile, 'w') 93 | runid = now 94 | print(runid, file=rf) 95 | rf.close() 96 | os.mkdir(basedir + '/work/' + runid) 97 | else: 98 | runtype = 'Fresh Run Start - no current_runid file' 99 | rf = open(runidfile, 'w') 100 | runid = now 101 | print(runid, file=rf) 102 | rf.close() 103 | os.mkdir(basedir + '/work/' + runid) 104 | return runid, runtype 105 | 106 | 107 | def splitall(path): 108 | allparts = [] 109 | while 1: 110 | parts = os.path.split(path) 111 | if parts[0] == path: # sentinel for absolute paths 112 | allparts.insert(0, parts[0]) 113 | break 114 | elif parts[1] == path: # sentinel for relative paths 115 | allparts.insert(0, parts[1]) 116 | break 117 | else: 118 | path = parts[0] 119 | allparts.insert(0, parts[1]) 120 | return allparts 121 | -------------------------------------------------------------------------------- /bin/zeroconf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 3 | """ 4 | Copyright 2017 PayPal 5 | Created on Mon Feb 27 19:11:59 PST 2017 6 | @author: ekobylkin 7 | @version 0.2 8 | @author: ulrich arndt - data2knowledge 9 | @update: 2017-09-27 10 | """ 11 | 12 | import argparse 13 | import numpy as np 14 | import os 15 | import pandas as pd 16 | import shutil 17 | 18 | from sklearn.model_selection import train_test_split 19 | from sklearn.metrics import (confusion_matrix, precision_score, 20 | recall_score, f1_score, accuracy_score) 21 | 22 | import utility as utl 23 | import dataTransformationProcessing as dt 24 | 25 | parameter = utl.init_process(__file__) 26 | 27 | ########################################################### 28 | # define the command line argument parser 29 | ########################################################### 30 | # https://docs.python.org/2/howto/argparse.html 31 | parser = argparse.ArgumentParser( 32 | description='zero configuration predictic modeling script. Requires a pandas HDFS dataframe file ' + \ 33 | 'and a yaml parameter file as input as input') 34 | parser.add_argument('-d', 35 | '--data_file', 36 | nargs=1, 37 | help='input pandas HDFS dataframe .h5 with an unique indentifier and a target column\n' + 38 | 'as well as additional data columns\n' 39 | 'default values are cust_id and category or need to be defined in an\n' + 40 | 'optional parameter file ' 41 | ) 42 | parser.add_argument('-p', 43 | '--param_file', 44 | help='input yaml parameter file' 45 | ) 46 | 47 | args = parser.parse_args() 48 | logger = utl.get_logger(os.path.basename(__file__)) 49 | logger.info("Program started with the following arguments:") 50 | logger.info(args) 51 | 52 | ########################################################### 53 | # set dir to project dir 54 | ########################################################### 55 | abspath = os.path.abspath(__file__) 56 | dname = os.path.dirname(os.path.dirname(abspath)) 57 | os.chdir(dname) 58 | 59 | ########################################################### 60 | # file check for the parameter 61 | ########################################################### 62 | param_file = '' 63 | if args.param_file: 64 | param_file = args.param_file[0] 65 | else: 66 | param_file = os.path.abspath("./parameter/default.yml") 67 | logger.info("Using the default parameter file: " + param_file) 68 | if (not (os.path.isfile(param_file))): 69 | msg = 'the input parameter file: ' + param_file + ' does not exist!' 70 | logger.error(msg) 71 | exit(8) 72 | 73 | data_file = '' 74 | if args.data_file: 75 | data_file = args.data_file[0] 76 | else: 77 | msg = "A data file is mandatory!" 78 | logger.error(msg) 79 | exit(8) 80 | if (not (os.path.isfile(data_file))): 81 | msg = 'the input parameter file: ' + data_file + ' does not exist!' 82 | logger.error(msg) 83 | exit(8) 84 | 85 | parameter = utl.read_parameter(param_file, parameter) 86 | 87 | parameter["data_file"] = os.path.abspath(data_file) 88 | parameter["basedir"] = os.path.abspath(parameter["basedir"]) 89 | parameter["parameter_file"] = os.path.abspath(param_file) 90 | parameter["resultfile"] = os.path.abspath(parameter["resultfile"]) 91 | 92 | 93 | ########################################################### 94 | # set base dir 95 | ########################################################### 96 | os.chdir(parameter["basedir"]) 97 | logger.info("Set basedir to: " + parameter["basedir"]) 98 | 99 | logger = utl.get_logger(os.path.basename(__file__)) 100 | 101 | logger.info("Program Call Parameter (Arguments and Parameter File Values):") 102 | for key in sorted(parameter.keys()): 103 | logger.info(" " + key + ": " + str(parameter[key])) 104 | 105 | work_dir = parameter["workdir"] 106 | result_filename = parameter["resultfile"] 107 | atsklrn_tempdir = os.path.join(work_dir, 'atsklrn_tmp') 108 | shutil.rmtree(atsklrn_tempdir, ignore_errors=True) # cleanup - remove temp directory 109 | 110 | 111 | # if the memory limit is lower the model can fail and the whole process will crash 112 | memory_limit = parameter["memory_limit"] # MB 113 | global max_classifier_time_budget 114 | max_classifier_time_budget = parameter["max_classifier_time_budget"] # but 10 minutes is usually more than enough 115 | max_sample_size = parameter["max_sample_size"] # so that the classifiers fit method completes in a reasonable time 116 | 117 | dataframe = dt.read_dataframe_h5(data_file, logger) 118 | 119 | logger.info("Values of y " + str(dataframe[parameter["target_field"]].unique())) 120 | logger.info("We need to protect NAs in y from the prediction dataset so we convert them to -1") 121 | dataframe[parameter["target_field"]] = dataframe[parameter["target_field"]].fillna(-1) 122 | logger.info("New values of y " + str(dataframe[parameter["target_field"]].unique())) 123 | 124 | logger.info("Filling missing values in X with the most frequent values") 125 | dataframe = dataframe.fillna(dataframe.mode().iloc[0]) 126 | 127 | logger.info("Factorizing the X") 128 | # we need this list of original dtypes for the Autosklearn fit, create it before categorisation or split 129 | col_dtype_dict = {col: ('Numerical' if np.issubdtype(dataframe[col].dtype, np.number) else 'Categorical') 130 | for col in dataframe.columns if col not in [parameter["id_field"], parameter["target_field"]]} 131 | 132 | # http://stackoverflow.com/questions/25530504/encoding-column-labels-in-pandas-for-machine-learning 133 | # http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn?rq=1 134 | # https://github.com/automl/auto-sklearn/issues/121#issuecomment-251459036 135 | 136 | for col in dataframe.select_dtypes(exclude=[np.number]).columns: 137 | if col not in [parameter["id_field"], parameter["target_field"]]: 138 | dataframe[col] = dataframe[col].astype('category').cat.codes 139 | 140 | df_unknown = dataframe[dataframe[parameter["target_field"]] == -1] # 'None' gets categorzized into -1 141 | df_known = dataframe[dataframe[parameter["target_field"]] != -1] # not [0,1] for multiclass labeling compartibility 142 | logger.debug("Length of unknown dataframe:" + str(len(df_unknown))) 143 | logger.debug("Length of known dataframe:" + str(len(df_known))) 144 | 145 | del dataframe 146 | 147 | X, y = dt.x_y_dataframe_split(df_known, parameter) 148 | 149 | logger.info("Preparing a sample to measure approx classifier run time and select features") 150 | dataset_size = df_known.shape[0] 151 | 152 | if dataset_size > max_sample_size: 153 | sample_factor = dataset_size / float(max_sample_size) 154 | logger.info("Sample factor =" + str(sample_factor)) 155 | X_sample, y_sample = dt.x_y_dataframe_split(df_known.sample(max_sample_size, random_state=42), parameter) 156 | X_train, X_test, y_train, y_test = train_test_split(X.copy(), y, stratify=y, test_size=33000, 157 | random_state=42) # no need for larger test 158 | else: 159 | sample_factor = 1 160 | X_sample, y_sample = X.copy(), y 161 | X_train, X_test, y_train, y_test = train_test_split(X.copy(), y, stratify=y, test_size=0.33, random_state=42) 162 | logger.info("train size:" + str(len(X_train))) 163 | logger.info("test size:" + str(len(X_test))) 164 | logger.info("Reserved 33% of the training dataset for validation (upto 33k rows)") 165 | 166 | per_run_time_limit = dt.max_estimators_fit_duration(X_train.values, y_train, max_classifier_time_budget, logger, 167 | sample_factor) 168 | logger.info("per_run_time_limit=" + str(per_run_time_limit)) 169 | pool_size = dt.define_pool_size(int(memory_limit)) 170 | logger.info("Process pool size=" + str(pool_size)) 171 | feat_type = [col_dtype_dict[col] for col in X.columns] 172 | logger.info("Starting autosklearn classifiers fiting on a 67% sample up to 67k rows") 173 | dt.train_multicore(X_train.values, y_train, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, 174 | per_run_time_limit) 175 | 176 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 177 | 178 | logger = utl.get_logger(os.path.basename(__file__)) 179 | logger.info("Validating") 180 | logger.info("Predicting on validation set") 181 | y_hat = ensemble.predict(X_test.values) 182 | 183 | logger.info("#" * 72) 184 | logger.info("Accuracy score {0:2.0%}".format(accuracy_score(y_test, y_hat))) 185 | logger.info("The below scores are calculated for predicting '1' category value") 186 | logger.info("Precision: {0:2.0%}, Recall: {1:2.0%}, F1: {2:.2f}".format( 187 | precision_score(y_test, y_hat), recall_score(y_test, y_hat), f1_score(y_test, y_hat))) 188 | ############################# 189 | ## Print COnfusion Matrix 190 | ############################# 191 | logger.info("Confusion Matrix: https://en.wikipedia.org/wiki/Precision_and_recall") 192 | cm = confusion_matrix(y_test, y_hat) 193 | for row in cm: 194 | logger.info(row) 195 | 196 | baseline_1 = str(sum(a for a in y_test)) 197 | baseline_all = str(len(y_test)) 198 | baseline_prcnt = "{0:2.0%}".format(float(sum(a for a in y_test) / len(y_test))) 199 | logger.info("Baseline %s positives from %s overall = %1.1f%%" % 200 | (sum(a for a in y_test), len(y_test), 100 * sum(a for a in y_test) / len(y_test))) 201 | logger.info("#" * 72) 202 | 203 | if df_unknown.shape[0] == 0: # if there is nothing to predict we can stop already 204 | logger.info("##### Nothing to predict. Prediction dataset is empty. #####") 205 | exit(0) 206 | 207 | X_unknown, y_unknown, row_id_unknown = dt.x_y_dataframe_split(df_unknown, parameter, id=True) 208 | 209 | logger.info("Re-fitting the model ensemble on full known dataset to prepare for prediciton. This can take a long time.") 210 | try: 211 | ensemble.refit(X.copy().values, y) 212 | except Exception as e: 213 | logger.info("Refit failed, reshuffling the rows, restarting") 214 | logger.info(e) 215 | try: 216 | X2 = X.copy().values 217 | indices = np.arange(X2.shape[0]) 218 | np.random.shuffle(indices) # a workaround to algoritm shortcomings 219 | X2 = X2[indices] 220 | y = y[indices] 221 | ensemble.refit(X2, y) 222 | except Exception as e: 223 | logger.info("Second refit failed") 224 | logger.info(e) 225 | logger.info( 226 | " WORKAROUND: because Refitting fails due to an upstream bug https://github.com/automl/auto-sklearn/issues/263") 227 | logger.info(" WORKAROUND: we are fitting autosklearn classifiers a second time, now on the full dataset") 228 | dt.train_multicore(X.values, y, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, per_run_time_limit) 229 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 230 | 231 | logger.info("Predicting. This can take a long time for a large prediction set.") 232 | try: 233 | y_pred = ensemble.predict(X_unknown.copy().values) 234 | logger.info("Prediction done") 235 | except Exception as e: 236 | logger.info(e) 237 | logger.info( 238 | " WORKAROUND: because REfitting fails due to an upstream bug https://github.com/automl/auto-sklearn/issues/263") 239 | logger.info(" WORKAROUND: we are fitting autosklearn classifiers a second time, now on the full dataset") 240 | dt.train_multicore(X.values, y, feat_type, int(memory_limit), atsklrn_tempdir, pool_size, per_run_time_limit) 241 | ensemble = dt.zeroconf_fit_ensemble(y_train, atsklrn_tempdir) 242 | logger.info("Predicting. This can take a long time for a large prediction set.") 243 | try: 244 | y_pred = ensemble.predict(X_unknown.copy().values) 245 | logger.info("Prediction done") 246 | except Exception as e: 247 | logger.info("##### Prediction failed, exiting! #####") 248 | logger.info(e) 249 | exit(2) 250 | 251 | result_df = pd.DataFrame( 252 | {parameter["id_field"]: row_id_unknown, 'prediction': pd.Series(y_pred, index=row_id_unknown.index)}) 253 | logger.info("Exporting the data") 254 | result_df.to_csv(result_filename, index=False, header=True) 255 | logger.info("##### Zeroconf Script Completed! #####") 256 | utl.end_proc_success(parameter, logger) 257 | -------------------------------------------------------------------------------- /data/Adult.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/c61ada6d354a46535b51321ec85e6b3064aa3f4f/data/Adult.h5 -------------------------------------------------------------------------------- /data/Adult.h5.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paypal/autosklearn-zeroconf/c61ada6d354a46535b51321ec85e6b3064aa3f4f/data/Adult.h5.tar.gz -------------------------------------------------------------------------------- /data/adult.names: -------------------------------------------------------------------------------- 1 | | This data was extracted from the census bureau database found at 2 | | http://www.census.gov/ftp/pub/DES/www/welcome.html 3 | | Donor: Ronny Kohavi and Barry Becker, 4 | | Data Mining and Visualization 5 | | Silicon Graphics. 6 | | e-mail: ronnyk@sgi.com for questions. 7 | | Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random). 8 | | 48842 instances, mix of continuous and discrete (train=32561, test=16281) 9 | | 45222 if instances with unknown values are removed (train=30162, test=15060) 10 | | Duplicate or conflicting instances : 6 11 | | Class probabilities for adult.all file 12 | | Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) 13 | | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) 14 | | 15 | | Extraction was done by Barry Becker from the 1994 Census database. A set of 16 | | reasonably clean records was extracted using the following conditions: 17 | | ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) 18 | | 19 | | Prediction task is to determine whether a person makes over 50K 20 | | a year. 21 | | 22 | | First cited in: 23 | | @inproceedings{kohavi-nbtree, 24 | | author={Ron Kohavi}, 25 | | title={Scaling Up the Accuracy of Naive-Bayes Classifiers: a 26 | | Decision-Tree Hybrid}, 27 | | booktitle={Proceedings of the Second International Conference on 28 | | Knowledge Discovery and Data Mining}, 29 | | year = 1996, 30 | | pages={to appear}} 31 | | 32 | | Error Accuracy reported as follows, after removal of unknowns from 33 | | train/test sets): 34 | | C4.5 : 84.46+-0.30 35 | | Naive-Bayes: 83.88+-0.30 36 | | NBTree : 85.90+-0.28 37 | | 38 | | 39 | | Following algorithms were later run with the following error rates, 40 | | all after removal of unknowns and using the original train/test split. 41 | | All these numbers are straight runs using MLC++ with default values. 42 | | 43 | | Algorithm Error 44 | | -- ---------------- ----- 45 | | 1 C4.5 15.54 46 | | 2 C4.5-auto 14.46 47 | | 3 C4.5 rules 14.94 48 | | 4 Voted ID3 (0.6) 15.64 49 | | 5 Voted ID3 (0.8) 16.47 50 | | 6 T2 16.84 51 | | 7 1R 19.54 52 | | 8 NBTree 14.10 53 | | 9 CN2 16.00 54 | | 10 HOODG 14.82 55 | | 11 FSS Naive Bayes 14.05 56 | | 12 IDTM (Decision table) 14.46 57 | | 13 Naive-Bayes 16.12 58 | | 14 Nearest-neighbor (1) 21.42 59 | | 15 Nearest-neighbor (3) 20.35 60 | | 16 OC1 15.04 61 | | 17 Pebls Crashed. Unknown why (bounds WERE increased) 62 | | 63 | | Conversion of original data as follows: 64 | | 1. Discretized agrossincome into two ranges with threshold 50,000. 65 | | 2. Convert U.S. to US to avoid periods. 66 | | 3. Convert Unknown to "?" 67 | | 4. Run MLC++ GenCVFiles to generate data,test. 68 | | 69 | | Description of fnlwgt (final weight) 70 | | 71 | | The weights on the CPS files are controlled to independent estimates of the 72 | | civilian noninstitutional population of the US. These are prepared monthly 73 | | for us by Population Division here at the Census Bureau. We use 3 sets of 74 | | controls. 75 | | These are: 76 | | 1. A single cell estimate of the population 16+ for each state. 77 | | 2. Controls for Hispanic Origin by age and sex. 78 | | 3. Controls by Race, age and sex. 79 | | 80 | | We use all three sets of controls in our weighting program and "rake" through 81 | | them 6 times so that by the end we come back to all the controls we used. 82 | | 83 | | The term estimate refers to population totals derived from CPS by creating 84 | | "weighted tallies" of any specified socio-economic characteristics of the 85 | | population. 86 | | 87 | | People with similar demographic characteristics should have 88 | | similar weights. There is one important caveat to remember 89 | | about this statement. That is that since the CPS sample is 90 | | actually a collection of 51 state samples, each with its own 91 | | probability of selection, the statement only applies within 92 | | state. 93 | 94 | 95 | >50K, <=50K. 96 | 97 | age: continuous. 98 | workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 99 | fnlwgt: continuous. 100 | education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 101 | education-num: continuous. 102 | marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 103 | occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 104 | relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 105 | race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 106 | sex: Female, Male. 107 | capital-gain: continuous. 108 | capital-loss: continuous. 109 | hours-per-week: continuous. 110 | native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 111 | -------------------------------------------------------------------------------- /log/.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore everything in this directory 2 | * 3 | # Except this file 4 | !.gitignore -------------------------------------------------------------------------------- /parameter/default.yml: -------------------------------------------------------------------------------- 1 | ######################################################################### 2 | ## default parameters - used if no specific parameter set is given at 3 | ## program call 4 | ######################################################################### 5 | basedir: . 6 | resultfile: ./data/zeroconf-result.csv 7 | memory_limit: 15000 # MB 8 | max_classifier_time_budget: 1200 # but 10 minutes is usually more than enough 9 | max_sample_size: 100000 # so that the classifiers fit method completes in a reasonable time 10 | # definition of dataset fields 11 | id_field: cust_id # has to be ignored by the analysis 12 | target_field: category # name of the field with the target variable 13 | -------------------------------------------------------------------------------- /parameter/logger.yml: -------------------------------------------------------------------------------- 1 | ################################################################ 2 | ## Logger setup parameter 3 | ################################################################ 4 | --- 5 | version: 1 6 | disable_existing_loggers: False 7 | formatters: 8 | simple: 9 | format: "%(asctime)s - [ZEROCONF] - %(name)s - %(levelname)s - %(message)s" 10 | 11 | handlers: 12 | console: 13 | class: logging.StreamHandler 14 | level: DEBUG 15 | formatter: simple 16 | stream: ext://sys.stdout 17 | 18 | info_file_handler: 19 | class: logging.handlers.RotatingFileHandler 20 | level: INFO 21 | formatter: simple 22 | filename: ./log/info.log 23 | maxBytes: 10485760 # 10MB 24 | backupCount: 20 25 | encoding: utf8 26 | 27 | error_file_handler: 28 | class: logging.handlers.RotatingFileHandler 29 | level: ERROR 30 | formatter: simple 31 | filename: ./log/errors.log 32 | maxBytes: 10485760 # 10MB 33 | backupCount: 20 34 | encoding: utf8 35 | 36 | loggers: 37 | my_module: 38 | level: DEBUG 39 | handlers: [console] 40 | propagate: no 41 | 42 | root: 43 | level: INFO 44 | handlers: [console, info_file_handler, error_file_handler] 45 | -------------------------------------------------------------------------------- /parameter/standard.yml: -------------------------------------------------------------------------------- 1 | basedir: . 2 | resultfile: ./data/zeroconf-result.csv 3 | memory_limit: 15000 # MB 4 | max_classifier_time_budget: 1200 # but 10 minutes is usually more than enough 5 | max_sample_size: 100000 # so that the classifiers fit method completes in a reasonable time 6 | # definition of dataset fields 7 | id_field: cust_id # has to be ignored by the analysis 8 | target_field: category # name of the field with the target variable 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.16.2 2 | bunch 3 | psutil 4 | tables 5 | ruamel.yaml 6 | cython 7 | auto-sklearn==0.3.0 8 | -------------------------------------------------------------------------------- /work/.gitignore: -------------------------------------------------------------------------------- 1 | # Ignore everything in this directory 2 | * 3 | # Except this file 4 | !.gitignore --------------------------------------------------------------------------------