└── README.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Machine Learning in Production 3 | 4 | This repo is intended to stimulate discussions about how to use machine learning in 5 | production (system components, processes, challenges/pitfalls etc.). 6 | It may lead to some kind of "best practices" blog post or paper eventually. 7 | Discussions/comments are welcome from anyone via github issues. 8 | 9 | 10 | 11 | ## Initial ideas/outline: 12 | 13 | The following figure summarizes the components/processes involved: 14 | 15 | ![img](https://raw.githubusercontent.com/szilard/MLprod-1slide/master/MLprod-1slide.png) 16 | 17 | ![License: CC BY 4.0](https://licensebuttons.net/l/by/4.0/80x15.png) 18 | 19 | 20 | 21 | ### Historical data 22 | 23 | Can be in database, csv file(s), data warehouse, HDFS 24 | 25 | 26 | 27 | ### Feature engineering 28 | 29 | On typical structured/tabular business data it can involve joins and aggregates (e.g. how many clicks from 30 | a given user in given time period) 31 | 32 | This "ETL" is heavy processing, not suited for operational systems (e.g. MySQL), usually 33 | in "analytical" database (Vertica, Redshift) or maybe Spark 34 | 35 | Figuring out good features is trial-error/iterative/researchy/exploratory/time consuming (as in general 36 | the whole upper part of the Figure above, i.e. FE, model training and evaluation) 37 | 38 | Categorial variables: some modeling tools require transformation to numeric (e.g. one-hot encoding) 39 | 40 | 41 | 42 | ### Training, tuning 43 | 44 | The result of feature engineering is a "data matrix" with features and labels (in case of supervised 45 | learning) 46 | 47 | This data is usually smaller and most often does not require distributed systems 48 | 49 | The algos with best performance are usually: gradient boosting (GBM), random forests, 50 | neural networks (and deep learning), support vector machines (SVM) 51 | 52 | In certain cases (sparse data, model interpretability required) linear models must be 53 | used (e.g. logistic regression) 54 | 55 | There are good open source tools for all this (R packages, Python sklearn, xgboost, VW, H2O etc.) 56 | 57 | The name of the game is avoid overfitting (and techniques such as regularization are used) 58 | 59 | Also need unbiased evaluation, see next point 60 | 61 | Models can be tuned by search in the hyperparameter space (grid or random search, Bayesian optimization methods etc.) 62 | 63 | Performance can be often increased further by ensembling several models (averaging, stacking etc.), 64 | but drawbacks/tradeoffs (increased complexity in deploying such models) 65 | 66 | 67 | 68 | ### Model evaluation 69 | 70 | This is super-important, spend a lot of time here 71 | 72 | Unbiased evaluation with test set, cross validation (some algos have "early stoping" requiring a validation set) 73 | 74 | If you did hyperparameter tuning, that also needed a separate validation set (or cross validation) 75 | 76 | The real world is non-stationary, use a time gapped test set 77 | 78 | Diagnostics: distribution of probability scores, ROC curves etc. 79 | 80 | Also do evaluation using relevant business metrics (impact of model in business terms) 81 | 82 | 83 | 84 | ### Model deployment 85 | 86 | Scoring of live data 87 | 88 | Considered usually an "engineering" task (thrown over a "wall" from data scientists to software engineers) 89 | 90 | Use same tool to deploy, do not rewrite in other "language" or tool (SQL, PMML, Java, C++, custom 91 | format such as JSON) (unless the export is done by the same tool/vendor doing the training) (high 92 | risk of subtle bugs in edge cases) 93 | 94 | Different servers (training requires more CPU/RAM; scoring requires low latency, high-availability, maybe 95 | scalability) 96 | 97 | Live data comes from a different system, often FE needs to be replicated (duplicate code is evil, 98 | but may be unavoidable); transformations/data cleaning already in the historical data might need to be 99 | duplicated here as well 100 | 101 | Scoring can be batch (easier, can read from database, score and write results back to database) or 102 | real-time (the modern way to do it is via http REST API provinding a separation of concerns) 103 | 104 | Better IMO if data science team owns this part as well (along with as much as possible of the lower 105 | part of the Figure above, possibly with some engineering support) 106 | 107 | 108 | 109 | ### Taking action 110 | 111 | The primary goal of an ML system in a company is to provide some business value 112 | (happy customers, $$$ etc.) 113 | 114 | Action probably must be owned by the engineering team (therefore "wall" moved around here?) 115 | 116 | Ability to test live/roll out gradually (A/B testing of models) 117 | 118 | 119 | 120 | ### Evaluate & monitor 121 | 122 | Models might behave differently in production vs train-test (non-stationarity, changed 123 | conditions, wrong assumptions, bugs etc.) 124 | 125 | Crucial to evaluate the models after deployment 126 | 127 | Evaluation based on ML metrics (distribution of scores etc.) and business metrics (impact of 128 | taking action) 129 | 130 | Evaluation after deployment and continuous monitoring subsequently (dashboards and alerts) 131 | (to detect if something external changes/breaks it, also models can slowly degrade in time) 132 | 133 | This too should be owned by the data science team (expertise to compare with the model 134 | developed offline) 135 | 136 | 137 | 138 | ### Misc 139 | 140 | ML creates tight couplings that is considered evil from engineering perspective 141 | 142 | Some problems identified in [this paper](http://research.google.com/pubs/pub43146.html) 143 | although no silver bullet solutions exist at the moment (keep 144 | in mind/mitigate as much as possible though) 145 | 146 | Some ideas for a framework are 147 | [here](http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/51731) 148 | (also described 149 | [here](https://medium.com/@HarlanH/insights-from-a-predictive-model-pipeline-abstraction-c8b47fd406da)) 150 | 151 | Example couplings: FE to data schemas (can change upstream), duplicated FE in scoring, 152 | action taking couples with lots of engineering/business domain 153 | 154 | ML needs to be "sold" to the business side (management/business units in the application domain 155 | of each ML product) 156 | 157 | Involving the business into ML's inner working and showing business inpact on a on-going basis 158 | (reports, dashboards, alerts etc.) can help trust/buy-in 159 | 160 | 161 | 162 | ### Learn & improve 163 | 164 | Iterate over all the components, learn from the experience of using it in practice (e.g. incorporate 165 | ideas from business, add new features to FE, retrain models if performance degrade in time etc.) 166 | 167 | For iterations to be fast, as much of the above should use tools that facilitate automation/reproducibity 168 | (e.g. Rstudio+R-markdown/Jupyter notebooks, git, docker etc.) 169 | 170 | 171 | --------------------------------------------------------------------------------