└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Machine Learning in Production
  3 | 
  4 | This repo is intended to stimulate discussions about how to use machine learning in 
  5 | production (system components, processes, challenges/pitfalls etc.).
  6 | It may lead to some kind of "best practices" blog post or paper eventually. 
  7 | Discussions/comments are welcome from anyone via github issues.
  8 | 
  9 | 
 10 | 
 11 | ## Initial ideas/outline:
 12 | 
 13 | The following figure summarizes the components/processes involved:
 14 | 
 15 | ![img](https://raw.githubusercontent.com/szilard/MLprod-1slide/master/MLprod-1slide.png)
 16 | 
 17 | ![License: CC BY 4.0](https://licensebuttons.net/l/by/4.0/80x15.png)
 18 | 
 19 | 
 20 | 
 21 | ### Historical data
 22 | 
 23 | Can be in database, csv file(s), data warehouse, HDFS
 24 | 
 25 | 
 26 | 
 27 | ### Feature engineering
 28 | 
 29 | On typical structured/tabular business data it can involve joins and aggregates (e.g. how many clicks from
 30 | a given user in given time period)
 31 | 
 32 | This "ETL" is heavy processing, not suited for operational systems (e.g. MySQL), usually
 33 | in "analytical" database (Vertica, Redshift) or maybe Spark
 34 | 
 35 | Figuring out good features is trial-error/iterative/researchy/exploratory/time consuming (as in general
 36 | the whole upper part of the Figure above, i.e. FE, model training and evaluation)
 37 | 
 38 | Categorial variables: some modeling tools require transformation to numeric (e.g. one-hot encoding)
 39 | 
 40 | 
 41 | 
 42 | ### Training, tuning
 43 | 
 44 | The result of feature engineering is a "data matrix" with features and labels (in case of supervised
 45 | learning)
 46 | 
 47 | This data is usually smaller and most often does not require distributed systems 
 48 | 
 49 | The algos with best performance are usually: gradient boosting (GBM), random forests, 
 50 | neural networks (and deep learning), support vector machines (SVM)
 51 | 
 52 | In certain cases (sparse data, model interpretability required) linear models must be
 53 | used (e.g. logistic regression)
 54 | 
 55 | There are good open source tools for all this (R packages, Python sklearn, xgboost, VW, H2O etc.)
 56 | 
 57 | The name of the game is avoid overfitting (and techniques such as regularization are used)
 58 | 
 59 | Also need unbiased evaluation, see next point
 60 | 
 61 | Models can be tuned by search in the hyperparameter space (grid or random search, Bayesian optimization methods etc.)
 62 | 
 63 | Performance can be often increased further by ensembling several models (averaging, stacking etc.), 
 64 | but drawbacks/tradeoffs (increased complexity in deploying such models)
 65 | 
 66 | 
 67 | 
 68 | ### Model evaluation
 69 | 
 70 | This is super-important, spend a lot of time here
 71 | 
 72 | Unbiased evaluation with test set, cross validation (some algos have "early stoping" requiring a validation set)
 73 | 
 74 | If you did hyperparameter tuning, that also needed a separate validation set (or cross validation)
 75 | 
 76 | The real world is non-stationary, use a time gapped test set
 77 | 
 78 | Diagnostics: distribution of probability scores, ROC curves etc.
 79 | 
 80 | Also do evaluation using relevant business metrics (impact of model in business terms)
 81 | 
 82 | 
 83 | 
 84 | ### Model deployment
 85 | 
 86 | Scoring of live data
 87 | 
 88 | Considered usually an "engineering" task (thrown over a "wall" from data scientists to software engineers)
 89 | 
 90 | Use same tool to deploy, do not rewrite in other "language" or tool (SQL, PMML, Java, C++, custom
 91 | format such as JSON) (unless the export is done by the same tool/vendor doing the training) (high
 92 | risk of subtle bugs in edge cases)
 93 | 
 94 | Different servers (training requires more CPU/RAM; scoring requires low latency, high-availability, maybe
 95 | scalability)
 96 | 
 97 | Live data comes from a different system, often FE needs to be replicated (duplicate code is evil,
 98 | but may be unavoidable); transformations/data cleaning already in the historical data might need to be
 99 | duplicated here as well
100 | 
101 | Scoring can be batch (easier, can read from database, score and write results back to database) or
102 | real-time (the modern way to do it is via http REST API provinding a separation of concerns)
103 | 
104 | Better IMO if data science team owns this part as well (along with as much as possible of the lower
105 | part of the Figure above, possibly with some engineering support)
106 | 
107 | 
108 | 
109 | ### Taking action
110 | 
111 | The primary goal of an ML system in a company is to provide some business value
112 | (happy customers, $$$ etc.)
113 | 
114 | Action probably must be owned by the engineering team (therefore "wall" moved around here?)
115 | 
116 | Ability to test live/roll out gradually (A/B testing of models)
117 | 
118 | 
119 | 
120 | ### Evaluate & monitor
121 | 
122 | Models might behave differently in production vs train-test (non-stationarity, changed
123 | conditions, wrong assumptions, bugs etc.)
124 | 
125 | Crucial to evaluate the models after deployment
126 | 
127 | Evaluation based on ML metrics (distribution of scores etc.) and business metrics (impact of
128 | taking action)
129 | 
130 | Evaluation after deployment and continuous monitoring subsequently (dashboards and alerts)
131 | (to detect if something external changes/breaks it, also models can slowly degrade in time)
132 | 
133 | This too should be owned by the data science team (expertise to compare with the model
134 | developed offline)
135 | 
136 | 
137 | 
138 | ### Misc
139 | 
140 | ML creates tight couplings that is considered evil from engineering perspective
141 | 
142 | Some problems identified in [this paper](http://research.google.com/pubs/pub43146.html)
143 | although no silver bullet solutions exist at the moment (keep
144 | in mind/mitigate as much as possible though)
145 | 
146 | Some ideas for a framework are 
147 | [here](http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/51731)
148 | (also described 
149 | [here](https://medium.com/@HarlanH/insights-from-a-predictive-model-pipeline-abstraction-c8b47fd406da))
150 | 
151 | Example couplings: FE to data schemas (can change upstream), duplicated FE in scoring, 
152 | action taking couples with lots of engineering/business domain
153 | 
154 | ML needs to be "sold" to the business side (management/business units in the application domain
155 | of each ML product)
156 | 
157 | Involving the business into ML's inner working and showing business inpact on a on-going basis 
158 | (reports, dashboards, alerts etc.) can help trust/buy-in
159 | 
160 | 
161 | 
162 | ### Learn & improve
163 | 
164 | Iterate over all the components, learn from the experience of using it in practice (e.g. incorporate 
165 | ideas from business, add new features to FE, retrain models if performance degrade in time etc.)
166 | 
167 | For iterations to be fast, as much of the above should use tools that facilitate automation/reproducibity
168 | (e.g. Rstudio+R-markdown/Jupyter notebooks, git, docker etc.)
169 | 
170 | 
171 | 


--------------------------------------------------------------------------------