├── .gitignore
├── LICENSE
├── README.md
├── blogs
    ├── blog-20190813-d6tflow-pytorch.html
    ├── blog-20190813-d6tflow-pytorch.rmd
    ├── blog-20200426-shapley-report.ipynb
    ├── blog-20200426-shapley.ipynb
    ├── datasci-dags-airflow-meetup.md
    ├── datasci-projects-e2e.md
    ├── design-ml-e2e.md
    ├── effective-datasci-workflows.rst
    ├── images
    │   ├── d6tflow-filenames.png
    │   ├── top10-stats-example2.png
    │   └── top10-stats-example3.png
    ├── reasons-why-bad-ml-code.rst
    ├── top10-mistakes-business.md
    ├── top10-mistakes-coding.md
    ├── top10-mistakes-statistics.md
    └── top5-mistakes-vendors.md
└── overview.png


/.gitignore:
--------------------------------------------------------------------------------
  1 | .idea/
  2 | *.csv
  3 | *.pq
  4 | *.json
  5 | data/
  6 | tests/.creds.yml
  7 | 
  8 | # Byte-compiled / optimized / DLL files
  9 | __pycache__/
 10 | *.py[cod]
 11 | *$py.class
 12 | 
 13 | # C extensions
 14 | *.so
 15 | 
 16 | # Distribution / packaging
 17 | .Python
 18 | build/
 19 | develop-eggs/
 20 | dist/
 21 | downloads/
 22 | eggs/
 23 | .eggs/
 24 | lib/
 25 | lib64/
 26 | parts/
 27 | sdist/
 28 | var/
 29 | wheels/
 30 | share/python-wheels/
 31 | *.egg-info/
 32 | .installed.cfg
 33 | *.egg
 34 | MANIFEST
 35 | 
 36 | # PyInstaller
 37 | #  Usually these files are written by a python script from a template
 38 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 39 | *.manifest
 40 | *.spec
 41 | 
 42 | # Installer logs
 43 | pip-log.txt
 44 | pip-delete-this-directory.txt
 45 | 
 46 | # Unit test / coverage reports
 47 | htmlcov/
 48 | .tox/
 49 | .nox/
 50 | .coverage
 51 | .coverage.*
 52 | .cache
 53 | nosetests.xml
 54 | coverage.xml
 55 | *.cover
 56 | .hypothesis/
 57 | .pytest_cache/
 58 | 
 59 | # Translations
 60 | *.mo
 61 | *.pot
 62 | 
 63 | # Django stuff:
 64 | *.log
 65 | local_settings.py
 66 | db.sqlite3
 67 | 
 68 | # Flask stuff:
 69 | instance/
 70 | .webassets-cache
 71 | 
 72 | # Scrapy stuff:
 73 | .scrapy
 74 | 
 75 | # Sphinx documentation
 76 | docs/_build/
 77 | 
 78 | # PyBuilder
 79 | target/
 80 | 
 81 | # Jupyter Notebook
 82 | .ipynb_checkpoints
 83 | 
 84 | # IPython
 85 | profile_default/
 86 | ipython_config.py
 87 | 
 88 | # pyenv
 89 | .python-version
 90 | 
 91 | # celery beat schedule file
 92 | celerybeat-schedule
 93 | 
 94 | # SageMath parsed files
 95 | *.sage.py
 96 | 
 97 | # Environments
 98 | .env
 99 | .venv
100 | env/
101 | venv/
102 | ENV/
103 | env.bak/
104 | venv.bak/
105 | 
106 | # Spyder project settings
107 | .spyderproject
108 | .spyproject
109 | 
110 | # Rope project settings
111 | .ropeproject
112 | 
113 | # mkdocs documentation
114 | /site
115 | 
116 | # mypy
117 | .mypy_cache/
118 | .dmypy.json
119 | dmypy.json
120 | 
121 | # Pyre type checker
122 | .pyre/
123 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 d6t
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Accelerate Data Science
 2 | 
 3 | ## Databolt python libraries
 4 | 
 5 | For data scientists and data engineers, DataBolt is a collection of python-based libraries and products to reduce the time it takes to get your data ready for analysis and collaborate with others.
 6 | 
 7 | Majority of time in data science is spent on tedious tasks unrelated to data analysis. DataBolt simplifies those tasks so you can experience up to 10x productivity gains.
 8 | 
 9 | ![Databolt Workflow](overview.png "Databolt Workflow")
10 | 
11 | * **manage data workflows**: quickly build highly effective data science workflows
12 | * **push/pull data**: quickly get and share data files like code
13 | * **import data**: quickly ingest messy raw CSV and XLS files to pandas, SQL and more
14 | * **join data**: quickly combine multiple datasets using fuzzy joins
15 | 
16 | The libraries are modularized so you can use them individually but they work well together to improve your entire data workflow. 
17 | 
18 | 
19 | ## [Manage data workflows](https://github.com/d6t/d6tflow)
20 | 
21 | Easily manage data workflows including complex dependencies and parameters. With d6tflow you can easily chain together complex data flows and intelligently execute them. You can quickly load input and output data for each task. It makes your workflow very clear and intuitive.
22 | 
23 | ### What can it do?
24 | 
25 | * Build a data workflow made up of tasks with dependencies and parameters
26 | * Intelligently rerun workflow after changing parameters, code or data
27 | * Quickly load task input and output data without manual work
28 | 
29 | Learn more at [https://github.com/d6t/d6tflow](https://github.com/d6t/d6tflow)
30 | 
31 | 
32 | ## [Push/Pull Data](https://github.com/d6t/d6tpipe)
33 | 
34 | d6tpipe is a python library which makes it easier to exchange data. It's like git for data! But better because you can include it in your data science code.
35 | 
36 | ### What can it do?
37 | 
38 | * Quickly create public and private remote file storage on AWS S3 and ftp
39 | * Push/pull data to/from remote file storage to sync files and share with others
40 | * Add schema information so data can be loaded quickly
41 | 
42 | Learn more at [https://github.com/d6t/d6tpipe](https://github.com/d6t/d6tpipe)
43 | 
44 | 
45 | ## [Ingest Data](https://github.com/d6t/d6tstack)
46 | 
47 | Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. d6tstack solves many performance and other problems typically encountered when ingesting raw files.
48 | 
49 | ### What can it do?
50 | 
51 | * Fast pd.to_sql() for postgres and mysql
52 | * Check and fix schema problems like added/missing/renamed columns
53 | * Load and process messy Excel files
54 | 
55 | Learn more at [https://github.com/d6t/d6tstack](https://github.com/d6t/d6tstack)
56 | 
57 | 
58 | ## [Join Data](https://github.com/d6t/d6tjoin)
59 | 
60 | Easily join different datasets without writing custom code using fuzzy matches. Does similarity joins on strings, dates and numbers. For example you can quickly join similar but not identical stock tickers, addresses, names and dates without manual processing.
61 | 
62 | ### What can it do?
63 | 
64 | * Identify and diagnose join problems
65 | * Best match fuzzy joins on strings and dates
66 | * Best match substring joins
67 | 
68 | Learn more at [https://github.com/d6t/d6tjoin](https://github.com/d6t/d6tjoin)
69 | 
70 | 
71 | ## [Blog](http://blog.databolt.tech)
72 | 
73 | We encourage you to join the Databolt blog to get updates and tips+tricks [http://blog.databolt.tech](http://blog.databolt.tech)
74 | 
75 | 
76 | ## [About](https://www.databolt.tech)
77 | 
78 | [https://www.databolt.tech](https://www.databolt.tech)
79 | 
80 | For questions or comments contact: support-at-databolt.tech
81 | 


--------------------------------------------------------------------------------
/blogs/blog-20190813-d6tflow-pytorch.rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "5 Step Guide to Scalable Deep Learning Pipelines with d6tflow"
  3 | output: html_document
  4 | ---
  5 | 
  6 | ```{r setup, include=FALSE}
  7 | knitr::opts_chunk$set(echo = TRUE)
  8 | library(knitr)
  9 | library(reticulate)
 10 | library(kableExtra)
 11 | 
 12 | setwd("d:/dev/blogs-source/dlrm/")
 13 | source_python("flow_tasks.py")
 14 | 
 15 | ```
 16 | 
 17 | *How to turn a typical pytorch script into a scalable d6tflow DAG for faster research & development*
 18 | 
 19 | # Introduction: Why bother?
 20 | 
 21 | Building deep learning models typically involves complex data pipelines as well as a lot of trial and error, tweaking model architecture and parameters whose performance needs to be compared. It is often difficult to keep track of all the experiments, leading at best to confusion and at worst wrong conclusions.
 22 | 
 23 | In [4 reasons why your ML code is bad](https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html) we explored how to organize ML code as DAG workflows to solve that problem. In this guide we will go through a practical case study on turning an existing pytorch script into a scalable deep learning pipeline with [d6tflow](https://github.com/d6t/d6tflow). The starting point is [a pytorch deep recommender model by Facebook](https://github.com/facebookresearch/dlrm) and we will go through the 5 steps of migrating the code into a scalable deep learning pipeline. The steps below are written in partial pseudo code to illustrate concepts, the full code is available also, see instructions at the end of the article.
 24 | 
 25 | Lets get started!
 26 | 
 27 | ## Step 1: Plan your DAG
 28 | 
 29 | To plan your work and help others understand how your pipeline fits together, you want to start by thinking about the data flow, dependencies between tasks and task parameters. This helps you organize your workflow into logical components. You might want to draw a diagram such as this
 30 | 
 31 | ![](https://github.com/d6t/d6tflow/raw/master/docs/d6tflow-docs-graph.png?raw=true)
 32 | 
 33 | Below is the pytorch model training DAG for FB DLRM. It shows the training task `TaskModelTrain` with all its dependencies and how the dependencies relate to each other. If you write functional code it is difficult see how your workflow fits together like this.
 34 | 
 35 | ```{python}
 36 | task = TaskModelTrain()
 37 | print(d6tflow.preview(task, clip_params=True))
 38 | 
 39 | ```
 40 | 
 41 | ## Step 2: Write Tasks instead of functions
 42 | 
 43 | Data science code is typically organized in functions which leads to a lot of problems as explained in [4 reasons why your ML code is bad](https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html). Instead you want to write d6tflow tasks. The benefits are that you can:  
 44 | 
 45 | * chain tasks into a DAG so that required dependencies run automatically  
 46 | * easily load task input data from dependencies
 47 | * easily save task output such as preprocessed data and trained models. That way you don't accidentally rerun long-running training tasks  
 48 | * parameterize tasks so they can be intelligently managed (see next step)  
 49 | * save output to [d6tpipe](https://github.com/d6t/d6tpipe) to separate data from code and easily share the data, see [Top 10 Coding Mistakes Made by Data   Scientists](https://www.kdnuggets.com/2019/04/top-10-coding-mistakes-data-scientists.html)
 50 | 
 51 | Here is what the before/after looks like for the FB DLRM code after you convert functional code into d6tflow tasks.
 52 | 
 53 | Typical pytorch functional code that does not scale well:
 54 | 
 55 | ```{python, echo=TRUE, eval = FALSE}
 56 | # ***BEFORE***
 57 | # see dlrm_s_pytorch.py
 58 | 
 59 | def train_model():
 60 |     data = loadData()
 61 |     dlrm = DLRM_Net([...])
 62 |     model = dlrm.train(data) 
 63 |     torch.save({model},'model.pickle')
 64 | 
 65 | if __name__ == "__main__":
 66 | 
 67 |     parser.add_argument("--load-model")
 68 |     if load_model:
 69 |         model = torch.load('model.pickle')
 70 |     else:
 71 |         model = train_model()
 72 | ```
 73 | 
 74 | 
 75 | Same logic written using scalable d6tflow tasks:
 76 | 
 77 | ```{python, echo=TRUE, eval = FALSE}
 78 | # ***AFTER***
 79 | # see flow_tasks.py
 80 | 
 81 | class TaskModelTrain(d6tflow.tasks.TaskPickle):
 82 | 
 83 |     def requires(self):  # define dependencies
 84 |         return {'data': TaskPrepareData(), 'model': TaskBuildNetwork()}
 85 | 
 86 |     def run(self):
 87 |         data = self.input()['data'].load() # easily load input data
 88 |         dlrm = self.input()['model'].load()
 89 |         model = dlrm.train(data) 
 90 |         self.save(model) # easily save trained model as pickle
 91 | 
 92 | 
 93 | if __name__ == "__main__":
 94 |     if TaskModelTrain().complete(): # load ouput if task was run
 95 |         model = TaskModelTrain().output().load()
 96 | 
 97 | ```
 98 | 
 99 | 
100 | ## Step 3: Parameterize tasks
101 | 
102 | To improve model performance, you will try different models, parameters and preprocessing settings. To keep track of all this, you can add parameters to tasks. That way you can:  
103 | 
104 | * keep track which models have been trained with which parameters  
105 | * intelligently rerun tasks as parameters change  
106 | * help others understand where in workflow parameters are introduced  
107 | 
108 | Below sets up FB DLRM model training task with parameters. Note how you no longer have to manually specify where to save the trained model and data.
109 | 
110 | ```{python, echo=TRUE, eval = FALSE}
111 | # ***BEFORE***
112 | # dlrm_s_pytorch.py
113 | 
114 | if __name__ == "__main__":
115 |     # define model parameters
116 |     parser.add_argument("--learning-rate", type=float, default=0.01)
117 |     parser.add_argument("--nepochs", type=int, default=1)
118 |     # manually specify filename
119 |     parser.add_argument("--save-model", type=str, default="") 
120 |     model = train_model()
121 |     torch.save(model, args.save_model)
122 | 
123 | # ***AFTER***
124 | # see flow_tasks.py
125 | 
126 | class TaskModelTrain(d6tflow.tasks.TaskPickle):
127 | 
128 |     # define model parameters
129 |     learning_rate = luigi.FloatParameter(default = 0.01)
130 |     num_epochs = luigi.IntParameter(default = 1)
131 |     # filename is determined automatically
132 | 
133 |     def run(self):
134 |         data = self.input()['data'].load()
135 |         dlrm = self.input()['model'].load()
136 |         
137 |         # use learning_rate param
138 |         optimizer = torch.optim.SGD(dlrm.parameters(), lr=self.learning_rate)        
139 |         # use num_epochs param
140 |         while k < self.num_epochs: 
141 |             optimizer.step()
142 |         model = optimizer.get_model()
143 |         self.save(model) # automatically save model, seperately for each parameter config
144 |             
145 | ```
146 | 
147 | ### Compare trained models
148 | 
149 | Now you can use that parameter to easily compare output from different models. Make sure you run the workflow with that parameter before you load task output (see Step #4).
150 | 
151 | ```{python, eval = FALSE}
152 | model1 = TaskModelTrain().output().load() # use default num_epochs=1
153 | print_accuracy(model1)
154 | model2 = TaskModelTrain(num_epochs=10).output().load()
155 | print_accuracy(model2)
156 | 
157 | ```
158 | 
159 | 
160 | ### Inherit parameters
161 | 
162 | Often you need to have a parameter cascade downstream through the workflow. If you write functional code, you have to keep repeating the parameter in each function. With d6tflow you can inherit parameters so the terminal task can pass the parameter to upstream tasks as needed. 
163 | 
164 | In the FB DLRM workflow, `TaskModelTrain` inherits parameters from `TaskGetTrainDataset`. This way you can run `TaskModelTrain(mini_batch_size=2)` and it will pass the parameter to upstream tasks ie `TaskGetTrainDataset` and all other tasks that depend on it. In the actual code, note the use of `self.clone(TaskName)` and `@d6tflow.clone_parent`.
165 | 
166 | ```{python, echo=TRUE, eval = FALSE}
167 | 
168 | class TaskGetTrainDataset(d6tflow.tasks.TaskPickle):
169 |     mini_batch_size = luigi.FloatParameter(default = 1)
170 |     # [...]
171 | 
172 | @d6tflow.inherit(TaskGetTrainDataset)
173 | class TaskModelTrain(d6tflow.tasks.TaskPickle):
174 |     # no need to repeat parameters
175 |     pass
176 | 
177 | ```
178 | 
179 | ## Step 4: Run DAG to process data and train model
180 | 
181 | To kick off data processing and model training, you run the DAG. You only need to run the terminal task which automatically runs all dependencies. Before actually running the DAG, you can preview what will be run. This is especially helpful if you have made any changes to code or data because it will only run the tasks that have changed not the full workflow. 
182 | 
183 | ```{python, eval = FALSE}
184 | task = TaskModelTrain() # or task = TaskModelTrain(num_epochs=10)
185 | d6tflow.preview(task)
186 | d6tflow.run(task)
187 | 
188 | ```
189 | 
190 | 
191 | ## Step 5: Evaluate model performance
192 | 
193 | Now that the workflow has run and all tasks are complete, you can load predictions and other model output to compare and visualize output. Because the tasks knows where each output it saved, you can directly load output from the task instead of having to remember the file paths or variable names. It also makes your code a lot more readable.
194 | 
195 | ```{python, eval = FALSE}
196 | model1 = TaskModelTrain().output().load()
197 | print_accuracy(model1)
198 | 
199 | ```
200 | 
201 | ### Compare models
202 | 
203 | You can easily compare output from different models with different parameters.
204 | 
205 | ```{python, eval = FALSE}
206 | model1 = TaskModelTrain().output().load() # use default num_epochs=1
207 | print_accuracy(model1)
208 | model2 = TaskModelTrain(num_epochs=10).output().load()
209 | print_accuracy(model2)
210 | 
211 | ```
212 | 
213 | ### Keep iterating
214 | 
215 | As you iterate, changing parameters, code and data, you will want to rerun tasks. d6tflow intelligently figures out which tasks need to be rerun which makes iterating very efficient. If you have changed parameters, you don't need to do anything, it will know what to run automatically. If you have changed code or data, you have to mark the task as incomplete using `.invalidate()` and d6tflow will figure out the rest. 
216 | 
217 | In the FB DLRM workflow, say for example you changed training data or made changes to the training preprocessing.
218 | 
219 | ```{python, eval = FALSE}
220 | 
221 | TaskGetTrainDataset().invalidate()
222 | 
223 | # or
224 | d6tflow.run(task, forced=TaskGetTrainDataset())
225 | 
226 | ```
227 | 
228 | ## Full source code
229 | 
230 | All code is provided at https://github.com/d6tdev/dlrm. It is the same as https://github.com/facebook/dlrm with d6tflow files added:  
231 | 
232 | * flow_run.py: run flow => run this file  
233 | * flow_task.py: tasks code  
234 | * flow_viz.py: show model output  
235 | * flow_cfg.py: default parameters  
236 | * dlrm_d6t_pytorch.py: dlrm_data_pytorch.py adopted for d6tflow
237 | 
238 | Try yourself!
239 | 
240 | ## For your next project
241 | 
242 | In this guide we showed how to build scalable deep learning workflows. We used an existing code base and showed how to turn linear deep learning code into d6tflow DAGs and the benefits of doing so.
243 | 
244 | For new projects, you can start with a scalable project template from https://github.com/d6t/d6tflow-template. The structure is very similar:
245 | 
246 | * run.py: run workflow  
247 | * task.py: task code  
248 | * cfg.py: manage parameters  


--------------------------------------------------------------------------------
/blogs/datasci-dags-airflow-meetup.md:
--------------------------------------------------------------------------------
  1 | # How to use airflow-style DAGs to build highly effective data science workflows
  2 | 
  3 | Airflow and Luigi are great for data engineering productions workflows but not optimized for data science r&d workflows. We will be using the d6tflow open source python library to bring airflow-style DAGs to the data science research and development process.
  4 | 
  5 | ## Data science workflows are DAGs
  6 | 
  7 | Data science workflows typically look like this.
  8 | 
  9 | ![Sample Data Workflow](https://github.com/d6t/d6tflow/blob/master/docs/d6tflow-docs-graph.png?raw=true "Sample Data Workflow")
 10 | 
 11 | This workflow is similar to data engineering workflows. It involves chaining together parameterized tasks which pass multiple inputs and outputs between each other. See [4 Reasons Why Your Machine Learning Code is Probably Bad](https://github.com/d6t/d6t-python/blob/master/blogs/reasons-why-bad-ml-code.rst) why passing data between functions or hardcoding file/database names without explicity defining task dependencies is NOT a good way of writing data science code.
 12 | 
 13 | ```python
 14 | 
 15 | # bad data science code
 16 | def process_data(data, do_preprocess):
 17 |     data = do_stuff(data, do_preprocess)
 18 |     data.to_pickle('data.pkl')
 19 | 
 20 | data = pd.read_csv('data.csv')
 21 | process_data(data, True)
 22 | df_train = pd.read_pickle(df_train)
 23 | model = sklearn.svm.SVC()
 24 | model.fit(df_train.iloc[:,:-1], df_train['y'])
 25 | 
 26 | ```
 27 | 
 28 | ## R&D vs production data workflows
 29 | 
 30 | Using airflow or luigi is a big step up from writing functional code for managing data workflows. But both libraries are designed to be used by data engineers in production settings where the focus is on:  
 31 | * making sure everything is running smoothly on time  
 32 | * scheduling and coordination  
 33 | * recovering from failures  
 34 | * data quality  
 35 | 
 36 | In contrast, focus in the r&d workflow is on:  
 37 | * generating insights
 38 | * prototyping speed
 39 | * assessing predictive power with different models and parameters
 40 | * visualizing output
 41 | 
 42 | As a result, the r&d workflow:  
 43 | * is less well defined
 44 | * involves trial and error
 45 | * requires frequent resetting of tasks and output as models, parameters and data change
 46 | * takes output from the data engineer
 47 | 
 48 | ## Problems with airflow/luigi in R&D settings
 49 | 
 50 | Since both libraries are optimized for data engineering production settings, the UX for a data science r&d setting is not great:  
 51 | 
 52 | * WET code for reading/writing data
 53 | * Manually keep track of filenames or database table names where data is saved
 54 | * Inconvenient to reset tasks as models, parameters and data change
 55 | * Inconvenient to keep track of model results with different parameter settings
 56 | 
 57 | Manually keeping track of filenames in complex data workflows... Not scalable.
 58 | 
 59 | ```python
 60 | 
 61 | # vendor input
 62 | cfg_fpath_cc_base = cfg_fpath_base + 'vendor/'
 63 | cfg_fpath_cc_raw = cfg_fpath_cc_base + 'df_cc_raw.pkl'
 64 | cfg_fpath_cc_raw_recent2 = cfg_fpath_cc_base + 'df_cc_raw_recent2.pkl'
 65 | cfg_fpath_cc_yoy = cfg_fpath_cc_base + 'df_cc_yoy.pkl'
 66 | cfg_fpath_cc_yoy_bbg = cfg_fpath_cc_base + 'df_cc_yoy_bbg.pkl'
 67 | cfg_fpath_cc_yoy_fds = cfg_fpath_cc_base + 'df_cc_yoy_fds.pkl'
 68 | cfg_fpath_cc_var_fds = cfg_fpath_cc_base + 'df_cc_var_fds.pkl'
 69 | cfg_fpath_cc_yoy_recent2 = cfg_fpath_cc_base + 'df_cc_yoy_recent2.pkl'
 70 | cfg_fpath_cc_actual = cfg_fpath_cc_base + 'df_cc_sales_actual.pkl'
 71 | cfg_fpath_cc_monthly = cfg_fpath_cc_base + 'df_cc_monthly.pkl'
 72 | cfg_fpath_cc_yoy_cs2 = 'data/processed/df_cc_yoy_cs2.pq' # consistent shopper data for new methodology from 2018
 73 | 
 74 | # market
 75 | cfg_fpath_market_attributes_px = cfg_fpath_base + '/market/df_market_px.pkl'
 76 | cfg_fpath_market_consensus = cfg_fpath_base + '/market/df_market_consensus.pkl
 77 | cfg_fpath_market_attributes = cfg_fpath_base + '/market/df_market_attributes.pkl'
 78 | cfg_fpath_market_attributes_latest = cfg_fpath_base + '/market/df_market_attributes_latest.pkl'
 79 | cfg_fpath_market_announce = cfg_fpath_base + '/market/df_market_announce.pkl'
 80 | cfg_fpath_market_attributes_latest_fds1 = cfg_fpath_base + '/market/df_market_attributes_latest_fds1.pkl'
 81 | cfg_fpath_market_attributes_latest_fds2 = cfg_fpath_base + '/market/df_market_attributes_latest_fds2.pkl'
 82 | ```
 83 | 
 84 | ## How d6tflow is different from airflow/luigi
 85 | 
 86 | d6tflow is optimized for data science research and development workflows. Here are benefits of using d6tflow in data science.
 87 | 
 88 | Example workflow:  
 89 | ```
 90 | TaskGetData >> TaskProcess >> TaskTrain
 91 | ```
 92 | 
 93 | ### Benefit: Tasks have input and ouput data
 94 | 
 95 | Instead of having to manually load and save data, this is outsourced to the library. This scales better and reduces maintanance because the location of input/output data could change without having to rewrite code. It also makes it easier for the data engineer to hand off data to the data scientist.
 96 | 
 97 | ```python
 98 | class TaskProcess(d6tflow.tasks.TaskPqPandas): # define output format
 99 | 
100 |     def requires(self):
101 |         return TaskGetData() # define dependency
102 | 
103 |     def run(self):
104 |         data = self.input().load() # load input data
105 |         data = do_stuff(data) # process data
106 |         self.save(data) # save output data
107 | ```
108 | 
109 | ### Benefit: Easily invalidate tasks
110 | 
111 | Common invalidation scenarios are implemented. This increases prototyping speed as you change code and data during trial & error.
112 | 
113 | ```python
114 | # force execution including downstream tasks
115 | d6tflow.run(TaskTrain(), force=TaskGetData())
116 | 
117 | # reset single task
118 | TaskGetData().invalidate()
119 | 
120 | # reset all downstream tasks
121 | d6tflow.invalidate_downstream(TaskGetData(), TaskTrain())
122 | 
123 | # reset all upstream tasks
124 | d6tflow.invalidate_upstream(TaskTrain())
125 | 
126 | ```
127 | 
128 | ### Benefit: Easily train models using different paramters
129 | 
130 | You can intelligently rerun workflow after changing a parameter. Parameters are passed from the target task to the relevant downstream task. Thus, you no longer have to manually keep track of which tasks to update, increasing prototyping speed and reducing errors.
131 | 
132 | ```python
133 | d6tflow.preview(TaskTrain(do_preprocess=False))
134 | 
135 | '''
136 | └─--[TaskTrain-{'do_preprocess': 'False'} (PENDING)]
137 |    └─--[TaskPreprocess-{'do_preprocess': 'False'} (PENDING)]
138 |       └─--[TaskGetData-{} (COMPLETE)] => this doesn't change and doesn't need to rerun
139 | '''
140 | ```
141 | 
142 | ### Benefit: Easily compare models
143 | 
144 | Different models that were trained with different parameters can be easily loaded and compared. 
145 | 
146 | ```python
147 | df_train1 = TaskPreprocess().output().load()
148 | model1 = TaskTrain().output().load()
149 | print(sklearn.metrics.accuracy_score(df_train1['y'],model1.predict(df_train1.iloc[:,:-1])))
150 | 
151 | df_train2 = TaskPreprocess(do_preprocess=False).output().load()
152 | model2 = TaskTrain(do_preprocess=False).output().load()
153 | print(sklearn.metrics.accuracy_score(df_train2['y'],model2.predict(df_train2.iloc[:,:-1])))
154 | 
155 | ```
156 | 
157 | ## d6tflow Quickstart
158 | 
159 | Here is a full example of how to use d6tflow for a ML workflow
160 | https://github.com/d6t/d6tflow#example-output
161 | 
162 | ## Template for scalable ML projects
163 | 
164 | A d6tflow code template for real-life projects is available at 
165 | https://github.com/d6t/d6tflow-template
166 | 
167 | * Multiple task inputs and outputs
168 | * Parameter inheritance
169 | * Modularized tasks, run and viz
170 | 
171 | 
172 | ## Accelerate data engineer to data scientist hand-off
173 | 
174 | To quickly share workflow output files, you can use [d6tpipe](https://github.com/d6t/d6tpipe). See [Sharing Workflows and Outputs](https://d6tflow.readthedocs.io/en/latest/collaborate.html).
175 | 
176 | ```python
177 | import d6tflow.pipe
178 | 
179 | d6tflow.pipe.init(api, 'pipe-name') # save flow output 
180 | pipe = d6tflow.pipes.get_pipe()
181 | pipe.pull()
182 | 
183 | class Task2(d6tflow.tasks.TaskPqPandas):
184 | 
185 |     def requires(self):
186 |         return Task1() # define dependency
187 | 
188 |     def run(self):
189 |         data = self.input().load() # load data from data engineer
190 | 
191 | ```
192 | 
193 | Alternatively you can save outputs in a database using [d6tflow premium](https://pipe.databolt.tech/gui/request-premium/).
194 | 
195 | ```python
196 | d6tflow2.db.init('postgresql+psycopg2://usr:pwd@localhost/db', 'schema_name')
197 | 
198 | class Task1(d6tflow2.tasks.TaskSQLPandas):
199 | 
200 |     def run(self):
201 |         df = pd.DataFrame()
202 |         self.save(df)
203 | 
204 | 
205 | ```
206 | 
207 | Finally, the data scientist can inherit from tasks the data engineer has written to quickly load source data.
208 | 
209 | ```python
210 | import tasks_factors # import tasks written by data engineer
211 | import utils
212 | 
213 | class Task1(tasks_factors.Task1):
214 |     external = True # rely on data engineer to run
215 | 
216 |     def run(self):
217 |         data = self.input().load() # load data from data engineer
218 | ```
219 | 
220 | ## Bonus: centralize (local) data science files
221 | 
222 | https://github.com/d6t/d6tpipe
223 | 
224 | ## Bonus: keeping credentials safe
225 | 
226 | https://d6tpipe.readthedocs.io/en/latest/advremotes.html#keeping-credentials-safe
227 | 
228 | https://alexwlchan.net/2016/11/you-should-use-keyring/


--------------------------------------------------------------------------------
/blogs/datasci-projects-e2e.md:
--------------------------------------------------------------------------------
  1 | # Guide to managing data science projects from start to finish
  2 | 
  3 | ## Problem formulation:
  4 | 
  5 | ### Business context / requirements
  6 | 
  7 | * target users: who are the target users of information? 
  8 | 	* economic BUYER: who is the economic BUYER? 
  9 | 
 10 | * user flow: what is the user flow currently? 
 11 | 	* user pain points: pain points? how do they solve it today?
 12 | 	* improved workflow: where will the model be used? how does the user interact with the model?
 13 | 	* actionable insights: what practical value can the model provide to the user?
 14 | 
 15 | * business impact goals: how does the model improve business outcomes?
 16 | 	* critical bottleneck: what’s the critical problem that needs to be solved to have max impact?
 17 | 	* business metric: what business metric are we trying to impact?
 18 | 		* baseline: business metric status quo?
 19 | 	* ideal outcome: what’s the ideal outcome here? what goal should be met?
 20 | 	* success measure: how do we measure success?
 21 | 
 22 | 
 23 | ### Design goals
 24 | 
 25 | * goals: goals of the project from a user perspective
 26 | * tradeoffs: ?
 27 | 	* biz intel vs real-time decision making
 28 | 	* accuracy: ?
 29 | 	* interpretability: ?
 30 | 	* speed: ?
 31 | 	* online learning requirements?
 32 | 	* deployment plan: where/how will it be used?
 33 | 		* retraining: ?
 34 | 		* evaluation: ?
 35 | 
 36 | optional:
 37 | * what should we avoid/be careful of? common pitfalls?
 38 | * what are some potential bottlenecks/issues in solving this?
 39 | * how would you deal with xyz?
 40 | 
 41 | ### Target model output
 42 | 
 43 | * model output: model should predict/produce what?
 44 | 	* how relate to business metric? direct vs approx/indirect measure?
 45 | 	* V ideal output: target model
 46 | 	* V 1.0 output: early model
 47 | * supporting output: what does a full prediction look like? incl meta data
 48 | 
 49 | ### Model evaluation
 50 | 
 51 | * model score: which metric reflects success measure?
 52 | * loss function: how to optimize the model score?
 53 | 
 54 | ###  Roadmap/project plan
 55 | 
 56 | * prototyping: build something the user can interact with
 57 | 	* user feedback:
 58 | 		* what does the output look like?
 59 | 		* if we give you xyz does it work?
 60 | * early/easy wins: optimize impact ROI early on
 61 | * roadmap: how to sequence to achieve impact quickly and expand over time
 62 | 
 63 | 
 64 | ##  Data pipeline
 65 | 
 66 | * Best practices
 67 | 	* use [d6tflow data DAGs](https://github.com/d6t/d6tflow)
 68 | 	* modularize code, see [d6tflow project templates for highly effective data science workflows](https://github.com/d6t/d6tflow-template)
 69 | 	* use unit tests
 70 | 	* don’t use jupyter notebooks, except viz or UI prototype
 71 | 
 72 | ### Data sources
 73 | 
 74 | * Preferred data sources: What did we wish we had?
 75 | * Actual data sources: What are the sources and how can we acquire?
 76 | 	* Tearsheet available?
 77 | 	* Data dictionary available?
 78 | * How much labeled data do you have?
 79 | 
 80 | ### Infrastructure requirements/constraints
 81 | 
 82 | * data prep: how large is input data?
 83 | * model training: type and complexity of model?
 84 | * dev vs prod: prod bottlenecks? storage/compute/memory
 85 | 
 86 | ### Data preprocessing
 87 | 
 88 | * Data DAG: preparing data for analysis
 89 | 
 90 | ### Exploratory data analysis
 91 | 
 92 | * Look at the data!
 93 | 	* Start with a representative sample
 94 | * Summary visualizations
 95 | 	* Distributions
 96 | 	* Relationships
 97 | 		* Stability over time
 98 | * Categorical variables
 99 | * Quirks: skews/non-normal, Outliers, Missing values, imbalanced data, non-stationarity, autocorrel, multicolinearity, heteroskedasticity
100 | * Biases: lookahead.  
101 | * AutoML output
102 | * Hypothesis on what should/should not work: Which features are most likely to predict? Which features should/should not predict?
103 | 
104 | See [d6t EDA templates](https://github.com/d6t/d6tflow-template-datasci)
105 | 
106 | ### Feature engineering
107 | 
108 | * Feature library:
109 | 	* Interaction features
110 | 
111 | ### Feature preprocessing
112 | 
113 | * Normalize: N(0,1), mean relative
114 | * Look-ahead bias: real-time decision making issues
115 | * Fix quirks
116 | * Other transforms
117 | 	* Dimension reduction
118 | 		* PCA
119 | 		* GBM feature encoding
120 | 	* Embeddings
121 | 
122 | 
123 | ## Model building
124 | 
125 | * baseline models:
126 | * candidate models:
127 | 	* ability to meet design goals?
128 | 	* address tradeoffs: accuracy / interpretability / speed?
129 | 	* handle quirks?
130 | 	* scalability?
131 | * train / validation / test
132 | * feature selection
133 | * model training
134 | 	* parameters
135 | 		* weights: input, class
136 | 		* learnings rates, # trees
137 | 	* regularization: penalties. early stopping.
138 | 		* GBM: number of trees. tree depth
139 | 		* DL: dropout. 
140 | 	* execution infrastructure: total training size? model size?
141 | * hyperparam tuning
142 | * ensembling/stacking
143 | 
144 | ## Model Evaluation
145 | 
146 | ### Evaluation metrics 
147 | 
148 | * Comparing models:
149 | 	* In-sample: baseline vs model
150 | 	* Out-sample: baseline vs model
151 | 	* Bias-variance trade-off
152 | * Visual inspection: sample. best/worst predictions. high influence.
153 | * Overfitting assessment
154 | 	* Test lookahead bias
155 | 	* Stability of relationships
156 | 	* Stability of test errors
157 | 
158 | ### Model interpretation
159 | 
160 | * Model output
161 | * Feature importance
162 | * SHAP plots
163 | * Surrogate models
164 | 
165 | ### User feedback
166 | 
167 | * performance drivers vs intuition
168 | * single predictions for decision making
169 | 
170 | ## Deployment 
171 | 
172 | * Data pipline best practices
173 | 	* Automated tests
174 | * Model speedup
175 | 	* Surrogate models
176 | 	* Fewer features: remove marginal features
177 | 	* Fewer trees: stop after achieved majority of gains
178 | * options: db vs API
179 | * integration with user-facing systems
180 | 
181 | ### A / B testing
182 | 
183 | ## Driving user adoption
184 | 
185 | * teachins
186 | * getting started guides
187 | * push model output
188 | * keep top of mind
189 | * integrating into workflow
190 | 
191 | 


--------------------------------------------------------------------------------
/blogs/design-ml-e2e.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Guide to designing end-to-end machine learning systems
  3 | 
  4 | ## Problem formulation:
  5 | 
  6 | ### Business context / requirements
  7 | 
  8 | * target users: who are the target users of information? 
  9 | 	* economic buyer: who pays for it? 
 10 | 
 11 | * user flow: what is the user flow currently? 
 12 | 	* user pain points: pain points? how do they solve it today?
 13 | 	* improved workflow: where will the model be used? how does the user interact with the model?
 14 | 	* actionable insights: what practical value can the model provide to the user?
 15 | 
 16 | * business impact goals: how does the model improve business outcomes?
 17 | 	* critical bottleneck: what’s the critical problem that needs to be solved to have max impact?
 18 | 	* business metric: how do we quantify success? what business metric are we trying to influence?
 19 | 		* baseline: business metric status quo?
 20 | 	* ideal outcome: what’s the ideal outcome here? what goal should be met?
 21 | 	* success measure: how do we measure success?
 22 | 
 23 | 
 24 | ### Design goals
 25 | 
 26 | * goals: goals of the project from a user perspective
 27 | * tradeoffs: ?
 28 | 	* biz intel vs real-time decision making
 29 | 	* accuracy: ?
 30 | 	* interpretability: ?
 31 | 	* speed: ?
 32 | 	* online learning requirements?
 33 | 	* deployment plan: where/how will it be used?
 34 | 		* retraining: ?
 35 | 		* evaluation: ?
 36 | 
 37 | optional:
 38 | * what should we avoid/be careful of? common pitfalls?
 39 | * what are some potential bottlenecks/issues in solving this?
 40 | * how would you deal with xyz?
 41 | 
 42 | ### Target model output
 43 | 
 44 | * modeling task: prediction, classification, recommendation, clustering?
 45 | * model output: model should predict/produce what?
 46 | 	* how relate to business metric? direct vs approx/indirect measure?
 47 | 	* V ideal output: target model
 48 | 	* V 1.0 output: early model
 49 | * supporting output: what does a full prediction look like? incl meta data
 50 | 
 51 | ### Model evaluation
 52 | 
 53 | * model score: which metric reflects success measure?
 54 | * loss function: how to optimize the model score?
 55 | 
 56 | ###  Roadmap/project plan
 57 | 
 58 | * prototyping: build something the user can interact with
 59 | 	* user feedback:
 60 | 		* what does the output look like?
 61 | 		* if we give you xyz does it work?
 62 | * early/easy wins: optimize impact ROI early on
 63 | * roadmap: how to sequence to achieve impact quickly and expand over time
 64 | 
 65 | 
 66 | ##  Data pipeline
 67 | 
 68 | * Best practices
 69 | 	* use [d6tflow data DAGs](https://github.com/d6t/d6tflow)
 70 | 	* modularize code, see [d6tflow project templates for highly effective data science workflows](https://github.com/d6t/d6tflow-template)
 71 | 	* use unit tests
 72 | 	* don’t use jupyter notebooks, except viz or UI prototype
 73 | 
 74 | ### Data sources
 75 | 
 76 | * Preferred data sources: What did we wish we had?
 77 | * Actual data sources: What are the sources and how can we acquire?
 78 | 	* Tearsheet available?
 79 | 	* Data dictionary available?
 80 | * How much labeled data do you have?
 81 | 
 82 | ### Infrastructure requirements/constraints
 83 | 
 84 | * data prep: how large is input data?
 85 | * model training: type and complexity of model?
 86 | * dev vs prod: prod bottlenecks? storage/compute/memory
 87 | 
 88 | ### Data preprocessing
 89 | 
 90 | * Data DAG: preparing data for analysis
 91 | 	* clean, clean, clean...
 92 | 	* combine
 93 | 	* reshape
 94 | 	* fill NANs
 95 | 
 96 | ## Model building
 97 | 
 98 | ### Exploratory data analysis
 99 | 
100 | * Look at the data!
101 | 	* Start with a representative sample
102 | * Summary visualizations
103 | 	* Distributions
104 | 	* Relationships
105 | 		* Stability over time
106 | * Categorical variables
107 | * Quirks: skews/non-normal, Outliers, Missing values, imbalanced data, non-stationarity, autocorrel, multicolinearity, heteroskedasticity
108 | * Biases: lookahead.  
109 | * AutoML output
110 | * Hypothesis on what should/should not work: Which features are most likely to predict? Which features should/should not predict?
111 | 
112 | See [d6t EDA templates](https://github.com/d6t/d6tflow-template-datasci)
113 | 
114 | ### Feature engineering
115 | 
116 | * Feature library:
117 | 	* Interaction features
118 | 
119 | 
120 | ### Feature preprocessing
121 | 
122 | * Missing values
123 | * Categorical variables
124 | 	* One-hot
125 | 	* Embeddings (# dimensions): when a categorical column has many possible values
126 | * Normalize: N(0,1), mean relative
127 | * Look-ahead bias: real-time decision making issues
128 | 	* N(0,1) across full dataset vs training/test separate
129 | * Fix quirks
130 | 	* Imbalance: set class weights, sub/over-sample
131 | * Other transforms
132 | 	* Dimension reduction
133 | 		* PCA
134 | 		* GBM feature encoding
135 | * Sparse vector transform
136 | 
137 | ### Model selection and training
138 | 
139 | * baseline models:
140 | * candidate models: linear, trees, SVM, (F)FM, neural net...
141 | 	* ability to meet design goals?
142 | 	* address tradeoffs: accuracy / interpretability / speed?
143 | 	* handle quirks?
144 | 	* scalability?
145 | * train / validation / test
146 | * feature selection
147 | * model training
148 | 	* parameters
149 | 		* weights: input, class
150 | 		* learnings rates, # trees
151 | 	* regularization: penalties. early stopping.
152 | 		* GBM: number of trees. tree depth
153 | 		* DL: dropout. 
154 | 	* execution infrastructure: total training size? model size?
155 | * hyperparam tuning
156 | * ensembling/stacking
157 | 
158 | ## Model Evaluation
159 | 
160 | ### Evaluation metrics 
161 | 
162 | * Comparing models:
163 | 	* In-sample: baseline vs model
164 | 	* Out-sample: baseline vs model
165 | 	* Bias-variance trade-off
166 | 	* statistically significant uplift?
167 | * Visual inspection: sample. best/worst predictions. high influence.
168 | * Overfitting assessment
169 | 	* Test lookahead bias
170 | 	* Stability of relationships
171 | 	* Stability of test errors
172 | 
173 | ### Model interpretation
174 | 
175 | * Model output
176 | * Feature importance
177 | * SHAP plots
178 | * Surrogate models
179 | 
180 | ### User feedback
181 | 
182 | * performance drivers vs intuition
183 | * single predictions for decision making
184 | 
185 | ## Deployment 
186 | 
187 | * Data pipline best practices
188 | 	* Automated tests
189 | * Model speedup
190 | 	* Surrogate models
191 | 	* Fewer features: remove marginal features
192 | 	* Fewer trees: stop after achieved majority of gains
193 | * options: db vs API
194 | * integration with user-facing systems
195 | 
196 | ### A / B testing
197 | 
198 | * performance inline with test?
199 | * statistically significant uplift?
200 | 
201 | ### Retraining
202 | 
203 | * how often does model need to be recalibrated
204 | * store prior model point-in-time predictions
205 | 
206 | ## Driving user adoption
207 | 
208 | * teachins
209 | * getting started guides
210 | * push model output
211 | * keep top of mind
212 | * integrating into workflow
213 | 
214 | 


--------------------------------------------------------------------------------
/blogs/effective-datasci-workflows.rst:
--------------------------------------------------------------------------------
  1 | How to Build Highly Effective Data Science Workflows
  2 | ============================================================
  3 | 
  4 | Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems:  
  5 | 
  6 | * it doesn't scale well as you add complexity
  7 | * you have to manually track which functions were run with which parameters
  8 | * you have to manually track where data is saved
  9 | * it's difficult for others to read
 10 | 
 11 | https://github.com/d6t/d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows.
 12 | 
 13 | Does your data science code look like this?
 14 | ------------------------------------------------------------
 15 | 
 16 | Don't do it! Read on how to make it better.
 17 | 
 18 | .. code-block:: python
 19 | 
 20 |     import pandas as pd
 21 |     import sklearn.svm, sklearn.metrics
 22 | 
 23 |     def get_data():
 24 |         data = download_data()
 25 |         data = clean_data(data)
 26 |         data.to_pickle('data.pkl')
 27 | 
 28 |     def preprocess(data):
 29 |         data = apply_function(data)
 30 |         return data
 31 | 
 32 |     # flow parameters
 33 |     reload_source = True
 34 |     do_preprocess = True
 35 | 
 36 |     # run workflow
 37 |     if reload_source:
 38 |         get_data()
 39 | 
 40 |     df_train = pd.read_pickle('data.pkl')
 41 |     if do_preprocess:
 42 |         df_train = preprocess(df_train)
 43 |     model = sklearn.svm.SVC()
 44 |     model.fit(df_train.iloc[:,:-1], df_train['y'])
 45 |     print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1])))
 46 | 
 47 | 
 48 | What to do about it?
 49 | ------------------------------------------------------------
 50 | 
 51 | Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. 
 52 | 
 53 | So instead of writing a function that does:
 54 | 
 55 | .. code-block:: python
 56 | 
 57 |     def process_data(data, parameter):
 58 | 
 59 |         if parameter:
 60 |             data = do_stuff(data)
 61 |         else:
 62 |             data = do_other_stuff(data)
 63 | 
 64 |         data.to_pickle('data.pkl')
 65 |         return data
 66 | 
 67 | You are better of writing tasks that you can chain together as a DAG: 
 68 | 
 69 | .. code-block:: python
 70 | 
 71 |     class TaskProcess(d6tflow.tasks.TaskPqPandas): # define output format
 72 | 
 73 |         def requires(self):
 74 |             return TaskGetData() # define dependency
 75 | 
 76 |         def run(self):
 77 |             data = self.input().load() # load input data
 78 |             data = do_stuff(data) # process data
 79 |             self.save(data) # save output data
 80 | 
 81 | The benefits of doings this are:
 82 | 
 83 | * All tasks follow the same pattern no matter how complex your workflow gets
 84 | * You have a scalable input ``requires()`` and processing function ``run()``
 85 | * You can quickly load and save data without having to hardcode filenames
 86 | * If the input task is not complete it will automatically run
 87 | * If input data or parameters change, the function will automatically rerun
 88 | 
 89 | An example machine learning workflow
 90 | ------------------------------------------------------------
 91 | 
 92 | Below is a stylized example of a machine learning flow which is expressed as a DAG. In the end you just need to run `TaskTrain()` and it will automatically know which dependencies to run. For a full example see https://github.com/d6t/d6tflow/blob/master/docs/example-ml.md
 93 | 
 94 | .. code-block:: python
 95 | 
 96 |     import pandas as pd
 97 |     import sklearn, sklearn.svm
 98 |     import d6tflow
 99 |     import luigi
100 | 
101 |     # define workflow
102 |     class TaskGetData(d6tflow.tasks.TaskPqPandas):  # save dataframe as parquet
103 | 
104 |         def run(self):        
105 |             data = download_data()
106 |             data = clean_data(data)
107 |             self.save(data) # quickly save dataframe
108 | 
109 |     class TaskPreprocess(d6tflow.tasks.TaskCachePandas):  # save data in memory
110 |         do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no
111 | 
112 |         def requires(self):
113 |             return TaskGetData() # define dependency
114 | 
115 |         def run(self):
116 |             df_train = self.input().load() # quickly load required data
117 |             if self.do_preprocess:
118 |                 df_train = preprocess(df_train)
119 |             self.save(df_train)
120 | 
121 |     class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle
122 |         do_preprocess = luigi.BoolParameter(default=True)
123 | 
124 |         def requires(self):
125 |             return TaskPreprocess(do_preprocess=self.do_preprocess)
126 | 
127 |         def run(self):
128 |             df_train = self.input().load()
129 |             model = sklearn.svm.SVC()
130 |             model.fit(df_train.iloc[:,:-1], df_train['y'])
131 |             self.save(model)
132 | 
133 |     # Check task dependencies and their execution status
134 |     d6tflow.preview(TaskTrain())
135 | 
136 |     '''
137 |     └─--[TaskTrain-{'do_preprocess': 'True'} (PENDING)]
138 |        └─--[TaskPreprocess-{'do_preprocess': 'True'} (PENDING)]
139 |           └─--[TaskGetData-{} (PENDING)]
140 |     '''
141 | 
142 |     # Execute the model training task including dependencies
143 |     d6tflow.run(TaskTrain())
144 | 
145 |     '''
146 |     ===== Luigi Execution Summary =====
147 | 
148 |     Scheduled 3 tasks of which:
149 |     * 3 ran successfully:
150 |         - 1 TaskGetData()
151 |         - 1 TaskPreprocess(do_preprocess=True)
152 |         - 1 TaskTrain(do_preprocess=True)
153 |     '''
154 | 
155 |     # Load task output to pandas dataframe and model object for model evaluation
156 |     model = TaskTrain().output().load()
157 |     df_train = TaskPreprocess().output().load()
158 |     print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1])))
159 |     # 0.9733333333333334
160 | 
161 | Conclusion
162 | ------------------------------------------------------------
163 | 
164 | Writing machine learning code as a linear series of functions likely creates many workflow problems. Because of the complex dependencies between different ML tasks it is better to write them as a DAG. https://github.com/d6t/d6tflow makes this very easy. Alternatively you can use `luigi 
165 | <https://github.com/spotify/luigi>`_ and `airflow 
166 | <https://airflow.apache.org/>`_  but they are more optimized for ETL than data science.
167 | 


--------------------------------------------------------------------------------
/blogs/images/d6tflow-filenames.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/d6tflow-filenames.png


--------------------------------------------------------------------------------
/blogs/images/top10-stats-example2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/top10-stats-example2.png


--------------------------------------------------------------------------------
/blogs/images/top10-stats-example3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/top10-stats-example3.png


--------------------------------------------------------------------------------
/blogs/reasons-why-bad-ml-code.rst:
--------------------------------------------------------------------------------
  1 | 4 Reasons Why Your Machine Learning Code is Probably Bad
  2 | ============================================================
  3 | 
  4 | Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems:  
  5 | 
  6 | * it doesn't scale well as you add complexity
  7 | * you have to manually keep track of which functions were run with which parameter as you iterate through your workflow
  8 | * you have to manually keep track of where data is saved
  9 | * it's difficult for others to read
 10 | 
 11 | .. code-block:: python
 12 | 
 13 |     import pandas as pd
 14 |     import sklearn.svm, sklearn.metrics
 15 | 
 16 |     def get_data():
 17 |         data = download_data()
 18 |         data.to_pickle('data.pkl')
 19 | 
 20 |     def preprocess(data):
 21 |         data = clean_data(data)
 22 |         return data
 23 | 
 24 |     # flow parameters
 25 |     do_preprocess = True
 26 | 
 27 |     # run workflow
 28 |     get_data()
 29 | 
 30 |     df_train = pd.read_pickle('data.pkl')
 31 |     if do_preprocess:
 32 |         df_train = preprocess(df_train)
 33 |     model = sklearn.svm.SVC()
 34 |     model.fit(df_train.iloc[:,:-1], df_train['y'])
 35 |     print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1])))
 36 | 
 37 | What to do about it?
 38 | ------------------------------------------------------------
 39 | 
 40 | Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG.
 41 | 
 42 | `d6tflow <https://github.com/d6t/d6tflow>`_ is a free open-source library which makes it easy for you to build highly effective data science workflows.
 43 | 
 44 | Instead of writing a function that does:
 45 | 
 46 | .. code-block:: python
 47 | 
 48 |     def process_data(df, parameter):
 49 |         df = do_stuff(df)
 50 |         data.to_pickle('data.pkl')
 51 |         return df
 52 | 
 53 |     dataraw = download_data()
 54 |     data = process_data(dataraw)
 55 | 
 56 | You can write tasks that you can chain together as a DAG:
 57 | 
 58 | .. code-block:: python
 59 | 
 60 |     class TaskGetData(d6tflow.tasks.TaskPqPandas):
 61 | 
 62 |         def run():
 63 |             data = download_data()
 64 |             self.save(data) # save output data
 65 | 
 66 |     @d6tflow.requires(TaskGetData) # define dependency
 67 |     class TaskProcess(d6tflow.tasks.TaskPqPandas):
 68 | 
 69 |         def run(self):
 70 |             data = self.input().load() # load input data
 71 |             data = do_stuff(data) # process data
 72 |             self.save(data) # save output data
 73 | 
 74 |     d6tflow.run(TaskProcess()) # execute task including dependencies
 75 |     data = TaskProcess().output().load() # load output data
 76 | 
 77 | The benefits of doings this are:
 78 | 
 79 | * All tasks follow the same pattern no matter how complex your workflow gets
 80 | * You have a scalable input ``requires()`` and processing function ``run()``
 81 | * You can quickly load and save data without having to hardcode filenames
 82 | * If the input task is not complete it will automatically run
 83 | * If input data or parameters change, the function will automatically rerun
 84 | * It’s much easier for others to read and understand the workflow
 85 | 
 86 | An example machine learning DAG
 87 | ------------------------------------------------------------
 88 | 
 89 | Below is a stylized example of a machine learning flow which is expressed as a DAG. In the end you just need to run `TaskTrain()` and it will automatically know which dependencies to run. For a full example see https://github.com/d6t/d6tflow/blob/master/docs/example-ml.md
 90 | 
 91 | .. code-block:: python
 92 | 
 93 |     import pandas as pd
 94 |     import sklearn, sklearn.svm
 95 |     import d6tflow
 96 |     import luigi
 97 | 
 98 |     # define workflow
 99 |     class TaskGetData(d6tflow.tasks.TaskPqPandas):  # save dataframe as parquet
100 | 
101 |         def run(self):        
102 |             data = download_data()
103 |             data = clean_data(data)
104 |             self.save(data) # quickly save dataframe
105 | 
106 |     @d6tflow.requires(TaskGetData) # define dependency
107 |     class TaskPreprocess(d6tflow.tasks.TaskCachePandas):  # save data in memory
108 |         do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no
109 | 
110 |         def run(self):
111 |             df_train = self.input().load() # quickly load required data
112 |             if self.do_preprocess:
113 |                 df_train = preprocess(df_train)
114 |             self.save(df_train)
115 | 
116 |     @d6tflow.requires(TaskPreprocess) # define dependency
117 |     class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle
118 | 
119 |         def run(self):
120 |             df_train = self.input().load()
121 |             if self.model=='ols':
122 |                 model = sklearn.linear_model.LogisticRegression()
123 |             elif self.model=='svm':
124 |                 model = sklearn.svm.SVC()
125 |             else:
126 |                 raise ValueError('invalid model selection')
127 |             model.fit(df_train.drop('y',1), df_train['y'])
128 |             self.save(model)
129 | 
130 |     # Check task dependencies and their execution status
131 |     d6tflow.preview(TaskTrain())
132 | 
133 |     '''
134 |     └─--[TaskTrain-{'do_preprocess': 'True'} (PENDING)]
135 |        └─--[TaskPreprocess-{'do_preprocess': 'True'} (PENDING)]
136 |           └─--[TaskGetData-{} (PENDING)]
137 |     '''
138 | 
139 |     # Execute the model training task including dependencies
140 |     d6tflow.run(TaskTrain())
141 | 
142 |     '''
143 |     ===== Luigi Execution Summary =====
144 | 
145 |     Scheduled 3 tasks of which:
146 |     * 3 ran successfully:
147 |         - 1 TaskGetData()
148 |         - 1 TaskPreprocess(do_preprocess=True)
149 |         - 1 TaskTrain(do_preprocess=True)
150 |     '''
151 | 
152 |     # Load task output to pandas dataframe and model object for model evaluation
153 |     model = TaskTrain().output().load()
154 |     df_train = TaskPreprocess().output().load()
155 |     print(model.score(df_train.drop('y',1), df_train['y']))
156 |     # 0.9733333333333334
157 | 
158 | Conclusion
159 | ------------------------------------------------------------
160 | 
161 | Writing machine learning code as a linear series of functions likely creates many workflow problems. Because of the complex dependencies between different ML tasks it is better to write them as a DAG. https://github.com/d6t/d6tflow makes this very easy. Alternatively you can use `luigi 
162 | <https://github.com/spotify/luigi>`_ and `airflow 
163 | <https://airflow.apache.org/>`_  but they are more optimized for ETL than data science.
164 | 


--------------------------------------------------------------------------------
/blogs/top10-mistakes-business.md:
--------------------------------------------------------------------------------
 1 | Better at coding than stats, better at stats than coding. Not quite enough, need a commercial sense as well. 
 2 | 
 3 | ## 1. Not understand business objective
 4 | 
 5 | ## focus on data not the user
 6 | 
 7 | ## overengineering instead of prototyping
 8 | 
 9 | Counterintuitively, often the best way to get started analyzing data is by working on a representative sample of the data. That allows you to familiarize yourself with the data and build the data pipeline without waiting for data processing and model training. But data scientists seem not to like that - more data is better. 
10 | 
11 | 
12 | ## 9. Cannot explain results
13 | 
14 | You've crunched the data and kept optimizing results. The error is low, everything is great. You take it back to the person who asked you to do the analysis and s/he starts asking questions: what does this variable mean? Why is the coefficient like this? What about when xyz happens? You hadn't thought about those questions because you were busy building models instead of applying the output. You don't look so smart anymore...
15 | 
16 | **Solution**: know the data, models and results inside out! And think like a user of the data, not just like the data monkey.
17 | 
18 | ## 9. Not intuitively understand pros/cons of different models
19 | 
20 | Again ML libraries make it easy to just throw different models at a problem and see which model best minimizes errors. 
21 | 
22 | Example: We once built a model to understand human decisions. You could see decisions in the graphs decisions were very clustered and indeed a tree model performed much better than a linear regression. It made intuitive sense because human decision making is more like a decision tree than a regression.
23 | 
24 | **Solution**: understand how a model works. Why does model 2 reduce the error vs model 1? Not just mathematically but using economic intuition.
25 | 


--------------------------------------------------------------------------------
/blogs/top10-mistakes-coding.md:
--------------------------------------------------------------------------------
  1 | # Top 10 Coding Mistakes Made by Data Scientists
  2 | 
  3 | A data scientist is a "person who is better at statistics than any software engineer and better at software engineering than any statistician". Many data scientists have a statistics background and little experience with software engineering. I'm a senior data scientist ranked top 1% on Stackoverflow for python coding and work with a lot of (junior) data scientists. Here is my list of 10 common mistakes I frequently see.
  4 | 
  5 | ## 1. Don't share data referenced in code
  6 | 
  7 | Data science needs code AND data. So for someone else to be able to reproduce your results, they need to have access to the data. Seems basic but a lot of people forget to share the data with their code.
  8 | 
  9 | ```python
 10 | 
 11 | import pandas as pd
 12 | df1 = pd.read_csv('file-i-dont-have.csv') # fails
 13 | do_stuff(df)
 14 | 
 15 | ```
 16 | 
 17 | **Solution**: Use [d6tpipe](https://github.com/d6t/d6tpipe) to share data files with your code or upload to S3/web/google drive etc or save to a database so the recipient can retrieve files (but don't add them to git, see below).
 18 | 
 19 | ## 2. Hardcode inaccessible paths
 20 | 
 21 | Similar to mistake 1, if you hardcode paths others don't have access to, they can't run your code and have to look in lots of places to manually change paths. Booo!
 22 | 
 23 | ```python
 24 | 
 25 | import pandas as pd
 26 | df = pd.read_csv('/path/i-dont/have/data.csv') # fails
 27 | do_stuff(df)
 28 | 
 29 | # or 
 30 | impor os
 31 | os.chdir('c:\\Users\\yourname\\desktop\\python') # fails
 32 | 
 33 | ```
 34 | 
 35 | **Solution**: Use relative paths, global path config variables or [d6tpipe](https://github.com/d6t/d6tpipe) to make your data easily accessible.
 36 | 
 37 | ## 3. Mix data with code
 38 | 
 39 | Since data science code needs data why not dump it in the same directory? And while you are at it, save images, reports and other junk there too. Yikes, what a mess!
 40 | 
 41 | ```
 42 | ├── data.csv
 43 | ├── ingest.py
 44 | ├── other-data.csv
 45 | ├── output.png
 46 | ├── report.html
 47 | └── run.py
 48 | ```
 49 | 
 50 | **Solution**: Organize your directory into categories, like data, reports, code etc. See [Cookiecutter Data Science
 51 | ](https://drivendata.github.io/cookiecutter-data-science/#directory-structure) or [d6tflow project templates](https://github.com/d6t/d6tflow-template) and use tools mentioned in #1 to store and share data.
 52 | 
 53 | ## 4. Git commit data with source code
 54 | 
 55 | Most people now version control their code (if you don't that's another mistake!! See [git](https://git-scm.com/)). In an attempt to share data, it might be tempting to add data files to version control. That's ok for very small files but git is not optimized for data, especially large files.
 56 | 
 57 | ```bash
 58 | git add data.csv
 59 | ```
 60 | 
 61 | **Solution**:  Use tools mentioned in #1 to store and share data. If you really want to version control data, see [d6tpipe](https://github.com/d6t/d6tpipe), [DVC](https://dvc.org/) and [Git Large File Storage](https://git-lfs.github.com/).
 62 | 
 63 | ## 5. Write functions instead of DAGs
 64 | 
 65 | Enough about data, lets talk about the actual code! Since one of the first things you learn when you learn to code are functions, data science code is mostly organized as a series of functions that are run linearly. That causes several problems, see [4 Reasons Why Your Machine Learning Code is Probably Bad](https://github.com/d6t/d6t-python/blob/master/blogs/reasons-why-bad-ml-code.rst). 
 66 | 
 67 | ```python
 68 | 
 69 | def process_data(data, parameter):
 70 |     data = do_stuff(data)
 71 |     data.to_pickle('data.pkl')
 72 | 
 73 | data = pd.read_csv('data.csv')
 74 | process_data(data)
 75 | df_train = pd.read_pickle(df_train)
 76 | model = sklearn.svm.SVC()
 77 | model.fit(df_train.iloc[:,:-1], df_train['y'])
 78 | 
 79 | ```
 80 | 
 81 | **Solution**: Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. Use [d6tflow](https://github.com/d6t/d6tflow) or [airflow](https://airflow.apache.org/).
 82 | 
 83 | ## 6. Write for loops
 84 | 
 85 | Like functions, for loops are the first thing you learn when you learn to code. Easy to understand, but they are slow and excessively wordy, typically indicating you are unaware of vectorized alternatives. 
 86 | 
 87 | ```python
 88 | 
 89 | x = range(10)
 90 | avg = sum(x)/len(x); std = math.sqrt(sum((i-avg)**2 for i in x)/len(x));
 91 | zscore = [(i-avg)/std for x]
 92 | # should be: scipy.stats.zscore(x)
 93 | 
 94 | # or
 95 | groupavg = []
 96 | for i in df['g'].unique():
 97 | 	dfg = df[df[g']==i]
 98 | 	groupavg.append(dfg['g'].mean())
 99 | # should be: df.groupby('g').mean()
100 | 
101 | ```
102 | 
103 | **Solution**: [Numpy](http://www.numpy.org/), [scipy](https://www.scipy.org/) and [pandas](https://pandas.pydata.org/) have vectorized functions for most things that you think might require for loops.
104 | 
105 | ## 7. Don't write unit tests
106 | 
107 | As data, parameters or user input change, your code might break, sometimes without you noticing. That can lead to bad output and if someone makes decisions based on your output, bad data will lead to bad decisions!
108 | 
109 | **Solution**: Use `assert` statements to check for data quality. [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#testing-functions) has equality tests, [d6tstack](https://github.com/d6t/d6tstack) has checks for data ingestion and [d6tjoin](https://github.com/d6t/d6tjoin/blob/master/examples-prejoin.ipynb) for data joins. Code for example data checks:
110 | 
111 | ```python
112 | 
113 | assert df['id'].unique().shape[0] == len(ids) # have data for all ids?
114 | assert df.isna().sum()<0.9 # catch missing values
115 | assert df.groupby(['g','date']).size().max() ==1 # no duplicate values/date?
116 | assert d6tjoin.utils.PreJoin([df1,df2],['id','date']).is_all_matched() # all ids matched?
117 | 
118 | ```
119 | 
120 | ## 8. Don't document code
121 | 
122 | I get it, you're in a hurry to produce some analysis. You hack things together to get results to your client or boss. Then a week later they come back and say "can you change xyz" or "can you update this please". You look at your code and can't remember why you did what you did. And now imagine someone else has to run it.
123 | 
124 | ```python
125 | 
126 | def some_complicated_function(data):
127 | 	data = data[data['column']!='wrong']
128 | 	data = data.groupby('date').apply(lambda x: complicated_stuff(x))
129 | 	data = data[data['value']<0.9]
130 | 	return data
131 | 
132 | ```
133 | 
134 | **Solution**: Take the extra time, even if it's after you've delivered the analysis, to document what you did. You will thank yourself and other will do so even more! You'll look like a pro!
135 | 
136 | ## 9. Save data as csv or pickle
137 | 
138 | Back data, it's DATA science after all. Just like functions and for loops, CSVs and pickle files are commonly used but they are actually not very good. CSVs don't include a schema so everyone has to parse numbers and dates again. Pickles solve that but only work in python and are not compressed. Both are not good formats to store large datasets.
139 | 
140 | ```python
141 | 
142 | def process_data(data, parameter):
143 |     data = do_stuff(data)
144 |     data.to_pickle('data.pkl')
145 | 
146 | data = pd.read_csv('data.csv')
147 | process_data(data)
148 | df_train = pd.read_pickle(df_train)
149 | 
150 | ```
151 | 
152 | **Solution**: Use [parquet](https://github.com/dask/fastparquet) or other binary data formats with data schemas, ideally ones that compress data. [d6tflow](https://github.com/d6t/d6tflow) automatically saves data output of tasks as parquet so you don't have to deal with it.
153 | 
154 | ## 10. Use jupyter notebooks
155 | 
156 | Lets conclude with a controversial one: jupyter notebooks are as common as CSVs. A lot of people use them. That doesn't make them good. Jupyter notebooks promote a lot of bad software engineering habits mentioned above, notably:
157 | 
158 | 1. You are tempted to dump all files into one directory
159 | 2. You write code that runs top-bottom instead of DAGs
160 | 3. You don't modularize your code
161 | 4. Difficult to debug
162 | 5. Code and output gets mixed in one file
163 | 6. They don't version control well
164 | 
165 | It feels easy to get started but scales poorly.
166 | 
167 | **Solution**: Use [pycharm](https://www.jetbrains.com/pycharm/) and/or [spyder](https://www.spyder-ide.org/).
168 | 


--------------------------------------------------------------------------------
/blogs/top10-mistakes-statistics.md:
--------------------------------------------------------------------------------
  1 | # Top 10 Statistics Mistakes Made by Data Scientists
  2 | 
  3 | A data scientist is a "person who is better at statistics than any software engineer and better at software engineering than any statistician". In [Top 10 Coding Mistakes Made by Data Scientists](https://github.com/d6t/d6t-python/blob/master/blogs/top10-mistakes-coding.md) we discussed how statisticians can become a better coders. Here we discuss how coders can become better statisticians.
  4 | 
  5 | Detailed output and code for each of the examples is available in [github](http://tiny.cc/top10-mistakes-stats-code) and in an [interactive notebook](http://tiny.cc/top10-mistakes-stats-bind). The code uses workflow management library [d6tflow](https://github.com/d6t/d6tflow) and data is shared with dataset management library [d6tpipe](https://github.com/d6t/d6tpipe).
  6 | 
  7 | ## 1. Not fully understand objective function
  8 | 
  9 | Data scientists want to build the "best" model. But beauty is in the eye of the beholder. If you don't know what the goal and objective function is and how it behaves, it is unlikely you will be able to build the "best" model. And fwiw the objective may not even be a mathematical function but perhaps improving a business metric.
 10 | 
 11 | **Solution**: most kaggle winners spend a lot of time understanding the objective function and how the data and model relates to the objective function. If you are optimizing a business metric, map it to an appropriate mathematical objective function.
 12 | 
 13 | **Example**: F1 score is typically used to assess classification models. We once built a classification model whose success depended on the % of occurrences it was right. The F1 score was misleading because it shows the model was correct ~60% of the time whereas in fact it was correct only 40% of the time.
 14 | 
 15 | ```
 16 | f1 0.571 accuracy 0.4
 17 | ```
 18 | ## 2. Not have a hypothesis why something should work
 19 | 
 20 | Commonly data scientists want to build "models". They heard xgboost and random forests work best so lets use those. They read about deep learning, maybe that will improve results further. They throw models at the problem without having looked at the data and without having formed a hypothesis which model is most likely to best capture the features of the data. It makes explaining your work really hard too because you are just randomly throwing models at data.
 21 | 
 22 | **Solution**: look at the data! Understand its characteristics and form a hypothesis which model is likely to best capture those characteristics. 
 23 | 
 24 | **Example**: Without running any models by just plotting this sample data, you can already have a strong view that x1 is linearly related with y and x2 doesn't have much of a relationship with y.
 25 | ![Example2](images/top10-stats-example2.png?raw=true "Example2")
 26 | 
 27 | ## 3. Not looking at the data before interpreting results
 28 | 
 29 | Another problem with not looking at the data is that your results can be heavily driven by outliers or other artifacts. This is especially true for models that minimize squared sums. Even without outliers, you can have problems with imbalanced datasets, clipped or missing values and all sorts of other weird artifacts of real-life data that you didn't see in the classroom.
 30 | 
 31 | **Solution**: it's so important it's worth repeating: look at the data! Understand how the nature of the data is impacting model results. 
 32 | 
 33 | **Example**: with outliers, x1 slope changed from 0.906 to -0.375!
 34 | ![Example3](images/top10-stats-example3.png?raw=true "Example3")
 35 | 
 36 | ## 4. Not having a naive baseline model
 37 | 
 38 | Modern ML libraries almost make it too easy... Just change a line of code and you can run a new model. And another. And another. Error metrics are decreasing, tweak parameters - great - error metrics are decreasing further... With all the model fanciness, you can forget the dumb way of forecasting data. And without that naive benchmark, you don't have a good absolute comparison for how good your models are, they may all be bad in absolute terms.
 39 | 
 40 | **Solution**: what's the dumbest way you can predict a value? Build a "model" using the last known value, the (rolling) mean or some constant eg 0. Compare your model performance against a zero-intelligence forecast monkey!
 41 | 
 42 | **Example**: With this time series dataset, model1 must be better than model2 with MSE of 0.21 and 0.45 respectively. But wait! By just taking the last known value, the MSE drops to 0.003!
 43 | 
 44 | ```
 45 | ols CV mse 0.215
 46 | rf CV mse 0.428
 47 | last out-sample mse 0.003
 48 | ```
 49 | 
 50 | ## 5. Incorrect out-sample testing
 51 | 
 52 | This is the one that could derail your career! The model you built looked great in R&D but performs horrible in production. The model you said will do wonders is causing really bad business outcomes, potentially costing the company $m+. It's so important all the remaining mistakes bar the last one focus on it.
 53 | 
 54 | **Solution**: Make sure you've run your model in realistic outsample conditions and understand when it will perform well and when it doesn't.
 55 | 
 56 | **Example**: In-sample the random forest does a lot better than linear regression with mse 0.048 vs ols mse 0.183 but out-sample it does a lot worse with mse 0.259 vs linear regression mse 0.187. The random forest overtrained and would not perform well live in production!
 57 | 
 58 | ```
 59 | in-sample
 60 | rf mse 0.04 ols mse 0.183
 61 | out-sample
 62 | rf mse 0.261 ols mse 0.187
 63 | ```
 64 | 
 65 | ## 6. Incorrect out-sample testing: applying preprocessing to full dataset
 66 | 
 67 | You probably know that powerful ML models can overtrain. Overtraining means it performs well in-sample but badly out-sample. So you need to be aware of having training data leak into test data. If you are not careful, any time you do feature engineering or cross-validation, train data can creep into test data and inflate model performance. 
 68 | 
 69 | **Solution**: make sure you have a true test set free of any leakage from training set. Especially beware of any time-dependent relationships that could occur in production use.
 70 | 
 71 | **Example**: This happens a lot. Preprocessing is applied to the full dataset BEFORE it is split into train and test, meaning you do not have a true test set. Preprocessing needs to be applied separately AFTER data is split into train and test sets to make it a true test set. The MSE between the two methods (mixed out-sample CV mse 0.187 vs true out-sample CV mse 0.181) in this case is not all that different because the distributional properties between train and test are not that different but that might not always be the case.
 72 | 
 73 | ```
 74 | mixed out-sample CV mse 0.187 true out-sample CV mse 0.181
 75 | ```
 76 | 
 77 | ## 7. Incorrect out-sample testing: cross-sectional data & panel data
 78 | 
 79 | You were taught cross-validation is all you need. sklearn even provides you some nice convenience functions so you think you have checked all the boxes. But most cross-validation methods do random sampling so you might end up with training data in your test set which inflates performance.
 80 | 
 81 | **Solution**: generate test data such that it accurately reflects data on which you would make predictions in live production use. Especially with time series and panel data you likely will have to generate custom cross-validation data or do roll-forward testing.
 82 | 
 83 | **Example**: here you have panel data for two different entities (eg companies) which are cross-sectionally highly correlated. If you randomly split data you make accurate predictions using data you did not actually have available during test, overstating model performance. You think you avoided mistake #5 by using cross-validation and found the random forest performs a lot better than linear regression in cross-validation. But running a roll-forward out-sample test which prevents future data from leaking into test, it performs a lot worse again! (random forest MSE goes from 0.047 to 0.211, higher than linear regression!)
 84 | 
 85 | ```
 86 | normal CV
 87 | ols 0.203 rf 0.051
 88 | true out-sample error
 89 | ols 0.166 rf 0.229
 90 | ```
 91 | 
 92 | ## 8. Not considering which data is available at point of decision
 93 | 
 94 | When you run a model in production, it gets fed with data that is available when you run the model. That data might be different than what you assumed to be available in training. For example the data might be published with delay so by the time you run the model other inputs have changed and you are making predictions with wrong data or your true y variable is incorrect.
 95 | 
 96 | **Solution**: do a rolling out-sample forward test. If I had used this model in production, what would my training data look like, ie what data do you have to make predictions? That's the training data you use to make a true out-sample production test. Furthermore, think about if you acted on the prediction, what result would that generate at the point of decision?
 97 | 
 98 | ## 9. Subtle Overtraining
 99 | 
100 | The more time you spend on a dataset, the more likely you are to overtrain it. You keep tinkering with features and optimizing model parameters. You used cross-validation so everything must be good.
101 | 
102 | **Solution**: After you have finished building the model, try to find another "version" of the datasets that can be a surrogate for a true out-sample dataset. If you are a manager, deliberately withhold data so that it does not get used for training.
103 | 
104 | **Example**: Applying the models that were trained on dataset 1 to dataset 2 shows the MSEs more than doubled. Are they still acceptable...? This is a judgement call but your results from #4 might help you decide.
105 | 
106 | ```
107 | first dataset
108 | rf mse 0.261 ols mse 0.187
109 | new dataset
110 | rf mse 0.681 ols mse 0.495
111 | ```
112 | 
113 | ## 10. "need more data" fallacy
114 | 
115 | Counterintuitively, often the best way to get started analyzing data is by working on a representative sample of the data. That allows you to familiarize yourself with the data and build the data pipeline without waiting for data processing and model training. But data scientists seem not to like that - more data is better. 
116 | 
117 | **Solution**: start working with a small representative sample and see if you can get something useful out of it. Give it back to the end user, can they use it? Does it solve a real pain point? If not, the problem is likely not because you have too little data but with your approach.
118 | 


--------------------------------------------------------------------------------
/blogs/top5-mistakes-vendors.md:
--------------------------------------------------------------------------------
 1 | # 5 Mistakes Data Vendors Commonly Make
 2 | 
 3 | Avoid the common traps and keep your clients happy.
 4 | 
 5 | * **Reinvent the Wheel**: To deliver data to clients, you need build APIs, ftp servers, S3 buckets and so on. And you need it all because your client needs are diverse
 6 | * **Great Sales, Bad Delivery**: The sales and product team closed the deal and then it's up to the engineering team to deliver data. But they have limited capacity and it's taking longer than you think.
 7 | * **No Quickstart Instructions**: You presented a great story and case study but the client cannot easily verify our clients. They spin the wheels just to recreate why you did
 8 | * **Not Cloud Ready**: Is ftp still the only thing you have to offer? Your clients are moving to the cloud and so shoud you
 9 | * **No usage analytics**: 
10 | 
11 | The solution to all this? [DataBolt Pipe](https://www.databolt.tech/index-pipe-vendors.html). We provide turnkey solutions to efficiently distribute data to your clients.
12 | 
13 | * **Turnkey Infrastructure**: Use managed infrastructure, no need to build and maintain separate APIs, AWS S3 buckets and ftp servers
14 | * **Simple Web GUI**: Non-technical users from sales and product teams can distribute data without involving your engineering team
15 | * **Usage Analytics**: Get download confirmations and daily usage statistics to measure engagement by client
16 | 
17 | 


--------------------------------------------------------------------------------
/overview.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/overview.png


--------------------------------------------------------------------------------