├── .gitignore ├── LICENSE ├── README.md ├── blogs ├── blog-20190813-d6tflow-pytorch.html ├── blog-20190813-d6tflow-pytorch.rmd ├── blog-20200426-shapley-report.ipynb ├── blog-20200426-shapley.ipynb ├── datasci-dags-airflow-meetup.md ├── datasci-projects-e2e.md ├── design-ml-e2e.md ├── effective-datasci-workflows.rst ├── images │ ├── d6tflow-filenames.png │ ├── top10-stats-example2.png │ └── top10-stats-example3.png ├── reasons-why-bad-ml-code.rst ├── top10-mistakes-business.md ├── top10-mistakes-coding.md ├── top10-mistakes-statistics.md └── top5-mistakes-vendors.md └── overview.png /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | *.csv 3 | *.pq 4 | *.json 5 | data/ 6 | tests/.creds.yml 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # C extensions 14 | *.so 15 | 16 | # Distribution / packaging 17 | .Python 18 | build/ 19 | develop-eggs/ 20 | dist/ 21 | downloads/ 22 | eggs/ 23 | .eggs/ 24 | lib/ 25 | lib64/ 26 | parts/ 27 | sdist/ 28 | var/ 29 | wheels/ 30 | share/python-wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | MANIFEST 35 | 36 | # PyInstaller 37 | # Usually these files are written by a python script from a template 38 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .nox/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | .hypothesis/ 57 | .pytest_cache/ 58 | 59 | # Translations 60 | *.mo 61 | *.pot 62 | 63 | # Django stuff: 64 | *.log 65 | local_settings.py 66 | db.sqlite3 67 | 68 | # Flask stuff: 69 | instance/ 70 | .webassets-cache 71 | 72 | # Scrapy stuff: 73 | .scrapy 74 | 75 | # Sphinx documentation 76 | docs/_build/ 77 | 78 | # PyBuilder 79 | target/ 80 | 81 | # Jupyter Notebook 82 | .ipynb_checkpoints 83 | 84 | # IPython 85 | profile_default/ 86 | ipython_config.py 87 | 88 | # pyenv 89 | .python-version 90 | 91 | # celery beat schedule file 92 | celerybeat-schedule 93 | 94 | # SageMath parsed files 95 | *.sage.py 96 | 97 | # Environments 98 | .env 99 | .venv 100 | env/ 101 | venv/ 102 | ENV/ 103 | env.bak/ 104 | venv.bak/ 105 | 106 | # Spyder project settings 107 | .spyderproject 108 | .spyproject 109 | 110 | # Rope project settings 111 | .ropeproject 112 | 113 | # mkdocs documentation 114 | /site 115 | 116 | # mypy 117 | .mypy_cache/ 118 | .dmypy.json 119 | dmypy.json 120 | 121 | # Pyre type checker 122 | .pyre/ 123 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 d6t 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Accelerate Data Science 2 | 3 | ## Databolt python libraries 4 | 5 | For data scientists and data engineers, DataBolt is a collection of python-based libraries and products to reduce the time it takes to get your data ready for analysis and collaborate with others. 6 | 7 | Majority of time in data science is spent on tedious tasks unrelated to data analysis. DataBolt simplifies those tasks so you can experience up to 10x productivity gains. 8 | 9 | ![Databolt Workflow](overview.png "Databolt Workflow") 10 | 11 | * **manage data workflows**: quickly build highly effective data science workflows 12 | * **push/pull data**: quickly get and share data files like code 13 | * **import data**: quickly ingest messy raw CSV and XLS files to pandas, SQL and more 14 | * **join data**: quickly combine multiple datasets using fuzzy joins 15 | 16 | The libraries are modularized so you can use them individually but they work well together to improve your entire data workflow. 17 | 18 | 19 | ## [Manage data workflows](https://github.com/d6t/d6tflow) 20 | 21 | Easily manage data workflows including complex dependencies and parameters. With d6tflow you can easily chain together complex data flows and intelligently execute them. You can quickly load input and output data for each task. It makes your workflow very clear and intuitive. 22 | 23 | ### What can it do? 24 | 25 | * Build a data workflow made up of tasks with dependencies and parameters 26 | * Intelligently rerun workflow after changing parameters, code or data 27 | * Quickly load task input and output data without manual work 28 | 29 | Learn more at [https://github.com/d6t/d6tflow](https://github.com/d6t/d6tflow) 30 | 31 | 32 | ## [Push/Pull Data](https://github.com/d6t/d6tpipe) 33 | 34 | d6tpipe is a python library which makes it easier to exchange data. It's like git for data! But better because you can include it in your data science code. 35 | 36 | ### What can it do? 37 | 38 | * Quickly create public and private remote file storage on AWS S3 and ftp 39 | * Push/pull data to/from remote file storage to sync files and share with others 40 | * Add schema information so data can be loaded quickly 41 | 42 | Learn more at [https://github.com/d6t/d6tpipe](https://github.com/d6t/d6tpipe) 43 | 44 | 45 | ## [Ingest Data](https://github.com/d6t/d6tstack) 46 | 47 | Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. d6tstack solves many performance and other problems typically encountered when ingesting raw files. 48 | 49 | ### What can it do? 50 | 51 | * Fast pd.to_sql() for postgres and mysql 52 | * Check and fix schema problems like added/missing/renamed columns 53 | * Load and process messy Excel files 54 | 55 | Learn more at [https://github.com/d6t/d6tstack](https://github.com/d6t/d6tstack) 56 | 57 | 58 | ## [Join Data](https://github.com/d6t/d6tjoin) 59 | 60 | Easily join different datasets without writing custom code using fuzzy matches. Does similarity joins on strings, dates and numbers. For example you can quickly join similar but not identical stock tickers, addresses, names and dates without manual processing. 61 | 62 | ### What can it do? 63 | 64 | * Identify and diagnose join problems 65 | * Best match fuzzy joins on strings and dates 66 | * Best match substring joins 67 | 68 | Learn more at [https://github.com/d6t/d6tjoin](https://github.com/d6t/d6tjoin) 69 | 70 | 71 | ## [Blog](http://blog.databolt.tech) 72 | 73 | We encourage you to join the Databolt blog to get updates and tips+tricks [http://blog.databolt.tech](http://blog.databolt.tech) 74 | 75 | 76 | ## [About](https://www.databolt.tech) 77 | 78 | [https://www.databolt.tech](https://www.databolt.tech) 79 | 80 | For questions or comments contact: support-at-databolt.tech 81 | -------------------------------------------------------------------------------- /blogs/blog-20190813-d6tflow-pytorch.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "5 Step Guide to Scalable Deep Learning Pipelines with d6tflow" 3 | output: html_document 4 | --- 5 | 6 | ```{r setup, include=FALSE} 7 | knitr::opts_chunk$set(echo = TRUE) 8 | library(knitr) 9 | library(reticulate) 10 | library(kableExtra) 11 | 12 | setwd("d:/dev/blogs-source/dlrm/") 13 | source_python("flow_tasks.py") 14 | 15 | ``` 16 | 17 | *How to turn a typical pytorch script into a scalable d6tflow DAG for faster research & development* 18 | 19 | # Introduction: Why bother? 20 | 21 | Building deep learning models typically involves complex data pipelines as well as a lot of trial and error, tweaking model architecture and parameters whose performance needs to be compared. It is often difficult to keep track of all the experiments, leading at best to confusion and at worst wrong conclusions. 22 | 23 | In [4 reasons why your ML code is bad](https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html) we explored how to organize ML code as DAG workflows to solve that problem. In this guide we will go through a practical case study on turning an existing pytorch script into a scalable deep learning pipeline with [d6tflow](https://github.com/d6t/d6tflow). The starting point is [a pytorch deep recommender model by Facebook](https://github.com/facebookresearch/dlrm) and we will go through the 5 steps of migrating the code into a scalable deep learning pipeline. The steps below are written in partial pseudo code to illustrate concepts, the full code is available also, see instructions at the end of the article. 24 | 25 | Lets get started! 26 | 27 | ## Step 1: Plan your DAG 28 | 29 | To plan your work and help others understand how your pipeline fits together, you want to start by thinking about the data flow, dependencies between tasks and task parameters. This helps you organize your workflow into logical components. You might want to draw a diagram such as this 30 | 31 | ![](https://github.com/d6t/d6tflow/raw/master/docs/d6tflow-docs-graph.png?raw=true) 32 | 33 | Below is the pytorch model training DAG for FB DLRM. It shows the training task `TaskModelTrain` with all its dependencies and how the dependencies relate to each other. If you write functional code it is difficult see how your workflow fits together like this. 34 | 35 | ```{python} 36 | task = TaskModelTrain() 37 | print(d6tflow.preview(task, clip_params=True)) 38 | 39 | ``` 40 | 41 | ## Step 2: Write Tasks instead of functions 42 | 43 | Data science code is typically organized in functions which leads to a lot of problems as explained in [4 reasons why your ML code is bad](https://www.kdnuggets.com/2019/02/4-reasons-machine-learning-code-probably-bad.html). Instead you want to write d6tflow tasks. The benefits are that you can: 44 | 45 | * chain tasks into a DAG so that required dependencies run automatically 46 | * easily load task input data from dependencies 47 | * easily save task output such as preprocessed data and trained models. That way you don't accidentally rerun long-running training tasks 48 | * parameterize tasks so they can be intelligently managed (see next step) 49 | * save output to [d6tpipe](https://github.com/d6t/d6tpipe) to separate data from code and easily share the data, see [Top 10 Coding Mistakes Made by Data Scientists](https://www.kdnuggets.com/2019/04/top-10-coding-mistakes-data-scientists.html) 50 | 51 | Here is what the before/after looks like for the FB DLRM code after you convert functional code into d6tflow tasks. 52 | 53 | Typical pytorch functional code that does not scale well: 54 | 55 | ```{python, echo=TRUE, eval = FALSE} 56 | # ***BEFORE*** 57 | # see dlrm_s_pytorch.py 58 | 59 | def train_model(): 60 | data = loadData() 61 | dlrm = DLRM_Net([...]) 62 | model = dlrm.train(data) 63 | torch.save({model},'model.pickle') 64 | 65 | if __name__ == "__main__": 66 | 67 | parser.add_argument("--load-model") 68 | if load_model: 69 | model = torch.load('model.pickle') 70 | else: 71 | model = train_model() 72 | ``` 73 | 74 | 75 | Same logic written using scalable d6tflow tasks: 76 | 77 | ```{python, echo=TRUE, eval = FALSE} 78 | # ***AFTER*** 79 | # see flow_tasks.py 80 | 81 | class TaskModelTrain(d6tflow.tasks.TaskPickle): 82 | 83 | def requires(self): # define dependencies 84 | return {'data': TaskPrepareData(), 'model': TaskBuildNetwork()} 85 | 86 | def run(self): 87 | data = self.input()['data'].load() # easily load input data 88 | dlrm = self.input()['model'].load() 89 | model = dlrm.train(data) 90 | self.save(model) # easily save trained model as pickle 91 | 92 | 93 | if __name__ == "__main__": 94 | if TaskModelTrain().complete(): # load ouput if task was run 95 | model = TaskModelTrain().output().load() 96 | 97 | ``` 98 | 99 | 100 | ## Step 3: Parameterize tasks 101 | 102 | To improve model performance, you will try different models, parameters and preprocessing settings. To keep track of all this, you can add parameters to tasks. That way you can: 103 | 104 | * keep track which models have been trained with which parameters 105 | * intelligently rerun tasks as parameters change 106 | * help others understand where in workflow parameters are introduced 107 | 108 | Below sets up FB DLRM model training task with parameters. Note how you no longer have to manually specify where to save the trained model and data. 109 | 110 | ```{python, echo=TRUE, eval = FALSE} 111 | # ***BEFORE*** 112 | # dlrm_s_pytorch.py 113 | 114 | if __name__ == "__main__": 115 | # define model parameters 116 | parser.add_argument("--learning-rate", type=float, default=0.01) 117 | parser.add_argument("--nepochs", type=int, default=1) 118 | # manually specify filename 119 | parser.add_argument("--save-model", type=str, default="") 120 | model = train_model() 121 | torch.save(model, args.save_model) 122 | 123 | # ***AFTER*** 124 | # see flow_tasks.py 125 | 126 | class TaskModelTrain(d6tflow.tasks.TaskPickle): 127 | 128 | # define model parameters 129 | learning_rate = luigi.FloatParameter(default = 0.01) 130 | num_epochs = luigi.IntParameter(default = 1) 131 | # filename is determined automatically 132 | 133 | def run(self): 134 | data = self.input()['data'].load() 135 | dlrm = self.input()['model'].load() 136 | 137 | # use learning_rate param 138 | optimizer = torch.optim.SGD(dlrm.parameters(), lr=self.learning_rate) 139 | # use num_epochs param 140 | while k < self.num_epochs: 141 | optimizer.step() 142 | model = optimizer.get_model() 143 | self.save(model) # automatically save model, seperately for each parameter config 144 | 145 | ``` 146 | 147 | ### Compare trained models 148 | 149 | Now you can use that parameter to easily compare output from different models. Make sure you run the workflow with that parameter before you load task output (see Step #4). 150 | 151 | ```{python, eval = FALSE} 152 | model1 = TaskModelTrain().output().load() # use default num_epochs=1 153 | print_accuracy(model1) 154 | model2 = TaskModelTrain(num_epochs=10).output().load() 155 | print_accuracy(model2) 156 | 157 | ``` 158 | 159 | 160 | ### Inherit parameters 161 | 162 | Often you need to have a parameter cascade downstream through the workflow. If you write functional code, you have to keep repeating the parameter in each function. With d6tflow you can inherit parameters so the terminal task can pass the parameter to upstream tasks as needed. 163 | 164 | In the FB DLRM workflow, `TaskModelTrain` inherits parameters from `TaskGetTrainDataset`. This way you can run `TaskModelTrain(mini_batch_size=2)` and it will pass the parameter to upstream tasks ie `TaskGetTrainDataset` and all other tasks that depend on it. In the actual code, note the use of `self.clone(TaskName)` and `@d6tflow.clone_parent`. 165 | 166 | ```{python, echo=TRUE, eval = FALSE} 167 | 168 | class TaskGetTrainDataset(d6tflow.tasks.TaskPickle): 169 | mini_batch_size = luigi.FloatParameter(default = 1) 170 | # [...] 171 | 172 | @d6tflow.inherit(TaskGetTrainDataset) 173 | class TaskModelTrain(d6tflow.tasks.TaskPickle): 174 | # no need to repeat parameters 175 | pass 176 | 177 | ``` 178 | 179 | ## Step 4: Run DAG to process data and train model 180 | 181 | To kick off data processing and model training, you run the DAG. You only need to run the terminal task which automatically runs all dependencies. Before actually running the DAG, you can preview what will be run. This is especially helpful if you have made any changes to code or data because it will only run the tasks that have changed not the full workflow. 182 | 183 | ```{python, eval = FALSE} 184 | task = TaskModelTrain() # or task = TaskModelTrain(num_epochs=10) 185 | d6tflow.preview(task) 186 | d6tflow.run(task) 187 | 188 | ``` 189 | 190 | 191 | ## Step 5: Evaluate model performance 192 | 193 | Now that the workflow has run and all tasks are complete, you can load predictions and other model output to compare and visualize output. Because the tasks knows where each output it saved, you can directly load output from the task instead of having to remember the file paths or variable names. It also makes your code a lot more readable. 194 | 195 | ```{python, eval = FALSE} 196 | model1 = TaskModelTrain().output().load() 197 | print_accuracy(model1) 198 | 199 | ``` 200 | 201 | ### Compare models 202 | 203 | You can easily compare output from different models with different parameters. 204 | 205 | ```{python, eval = FALSE} 206 | model1 = TaskModelTrain().output().load() # use default num_epochs=1 207 | print_accuracy(model1) 208 | model2 = TaskModelTrain(num_epochs=10).output().load() 209 | print_accuracy(model2) 210 | 211 | ``` 212 | 213 | ### Keep iterating 214 | 215 | As you iterate, changing parameters, code and data, you will want to rerun tasks. d6tflow intelligently figures out which tasks need to be rerun which makes iterating very efficient. If you have changed parameters, you don't need to do anything, it will know what to run automatically. If you have changed code or data, you have to mark the task as incomplete using `.invalidate()` and d6tflow will figure out the rest. 216 | 217 | In the FB DLRM workflow, say for example you changed training data or made changes to the training preprocessing. 218 | 219 | ```{python, eval = FALSE} 220 | 221 | TaskGetTrainDataset().invalidate() 222 | 223 | # or 224 | d6tflow.run(task, forced=TaskGetTrainDataset()) 225 | 226 | ``` 227 | 228 | ## Full source code 229 | 230 | All code is provided at https://github.com/d6tdev/dlrm. It is the same as https://github.com/facebook/dlrm with d6tflow files added: 231 | 232 | * flow_run.py: run flow => run this file 233 | * flow_task.py: tasks code 234 | * flow_viz.py: show model output 235 | * flow_cfg.py: default parameters 236 | * dlrm_d6t_pytorch.py: dlrm_data_pytorch.py adopted for d6tflow 237 | 238 | Try yourself! 239 | 240 | ## For your next project 241 | 242 | In this guide we showed how to build scalable deep learning workflows. We used an existing code base and showed how to turn linear deep learning code into d6tflow DAGs and the benefits of doing so. 243 | 244 | For new projects, you can start with a scalable project template from https://github.com/d6t/d6tflow-template. The structure is very similar: 245 | 246 | * run.py: run workflow 247 | * task.py: task code 248 | * cfg.py: manage parameters -------------------------------------------------------------------------------- /blogs/datasci-dags-airflow-meetup.md: -------------------------------------------------------------------------------- 1 | # How to use airflow-style DAGs to build highly effective data science workflows 2 | 3 | Airflow and Luigi are great for data engineering productions workflows but not optimized for data science r&d workflows. We will be using the d6tflow open source python library to bring airflow-style DAGs to the data science research and development process. 4 | 5 | ## Data science workflows are DAGs 6 | 7 | Data science workflows typically look like this. 8 | 9 | ![Sample Data Workflow](https://github.com/d6t/d6tflow/blob/master/docs/d6tflow-docs-graph.png?raw=true "Sample Data Workflow") 10 | 11 | This workflow is similar to data engineering workflows. It involves chaining together parameterized tasks which pass multiple inputs and outputs between each other. See [4 Reasons Why Your Machine Learning Code is Probably Bad](https://github.com/d6t/d6t-python/blob/master/blogs/reasons-why-bad-ml-code.rst) why passing data between functions or hardcoding file/database names without explicity defining task dependencies is NOT a good way of writing data science code. 12 | 13 | ```python 14 | 15 | # bad data science code 16 | def process_data(data, do_preprocess): 17 | data = do_stuff(data, do_preprocess) 18 | data.to_pickle('data.pkl') 19 | 20 | data = pd.read_csv('data.csv') 21 | process_data(data, True) 22 | df_train = pd.read_pickle(df_train) 23 | model = sklearn.svm.SVC() 24 | model.fit(df_train.iloc[:,:-1], df_train['y']) 25 | 26 | ``` 27 | 28 | ## R&D vs production data workflows 29 | 30 | Using airflow or luigi is a big step up from writing functional code for managing data workflows. But both libraries are designed to be used by data engineers in production settings where the focus is on: 31 | * making sure everything is running smoothly on time 32 | * scheduling and coordination 33 | * recovering from failures 34 | * data quality 35 | 36 | In contrast, focus in the r&d workflow is on: 37 | * generating insights 38 | * prototyping speed 39 | * assessing predictive power with different models and parameters 40 | * visualizing output 41 | 42 | As a result, the r&d workflow: 43 | * is less well defined 44 | * involves trial and error 45 | * requires frequent resetting of tasks and output as models, parameters and data change 46 | * takes output from the data engineer 47 | 48 | ## Problems with airflow/luigi in R&D settings 49 | 50 | Since both libraries are optimized for data engineering production settings, the UX for a data science r&d setting is not great: 51 | 52 | * WET code for reading/writing data 53 | * Manually keep track of filenames or database table names where data is saved 54 | * Inconvenient to reset tasks as models, parameters and data change 55 | * Inconvenient to keep track of model results with different parameter settings 56 | 57 | Manually keeping track of filenames in complex data workflows... Not scalable. 58 | 59 | ```python 60 | 61 | # vendor input 62 | cfg_fpath_cc_base = cfg_fpath_base + 'vendor/' 63 | cfg_fpath_cc_raw = cfg_fpath_cc_base + 'df_cc_raw.pkl' 64 | cfg_fpath_cc_raw_recent2 = cfg_fpath_cc_base + 'df_cc_raw_recent2.pkl' 65 | cfg_fpath_cc_yoy = cfg_fpath_cc_base + 'df_cc_yoy.pkl' 66 | cfg_fpath_cc_yoy_bbg = cfg_fpath_cc_base + 'df_cc_yoy_bbg.pkl' 67 | cfg_fpath_cc_yoy_fds = cfg_fpath_cc_base + 'df_cc_yoy_fds.pkl' 68 | cfg_fpath_cc_var_fds = cfg_fpath_cc_base + 'df_cc_var_fds.pkl' 69 | cfg_fpath_cc_yoy_recent2 = cfg_fpath_cc_base + 'df_cc_yoy_recent2.pkl' 70 | cfg_fpath_cc_actual = cfg_fpath_cc_base + 'df_cc_sales_actual.pkl' 71 | cfg_fpath_cc_monthly = cfg_fpath_cc_base + 'df_cc_monthly.pkl' 72 | cfg_fpath_cc_yoy_cs2 = 'data/processed/df_cc_yoy_cs2.pq' # consistent shopper data for new methodology from 2018 73 | 74 | # market 75 | cfg_fpath_market_attributes_px = cfg_fpath_base + '/market/df_market_px.pkl' 76 | cfg_fpath_market_consensus = cfg_fpath_base + '/market/df_market_consensus.pkl 77 | cfg_fpath_market_attributes = cfg_fpath_base + '/market/df_market_attributes.pkl' 78 | cfg_fpath_market_attributes_latest = cfg_fpath_base + '/market/df_market_attributes_latest.pkl' 79 | cfg_fpath_market_announce = cfg_fpath_base + '/market/df_market_announce.pkl' 80 | cfg_fpath_market_attributes_latest_fds1 = cfg_fpath_base + '/market/df_market_attributes_latest_fds1.pkl' 81 | cfg_fpath_market_attributes_latest_fds2 = cfg_fpath_base + '/market/df_market_attributes_latest_fds2.pkl' 82 | ``` 83 | 84 | ## How d6tflow is different from airflow/luigi 85 | 86 | d6tflow is optimized for data science research and development workflows. Here are benefits of using d6tflow in data science. 87 | 88 | Example workflow: 89 | ``` 90 | TaskGetData >> TaskProcess >> TaskTrain 91 | ``` 92 | 93 | ### Benefit: Tasks have input and ouput data 94 | 95 | Instead of having to manually load and save data, this is outsourced to the library. This scales better and reduces maintanance because the location of input/output data could change without having to rewrite code. It also makes it easier for the data engineer to hand off data to the data scientist. 96 | 97 | ```python 98 | class TaskProcess(d6tflow.tasks.TaskPqPandas): # define output format 99 | 100 | def requires(self): 101 | return TaskGetData() # define dependency 102 | 103 | def run(self): 104 | data = self.input().load() # load input data 105 | data = do_stuff(data) # process data 106 | self.save(data) # save output data 107 | ``` 108 | 109 | ### Benefit: Easily invalidate tasks 110 | 111 | Common invalidation scenarios are implemented. This increases prototyping speed as you change code and data during trial & error. 112 | 113 | ```python 114 | # force execution including downstream tasks 115 | d6tflow.run(TaskTrain(), force=TaskGetData()) 116 | 117 | # reset single task 118 | TaskGetData().invalidate() 119 | 120 | # reset all downstream tasks 121 | d6tflow.invalidate_downstream(TaskGetData(), TaskTrain()) 122 | 123 | # reset all upstream tasks 124 | d6tflow.invalidate_upstream(TaskTrain()) 125 | 126 | ``` 127 | 128 | ### Benefit: Easily train models using different paramters 129 | 130 | You can intelligently rerun workflow after changing a parameter. Parameters are passed from the target task to the relevant downstream task. Thus, you no longer have to manually keep track of which tasks to update, increasing prototyping speed and reducing errors. 131 | 132 | ```python 133 | d6tflow.preview(TaskTrain(do_preprocess=False)) 134 | 135 | ''' 136 | └─--[TaskTrain-{'do_preprocess': 'False'} (PENDING)] 137 | └─--[TaskPreprocess-{'do_preprocess': 'False'} (PENDING)] 138 | └─--[TaskGetData-{} (COMPLETE)] => this doesn't change and doesn't need to rerun 139 | ''' 140 | ``` 141 | 142 | ### Benefit: Easily compare models 143 | 144 | Different models that were trained with different parameters can be easily loaded and compared. 145 | 146 | ```python 147 | df_train1 = TaskPreprocess().output().load() 148 | model1 = TaskTrain().output().load() 149 | print(sklearn.metrics.accuracy_score(df_train1['y'],model1.predict(df_train1.iloc[:,:-1]))) 150 | 151 | df_train2 = TaskPreprocess(do_preprocess=False).output().load() 152 | model2 = TaskTrain(do_preprocess=False).output().load() 153 | print(sklearn.metrics.accuracy_score(df_train2['y'],model2.predict(df_train2.iloc[:,:-1]))) 154 | 155 | ``` 156 | 157 | ## d6tflow Quickstart 158 | 159 | Here is a full example of how to use d6tflow for a ML workflow 160 | https://github.com/d6t/d6tflow#example-output 161 | 162 | ## Template for scalable ML projects 163 | 164 | A d6tflow code template for real-life projects is available at 165 | https://github.com/d6t/d6tflow-template 166 | 167 | * Multiple task inputs and outputs 168 | * Parameter inheritance 169 | * Modularized tasks, run and viz 170 | 171 | 172 | ## Accelerate data engineer to data scientist hand-off 173 | 174 | To quickly share workflow output files, you can use [d6tpipe](https://github.com/d6t/d6tpipe). See [Sharing Workflows and Outputs](https://d6tflow.readthedocs.io/en/latest/collaborate.html). 175 | 176 | ```python 177 | import d6tflow.pipe 178 | 179 | d6tflow.pipe.init(api, 'pipe-name') # save flow output 180 | pipe = d6tflow.pipes.get_pipe() 181 | pipe.pull() 182 | 183 | class Task2(d6tflow.tasks.TaskPqPandas): 184 | 185 | def requires(self): 186 | return Task1() # define dependency 187 | 188 | def run(self): 189 | data = self.input().load() # load data from data engineer 190 | 191 | ``` 192 | 193 | Alternatively you can save outputs in a database using [d6tflow premium](https://pipe.databolt.tech/gui/request-premium/). 194 | 195 | ```python 196 | d6tflow2.db.init('postgresql+psycopg2://usr:pwd@localhost/db', 'schema_name') 197 | 198 | class Task1(d6tflow2.tasks.TaskSQLPandas): 199 | 200 | def run(self): 201 | df = pd.DataFrame() 202 | self.save(df) 203 | 204 | 205 | ``` 206 | 207 | Finally, the data scientist can inherit from tasks the data engineer has written to quickly load source data. 208 | 209 | ```python 210 | import tasks_factors # import tasks written by data engineer 211 | import utils 212 | 213 | class Task1(tasks_factors.Task1): 214 | external = True # rely on data engineer to run 215 | 216 | def run(self): 217 | data = self.input().load() # load data from data engineer 218 | ``` 219 | 220 | ## Bonus: centralize (local) data science files 221 | 222 | https://github.com/d6t/d6tpipe 223 | 224 | ## Bonus: keeping credentials safe 225 | 226 | https://d6tpipe.readthedocs.io/en/latest/advremotes.html#keeping-credentials-safe 227 | 228 | https://alexwlchan.net/2016/11/you-should-use-keyring/ -------------------------------------------------------------------------------- /blogs/datasci-projects-e2e.md: -------------------------------------------------------------------------------- 1 | # Guide to managing data science projects from start to finish 2 | 3 | ## Problem formulation: 4 | 5 | ### Business context / requirements 6 | 7 | * target users: who are the target users of information? 8 | * economic BUYER: who is the economic BUYER? 9 | 10 | * user flow: what is the user flow currently? 11 | * user pain points: pain points? how do they solve it today? 12 | * improved workflow: where will the model be used? how does the user interact with the model? 13 | * actionable insights: what practical value can the model provide to the user? 14 | 15 | * business impact goals: how does the model improve business outcomes? 16 | * critical bottleneck: what’s the critical problem that needs to be solved to have max impact? 17 | * business metric: what business metric are we trying to impact? 18 | * baseline: business metric status quo? 19 | * ideal outcome: what’s the ideal outcome here? what goal should be met? 20 | * success measure: how do we measure success? 21 | 22 | 23 | ### Design goals 24 | 25 | * goals: goals of the project from a user perspective 26 | * tradeoffs: ? 27 | * biz intel vs real-time decision making 28 | * accuracy: ? 29 | * interpretability: ? 30 | * speed: ? 31 | * online learning requirements? 32 | * deployment plan: where/how will it be used? 33 | * retraining: ? 34 | * evaluation: ? 35 | 36 | optional: 37 | * what should we avoid/be careful of? common pitfalls? 38 | * what are some potential bottlenecks/issues in solving this? 39 | * how would you deal with xyz? 40 | 41 | ### Target model output 42 | 43 | * model output: model should predict/produce what? 44 | * how relate to business metric? direct vs approx/indirect measure? 45 | * V ideal output: target model 46 | * V 1.0 output: early model 47 | * supporting output: what does a full prediction look like? incl meta data 48 | 49 | ### Model evaluation 50 | 51 | * model score: which metric reflects success measure? 52 | * loss function: how to optimize the model score? 53 | 54 | ### Roadmap/project plan 55 | 56 | * prototyping: build something the user can interact with 57 | * user feedback: 58 | * what does the output look like? 59 | * if we give you xyz does it work? 60 | * early/easy wins: optimize impact ROI early on 61 | * roadmap: how to sequence to achieve impact quickly and expand over time 62 | 63 | 64 | ## Data pipeline 65 | 66 | * Best practices 67 | * use [d6tflow data DAGs](https://github.com/d6t/d6tflow) 68 | * modularize code, see [d6tflow project templates for highly effective data science workflows](https://github.com/d6t/d6tflow-template) 69 | * use unit tests 70 | * don’t use jupyter notebooks, except viz or UI prototype 71 | 72 | ### Data sources 73 | 74 | * Preferred data sources: What did we wish we had? 75 | * Actual data sources: What are the sources and how can we acquire? 76 | * Tearsheet available? 77 | * Data dictionary available? 78 | * How much labeled data do you have? 79 | 80 | ### Infrastructure requirements/constraints 81 | 82 | * data prep: how large is input data? 83 | * model training: type and complexity of model? 84 | * dev vs prod: prod bottlenecks? storage/compute/memory 85 | 86 | ### Data preprocessing 87 | 88 | * Data DAG: preparing data for analysis 89 | 90 | ### Exploratory data analysis 91 | 92 | * Look at the data! 93 | * Start with a representative sample 94 | * Summary visualizations 95 | * Distributions 96 | * Relationships 97 | * Stability over time 98 | * Categorical variables 99 | * Quirks: skews/non-normal, Outliers, Missing values, imbalanced data, non-stationarity, autocorrel, multicolinearity, heteroskedasticity 100 | * Biases: lookahead. 101 | * AutoML output 102 | * Hypothesis on what should/should not work: Which features are most likely to predict? Which features should/should not predict? 103 | 104 | See [d6t EDA templates](https://github.com/d6t/d6tflow-template-datasci) 105 | 106 | ### Feature engineering 107 | 108 | * Feature library: 109 | * Interaction features 110 | 111 | ### Feature preprocessing 112 | 113 | * Normalize: N(0,1), mean relative 114 | * Look-ahead bias: real-time decision making issues 115 | * Fix quirks 116 | * Other transforms 117 | * Dimension reduction 118 | * PCA 119 | * GBM feature encoding 120 | * Embeddings 121 | 122 | 123 | ## Model building 124 | 125 | * baseline models: 126 | * candidate models: 127 | * ability to meet design goals? 128 | * address tradeoffs: accuracy / interpretability / speed? 129 | * handle quirks? 130 | * scalability? 131 | * train / validation / test 132 | * feature selection 133 | * model training 134 | * parameters 135 | * weights: input, class 136 | * learnings rates, # trees 137 | * regularization: penalties. early stopping. 138 | * GBM: number of trees. tree depth 139 | * DL: dropout. 140 | * execution infrastructure: total training size? model size? 141 | * hyperparam tuning 142 | * ensembling/stacking 143 | 144 | ## Model Evaluation 145 | 146 | ### Evaluation metrics 147 | 148 | * Comparing models: 149 | * In-sample: baseline vs model 150 | * Out-sample: baseline vs model 151 | * Bias-variance trade-off 152 | * Visual inspection: sample. best/worst predictions. high influence. 153 | * Overfitting assessment 154 | * Test lookahead bias 155 | * Stability of relationships 156 | * Stability of test errors 157 | 158 | ### Model interpretation 159 | 160 | * Model output 161 | * Feature importance 162 | * SHAP plots 163 | * Surrogate models 164 | 165 | ### User feedback 166 | 167 | * performance drivers vs intuition 168 | * single predictions for decision making 169 | 170 | ## Deployment 171 | 172 | * Data pipline best practices 173 | * Automated tests 174 | * Model speedup 175 | * Surrogate models 176 | * Fewer features: remove marginal features 177 | * Fewer trees: stop after achieved majority of gains 178 | * options: db vs API 179 | * integration with user-facing systems 180 | 181 | ### A / B testing 182 | 183 | ## Driving user adoption 184 | 185 | * teachins 186 | * getting started guides 187 | * push model output 188 | * keep top of mind 189 | * integrating into workflow 190 | 191 | -------------------------------------------------------------------------------- /blogs/design-ml-e2e.md: -------------------------------------------------------------------------------- 1 | 2 | # Guide to designing end-to-end machine learning systems 3 | 4 | ## Problem formulation: 5 | 6 | ### Business context / requirements 7 | 8 | * target users: who are the target users of information? 9 | * economic buyer: who pays for it? 10 | 11 | * user flow: what is the user flow currently? 12 | * user pain points: pain points? how do they solve it today? 13 | * improved workflow: where will the model be used? how does the user interact with the model? 14 | * actionable insights: what practical value can the model provide to the user? 15 | 16 | * business impact goals: how does the model improve business outcomes? 17 | * critical bottleneck: what’s the critical problem that needs to be solved to have max impact? 18 | * business metric: how do we quantify success? what business metric are we trying to influence? 19 | * baseline: business metric status quo? 20 | * ideal outcome: what’s the ideal outcome here? what goal should be met? 21 | * success measure: how do we measure success? 22 | 23 | 24 | ### Design goals 25 | 26 | * goals: goals of the project from a user perspective 27 | * tradeoffs: ? 28 | * biz intel vs real-time decision making 29 | * accuracy: ? 30 | * interpretability: ? 31 | * speed: ? 32 | * online learning requirements? 33 | * deployment plan: where/how will it be used? 34 | * retraining: ? 35 | * evaluation: ? 36 | 37 | optional: 38 | * what should we avoid/be careful of? common pitfalls? 39 | * what are some potential bottlenecks/issues in solving this? 40 | * how would you deal with xyz? 41 | 42 | ### Target model output 43 | 44 | * modeling task: prediction, classification, recommendation, clustering? 45 | * model output: model should predict/produce what? 46 | * how relate to business metric? direct vs approx/indirect measure? 47 | * V ideal output: target model 48 | * V 1.0 output: early model 49 | * supporting output: what does a full prediction look like? incl meta data 50 | 51 | ### Model evaluation 52 | 53 | * model score: which metric reflects success measure? 54 | * loss function: how to optimize the model score? 55 | 56 | ### Roadmap/project plan 57 | 58 | * prototyping: build something the user can interact with 59 | * user feedback: 60 | * what does the output look like? 61 | * if we give you xyz does it work? 62 | * early/easy wins: optimize impact ROI early on 63 | * roadmap: how to sequence to achieve impact quickly and expand over time 64 | 65 | 66 | ## Data pipeline 67 | 68 | * Best practices 69 | * use [d6tflow data DAGs](https://github.com/d6t/d6tflow) 70 | * modularize code, see [d6tflow project templates for highly effective data science workflows](https://github.com/d6t/d6tflow-template) 71 | * use unit tests 72 | * don’t use jupyter notebooks, except viz or UI prototype 73 | 74 | ### Data sources 75 | 76 | * Preferred data sources: What did we wish we had? 77 | * Actual data sources: What are the sources and how can we acquire? 78 | * Tearsheet available? 79 | * Data dictionary available? 80 | * How much labeled data do you have? 81 | 82 | ### Infrastructure requirements/constraints 83 | 84 | * data prep: how large is input data? 85 | * model training: type and complexity of model? 86 | * dev vs prod: prod bottlenecks? storage/compute/memory 87 | 88 | ### Data preprocessing 89 | 90 | * Data DAG: preparing data for analysis 91 | * clean, clean, clean... 92 | * combine 93 | * reshape 94 | * fill NANs 95 | 96 | ## Model building 97 | 98 | ### Exploratory data analysis 99 | 100 | * Look at the data! 101 | * Start with a representative sample 102 | * Summary visualizations 103 | * Distributions 104 | * Relationships 105 | * Stability over time 106 | * Categorical variables 107 | * Quirks: skews/non-normal, Outliers, Missing values, imbalanced data, non-stationarity, autocorrel, multicolinearity, heteroskedasticity 108 | * Biases: lookahead. 109 | * AutoML output 110 | * Hypothesis on what should/should not work: Which features are most likely to predict? Which features should/should not predict? 111 | 112 | See [d6t EDA templates](https://github.com/d6t/d6tflow-template-datasci) 113 | 114 | ### Feature engineering 115 | 116 | * Feature library: 117 | * Interaction features 118 | 119 | 120 | ### Feature preprocessing 121 | 122 | * Missing values 123 | * Categorical variables 124 | * One-hot 125 | * Embeddings (# dimensions): when a categorical column has many possible values 126 | * Normalize: N(0,1), mean relative 127 | * Look-ahead bias: real-time decision making issues 128 | * N(0,1) across full dataset vs training/test separate 129 | * Fix quirks 130 | * Imbalance: set class weights, sub/over-sample 131 | * Other transforms 132 | * Dimension reduction 133 | * PCA 134 | * GBM feature encoding 135 | * Sparse vector transform 136 | 137 | ### Model selection and training 138 | 139 | * baseline models: 140 | * candidate models: linear, trees, SVM, (F)FM, neural net... 141 | * ability to meet design goals? 142 | * address tradeoffs: accuracy / interpretability / speed? 143 | * handle quirks? 144 | * scalability? 145 | * train / validation / test 146 | * feature selection 147 | * model training 148 | * parameters 149 | * weights: input, class 150 | * learnings rates, # trees 151 | * regularization: penalties. early stopping. 152 | * GBM: number of trees. tree depth 153 | * DL: dropout. 154 | * execution infrastructure: total training size? model size? 155 | * hyperparam tuning 156 | * ensembling/stacking 157 | 158 | ## Model Evaluation 159 | 160 | ### Evaluation metrics 161 | 162 | * Comparing models: 163 | * In-sample: baseline vs model 164 | * Out-sample: baseline vs model 165 | * Bias-variance trade-off 166 | * statistically significant uplift? 167 | * Visual inspection: sample. best/worst predictions. high influence. 168 | * Overfitting assessment 169 | * Test lookahead bias 170 | * Stability of relationships 171 | * Stability of test errors 172 | 173 | ### Model interpretation 174 | 175 | * Model output 176 | * Feature importance 177 | * SHAP plots 178 | * Surrogate models 179 | 180 | ### User feedback 181 | 182 | * performance drivers vs intuition 183 | * single predictions for decision making 184 | 185 | ## Deployment 186 | 187 | * Data pipline best practices 188 | * Automated tests 189 | * Model speedup 190 | * Surrogate models 191 | * Fewer features: remove marginal features 192 | * Fewer trees: stop after achieved majority of gains 193 | * options: db vs API 194 | * integration with user-facing systems 195 | 196 | ### A / B testing 197 | 198 | * performance inline with test? 199 | * statistically significant uplift? 200 | 201 | ### Retraining 202 | 203 | * how often does model need to be recalibrated 204 | * store prior model point-in-time predictions 205 | 206 | ## Driving user adoption 207 | 208 | * teachins 209 | * getting started guides 210 | * push model output 211 | * keep top of mind 212 | * integrating into workflow 213 | 214 | -------------------------------------------------------------------------------- /blogs/effective-datasci-workflows.rst: -------------------------------------------------------------------------------- 1 | How to Build Highly Effective Data Science Workflows 2 | ============================================================ 3 | 4 | Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems: 5 | 6 | * it doesn't scale well as you add complexity 7 | * you have to manually track which functions were run with which parameters 8 | * you have to manually track where data is saved 9 | * it's difficult for others to read 10 | 11 | https://github.com/d6t/d6tflow is a free open-source library which makes it easy for you to build highly effective data science workflows. 12 | 13 | Does your data science code look like this? 14 | ------------------------------------------------------------ 15 | 16 | Don't do it! Read on how to make it better. 17 | 18 | .. code-block:: python 19 | 20 | import pandas as pd 21 | import sklearn.svm, sklearn.metrics 22 | 23 | def get_data(): 24 | data = download_data() 25 | data = clean_data(data) 26 | data.to_pickle('data.pkl') 27 | 28 | def preprocess(data): 29 | data = apply_function(data) 30 | return data 31 | 32 | # flow parameters 33 | reload_source = True 34 | do_preprocess = True 35 | 36 | # run workflow 37 | if reload_source: 38 | get_data() 39 | 40 | df_train = pd.read_pickle('data.pkl') 41 | if do_preprocess: 42 | df_train = preprocess(df_train) 43 | model = sklearn.svm.SVC() 44 | model.fit(df_train.iloc[:,:-1], df_train['y']) 45 | print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1]))) 46 | 47 | 48 | What to do about it? 49 | ------------------------------------------------------------ 50 | 51 | Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. 52 | 53 | So instead of writing a function that does: 54 | 55 | .. code-block:: python 56 | 57 | def process_data(data, parameter): 58 | 59 | if parameter: 60 | data = do_stuff(data) 61 | else: 62 | data = do_other_stuff(data) 63 | 64 | data.to_pickle('data.pkl') 65 | return data 66 | 67 | You are better of writing tasks that you can chain together as a DAG: 68 | 69 | .. code-block:: python 70 | 71 | class TaskProcess(d6tflow.tasks.TaskPqPandas): # define output format 72 | 73 | def requires(self): 74 | return TaskGetData() # define dependency 75 | 76 | def run(self): 77 | data = self.input().load() # load input data 78 | data = do_stuff(data) # process data 79 | self.save(data) # save output data 80 | 81 | The benefits of doings this are: 82 | 83 | * All tasks follow the same pattern no matter how complex your workflow gets 84 | * You have a scalable input ``requires()`` and processing function ``run()`` 85 | * You can quickly load and save data without having to hardcode filenames 86 | * If the input task is not complete it will automatically run 87 | * If input data or parameters change, the function will automatically rerun 88 | 89 | An example machine learning workflow 90 | ------------------------------------------------------------ 91 | 92 | Below is a stylized example of a machine learning flow which is expressed as a DAG. In the end you just need to run `TaskTrain()` and it will automatically know which dependencies to run. For a full example see https://github.com/d6t/d6tflow/blob/master/docs/example-ml.md 93 | 94 | .. code-block:: python 95 | 96 | import pandas as pd 97 | import sklearn, sklearn.svm 98 | import d6tflow 99 | import luigi 100 | 101 | # define workflow 102 | class TaskGetData(d6tflow.tasks.TaskPqPandas): # save dataframe as parquet 103 | 104 | def run(self): 105 | data = download_data() 106 | data = clean_data(data) 107 | self.save(data) # quickly save dataframe 108 | 109 | class TaskPreprocess(d6tflow.tasks.TaskCachePandas): # save data in memory 110 | do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no 111 | 112 | def requires(self): 113 | return TaskGetData() # define dependency 114 | 115 | def run(self): 116 | df_train = self.input().load() # quickly load required data 117 | if self.do_preprocess: 118 | df_train = preprocess(df_train) 119 | self.save(df_train) 120 | 121 | class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle 122 | do_preprocess = luigi.BoolParameter(default=True) 123 | 124 | def requires(self): 125 | return TaskPreprocess(do_preprocess=self.do_preprocess) 126 | 127 | def run(self): 128 | df_train = self.input().load() 129 | model = sklearn.svm.SVC() 130 | model.fit(df_train.iloc[:,:-1], df_train['y']) 131 | self.save(model) 132 | 133 | # Check task dependencies and their execution status 134 | d6tflow.preview(TaskTrain()) 135 | 136 | ''' 137 | └─--[TaskTrain-{'do_preprocess': 'True'} (PENDING)] 138 | └─--[TaskPreprocess-{'do_preprocess': 'True'} (PENDING)] 139 | └─--[TaskGetData-{} (PENDING)] 140 | ''' 141 | 142 | # Execute the model training task including dependencies 143 | d6tflow.run(TaskTrain()) 144 | 145 | ''' 146 | ===== Luigi Execution Summary ===== 147 | 148 | Scheduled 3 tasks of which: 149 | * 3 ran successfully: 150 | - 1 TaskGetData() 151 | - 1 TaskPreprocess(do_preprocess=True) 152 | - 1 TaskTrain(do_preprocess=True) 153 | ''' 154 | 155 | # Load task output to pandas dataframe and model object for model evaluation 156 | model = TaskTrain().output().load() 157 | df_train = TaskPreprocess().output().load() 158 | print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1]))) 159 | # 0.9733333333333334 160 | 161 | Conclusion 162 | ------------------------------------------------------------ 163 | 164 | Writing machine learning code as a linear series of functions likely creates many workflow problems. Because of the complex dependencies between different ML tasks it is better to write them as a DAG. https://github.com/d6t/d6tflow makes this very easy. Alternatively you can use `luigi 165 | `_ and `airflow 166 | `_ but they are more optimized for ETL than data science. 167 | -------------------------------------------------------------------------------- /blogs/images/d6tflow-filenames.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/d6tflow-filenames.png -------------------------------------------------------------------------------- /blogs/images/top10-stats-example2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/top10-stats-example2.png -------------------------------------------------------------------------------- /blogs/images/top10-stats-example3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/blogs/images/top10-stats-example3.png -------------------------------------------------------------------------------- /blogs/reasons-why-bad-ml-code.rst: -------------------------------------------------------------------------------- 1 | 4 Reasons Why Your Machine Learning Code is Probably Bad 2 | ============================================================ 3 | 4 | Your current workflow probably chains several functions together like in the example below. While quick, it likely has many problems: 5 | 6 | * it doesn't scale well as you add complexity 7 | * you have to manually keep track of which functions were run with which parameter as you iterate through your workflow 8 | * you have to manually keep track of where data is saved 9 | * it's difficult for others to read 10 | 11 | .. code-block:: python 12 | 13 | import pandas as pd 14 | import sklearn.svm, sklearn.metrics 15 | 16 | def get_data(): 17 | data = download_data() 18 | data.to_pickle('data.pkl') 19 | 20 | def preprocess(data): 21 | data = clean_data(data) 22 | return data 23 | 24 | # flow parameters 25 | do_preprocess = True 26 | 27 | # run workflow 28 | get_data() 29 | 30 | df_train = pd.read_pickle('data.pkl') 31 | if do_preprocess: 32 | df_train = preprocess(df_train) 33 | model = sklearn.svm.SVC() 34 | model.fit(df_train.iloc[:,:-1], df_train['y']) 35 | print(sklearn.metrics.accuracy_score(df_train['y'],model.predict(df_train.iloc[:,:-1]))) 36 | 37 | What to do about it? 38 | ------------------------------------------------------------ 39 | 40 | Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. That is your data science workflow should be a DAG. 41 | 42 | `d6tflow `_ is a free open-source library which makes it easy for you to build highly effective data science workflows. 43 | 44 | Instead of writing a function that does: 45 | 46 | .. code-block:: python 47 | 48 | def process_data(df, parameter): 49 | df = do_stuff(df) 50 | data.to_pickle('data.pkl') 51 | return df 52 | 53 | dataraw = download_data() 54 | data = process_data(dataraw) 55 | 56 | You can write tasks that you can chain together as a DAG: 57 | 58 | .. code-block:: python 59 | 60 | class TaskGetData(d6tflow.tasks.TaskPqPandas): 61 | 62 | def run(): 63 | data = download_data() 64 | self.save(data) # save output data 65 | 66 | @d6tflow.requires(TaskGetData) # define dependency 67 | class TaskProcess(d6tflow.tasks.TaskPqPandas): 68 | 69 | def run(self): 70 | data = self.input().load() # load input data 71 | data = do_stuff(data) # process data 72 | self.save(data) # save output data 73 | 74 | d6tflow.run(TaskProcess()) # execute task including dependencies 75 | data = TaskProcess().output().load() # load output data 76 | 77 | The benefits of doings this are: 78 | 79 | * All tasks follow the same pattern no matter how complex your workflow gets 80 | * You have a scalable input ``requires()`` and processing function ``run()`` 81 | * You can quickly load and save data without having to hardcode filenames 82 | * If the input task is not complete it will automatically run 83 | * If input data or parameters change, the function will automatically rerun 84 | * It’s much easier for others to read and understand the workflow 85 | 86 | An example machine learning DAG 87 | ------------------------------------------------------------ 88 | 89 | Below is a stylized example of a machine learning flow which is expressed as a DAG. In the end you just need to run `TaskTrain()` and it will automatically know which dependencies to run. For a full example see https://github.com/d6t/d6tflow/blob/master/docs/example-ml.md 90 | 91 | .. code-block:: python 92 | 93 | import pandas as pd 94 | import sklearn, sklearn.svm 95 | import d6tflow 96 | import luigi 97 | 98 | # define workflow 99 | class TaskGetData(d6tflow.tasks.TaskPqPandas): # save dataframe as parquet 100 | 101 | def run(self): 102 | data = download_data() 103 | data = clean_data(data) 104 | self.save(data) # quickly save dataframe 105 | 106 | @d6tflow.requires(TaskGetData) # define dependency 107 | class TaskPreprocess(d6tflow.tasks.TaskCachePandas): # save data in memory 108 | do_preprocess = luigi.BoolParameter(default=True) # parameter for preprocessing yes/no 109 | 110 | def run(self): 111 | df_train = self.input().load() # quickly load required data 112 | if self.do_preprocess: 113 | df_train = preprocess(df_train) 114 | self.save(df_train) 115 | 116 | @d6tflow.requires(TaskPreprocess) # define dependency 117 | class TaskTrain(d6tflow.tasks.TaskPickle): # save output as pickle 118 | 119 | def run(self): 120 | df_train = self.input().load() 121 | if self.model=='ols': 122 | model = sklearn.linear_model.LogisticRegression() 123 | elif self.model=='svm': 124 | model = sklearn.svm.SVC() 125 | else: 126 | raise ValueError('invalid model selection') 127 | model.fit(df_train.drop('y',1), df_train['y']) 128 | self.save(model) 129 | 130 | # Check task dependencies and their execution status 131 | d6tflow.preview(TaskTrain()) 132 | 133 | ''' 134 | └─--[TaskTrain-{'do_preprocess': 'True'} (PENDING)] 135 | └─--[TaskPreprocess-{'do_preprocess': 'True'} (PENDING)] 136 | └─--[TaskGetData-{} (PENDING)] 137 | ''' 138 | 139 | # Execute the model training task including dependencies 140 | d6tflow.run(TaskTrain()) 141 | 142 | ''' 143 | ===== Luigi Execution Summary ===== 144 | 145 | Scheduled 3 tasks of which: 146 | * 3 ran successfully: 147 | - 1 TaskGetData() 148 | - 1 TaskPreprocess(do_preprocess=True) 149 | - 1 TaskTrain(do_preprocess=True) 150 | ''' 151 | 152 | # Load task output to pandas dataframe and model object for model evaluation 153 | model = TaskTrain().output().load() 154 | df_train = TaskPreprocess().output().load() 155 | print(model.score(df_train.drop('y',1), df_train['y'])) 156 | # 0.9733333333333334 157 | 158 | Conclusion 159 | ------------------------------------------------------------ 160 | 161 | Writing machine learning code as a linear series of functions likely creates many workflow problems. Because of the complex dependencies between different ML tasks it is better to write them as a DAG. https://github.com/d6t/d6tflow makes this very easy. Alternatively you can use `luigi 162 | `_ and `airflow 163 | `_ but they are more optimized for ETL than data science. 164 | -------------------------------------------------------------------------------- /blogs/top10-mistakes-business.md: -------------------------------------------------------------------------------- 1 | Better at coding than stats, better at stats than coding. Not quite enough, need a commercial sense as well. 2 | 3 | ## 1. Not understand business objective 4 | 5 | ## focus on data not the user 6 | 7 | ## overengineering instead of prototyping 8 | 9 | Counterintuitively, often the best way to get started analyzing data is by working on a representative sample of the data. That allows you to familiarize yourself with the data and build the data pipeline without waiting for data processing and model training. But data scientists seem not to like that - more data is better. 10 | 11 | 12 | ## 9. Cannot explain results 13 | 14 | You've crunched the data and kept optimizing results. The error is low, everything is great. You take it back to the person who asked you to do the analysis and s/he starts asking questions: what does this variable mean? Why is the coefficient like this? What about when xyz happens? You hadn't thought about those questions because you were busy building models instead of applying the output. You don't look so smart anymore... 15 | 16 | **Solution**: know the data, models and results inside out! And think like a user of the data, not just like the data monkey. 17 | 18 | ## 9. Not intuitively understand pros/cons of different models 19 | 20 | Again ML libraries make it easy to just throw different models at a problem and see which model best minimizes errors. 21 | 22 | Example: We once built a model to understand human decisions. You could see decisions in the graphs decisions were very clustered and indeed a tree model performed much better than a linear regression. It made intuitive sense because human decision making is more like a decision tree than a regression. 23 | 24 | **Solution**: understand how a model works. Why does model 2 reduce the error vs model 1? Not just mathematically but using economic intuition. 25 | -------------------------------------------------------------------------------- /blogs/top10-mistakes-coding.md: -------------------------------------------------------------------------------- 1 | # Top 10 Coding Mistakes Made by Data Scientists 2 | 3 | A data scientist is a "person who is better at statistics than any software engineer and better at software engineering than any statistician". Many data scientists have a statistics background and little experience with software engineering. I'm a senior data scientist ranked top 1% on Stackoverflow for python coding and work with a lot of (junior) data scientists. Here is my list of 10 common mistakes I frequently see. 4 | 5 | ## 1. Don't share data referenced in code 6 | 7 | Data science needs code AND data. So for someone else to be able to reproduce your results, they need to have access to the data. Seems basic but a lot of people forget to share the data with their code. 8 | 9 | ```python 10 | 11 | import pandas as pd 12 | df1 = pd.read_csv('file-i-dont-have.csv') # fails 13 | do_stuff(df) 14 | 15 | ``` 16 | 17 | **Solution**: Use [d6tpipe](https://github.com/d6t/d6tpipe) to share data files with your code or upload to S3/web/google drive etc or save to a database so the recipient can retrieve files (but don't add them to git, see below). 18 | 19 | ## 2. Hardcode inaccessible paths 20 | 21 | Similar to mistake 1, if you hardcode paths others don't have access to, they can't run your code and have to look in lots of places to manually change paths. Booo! 22 | 23 | ```python 24 | 25 | import pandas as pd 26 | df = pd.read_csv('/path/i-dont/have/data.csv') # fails 27 | do_stuff(df) 28 | 29 | # or 30 | impor os 31 | os.chdir('c:\\Users\\yourname\\desktop\\python') # fails 32 | 33 | ``` 34 | 35 | **Solution**: Use relative paths, global path config variables or [d6tpipe](https://github.com/d6t/d6tpipe) to make your data easily accessible. 36 | 37 | ## 3. Mix data with code 38 | 39 | Since data science code needs data why not dump it in the same directory? And while you are at it, save images, reports and other junk there too. Yikes, what a mess! 40 | 41 | ``` 42 | ├── data.csv 43 | ├── ingest.py 44 | ├── other-data.csv 45 | ├── output.png 46 | ├── report.html 47 | └── run.py 48 | ``` 49 | 50 | **Solution**: Organize your directory into categories, like data, reports, code etc. See [Cookiecutter Data Science 51 | ](https://drivendata.github.io/cookiecutter-data-science/#directory-structure) or [d6tflow project templates](https://github.com/d6t/d6tflow-template) and use tools mentioned in #1 to store and share data. 52 | 53 | ## 4. Git commit data with source code 54 | 55 | Most people now version control their code (if you don't that's another mistake!! See [git](https://git-scm.com/)). In an attempt to share data, it might be tempting to add data files to version control. That's ok for very small files but git is not optimized for data, especially large files. 56 | 57 | ```bash 58 | git add data.csv 59 | ``` 60 | 61 | **Solution**: Use tools mentioned in #1 to store and share data. If you really want to version control data, see [d6tpipe](https://github.com/d6t/d6tpipe), [DVC](https://dvc.org/) and [Git Large File Storage](https://git-lfs.github.com/). 62 | 63 | ## 5. Write functions instead of DAGs 64 | 65 | Enough about data, lets talk about the actual code! Since one of the first things you learn when you learn to code are functions, data science code is mostly organized as a series of functions that are run linearly. That causes several problems, see [4 Reasons Why Your Machine Learning Code is Probably Bad](https://github.com/d6t/d6t-python/blob/master/blogs/reasons-why-bad-ml-code.rst). 66 | 67 | ```python 68 | 69 | def process_data(data, parameter): 70 | data = do_stuff(data) 71 | data.to_pickle('data.pkl') 72 | 73 | data = pd.read_csv('data.csv') 74 | process_data(data) 75 | df_train = pd.read_pickle(df_train) 76 | model = sklearn.svm.SVC() 77 | model.fit(df_train.iloc[:,:-1], df_train['y']) 78 | 79 | ``` 80 | 81 | **Solution**: Instead of linearly chaining functions, data science code is better written as a set of tasks with dependencies between them. Use [d6tflow](https://github.com/d6t/d6tflow) or [airflow](https://airflow.apache.org/). 82 | 83 | ## 6. Write for loops 84 | 85 | Like functions, for loops are the first thing you learn when you learn to code. Easy to understand, but they are slow and excessively wordy, typically indicating you are unaware of vectorized alternatives. 86 | 87 | ```python 88 | 89 | x = range(10) 90 | avg = sum(x)/len(x); std = math.sqrt(sum((i-avg)**2 for i in x)/len(x)); 91 | zscore = [(i-avg)/std for x] 92 | # should be: scipy.stats.zscore(x) 93 | 94 | # or 95 | groupavg = [] 96 | for i in df['g'].unique(): 97 | dfg = df[df[g']==i] 98 | groupavg.append(dfg['g'].mean()) 99 | # should be: df.groupby('g').mean() 100 | 101 | ``` 102 | 103 | **Solution**: [Numpy](http://www.numpy.org/), [scipy](https://www.scipy.org/) and [pandas](https://pandas.pydata.org/) have vectorized functions for most things that you think might require for loops. 104 | 105 | ## 7. Don't write unit tests 106 | 107 | As data, parameters or user input change, your code might break, sometimes without you noticing. That can lead to bad output and if someone makes decisions based on your output, bad data will lead to bad decisions! 108 | 109 | **Solution**: Use `assert` statements to check for data quality. [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html#testing-functions) has equality tests, [d6tstack](https://github.com/d6t/d6tstack) has checks for data ingestion and [d6tjoin](https://github.com/d6t/d6tjoin/blob/master/examples-prejoin.ipynb) for data joins. Code for example data checks: 110 | 111 | ```python 112 | 113 | assert df['id'].unique().shape[0] == len(ids) # have data for all ids? 114 | assert df.isna().sum()<0.9 # catch missing values 115 | assert df.groupby(['g','date']).size().max() ==1 # no duplicate values/date? 116 | assert d6tjoin.utils.PreJoin([df1,df2],['id','date']).is_all_matched() # all ids matched? 117 | 118 | ``` 119 | 120 | ## 8. Don't document code 121 | 122 | I get it, you're in a hurry to produce some analysis. You hack things together to get results to your client or boss. Then a week later they come back and say "can you change xyz" or "can you update this please". You look at your code and can't remember why you did what you did. And now imagine someone else has to run it. 123 | 124 | ```python 125 | 126 | def some_complicated_function(data): 127 | data = data[data['column']!='wrong'] 128 | data = data.groupby('date').apply(lambda x: complicated_stuff(x)) 129 | data = data[data['value']<0.9] 130 | return data 131 | 132 | ``` 133 | 134 | **Solution**: Take the extra time, even if it's after you've delivered the analysis, to document what you did. You will thank yourself and other will do so even more! You'll look like a pro! 135 | 136 | ## 9. Save data as csv or pickle 137 | 138 | Back data, it's DATA science after all. Just like functions and for loops, CSVs and pickle files are commonly used but they are actually not very good. CSVs don't include a schema so everyone has to parse numbers and dates again. Pickles solve that but only work in python and are not compressed. Both are not good formats to store large datasets. 139 | 140 | ```python 141 | 142 | def process_data(data, parameter): 143 | data = do_stuff(data) 144 | data.to_pickle('data.pkl') 145 | 146 | data = pd.read_csv('data.csv') 147 | process_data(data) 148 | df_train = pd.read_pickle(df_train) 149 | 150 | ``` 151 | 152 | **Solution**: Use [parquet](https://github.com/dask/fastparquet) or other binary data formats with data schemas, ideally ones that compress data. [d6tflow](https://github.com/d6t/d6tflow) automatically saves data output of tasks as parquet so you don't have to deal with it. 153 | 154 | ## 10. Use jupyter notebooks 155 | 156 | Lets conclude with a controversial one: jupyter notebooks are as common as CSVs. A lot of people use them. That doesn't make them good. Jupyter notebooks promote a lot of bad software engineering habits mentioned above, notably: 157 | 158 | 1. You are tempted to dump all files into one directory 159 | 2. You write code that runs top-bottom instead of DAGs 160 | 3. You don't modularize your code 161 | 4. Difficult to debug 162 | 5. Code and output gets mixed in one file 163 | 6. They don't version control well 164 | 165 | It feels easy to get started but scales poorly. 166 | 167 | **Solution**: Use [pycharm](https://www.jetbrains.com/pycharm/) and/or [spyder](https://www.spyder-ide.org/). 168 | -------------------------------------------------------------------------------- /blogs/top10-mistakes-statistics.md: -------------------------------------------------------------------------------- 1 | # Top 10 Statistics Mistakes Made by Data Scientists 2 | 3 | A data scientist is a "person who is better at statistics than any software engineer and better at software engineering than any statistician". In [Top 10 Coding Mistakes Made by Data Scientists](https://github.com/d6t/d6t-python/blob/master/blogs/top10-mistakes-coding.md) we discussed how statisticians can become a better coders. Here we discuss how coders can become better statisticians. 4 | 5 | Detailed output and code for each of the examples is available in [github](http://tiny.cc/top10-mistakes-stats-code) and in an [interactive notebook](http://tiny.cc/top10-mistakes-stats-bind). The code uses workflow management library [d6tflow](https://github.com/d6t/d6tflow) and data is shared with dataset management library [d6tpipe](https://github.com/d6t/d6tpipe). 6 | 7 | ## 1. Not fully understand objective function 8 | 9 | Data scientists want to build the "best" model. But beauty is in the eye of the beholder. If you don't know what the goal and objective function is and how it behaves, it is unlikely you will be able to build the "best" model. And fwiw the objective may not even be a mathematical function but perhaps improving a business metric. 10 | 11 | **Solution**: most kaggle winners spend a lot of time understanding the objective function and how the data and model relates to the objective function. If you are optimizing a business metric, map it to an appropriate mathematical objective function. 12 | 13 | **Example**: F1 score is typically used to assess classification models. We once built a classification model whose success depended on the % of occurrences it was right. The F1 score was misleading because it shows the model was correct ~60% of the time whereas in fact it was correct only 40% of the time. 14 | 15 | ``` 16 | f1 0.571 accuracy 0.4 17 | ``` 18 | ## 2. Not have a hypothesis why something should work 19 | 20 | Commonly data scientists want to build "models". They heard xgboost and random forests work best so lets use those. They read about deep learning, maybe that will improve results further. They throw models at the problem without having looked at the data and without having formed a hypothesis which model is most likely to best capture the features of the data. It makes explaining your work really hard too because you are just randomly throwing models at data. 21 | 22 | **Solution**: look at the data! Understand its characteristics and form a hypothesis which model is likely to best capture those characteristics. 23 | 24 | **Example**: Without running any models by just plotting this sample data, you can already have a strong view that x1 is linearly related with y and x2 doesn't have much of a relationship with y. 25 | ![Example2](images/top10-stats-example2.png?raw=true "Example2") 26 | 27 | ## 3. Not looking at the data before interpreting results 28 | 29 | Another problem with not looking at the data is that your results can be heavily driven by outliers or other artifacts. This is especially true for models that minimize squared sums. Even without outliers, you can have problems with imbalanced datasets, clipped or missing values and all sorts of other weird artifacts of real-life data that you didn't see in the classroom. 30 | 31 | **Solution**: it's so important it's worth repeating: look at the data! Understand how the nature of the data is impacting model results. 32 | 33 | **Example**: with outliers, x1 slope changed from 0.906 to -0.375! 34 | ![Example3](images/top10-stats-example3.png?raw=true "Example3") 35 | 36 | ## 4. Not having a naive baseline model 37 | 38 | Modern ML libraries almost make it too easy... Just change a line of code and you can run a new model. And another. And another. Error metrics are decreasing, tweak parameters - great - error metrics are decreasing further... With all the model fanciness, you can forget the dumb way of forecasting data. And without that naive benchmark, you don't have a good absolute comparison for how good your models are, they may all be bad in absolute terms. 39 | 40 | **Solution**: what's the dumbest way you can predict a value? Build a "model" using the last known value, the (rolling) mean or some constant eg 0. Compare your model performance against a zero-intelligence forecast monkey! 41 | 42 | **Example**: With this time series dataset, model1 must be better than model2 with MSE of 0.21 and 0.45 respectively. But wait! By just taking the last known value, the MSE drops to 0.003! 43 | 44 | ``` 45 | ols CV mse 0.215 46 | rf CV mse 0.428 47 | last out-sample mse 0.003 48 | ``` 49 | 50 | ## 5. Incorrect out-sample testing 51 | 52 | This is the one that could derail your career! The model you built looked great in R&D but performs horrible in production. The model you said will do wonders is causing really bad business outcomes, potentially costing the company $m+. It's so important all the remaining mistakes bar the last one focus on it. 53 | 54 | **Solution**: Make sure you've run your model in realistic outsample conditions and understand when it will perform well and when it doesn't. 55 | 56 | **Example**: In-sample the random forest does a lot better than linear regression with mse 0.048 vs ols mse 0.183 but out-sample it does a lot worse with mse 0.259 vs linear regression mse 0.187. The random forest overtrained and would not perform well live in production! 57 | 58 | ``` 59 | in-sample 60 | rf mse 0.04 ols mse 0.183 61 | out-sample 62 | rf mse 0.261 ols mse 0.187 63 | ``` 64 | 65 | ## 6. Incorrect out-sample testing: applying preprocessing to full dataset 66 | 67 | You probably know that powerful ML models can overtrain. Overtraining means it performs well in-sample but badly out-sample. So you need to be aware of having training data leak into test data. If you are not careful, any time you do feature engineering or cross-validation, train data can creep into test data and inflate model performance. 68 | 69 | **Solution**: make sure you have a true test set free of any leakage from training set. Especially beware of any time-dependent relationships that could occur in production use. 70 | 71 | **Example**: This happens a lot. Preprocessing is applied to the full dataset BEFORE it is split into train and test, meaning you do not have a true test set. Preprocessing needs to be applied separately AFTER data is split into train and test sets to make it a true test set. The MSE between the two methods (mixed out-sample CV mse 0.187 vs true out-sample CV mse 0.181) in this case is not all that different because the distributional properties between train and test are not that different but that might not always be the case. 72 | 73 | ``` 74 | mixed out-sample CV mse 0.187 true out-sample CV mse 0.181 75 | ``` 76 | 77 | ## 7. Incorrect out-sample testing: cross-sectional data & panel data 78 | 79 | You were taught cross-validation is all you need. sklearn even provides you some nice convenience functions so you think you have checked all the boxes. But most cross-validation methods do random sampling so you might end up with training data in your test set which inflates performance. 80 | 81 | **Solution**: generate test data such that it accurately reflects data on which you would make predictions in live production use. Especially with time series and panel data you likely will have to generate custom cross-validation data or do roll-forward testing. 82 | 83 | **Example**: here you have panel data for two different entities (eg companies) which are cross-sectionally highly correlated. If you randomly split data you make accurate predictions using data you did not actually have available during test, overstating model performance. You think you avoided mistake #5 by using cross-validation and found the random forest performs a lot better than linear regression in cross-validation. But running a roll-forward out-sample test which prevents future data from leaking into test, it performs a lot worse again! (random forest MSE goes from 0.047 to 0.211, higher than linear regression!) 84 | 85 | ``` 86 | normal CV 87 | ols 0.203 rf 0.051 88 | true out-sample error 89 | ols 0.166 rf 0.229 90 | ``` 91 | 92 | ## 8. Not considering which data is available at point of decision 93 | 94 | When you run a model in production, it gets fed with data that is available when you run the model. That data might be different than what you assumed to be available in training. For example the data might be published with delay so by the time you run the model other inputs have changed and you are making predictions with wrong data or your true y variable is incorrect. 95 | 96 | **Solution**: do a rolling out-sample forward test. If I had used this model in production, what would my training data look like, ie what data do you have to make predictions? That's the training data you use to make a true out-sample production test. Furthermore, think about if you acted on the prediction, what result would that generate at the point of decision? 97 | 98 | ## 9. Subtle Overtraining 99 | 100 | The more time you spend on a dataset, the more likely you are to overtrain it. You keep tinkering with features and optimizing model parameters. You used cross-validation so everything must be good. 101 | 102 | **Solution**: After you have finished building the model, try to find another "version" of the datasets that can be a surrogate for a true out-sample dataset. If you are a manager, deliberately withhold data so that it does not get used for training. 103 | 104 | **Example**: Applying the models that were trained on dataset 1 to dataset 2 shows the MSEs more than doubled. Are they still acceptable...? This is a judgement call but your results from #4 might help you decide. 105 | 106 | ``` 107 | first dataset 108 | rf mse 0.261 ols mse 0.187 109 | new dataset 110 | rf mse 0.681 ols mse 0.495 111 | ``` 112 | 113 | ## 10. "need more data" fallacy 114 | 115 | Counterintuitively, often the best way to get started analyzing data is by working on a representative sample of the data. That allows you to familiarize yourself with the data and build the data pipeline without waiting for data processing and model training. But data scientists seem not to like that - more data is better. 116 | 117 | **Solution**: start working with a small representative sample and see if you can get something useful out of it. Give it back to the end user, can they use it? Does it solve a real pain point? If not, the problem is likely not because you have too little data but with your approach. 118 | -------------------------------------------------------------------------------- /blogs/top5-mistakes-vendors.md: -------------------------------------------------------------------------------- 1 | # 5 Mistakes Data Vendors Commonly Make 2 | 3 | Avoid the common traps and keep your clients happy. 4 | 5 | * **Reinvent the Wheel**: To deliver data to clients, you need build APIs, ftp servers, S3 buckets and so on. And you need it all because your client needs are diverse 6 | * **Great Sales, Bad Delivery**: The sales and product team closed the deal and then it's up to the engineering team to deliver data. But they have limited capacity and it's taking longer than you think. 7 | * **No Quickstart Instructions**: You presented a great story and case study but the client cannot easily verify our clients. They spin the wheels just to recreate why you did 8 | * **Not Cloud Ready**: Is ftp still the only thing you have to offer? Your clients are moving to the cloud and so shoud you 9 | * **No usage analytics**: 10 | 11 | The solution to all this? [DataBolt Pipe](https://www.databolt.tech/index-pipe-vendors.html). We provide turnkey solutions to efficiently distribute data to your clients. 12 | 13 | * **Turnkey Infrastructure**: Use managed infrastructure, no need to build and maintain separate APIs, AWS S3 buckets and ftp servers 14 | * **Simple Web GUI**: Non-technical users from sales and product teams can distribute data without involving your engineering team 15 | * **Usage Analytics**: Get download confirmations and daily usage statistics to measure engagement by client 16 | 17 | -------------------------------------------------------------------------------- /overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d6t/d6t-python/0ecc488a21375cb02f79014348cc6564fdb65999/overview.png --------------------------------------------------------------------------------