├── .github ├── ISSUE_TEMPLATE │ ├── bug_report_template.yaml │ ├── doc_fix_template.yaml │ └── feature_request_template.yaml └── pull_request_template.md ├── .gitignore ├── LICENSE.txt ├── README.md ├── notebooks ├── databricks.py └── jupyter.ipynb ├── pipeline.yaml ├── profiles ├── databricks.yaml └── local.yaml ├── requirements.txt ├── requirements ├── lint-requirements.txt └── test-requirements.txt ├── steps ├── custom_metrics.py ├── ingest.py ├── split.py ├── train.py └── transform.py └── tests ├── __init__.py ├── train_test.py └── transform_test.py /.github/ISSUE_TEMPLATE/bug_report_template.yaml: -------------------------------------------------------------------------------- 1 | name: Bug Report 2 | description: Create a report to help us reproduce and correct the bug 3 | labels: 'bug' 4 | title: '[BUG]' 5 | 6 | body: 7 | - type: markdown 8 | attributes: 9 | value: | 10 | We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=bug&template=bug_report_template.yaml&title=%5BBUG%5D) 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/doc_fix_template.yaml: -------------------------------------------------------------------------------- 1 | name: Documentation Fix 2 | description: Use this template for proposing documentation fixes/improvements. 3 | labels: 'area/docs' 4 | title: '[DOC-FIX]' 5 | 6 | body: 7 | - type: markdown 8 | attributes: 9 | value: | 10 | We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=area%2Fdocs&template=doc_fix_template.yaml&title=%5BDOC-FIX%5D) 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request_template.yaml: -------------------------------------------------------------------------------- 1 | name: Feature Request 2 | description: Use this template for feature and enhancement proposals. 3 | labels: 'enhancement' 4 | title: '[FR]' 5 | 6 | body: 7 | - type: markdown 8 | attributes: 9 | value: | 10 | We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=enhancement&template=feature_request_template.yaml&title=%5BFR%5D) 11 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | ## What changes are proposed in this pull request? 2 | 3 | (Please fill in changes proposed in this fix) 4 | 5 | ## How is this patch tested? 6 | 7 | (Details) 8 | 9 | We require adding a link to the [workflow](https://github.com/mlflow/mlflow/actions/workflows/pipeline-template.yml) for testing the [mlflow/mlp-regression-template](https://github.com/mlflow/mlp-regression-template) code path. 10 | 11 | Run the workflow with repository name: `mlflow/mlp-regression-template` and the branch`(Example::feature)` that you are working with. 12 | ## Release Notes 13 | 14 | ### Is this a user-facing change? 15 | 16 | - [ ] No. You can skip the rest of this section. 17 | - [ ] Yes. Give a description of this change to be included in the release notes for MLflow users. 18 | 19 | (Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.) 20 | 21 | ### What component(s), interfaces, languages, and integrations does this PR affect? 22 | Components 23 | - [ ] `area/regression`: Sklearn regression pipeline example 24 | - [ ] `area/build`: Build and test infrastructure for MLflow pipeline example 25 | 26 | 31 | 32 | ### How should the PR be classified in the release notes? Choose one: 33 | 34 | - [ ] `rn/breaking-change` - The PR will be mentioned in the "Breaking Changes" section 35 | - [ ] `rn/none` - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section 36 | - [ ] `rn/feature` - A new user-facing feature worth mentioning in the release notes 37 | - [ ] `rn/bug-fix` - A user-facing bug fix worth mentioning in the release notes 38 | - [ ] `rn/documentation` - A user-facing documentation change worth mentioning in the release notes 39 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Mlflow 2 | mlruns/ 3 | mlartifacts/ 4 | outputs/ 5 | mlruns.db 6 | 7 | # Mac 8 | .DS_Store 9 | 10 | # Byte-compiled / optimized / DLL files 11 | __pycache__ 12 | *.py[cod] 13 | *$py.class 14 | 15 | # C extensions 16 | *.so 17 | 18 | # Distribution / packaging 19 | .Python 20 | build/ 21 | develop-eggs/ 22 | dist/ 23 | downloads/ 24 | eggs/ 25 | .eggs/ 26 | lib/ 27 | lib64/ 28 | parts/ 29 | sdist/ 30 | var/ 31 | wheels/ 32 | *.egg-info/ 33 | .installed.cfg 34 | *.egg 35 | MANIFEST 36 | node_modules 37 | 38 | # PyInstaller 39 | # Usually these files are written by a python script from a template 40 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 41 | *.manifest 42 | *.spec 43 | 44 | # Installer logs 45 | pip-log.txt 46 | pip-delete-this-directory.txt 47 | 48 | # Unit test / coverage reports 49 | htmlcov/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | .hypothesis/ 57 | .pytest_cache/ 58 | 59 | # Sphinx documentation 60 | docs/_build/ 61 | 62 | # Jupyter Notebook 63 | .ipynb_checkpoints 64 | 65 | # Environments 66 | env 67 | env3 68 | .env 69 | .venv 70 | env/ 71 | venv/ 72 | ENV/ 73 | env.bak/ 74 | venv.bak/ 75 | .python-version 76 | 77 | # Editor files 78 | .*project 79 | *.swp 80 | *.swo 81 | *.idea 82 | *.vscode 83 | *.iml 84 | *~ 85 | 86 | # mkdocs documentation 87 | /site 88 | 89 | # mypy 90 | .mypy_cache/ 91 | 92 | # java targets 93 | target/ 94 | 95 | # R notebooks 96 | .Rproj.user 97 | example/tutorial/R/*.nb.html 98 | 99 | # travis_wait command logs 100 | travis_wait*.log 101 | 102 | # Pytorch logs 103 | lightning_logs 104 | 105 | # tox logs 106 | *.tox 107 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # [DEPRECATED] MLflow Pipelines Regression Template 2 | **NOTE**: This repository is deprecated as of 2022/11/07, and will be removed soon. 3 | If you are using MLflow 2.0, 4 | please refer to [MLflow Recipes Regression Template](https://github.com/mlflow/recipes-regression-template) instead. 5 | 6 | The MLflow Regression Pipeline is an [MLflow Pipeline](https://mlflow.org/docs/latest/pipelines.html) for developing 7 | high-quality regression models. 8 | It is designed for developing models using scikit-learn and frameworks that integrate with scikit-learn, 9 | such as the `XGBRegressor` API from XGBoost. 10 | 11 | This repository is a template for developing production-ready regression models with the MLflow Regression Pipeline. 12 | It provides a pipeline structure for creating models as well as pointers to configurations and code files that should 13 | be filled in to produce a working pipeline. 14 | 15 | Code developed with this template should be run with [MLflow Pipelines](https://mlflow.org/docs/latest/pipelines.html). 16 | An example implementation of this template can be found in the [MLP Regression Example repo](https://github.com/mlflow/mlp-regression-example), 17 | which targets the NYC taxi dataset for its training problem. 18 | 19 | **Note**: [MLflow Pipelines](https://mlflow.org/docs/latest/pipelines.html) 20 | is an experimental feature in [MLflow](https://mlflow.org). 21 | If you observe any issues, 22 | please report them [here](https://github.com/mlflow/mlflow/issues). 23 | For suggestions on improvements, 24 | please file a discussion topic [here](https://github.com/mlflow/mlflow/discussions). 25 | Your contribution to MLflow Pipelines is greatly appreciated by the community! 26 | 27 | ## Key Features 28 | - Deterministic data splitting 29 | - Reproducible data transformations 30 | - Hyperparameter tuning support 31 | - Model registration for use in production 32 | - Starter code for ingest, split, transform and train steps 33 | - Cards containing step results, including dataset profiles, model leaderboard, performance plots and more 34 | 35 | ## Installation 36 | Follow the [MLflow Pipelines installation guide](https://mlflow.org/docs/latest/pipelines.html#installation). 37 | You may need to install additional libraries for extra features: 38 | - [Hyperopt](https://pypi.org/project/hyperopt/) is required for hyperparameter tuning. 39 | - [PySpark](https://pypi.org/project/pyspark/) is required for distributed training or to ingest Spark tables. 40 | - [Delta](https://pypi.org/project/delta-spark/) is required to ingest Delta tables. 41 | These libraries are available natively in the [Databricks Runtime for Machine Learning](https://docs.databricks.com/runtime/mlruntime.html). 42 | 43 | ## Get started 44 | After installing MLflow Pipelines, you can clone this repository to get started. Simply fill in the required values annotated by `FIXME::REQUIRED` comments in the [Pipeline configuration file](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml) 45 | and in the appropriate profile configuration: [`local.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/local.yaml) 46 | (if running locally) or [`databricks.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/databricks.yaml) 47 | (if running on Databricks). 48 | 49 | The Pipeline will then be in a runnable state, and when run completely, will produce a trained model ready for batch 50 | scoring, along with cards containing detailed information about the results of each step. 51 | The model will also be registered to the MLflow Model Registry if it meets registration thresholds. 52 | To iterate and improve your model, follow the [MLflow Pipelines usage guide](https://mlflow.org/docs/latest/pipelines.html#usage). 53 | Note that iteration will likely involve filling in the optional `FIXME`s in the 54 | step code files with your own code, in addition to the configuration keys. 55 | 56 | ## Reference 57 | ![image](https://user-images.githubusercontent.com/66143562/195912433-f1e44829-dea5-4fb2-a034-b197c2cebf71.png) 58 | 59 | This is a visual overview of the MLflow Regression Pipeline's information flow. 60 | 61 | Model develompent consists of the following sequential steps: 62 | ``` 63 | ingest -> split -> transform -> train -> evaluate -> register 64 | ``` 65 | 66 | The batch scoring workflow consists of the following sequential steps: 67 | ``` 68 | ingest_scoring -> predict 69 | ``` 70 | A detailed reference for each step follows. 71 | 72 | * [Reference](#reference) 73 | + [Step artifacts](#step-artifacts) 74 | + [Ingest step](#ingest-step) 75 | - [Data](#data) 76 | + [Split step](#split-step) 77 | + [Transform step](#transform-step) 78 | + [Train step](#train-step) 79 | + [Evaluate step](#evaluate-step) 80 | + [Register step](#register-step) 81 | + [Batch scoring](#batch-scoring) 82 | - [Ingest Scoring step](#ingest-scoring-step) 83 | - [Predict step](#predict-step) 84 | + [MLflow Tracking / Model Registry configuration](#mlflow-tracking--model-registry-configuration) 85 | + [Metrics](#metrics) 86 | - [Built-in metrics](#built-in-metrics) 87 | - [Custom metrics](#custom-metrics) 88 | 89 | ### Step artifacts 90 | Each of the steps in the pipeline produces artifacts after completion. These artifacts consist of cards containing 91 | detailed execution information, as well as other step-specific information. 92 | The [`Pipeline.inspect()`](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#mlflow.pipelines.regression.v1.pipeline.RegressionPipeline.inspect) 93 | API is used to view step cards. The [`get_artifact`](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#mlflow.pipelines.regression.v1.pipeline.RegressionPipeline.get_artifact) 94 | API is used to load all other step artifacts by name. 95 | Per-step artifacts are further detailed in the following step references. 96 | 97 | ### Ingest step 98 | The ingest step resolves the dataset specified by the `data` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml) 99 | and converts it to parquet format, leveraging the custom loader code specified in the `data` section if necessary. 100 | **Note**: If you make changes to the dataset referenced by the ingest step (e.g. by adding new records or columns), 101 | you must manually re-run the ingest step in order to use the updated dataset in the pipeline. 102 | The ingest step does not automatically detect changes in the dataset. 103 | 104 | The custom loader function allows use of datasets in other formats, such as `csv`. 105 | The function should be defined in [`steps/ingest.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/ingest.py), 106 | and should accept two parameters: 107 | - `file_path`: `str`. Path to the dataset file. 108 | - `file_format`: `str`. The file format string, such as `"csv"`. 109 | 110 | It should return a Pandas DataFrame representing the content of the specified file. [`steps/ingest.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/ingest.py) contains an example placeholder function. 111 | 112 | #### Data 113 | The input dataset is specified by the `data` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml) as follows: 114 |
115 | Full configuration reference 116 | 117 | - `location`: string. Required, unless `format` is `spark_sql`. 118 | Dataset locations on the local filesystem are supported, as 119 | well as HTTP(S) URLs and any other remote locations [resolvable by MLflow](https://mlflow.org/docs/latest/tracking.html#artifact-stores). 120 | One may specify multiple data locations by a list of locations as long as they have the same data format (see example below) 121 | Examples: 122 | ``` 123 | location: ./data/sample.parquet 124 | ``` 125 | ``` 126 | location: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet 127 | ``` 128 | ``` 129 | location: ["./data/sample.parquet", "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"] 130 | ``` 131 | - `format`: string. Required. 132 | One of `parquet`, `spark_sql` and `delta`. 133 | 134 | 135 | - `custom_loader_method`: string. Optional. 136 | Fully qualified name of the custom loader function. 137 | Example: 138 | ``` 139 | custom_loader_method: steps.ingest.load_file_as_dataframe 140 | ``` 141 | 142 | - `sql`: string. Required if format is `spark_sql`. 143 | Specifies a SparkSQL statement that identifies the dataset to use. 144 | 145 | 146 | - `version`: int. Optional. 147 | If the `delta` format is specified, use this to specify the Delta table version to read from. 148 | 149 | 150 | - `timestamp`: timestamp. Optional. 151 | If the `delta` format is specified, use this to specify the timestamp at which to read data. 152 |
153 | 154 | **Step artifacts** 155 | - `ingested_data`: The ingested data as a Pandas DataFrame. 156 | 157 | ### Split step 158 | 159 | The split step splits the ingested dataset produced by the ingest step into: 160 | - a training dataset for model training 161 | - a validation dataset for model performance evaluation & tuning, and 162 | - a test dataset for model performance evaluation. 163 | 164 | The fraction of records allocated to each dataset is defined by the `split_ratios` attribute of the `split` step 165 | definition in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml). 166 | The split step also preprocesses the datasets using logic defined in [`steps/split.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/split.py). 167 | Subsequent steps use these datasets to develop a model and measure its performance. 168 | 169 | The post-split method should be written in `steps/split.py` and should accept three parameters: 170 | - `train_df`: DataFrame. The unprocessed train dataset. 171 | - `validation_df`: DataFrame. The unprocessed validation dataset. 172 | - `test_df`: DataFrame. The unprocessed test dataset. 173 | 174 | It should return a triple representing the processed train, validation and test datasets. `steps/split.py` contains an example placeholder function. 175 | 176 | The split step is configured by the `steps.split` section in `pipeline.yaml` as follows: 177 |
178 | Full configuration reference 179 | 180 | - `split_ratios`: list. Optional. 181 | A YAML list specifying the ratios by which to split the dataset into training, validation and test sets. 182 | Example: 183 | ``` 184 | split_ratios: [0.75, 0.125, 0.125] # Defaults to this ratio if unspecified 185 | ``` 186 | - `post_split_filter_method`: string. Optional. 187 | Fully qualified name of the method to use to "post-process" the split datasets. 188 | This procedure is meant for removing/filtering records, or other cleaning processes. Arbitrary data transformations 189 | should be done in the transform step. 190 | Example: 191 | ``` 192 | post_split_filter_method: steps.split.process_splits 193 | ``` 194 |
195 | 196 | **Step artifacts**: 197 | - `training_data`: the training dataset as a Pandas DataFrame. 198 | - `validation_data`: the validation dataset as a Pandas DataFrame. 199 | - `test_data`: the test dataset as a Pandas DataFrame. 200 | 201 | ### Transform step 202 | 203 | The transform step uses the training dataset created by the split step to fit a transformer that performs the 204 | user-defined transformations. The transformer is then applied to the training dataset and the validation dataset, 205 | creating transformed datasets that are used by subsequent steps for estimator training and model performance evaluation. 206 | 207 | The user-defined transformation function is not required. If absent, an **identity transformer** will be used. 208 | The user-defined function should be written in 209 | [`steps/transform.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/transform.py), 210 | and should return an unfitted estimator that is sklearn-compatible; that is, the returned object should define 211 | `fit()` and `transform()` methods. `steps/transform.py` contains an example placeholder function. 212 | 213 | The transform step is configured by the `steps.transform` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml): 214 |
215 | Full configuration reference 216 | 217 | - `transformer_method`: string. Optional. 218 | Fully qualified name of the method that returns an `sklearn`-compatible transformer which applies feature 219 | transformation during model training and inference. If absent, an identity transformer with will be used. 220 | Example: 221 | ``` 222 | transformer_method: steps.split.transformer_fn 223 | ``` 224 | 225 |
226 | 227 | **Step artifacts**: 228 | - `transformed_training_data`: transformed training dataset as a Pandas DataFrame. 229 | - `transformed_validation_data`: transformed validation dataset as a Pandas DataFrame. 230 | - `transformer`: the sklearn transformer. 231 | 232 | 233 | ### Train step 234 | The train step uses the transformed training dataset output from the transform step to fit an user-defined estimator. 235 | The estimator is then joined with the fitted transformer output from the transform step to create a model pipeline. 236 | Finally, this model pipeline is evaluated against the transformed training and validation datasets to compute performance metrics. 237 | 238 | Custom evaluation metrics are computed according to definitions in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py) 239 | and the `metrics` section of `pipeline.yaml`; see [Custom Metrics](#custom-metrics) section for reference. 240 | 241 | The model pipeline and its associated parameters, performance metrics, and lineage information are logged to [MLflow Tracking](https://www.mlflow.org/docs/latest/tracking.html), producing an MLflow Run. 242 | 243 | The user-defined estimator function should be written in [`steps/train.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/train.py), 244 | and should return an unfitted estimator that is `sklearn`-compatible; that is, the returned object should define 245 | `fit()` and `transform()` methods. `steps/train.py` contains an example placeholder function. 246 | 247 | The train step is configured by the `steps.train` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml): 248 |
249 | Full configuration reference 250 | 251 | - `estimator_method`: string. Required. 252 | Fully qualified name of the method that returns an `sklearn`-compatible estimator used for model training. 253 | Example: 254 | ``` 255 | estimator_method: steps.train.estimator_fn 256 | ``` 257 | 258 | - Tuning configuration reference 259 | 260 | - `enabled`: boolean. Required. 261 | Indicates whether or not tuning is enabled. 262 | 263 | - `max_trials`: int. Required. 264 | Max tuning trials to run. 265 | 266 | - `algorithm`: string. Optional. 267 | Indicates whether or not tuning is enabled. 268 | 269 | - `early_stop_fn`: string. Optional. 270 | Early stopping function to be passed to `hyperopt`. 271 | 272 | - `parallelism`: int. Optional. 273 | Number of workers to run `hyperopt` across. 274 | 275 | - `sample_fraction`: float. Optional. 276 | Sampling fraction in the range `(0, 1.0]` to indicate the amount of data used in tuning. 277 | 278 | - `parameters`: list. Required. 279 | `hyperopt` search space in yaml format. 280 | 281 | Example: 282 | ``` 283 | tuning: 284 | enabled: True 285 | algorithm: "hyperopt.rand.suggest" 286 | max_trials: 5 287 | early_stop_fn: "hyperopt.early_stop.no_progress_loss(10)" 288 | parallelism: 1 289 | sample_fraction: 0.5 290 | parameters: 291 | alpha: 292 | distribution: "uniform" 293 | low: 0.0 294 | high: 0.01 295 | penalty: 296 | values: ["l1", "l2", "elasticnet"] 297 | ``` 298 | 299 |
300 | 301 | **Step artifacts**: 302 | - `model`: the [MLflow Model](https://www.mlflow.org/docs/latest/models.html) pipeline created in the train step 303 | as a [PyFuncModel](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.PyFuncModel) instance. 304 | 305 | 306 | ### Evaluate step 307 | The evaluate step evaluates the model pipeline created by the train step on the test dataset output from the 308 | split step, computing performance metrics and model explanations. 309 | 310 | Performance metrics are compared against configured thresholds to produce a `model_validation_status`, which indicates 311 | whether or not a model is validated to be registered to the [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html) 312 | by the subsequent [register step](#register-step). 313 | These model performance thresholds are defined in the 314 | `validation_criteria` section of the `evaluate` step definition in `pipeline.yaml`. 315 | Custom evaluation metrics are computed according to definitions in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py) 316 | and the `metrics` section of `pipeline.yaml`; see the [custom metrics section](#custom-metrics) for reference. 317 | 318 | Model performance metrics and explanations are logged to the same MLflow Tracking Run used by the train step. 319 | 320 | The evaluate step is configured by the `steps.evaluate` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml): 321 |
322 | Full configuration reference 323 | 324 | - `validation_criteria`: list. Optional. 325 | A list of validation thresholds, each of which a trained model must meet in order to be eligible for 326 | registration in the [register step](#register-step). 327 | A definition for a validation threshold consists of a metric name 328 | (either a [built-in metric](#built-in-metrics) or a [custom metric](#custom-metrics)), and a threshold value. 329 | Example: 330 | ``` 331 | validation_critera: 332 | - metric: root_mean_squared_error 333 | threshold: 10 334 | ``` 335 |
336 | 337 | **Step artifacts**: 338 | - `run`: the MLflow Tracking Run containing the model pipeline, as well as performance and metrics created during 339 | the train and evaluate steps. 340 | 341 | 342 | ### Register step 343 | The register step checks the `model_validation_status` output of the preceding [evaluate step](#evaluate-step) and, 344 | if model validation was successful (if model_validation_status is `'VALIDATED'`), registers the model pipeline created 345 | by the train step to the MLflow Model Registry. If the `model_validation_status` does not indicate that the model 346 | passed validation checks (if model_validation_status is `'REJECTED'`), the model pipeline is **not** registered to the 347 | MLflow Model Registry. 348 | If the model pipeline is registered to the MLflow Model Registry, a `registered_model_version` is produced containing 349 | the model name and the model version. 350 | 351 | The register step is configured by the `steps.register` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml): 352 |
353 | Full configuration reference 354 | 355 | - `model_name`: string. Required. 356 | Specifies the name to use when registering the trained model to the model registry. 357 | 358 | 359 | - `allow_non_validated_model`: boolean. Required. 360 | Whether to allow registration of models that fail to meet performance thresholds. 361 | 362 |
363 | 364 | **Step artifacts**: 365 | - `registered_model_version`: the MLflow Model Registry [ModelVersion](https://mlflow.org/docs/latest/model-registry.html#concepts) 366 | registered in this step. 367 | 368 | 369 | ### Batch scoring 370 | After model training, the regression pipeline provides the capability to score new data with the 371 | trained model. 372 | 373 | #### Ingest Scoring step 374 | The ingest scoring step, defined in the `data_scoring` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), 375 | specifies the dataset used for batch scoring and has the same API as the [ingest step](#ingest-step). 376 | 377 | **Step artifacts**: 378 | - `ingested_scoring_data`: the ingested scoring data as a Pandas DataFrame. 379 | 380 | #### Predict step 381 | The predict step uses the model registered by the [register step](#register-step) to score the 382 | ingested dataset produced by the [ingest scoring step](#ingest-scoring-step) and writes the resulting 383 | dataset to the specified output format and location. To fix a specific model for use in the predict 384 | step, provide its model URI as the `model_uri` attribute of the `pipeline.yaml` predict step definition. 385 | 386 | The predict step is configured by the `steps.predict` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml): 387 |
388 | Full configuration reference 389 | 390 | - `output_format`: string. Required. 391 | Specifies the output format of the scored data from the predict step. One of `parquet`, `delta`, and 392 | `table`. The `parquet` format writes the scored data as parquet files under a specified path. The 393 | `delta` format writes the scored data as a delta table under a specified path. The `table` format 394 | writes the scored data as delta table and creates a metastore entry for this table with a specified name. 395 | 396 | 397 | - `output_location`: string. Required. 398 | For the `parquet` and `delta` output formats, this attribute specifies the output path for writing 399 | the scored data. In Databricks, this path will be written to be under [DBFS](https://docs.databricks.com/dbfs/index.html), 400 | e.g. the path `my/special/path` will be written under `/dbfs/my/special/path`. For the `table` output 401 | format, this attribute specifies the table name that is used to create the metastore entry for the 402 | written delta table. 403 | Example: 404 | ``` 405 | output_location: ./outputs/predictions 406 | ``` 407 | 408 | 409 | - `model_uri`: string. Optional. 410 | Specifies the URI of the model to use for batch scoring. If empty, the latest model version produced 411 | by the register step is used. If the register step was cleared, the latest version of the 412 | registered model specified by the `model_name` attribute of the `pipeline.yaml` [register step](#register-step) 413 | will be used. 414 | Example: 415 | ``` 416 | model_uri: models/model.pkl 417 | ``` 418 | 419 | 420 | - `result_type`: string. Optional. Defaults to `double`. 421 | Specifies the data type for predictions generated by the model. See the 422 | [MLflow spark_udf API docs](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf) 423 | for more information. 424 | 425 | 426 | - `save_mode`: string. Optional. Defaults to `default`. 427 | Specifies the save mode used by Spark for writing the scored data. See the 428 | [PySpark save modes documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes) 429 | for more information. 430 |
431 | 432 | 433 | **Step artifacts**: 434 | - `scored_data`: the scored dataset, with model predictions under the `prediction` column, as a Pandas DataFrame. 435 | 436 | 437 | ### MLflow Tracking / Model Registry configuration 438 | The MLflow Tracking server can be configured to log MLflow runs to a specific server. Tracking information is specified 439 | in the profile configuration files - [`profiles/local.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/local.yaml) 440 | if running locally and [`profiles/databricks.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/databricks.yaml) 441 | if running on Databricks. 442 | 443 | Configuring a tracking server is optional. If this configuration is absent, the default experiment will be used. 444 | 445 | Tracking information is configured with the `experiment` section in the profile configuration: 446 |
447 | Full configuration reference 448 | 449 | - `name`: string. Required, if configuring tracking. 450 | Name of the experiment to log MLflow runs to. 451 | 452 | 453 | - `tracking_uri`: string. Required, if configuring tracking. 454 | URI of the MLflow tracking server to log runs to. Alternatively, the `MLFLOW_TRACKING_URI` environment variable can be [set to point to a valid tracking server](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_tracking_uri). 455 | 456 | 457 | - `artifact_location`: string. Optional. 458 | URI of the location to log run artifacts to. 459 | 460 |
461 | 462 | To register trained models to the MLflow Model Registry, further configuration may be required. If unspecified, models will be logged to the same server as specified in the tracking URI. 463 | 464 | To register models to a different server, specify the desired server in the `model_registry` section in the profile configuration: 465 |
466 | Full configuration reference 467 | 468 | - `uri`: string. Required, if this section is present. 469 | URI of the model registry server to which to register trained models. 470 | 471 |
472 | 473 | ### Metrics 474 | Evaluation metrics calculate model performance against different datasets. The metrics defined in the pipeline 475 | will be calculated as part of the training and evaluation steps, and calculated values will be recorded in each 476 | step’s information card. 477 | 478 | This regression pipeline features a set of built-in metrics, and supports user-defined metrics as well. 479 | 480 | The **primary evaluation metric** is the one that will be used to select the best performing model in the MLflow UI as 481 | well as in the train and evaluation steps. This can be either a built-in metric or a custom metric (see below). 482 | Models are ranked by this primary metric. 483 | 484 | Metrics are configured under the `metrics` section of [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), according to the following specification: 485 |
486 | Full configuration reference 487 | 488 | - `primary`: string. Required. 489 | The name of the primary evaluation metric. 490 | 491 | 492 | - `custom`: string. Optional. 493 | A list of custom metric configurations. 494 | 495 |
496 | 497 | Note that each metric specifies a boolean value `greater_is_better`, which indicates whether a higher value for that 498 | metric is associated with better model performance. 499 | 500 | #### Built-in metrics 501 | The following metrics are built-in. Note that `greater_is_better = False` for all these metrics: 502 | 503 | - `mean_absolute_error` 504 | - `mean_squared_error` 505 | - `root_mean_squared_error` 506 | - `max_error` 507 | - `mean_absolute_percentage_error` 508 | 509 | #### Custom metrics 510 | Custom evaluation metrics define how trained models should be evaluated against custom criteria not captured by 511 | built-in `sklearn` evaluation metrics. 512 | 513 | Custom evaluation metric functions should be defined in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py). 514 | Each should accept two parameters: 515 | - `eval_df`: DataFrame. 516 | A Pandas DataFrame containing two columns: 517 | - `prediction`: Predictions produced by submitting input data to the model. 518 | - `target`: Corresponding target truth values. 519 | 520 | 521 | - `builtin_metrics`: `Dict[str, int]`. 522 | The built-in metrics calculated during model evaluation. Maps metric names to corresponding scalar values. 523 | 524 | The custom metric function should return a `Dict[str, int]`, mapping custom metric names to corresponding scalar metric values. 525 | 526 | Custom metrics are specified as a list under the `metrics.custom` key in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), specified as follows: 527 | - `name`: string. Required. 528 | Name of the custom metric. This will be the name by which you refer to this metric when including it in model evaluation or model training. 529 | 530 | 531 | - `function`: string. Required. Specifies the function this custom metric refers to. 532 | 533 | 534 | - `greater_is_better`: boolean. Required. Boolean indicating whether a higher metric value indicates better model 535 | performance. 536 | 537 | An example custom metric configuration is as follows: 538 | ``` 539 | custom: 540 | - name: weighted_mean_square_error 541 | function: steps.custom_metrics.get_custom_metrics 542 | greater_is_better: True 543 | ``` 544 | -------------------------------------------------------------------------------- /notebooks/databricks.py: -------------------------------------------------------------------------------- 1 | # Databricks notebook source 2 | 3 | # MAGIC %md 4 | # MAGIC # MLflow Regression Pipeline Databricks Notebook 5 | # MAGIC This notebook runs the MLflow Regression Pipeline on Databricks and inspects its results. 6 | # MAGIC 7 | # MAGIC For more information about the MLflow Regression Pipeline, including usage examples, 8 | # MAGIC see the [Regression Pipeline overview documentation](https://mlflow.org/docs/latest/pipelines.html#regression-pipeline) 9 | # MAGIC and the [Regression Pipeline API documentation](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#module-mlflow.pipelines.regression.v1.pipeline). 10 | 11 | # COMMAND ---------- 12 | 13 | # MAGIC %pip install mlflow[pipelines] 14 | # MAGIC %pip install -r ../requirements.txt 15 | 16 | # COMMAND ---------- 17 | 18 | # MAGIC %md ### Create a new pipeline with "databricks" profile: 19 | 20 | # COMMAND ---------- 21 | 22 | from mlflow.pipelines import Pipeline 23 | 24 | p = Pipeline(profile="databricks") 25 | 26 | # COMMAND ---------- 27 | 28 | # MAGIC %md ### Inspect a newly created pipeline using a graphical representation: 29 | 30 | # COMMAND ---------- 31 | 32 | p.inspect() 33 | 34 | # COMMAND ---------- 35 | 36 | # MAGIC %md ### Ingest the dataset into the pipeline: 37 | 38 | # COMMAND ---------- 39 | 40 | p.run("ingest") 41 | 42 | # COMMAND ---------- 43 | 44 | # MAGIC %md ### Split the dataset in train, validation and test data profiles: 45 | 46 | # COMMAND ---------- 47 | 48 | p.run("split") 49 | 50 | # COMMAND ---------- 51 | 52 | training_data = p.get_artifact("training_data") 53 | training_data.describe() 54 | 55 | # COMMAND ---------- 56 | 57 | p.run("transform") 58 | 59 | # COMMAND ---------- 60 | 61 | # MAGIC %md ### Using training data profile, train the model: 62 | 63 | # COMMAND ---------- 64 | 65 | p.run("train") 66 | 67 | # COMMAND ---------- 68 | 69 | trained_model = p.get_artifact("model") 70 | print(trained_model) 71 | 72 | # COMMAND ---------- 73 | 74 | # MAGIC %md ### Evaluate the resulting model using validation data profile: 75 | 76 | # COMMAND ---------- 77 | 78 | 79 | p.run("evaluate") 80 | 81 | # COMMAND ---------- 82 | 83 | # MAGIC %md ### Register the trained model in the registry: 84 | 85 | # COMMAND ---------- 86 | 87 | p.run("register") 88 | -------------------------------------------------------------------------------- /notebooks/jupyter.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%load_ext autoreload\n", 10 | "%autoreload 2" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# MLflow Regression Pipeline Notebook\n", 18 | "\n", 19 | "This notebook runs the MLflow Regression Pipeline on Databricks and inspects its results. For more information about the MLflow Regression Pipeline, including usage examples, see the [Regression Pipeline overview documentation](https://mlflow.org/docs/latest/pipelines.html#regression-pipeline) the [Regression Pipeline API documentation](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#module-mlflow.pipelines.regression.v1.pipeline)." 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": null, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "from mlflow.pipelines import Pipeline\n", 29 | "\n", 30 | "p = Pipeline(profile=\"local\")" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "p.inspect()" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "p.run(\"ingest\")" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "p.run(\"split\")" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [ 66 | "training_data = p.get_artifact(\"training_data\")\n", 67 | "training_data.describe()" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "p.run(\"transform\")" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "p.run(\"train\")" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "trained_model = p.get_artifact(\"model\")\n", 95 | "print(trained_model)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "p.run(\"evaluate\")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "p.run(\"register\")" 114 | ] 115 | } 116 | ], 117 | "metadata": { 118 | "interpreter": { 119 | "hash": "c462df992c775797bd5d542b280333286dbcc2ffa1f781b674f30e76741ca83d" 120 | }, 121 | "kernelspec": { 122 | "display_name": "Python 3 (ipykernel)", 123 | "language": "python", 124 | "name": "python3" 125 | }, 126 | "language_info": { 127 | "codemirror_mode": { 128 | "name": "ipython", 129 | "version": 3 130 | }, 131 | "file_extension": ".py", 132 | "mimetype": "text/x-python", 133 | "name": "python", 134 | "nbconvert_exporter": "python", 135 | "pygments_lexer": "ipython3", 136 | "version": "3.9.12" 137 | } 138 | }, 139 | "nbformat": 4, 140 | "nbformat_minor": 4 141 | } 142 | -------------------------------------------------------------------------------- /pipeline.yaml: -------------------------------------------------------------------------------- 1 | # `pipeline.yaml` is the main configuration file for an MLflow Pipeline. 2 | # Required pipeline parameters should be defined in this file with either concrete values or 3 | # variables such as {{ INGEST_DATA_LOCATION }}. 4 | # 5 | # Variables must be dereferenced in a profile YAML file, located under `profiles/`. 6 | # See `profiles/local.yaml` for example usage. One may switch among profiles quickly by 7 | # providing a profile name such as `local` in the Pipeline object constructor: 8 | # `p = Pipeline(profile="local")` 9 | # 10 | # NOTE: All "FIXME::REQUIRED" fields in pipeline.yaml and profiles/*.yaml must be set correctly 11 | # to adapt this template to a specific regression problem. To find all required fields, 12 | # under the root directory of this pipeline, type on a unix-like command line: 13 | # $> grep "# FIXME::REQUIRED:" pipeline.yaml profiles/*.yaml 14 | # 15 | # NOTE: YAML does not support tabs for indentation. Please use spaces and ensure that all YAML 16 | # files are properly formatted. 17 | 18 | template: "regression/v1" 19 | # Specifies the dataset to use for model development 20 | data: 21 | # Dataset locations on the local filesystem are supported, as well as HTTP(S) URLs and 22 | # any other remote locations resolvable by MLflow, such as those listed in 23 | # https://mlflow.org/docs/latest/tracking.html#artifact-stores 24 | location: {{INGEST_DATA_LOCATION}} 25 | # Beyond `parquet` datasets, the `spark_sql` and `delta` formats are also natively supported for 26 | # use with Spark 27 | format: {{INGEST_DATA_FORMAT|default('parquet')}} 28 | # Datasets with other formats, including `csv`, can be used by implementing and 29 | # specifying a `custom_loader_method` 30 | custom_loader_method: steps.ingest.load_file_as_dataframe 31 | # If the `spark_sql` `format` is specified, 32 | # And if the table location format is path-like, use the following sql command for Spark to read 33 | # sub-columns from the table: 34 | # sql: SELECT col1, col2 FROM delta.`{{INGEST_DATA_LOCATION}}` 35 | # And if the table location format is table-like, use the following sql command for Spark to read 36 | # sub-columns from the table: 37 | # sql: SELECT col1, col2 FROM {{INGEST_DATA_LOCATION}} 38 | # If the `delta` `format` is specified, you can also configure the Delta table `version` to read 39 | # or the `timestamp` at which to read data 40 | # version: 2 41 | # timestamp: 2022-06-01T00:00:00.000Z 42 | # FIXME::OPTIONAL: Specify the dataset to use for batch scoring. All params serve the same function 43 | # as in `data` 44 | # data_scoring: 45 | # location: {{INGEST_SCORING_DATA_LOCATION}} 46 | # format: {{INGEST_SCORING_DATA_FORMAT|default('parquet')}} 47 | # custom_loader_method: steps.ingest.load_file_as_dataframe 48 | # sql: SELECT * FROM delta.`{{INGEST_SCORING_DATA_LOCATION}}` 49 | # 50 | # FIXME::REQUIRED: Specifies the target column name for model training and evaluation. 51 | # 52 | target_col: "" 53 | steps: 54 | split: 55 | # 56 | # FIXME::OPTIONAL: Adjust the train/validation/test split ratios below. 57 | # 58 | split_ratios: [0.75, 0.125, 0.125] 59 | # 60 | # FIXME::OPTIONAL: Specifies the method to use to "post-process" the split datasets. Note that 61 | # arbitrary transformations should go into the transform step. 62 | post_split_filter_method: steps.split.create_dataset_filter 63 | transform: 64 | # 65 | # FIXME::OPTIONAL: Specifies the method that defines an sklearn-compatible transformer, which 66 | # applies input feature transformation during model training and inference. 67 | transformer_method: steps.transform.transformer_fn 68 | train: 69 | using: estimator_spec 70 | # Specifies the method that defines the estimator type and parameters to use for model training 71 | estimator_method: steps.train.estimator_fn 72 | evaluate: 73 | # 74 | # FIXME::OPTIONAL: Sets performance thresholds that a trained model must meet in order to be 75 | # eligible for registration to the MLflow Model Registry. 76 | # 77 | # validation_criteria: 78 | # - metric: root_mean_squared_error 79 | # threshold: 10 80 | register: 81 | # 82 | # FIXME::REQUIRED: Specifies the name of the Registered Model to use when registering a trained 83 | # model to the MLflow Model Registry. 84 | # 85 | model_name: "" 86 | # Indicates whether or not a model that fails to meet performance thresholds should still 87 | # be registered to the MLflow Model Registry 88 | allow_non_validated_model: false 89 | # FIXME::OPTIONAL: Configure the predict step for batch scoring. See README.md for full 90 | # configuration reference. 91 | # predict: 92 | # output_format: {{SCORED_OUTPUT_DATA_FORMAT|default('parquet')}} 93 | # output_location: {{SCORED_OUTPUT_DATA_LOCATION}} 94 | # model_uri: "models/model.pkl" 95 | # result_type: "double" 96 | # save_mode: "default 97 | metrics: 98 | # 99 | # FIXME::REQUIRED: Sets the primary metric to use to evaluate model performance. This primary 100 | # metric is used to select best performing models in MLflow UI as well as in 101 | # train and evaluation step. 102 | # Built-in metrics are: example_count, mean_absolute_error, mean_squared_error, 103 | # root_mean_squared_error, sum_on_label, mean_on_label, r2_score, max_error, 104 | # mean_absolute_percentage_error 105 | primary: "" 106 | # 107 | # FIXME::OPTIONAL: Defines custom performance metrics to compute during model development. 108 | # 109 | # custom: 110 | # - name: "" 111 | # function: get_custom_metrics 112 | # greater_is_better: False 113 | 114 | -------------------------------------------------------------------------------- /profiles/databricks.yaml: -------------------------------------------------------------------------------- 1 | # 2 | # FIXME::REQUIRED: set an MLflow experiment name to track pipeline executions and artifacts. On Databricks, an 3 | # experiment name must be a valid path in the workspace. 4 | # 5 | experiment: 6 | name: "" 7 | # 8 | # FIXME::OPTIONAL: Set the registry server URI, useful if you have a registry server different 9 | # from the tracking server. First create a Databricks Profile, see 10 | # https://github.com/databricks/databricks-cli#installation 11 | # model_registry: 12 | # uri: "databricks://DATABRICKS_PROFILE_NAME" 13 | 14 | # FIXME::REQUIRED: Specify the training and evaluation data location. This is usually a DBFS 15 | # location ("dbfs:/...") or a SQL table ("SCHEMA.TABLE"). 16 | INGEST_DATA_LOCATION: "" 17 | # 18 | # FIXME::OPTIONAL: Specify the format of the training and evaluation dataset. Natively supported 19 | # formats are: parquet, spark_sql, delta. 20 | # INGEST_DATA_FORMAT: parquet 21 | # 22 | # FIXME::OPTIONAL: Specify the scoring data location. 23 | # INGEST_SCORING_DATA_LOCATION: "" 24 | # 25 | # FIXME::OPTIONAL: Specify the format of the scoring dataset. Natively supported formats are: 26 | # parquet, spark_sql, delta. 27 | # INGEST_SCORING_DATA_FORMAT: parquet 28 | # 29 | # FIXME::OPTIONAL: Specify the output location of the batch scoring predict step. 30 | # SCORED_OUTPUT_DATA_LOCATION: "" 31 | # 32 | # FIXME::OPTIONAL: Specify the format of the scored dataset. Natively supported formats are: 33 | # parquet, delta, table. 34 | # SCORED_OUTPUT_DATA_FORMAT: parquet 35 | -------------------------------------------------------------------------------- /profiles/local.yaml: -------------------------------------------------------------------------------- 1 | # 2 | # FIXME::REQUIRED: set an MLflow experiment name to track pipeline executions and artifacts. 3 | # 4 | experiment: 5 | name: "" 6 | tracking_uri: "sqlite:///metadata/mlflow/mlruns.db" 7 | artifact_location: "./metadata/mlflow/mlartifacts" 8 | # 9 | # FIXME::OPTIONAL: Set the registry server URI. This property is especially useful if you have a 10 | # registry server that’s different from the tracking server. 11 | # model_registry: 12 | # uri: "sqlite:///metadata/mlflow/registry.db" 13 | # 14 | # FIXME::REQUIRED: Specify the training and evaluation data location. 15 | INGEST_DATA_LOCATION: "" 16 | # 17 | # FIXME::OPTIONAL: Specify the format of the training and evaluation dataset. Natively supported 18 | # formats are: parquet, spark_sql, delta. 19 | # INGEST_DATA_FORMAT: parquet 20 | # 21 | # FIXME::OPTIONAL: Specify the scoring data location. 22 | # INGEST_SCORING_DATA_LOCATION: "" 23 | # 24 | # FIXME::OPTIONAL: Specify the format of the scoring dataset. Natively supported formats are: 25 | # parquet, spark_sql, delta. 26 | # INGEST_SCORING_DATA_FORMAT: parquet 27 | # 28 | # FIXME::OPTIONAL: Specify the output location of the batch scoring predict step. 29 | # SCORED_OUTPUT_DATA_LOCATION: "" 30 | # 31 | # FIXME::OPTIONAL: Specify the format of the scored dataset. Natively supported formats are: 32 | # parquet, delta, table. 33 | # SCORED_OUTPUT_DATA_FORMAT: parquet 34 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | mlflow[pipelines]>=1.27.0,<2.0 2 | pandas>=1.3.* 3 | scikit-learn>=1.1.* 4 | ipykernel>=6.12.* 5 | ipython>=7.32.* 6 | shap>=0.40.* 7 | -------------------------------------------------------------------------------- /requirements/lint-requirements.txt: -------------------------------------------------------------------------------- 1 | pylint==2.11.1 2 | black==22.3.0 3 | -------------------------------------------------------------------------------- /requirements/test-requirements.txt: -------------------------------------------------------------------------------- 1 | ## Test-only dependencies 2 | pytest 3 | -------------------------------------------------------------------------------- /steps/custom_metrics.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module defines custom metric functions that are invoked during the 'train' and 'evaluate' 3 | steps to provide model performance insights. Custom metric functions defined in this module are 4 | referenced in the ``metrics`` section of ``pipeline.yaml``, for example: 5 | 6 | .. code-block:: yaml 7 | :caption: Example custom metrics definition in ``pipeline.yaml`` 8 | 9 | metrics: 10 | custom: 11 | - name: weighted_mean_squared_error 12 | function: weighted_mean_squared_error 13 | greater_is_better: False 14 | """ 15 | from typing import Dict 16 | 17 | from pandas import DataFrame 18 | 19 | 20 | def get_custom_metrics( 21 | eval_df: DataFrame, 22 | builtin_metrics: Dict[str, int], # pylint: disable=unused-argument 23 | ) -> Dict[str, int]: 24 | """ 25 | FIXME::OPTIONAL: provide function doc string. 26 | :param eval_df: A Pandas DataFrame containing the following columns: 27 | - ``"prediction"``: Predictions produced by submitting input data to the model. 28 | - ``"target"``: Ground truth values corresponding to the input data. 29 | :param builtin_metrics: A dictionary containing the built-in metrics that are calculated 30 | automatically during model evaluation. The keys are the names of the 31 | metrics and the values are the scalar values of the metrics. For more 32 | information, see 33 | https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate. 34 | :return: A single-entry dictionary containing the custom metrics. The key is the metric name 35 | and the value is the scalar metric value. Note that custom metric functions can 36 | return dictionaries with multiple metric entries as well. 37 | """ 38 | # FIXME::OPTIONAL: implement custom metrics calculation here. 39 | 40 | raise NotImplementedError 41 | -------------------------------------------------------------------------------- /steps/ingest.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module defines the following routines used by the 'ingest' step of the regression pipeline: 3 | 4 | - ``load_file_as_dataframe``: Defines customizable logic for parsing dataset formats that are not 5 | natively parsed by MLflow Pipelines (i.e. formats other than Parquet, Delta, and Spark SQL). 6 | """ 7 | from pandas import DataFrame 8 | 9 | 10 | def load_file_as_dataframe(file_path: str, file_format: str) -> DataFrame: 11 | """ 12 | Load content from the specified dataset file as a Pandas DataFrame. 13 | 14 | This method is used to load dataset types that are not natively managed by MLflow Pipelines 15 | (datasets that are not in Parquet, Delta Table, or Spark SQL Table format). This method is 16 | called once for each file in the dataset, and MLflow Pipelines automatically combines the 17 | resulting DataFrames together. 18 | 19 | :param file_path: The path to the dataset file. 20 | :param file_format: The file format string, such as "csv". 21 | :return: A Pandas DataFrame representing the content of the specified file. 22 | """ 23 | # FIXME::OPTIONAL: implement the handling of non-natively supported file_format. 24 | 25 | raise NotImplementedError 26 | -------------------------------------------------------------------------------- /steps/split.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module defines the following routines used by the 'split' step of the regression pipeline: 3 | 4 | - ``create_dataset_filter``: Defines customizable logic for filtering the training, validation, 5 | and test datasets produced by the data splitting procedure. Note that arbitrary transformations 6 | should go into the transform step. 7 | """ 8 | 9 | from pandas import DataFrame, Series 10 | 11 | 12 | def create_dataset_filter(dataset: DataFrame) -> Series(bool): 13 | """ 14 | Mark rows of the split datasets to be additionally filtered. This function will be called on 15 | the training, validation, and test datasets. 16 | 17 | :param dataset: The {train,validation,test} dataset produced by the data splitting procedure. 18 | :return: A Series indicating whether each row should be filtered 19 | """ 20 | # FIXME::OPTIONAL: implement post-split filtering on the dataframes, such as data cleaning. 21 | 22 | return Series(True, index=dataset.index) 23 | -------------------------------------------------------------------------------- /steps/train.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module defines the following routines used by the 'train' step of the regression pipeline: 3 | 4 | - ``estimator_fn``: Defines the customizable estimator type and parameters that are used 5 | during training to produce a model pipeline. 6 | """ 7 | from typing import Dict, Any 8 | 9 | 10 | def estimator_fn(estimator_params: Dict[str, Any] = {}): 11 | """ 12 | Returns an *unfitted* estimator that defines ``fit()`` and ``predict()`` methods. 13 | The estimator's input and output signatures should be compatible with scikit-learn 14 | estimators. 15 | """ 16 | # 17 | # FIXME::OPTIONAL: return a scikit-learn-compatible regression estimator with fine-tuned 18 | # hyperparameters. 19 | from sklearn.linear_model import SGDRegressor 20 | 21 | return SGDRegressor(**estimator_params) 22 | -------------------------------------------------------------------------------- /steps/transform.py: -------------------------------------------------------------------------------- 1 | """ 2 | This module defines the following routines used by the 'transform' step of the regression pipeline: 3 | 4 | - ``transformer_fn``: Defines customizable logic for transforming input data before it is passed 5 | to the estimator during model inference. 6 | """ 7 | 8 | def transformer_fn(): 9 | """ 10 | Returns an *unfitted* transformer that defines ``fit()`` and ``transform()`` methods. 11 | The transformer's input and output signatures should be compatible with scikit-learn 12 | transformers. 13 | """ 14 | # 15 | # FIXME::OPTIONAL: return a scikit-learn-compatible transformer object. 16 | # 17 | # Identity feature transformation is applied when None is returned. 18 | return None 19 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlflow/mlp-regression-template/e1b0b66d2342afbbaffb832d163681e9b67854a6/tests/__init__.py -------------------------------------------------------------------------------- /tests/train_test.py: -------------------------------------------------------------------------------- 1 | from steps.train import estimator_fn 2 | from sklearn.utils.estimator_checks import check_estimator 3 | 4 | 5 | def test_train_fn_returns_object_with_correct_spec(): 6 | regressor = estimator_fn() 7 | assert callable(getattr(regressor, "fit", None)) 8 | assert callable(getattr(regressor, "predict", None)) 9 | 10 | 11 | def test_train_fn_passes_check_estimator(): 12 | regressor = estimator_fn() 13 | check_estimator(regressor) 14 | -------------------------------------------------------------------------------- /tests/transform_test.py: -------------------------------------------------------------------------------- 1 | from steps.transform import transformer_fn 2 | 3 | 4 | def test_tranform_fn_returns_object_with_correct_spec(): 5 | transformer = transformer_fn() 6 | if transformer: 7 | assert callable(getattr(transformer, "fit", None)) 8 | assert callable(getattr(transformer, "transform", None)) 9 | --------------------------------------------------------------------------------