├── .github
    ├── ISSUE_TEMPLATE
    │   ├── bug_report_template.yaml
    │   ├── doc_fix_template.yaml
    │   └── feature_request_template.yaml
    └── pull_request_template.md
├── .gitignore
├── LICENSE.txt
├── README.md
├── notebooks
    ├── databricks.py
    └── jupyter.ipynb
├── pipeline.yaml
├── profiles
    ├── databricks.yaml
    └── local.yaml
├── requirements.txt
├── requirements
    ├── lint-requirements.txt
    └── test-requirements.txt
├── steps
    ├── custom_metrics.py
    ├── ingest.py
    ├── split.py
    ├── train.py
    └── transform.py
└── tests
    ├── __init__.py
    ├── train_test.py
    └── transform_test.py


/.github/ISSUE_TEMPLATE/bug_report_template.yaml:
--------------------------------------------------------------------------------
 1 | name: Bug Report
 2 | description: Create a report to help us reproduce and correct the bug
 3 | labels: 'bug'
 4 | title: '[BUG]'
 5 | 
 6 | body:
 7 |   - type: markdown
 8 |     attributes:
 9 |       value: |
10 |         We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=bug&template=bug_report_template.yaml&title=%5BBUG%5D)
11 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/doc_fix_template.yaml:
--------------------------------------------------------------------------------
 1 | name: Documentation Fix
 2 | description: Use this template for proposing documentation fixes/improvements.
 3 | labels: 'area/docs'
 4 | title: '[DOC-FIX]'
 5 | 
 6 | body:
 7 |   - type: markdown
 8 |     attributes:
 9 |       value: |
10 |         We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=area%2Fdocs&template=doc_fix_template.yaml&title=%5BDOC-FIX%5D)
11 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request_template.yaml:
--------------------------------------------------------------------------------
 1 | name: Feature Request
 2 | description: Use this template for feature and enhancement proposals.
 3 | labels: 'enhancement'
 4 | title: '[FR]'
 5 | 
 6 | body:
 7 |   - type: markdown
 8 |     attributes:
 9 |       value: |
10 |         We encourage submitting an issue directly to [MLFlow](https://github.com/mlflow/mlflow/issues/new?assignees=&labels=enhancement&template=feature_request_template.yaml&title=%5BFR%5D)
11 | 


--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
 1 | ## What changes are proposed in this pull request?
 2 | 
 3 | (Please fill in changes proposed in this fix)
 4 | 
 5 | ## How is this patch tested?
 6 | 
 7 | (Details)
 8 | 
 9 | We require adding a link to the [workflow](https://github.com/mlflow/mlflow/actions/workflows/pipeline-template.yml) for testing the [mlflow/mlp-regression-template](https://github.com/mlflow/mlp-regression-template) code path.
10 | 
11 | Run the workflow with repository name: `mlflow/mlp-regression-template` and the branch`(Example:<user-name>:feature)` that you are working with.
12 | ## Release Notes
13 | 
14 | ### Is this a user-facing change?
15 | 
16 | - [ ] No. You can skip the rest of this section.
17 | - [ ] Yes. Give a description of this change to be included in the release notes for MLflow users.
18 | 
19 | (Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)
20 | 
21 | ### What component(s), interfaces, languages, and integrations does this PR affect?
22 | Components
23 | - [ ] `area/regression`: Sklearn regression pipeline example
24 | - [ ] `area/build`: Build and test infrastructure for MLflow pipeline example
25 | 
26 | <!--
27 | Insert an empty named anchor here to allow jumping to this section with a fragment URL
28 | (e.g. https://github.com/mlflow/mlflow/pull/123#user-content-release-note-category).
29 | Note that GitHub prefixes anchor names in markdown with "user-content-".
30 | -->
31 | <a name="release-note-category"></a>
32 | ### How should the PR be classified in the release notes? Choose one:
33 | 
34 | - [ ] `rn/breaking-change` - The PR will be mentioned in the "Breaking Changes" section
35 | - [ ] `rn/none` - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
36 | - [ ] `rn/feature` - A new user-facing feature worth mentioning in the release notes
37 | - [ ] `rn/bug-fix` - A user-facing bug fix worth mentioning in the release notes
38 | - [ ] `rn/documentation` - A user-facing documentation change worth mentioning in the release notes
39 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Mlflow
  2 | mlruns/
  3 | mlartifacts/
  4 | outputs/
  5 | mlruns.db
  6 | 
  7 | # Mac
  8 | .DS_Store
  9 | 
 10 | # Byte-compiled / optimized / DLL files
 11 | __pycache__
 12 | *.py[cod]
 13 | *$py.class
 14 | 
 15 | # C extensions
 16 | *.so
 17 | 
 18 | # Distribution / packaging
 19 | .Python
 20 | build/
 21 | develop-eggs/
 22 | dist/
 23 | downloads/
 24 | eggs/
 25 | .eggs/
 26 | lib/
 27 | lib64/
 28 | parts/
 29 | sdist/
 30 | var/
 31 | wheels/
 32 | *.egg-info/
 33 | .installed.cfg
 34 | *.egg
 35 | MANIFEST
 36 | node_modules
 37 | 
 38 | # PyInstaller
 39 | #  Usually these files are written by a python script from a template
 40 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 41 | *.manifest
 42 | *.spec
 43 | 
 44 | # Installer logs
 45 | pip-log.txt
 46 | pip-delete-this-directory.txt
 47 | 
 48 | # Unit test / coverage reports
 49 | htmlcov/
 50 | .coverage
 51 | .coverage.*
 52 | .cache
 53 | nosetests.xml
 54 | coverage.xml
 55 | *.cover
 56 | .hypothesis/
 57 | .pytest_cache/
 58 | 
 59 | # Sphinx documentation
 60 | docs/_build/
 61 | 
 62 | # Jupyter Notebook
 63 | .ipynb_checkpoints
 64 | 
 65 | # Environments
 66 | env
 67 | env3
 68 | .env
 69 | .venv
 70 | env/
 71 | venv/
 72 | ENV/
 73 | env.bak/
 74 | venv.bak/
 75 | .python-version
 76 | 
 77 | # Editor files
 78 | .*project
 79 | *.swp
 80 | *.swo
 81 | *.idea
 82 | *.vscode
 83 | *.iml
 84 | *~
 85 | 
 86 | # mkdocs documentation
 87 | /site
 88 | 
 89 | # mypy
 90 | .mypy_cache/
 91 | 
 92 | # java targets
 93 | target/
 94 | 
 95 | # R notebooks
 96 | .Rproj.user
 97 | example/tutorial/R/*.nb.html
 98 | 
 99 | # travis_wait command logs
100 | travis_wait*.log
101 | 
102 | # Pytorch logs
103 | lightning_logs
104 | 
105 | # tox logs
106 | *.tox
107 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # [DEPRECATED] MLflow Pipelines Regression Template
  2 | **NOTE**: This repository is deprecated as of 2022/11/07, and will be removed soon.
  3 | If you are using MLflow 2.0,
  4 | please refer to [MLflow Recipes Regression Template](https://github.com/mlflow/recipes-regression-template) instead.
  5 | 
  6 | The MLflow Regression Pipeline is an [MLflow Pipeline](https://mlflow.org/docs/latest/pipelines.html) for developing
  7 | high-quality regression models. 
  8 | It is designed for developing models using scikit-learn and frameworks that integrate with scikit-learn, 
  9 | such as the `XGBRegressor` API from XGBoost.
 10 | 
 11 | This repository is a template for developing production-ready regression models with the MLflow Regression Pipeline.
 12 | It provides a pipeline structure for creating models as well as pointers to configurations and code files that should
 13 | be filled in to produce a working pipeline.
 14 | 
 15 | Code developed with this template should be run with [MLflow Pipelines](https://mlflow.org/docs/latest/pipelines.html). 
 16 | An example implementation of this template can be found in the [MLP Regression Example repo](https://github.com/mlflow/mlp-regression-example), 
 17 | which targets the NYC taxi dataset for its training problem.
 18 | 
 19 | **Note**: [MLflow Pipelines](https://mlflow.org/docs/latest/pipelines.html)
 20 | is an experimental feature in [MLflow](https://mlflow.org).
 21 | If you observe any issues,
 22 | please report them [here](https://github.com/mlflow/mlflow/issues).
 23 | For suggestions on improvements,
 24 | please file a discussion topic [here](https://github.com/mlflow/mlflow/discussions).
 25 | Your contribution to MLflow Pipelines is greatly appreciated by the community!
 26 | 
 27 | ## Key Features
 28 | - Deterministic data splitting
 29 | - Reproducible data transformations
 30 | - Hyperparameter tuning support
 31 | - Model registration for use in production
 32 | - Starter code for ingest, split, transform and train steps
 33 | - Cards containing step results, including dataset profiles, model leaderboard, performance plots and more
 34 | 
 35 | ## Installation
 36 | Follow the [MLflow Pipelines installation guide](https://mlflow.org/docs/latest/pipelines.html#installation). 
 37 | You may need to install additional libraries for extra features:
 38 | - [Hyperopt](https://pypi.org/project/hyperopt/)  is required for hyperparameter tuning.
 39 | - [PySpark](https://pypi.org/project/pyspark/)  is required for distributed training or to ingest Spark tables.
 40 | - [Delta](https://pypi.org/project/delta-spark/) is required to ingest Delta tables.
 41 | These libraries are available natively in the [Databricks Runtime for Machine Learning](https://docs.databricks.com/runtime/mlruntime.html).
 42 | 
 43 | ## Get started
 44 | After installing MLflow Pipelines, you can clone this repository to get started. Simply fill in the required values annotated by `FIXME::REQUIRED` comments in the [Pipeline configuration file](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml) 
 45 | and in the appropriate profile configuration: [`local.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/local.yaml) 
 46 | (if running locally) or [`databricks.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/databricks.yaml) 
 47 | (if running on Databricks).
 48 | 
 49 | The Pipeline will then be in a runnable state, and when run completely, will produce a trained model ready for batch
 50 | scoring, along with cards containing detailed information about the results of each step. 
 51 | The model will also be registered to the MLflow Model Registry if it meets registration thresholds. 
 52 | To iterate and improve your model, follow the [MLflow Pipelines usage guide](https://mlflow.org/docs/latest/pipelines.html#usage). 
 53 | Note that iteration will likely involve filling in the optional `FIXME`s in the 
 54 | step code files with your own code, in addition to the configuration keys.
 55 | 
 56 | ## Reference
 57 | ![image](https://user-images.githubusercontent.com/66143562/195912433-f1e44829-dea5-4fb2-a034-b197c2cebf71.png)
 58 | 
 59 | This is a visual overview of the MLflow Regression Pipeline's information flow.
 60 | 
 61 | Model develompent consists of the following sequential steps:
 62 | ```
 63 | ingest -> split -> transform -> train -> evaluate -> register
 64 | ```
 65 | 
 66 | The batch scoring workflow consists of the following sequential steps:
 67 | ```
 68 | ingest_scoring -> predict
 69 | ```
 70 | A detailed reference for each step follows.
 71 | 
 72 |  * [Reference](#reference)
 73 |     + [Step artifacts](#step-artifacts)
 74 |     + [Ingest step](#ingest-step)
 75 |       - [Data](#data)
 76 |     + [Split step](#split-step)
 77 |     + [Transform step](#transform-step)
 78 |     + [Train step](#train-step)
 79 |     + [Evaluate step](#evaluate-step)
 80 |     + [Register step](#register-step)
 81 |     + [Batch scoring](#batch-scoring)
 82 |       - [Ingest Scoring step](#ingest-scoring-step)
 83 |       - [Predict step](#predict-step)
 84 |     + [MLflow Tracking / Model Registry configuration](#mlflow-tracking--model-registry-configuration)
 85 |     + [Metrics](#metrics)
 86 |       - [Built-in metrics](#built-in-metrics)
 87 |       - [Custom metrics](#custom-metrics)
 88 | 
 89 | ### Step artifacts
 90 | Each of the steps in the pipeline produces artifacts after completion. These artifacts consist of cards containing
 91 | detailed execution information, as well as other step-specific information.
 92 | The [`Pipeline.inspect()`](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#mlflow.pipelines.regression.v1.pipeline.RegressionPipeline.inspect)
 93 | API is used to view step cards. The [`get_artifact`](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#mlflow.pipelines.regression.v1.pipeline.RegressionPipeline.get_artifact)
 94 | API is used to load all other step artifacts by name.  
 95 | Per-step artifacts are further detailed in the following step references.
 96 | 
 97 | ### Ingest step
 98 | The ingest step resolves the dataset specified by the `data` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml)
 99 | and converts it to parquet format, leveraging the custom loader code specified in the `data` section if necessary.  
100 | **Note**: If you make changes to the dataset referenced by the ingest step (e.g. by adding new records or columns), 
101 | you must manually re-run the ingest step in order to use the updated dataset in the pipeline. 
102 | The ingest step does not automatically detect changes in the dataset.
103 | 
104 | The custom loader function allows use of datasets in other formats, such as `csv`. 
105 | The function should be defined in [`steps/ingest.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/ingest.py),
106 | and should accept two parameters:
107 | - `file_path`: `str`. Path to the dataset file.
108 | - `file_format`: `str`. The file format string, such as `"csv"`.
109 | 
110 | It should return a Pandas DataFrame representing the content of the specified file. [`steps/ingest.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/ingest.py) contains an example placeholder function.
111 | 
112 | #### Data
113 | The input dataset is specified by the `data` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml) as follows: 
114 | <details>
115 | <summary><strong><u>Full configuration reference</u></strong></summary>
116 | 
117 | - `location`: string. Required, unless `format` is `spark_sql`.  
118 | Dataset locations on the local filesystem are supported, as 
119 | well as HTTP(S) URLs and any other remote locations [resolvable by MLflow](https://mlflow.org/docs/latest/tracking.html#artifact-stores).
120 | One may specify multiple data locations by a list of locations as long as they have the same data format (see example below)
121 | <u>Examples</u>:
122 |   ```
123 |   location: ./data/sample.parquet
124 |   ```
125 |   ```
126 |   location: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
127 |   ```
128 |   ```
129 |   location: ["./data/sample.parquet", "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet"]
130 |   ```
131 | - `format`: string. Required.  
132 | One of `parquet`, `spark_sql` and `delta`.  
133 | 
134 | 
135 | - `custom_loader_method`: string. Optional.  
136 | Fully qualified name of the custom loader function.  
137 | <u>Example</u>: 
138 |   ```
139 |   custom_loader_method: steps.ingest.load_file_as_dataframe
140 |   ```
141 | 
142 | - `sql`: string. Required if format is `spark_sql`.  
143 | Specifies a SparkSQL statement that identifies the dataset to use.
144 | 
145 | 
146 | - `version`: int. Optional.  
147 | If the `delta` format is specified, use this to specify the Delta table version to read from.
148 | 
149 | 
150 | - `timestamp`: timestamp. Optional.  
151 | If the `delta` format is specified, use this to specify the timestamp at which to read data.
152 | </details>
153 | 
154 | **Step artifacts**
155 | - `ingested_data`: The ingested data as a Pandas DataFrame.
156 | 
157 | ### Split step
158 | 
159 | The split step splits the ingested dataset produced by the ingest step into:
160 | - a training dataset for model training
161 | - a validation dataset for model performance evaluation & tuning, and 
162 | - a test dataset for model performance evaluation.  
163 | 
164 | The fraction of records allocated to each dataset is defined by the `split_ratios` attribute of the `split` step
165 | definition in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml). 
166 | The split step also preprocesses the datasets using logic defined in [`steps/split.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/split.py).
167 | Subsequent steps use these datasets to develop a model and measure its performance.
168 | 
169 | The post-split method should be written in `steps/split.py` and should accept three parameters:
170 | - `train_df`: DataFrame. The unprocessed train dataset.
171 | - `validation_df`: DataFrame. The unprocessed validation dataset.
172 | - `test_df`: DataFrame. The unprocessed test dataset.
173 | 
174 | It should return a triple representing the processed train, validation and test datasets. `steps/split.py` contains an example placeholder function.
175 | 
176 | The split step is configured by the `steps.split` section in `pipeline.yaml` as follows:
177 | <details>
178 | <summary><strong><u>Full configuration reference</u></strong></summary>
179 | 
180 | - `split_ratios`: list. Optional.  
181 | A YAML list specifying the ratios by which to split the dataset into training, validation and test sets.  
182 | <u>Example</u>: 
183 |   ```
184 |   split_ratios: [0.75, 0.125, 0.125] # Defaults to this ratio if unspecified
185 |   ```
186 | - `post_split_filter_method`: string. Optional.   
187 | Fully qualified name of the method to use to "post-process" the split datasets. 
188 | This procedure is meant for removing/filtering records, or other cleaning processes. Arbitrary data transformations 
189 | should be done in the transform step.  
190 | <u>Example</u>:
191 |   ```
192 |   post_split_filter_method: steps.split.process_splits
193 |   ```
194 | </details>
195 | 
196 | **Step artifacts**:
197 | - `training_data`: the training dataset as a Pandas DataFrame.
198 | - `validation_data`: the validation dataset as a Pandas DataFrame.
199 | - `test_data`: the test dataset as a Pandas DataFrame.
200 | 
201 | ### Transform step
202 | 
203 | The transform step uses the training dataset created by the split step to fit a transformer that performs the 
204 | user-defined transformations. The transformer is then applied to the training dataset and the validation dataset, 
205 | creating transformed datasets that are used by subsequent steps for estimator training and model performance evaluation.
206 | 
207 | The user-defined transformation function is not required. If absent, an **identity transformer** will be used.
208 | The user-defined function should be written in
209 | [`steps/transform.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/transform.py), 
210 | and should return an unfitted estimator that is sklearn-compatible; that is, the returned object should define 
211 | `fit()` and `transform()` methods. `steps/transform.py` contains an example placeholder function.
212 | 
213 | The transform step is configured by the `steps.transform` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml):
214 | <details>
215 | <summary><strong><u>Full configuration reference</u></strong></summary>
216 | 
217 | - `transformer_method`: string. Optional.  
218 | Fully qualified name of the method that returns an `sklearn`-compatible transformer which applies feature 
219 | transformation during model training and inference. If absent, an identity transformer with will be used.
220 | <u>Example</u>:
221 |   ```
222 |   transformer_method: steps.split.transformer_fn
223 |   ```
224 | 
225 | </details>
226 | 
227 | **Step artifacts**:
228 | - `transformed_training_data`: transformed training dataset as a Pandas DataFrame.
229 | - `transformed_validation_data`: transformed validation dataset as a Pandas DataFrame.
230 | - `transformer`: the sklearn transformer.
231 | 
232 | 
233 | ### Train step
234 | The train step uses the transformed training dataset output from the transform step to fit an user-defined estimator. 
235 | The estimator is then joined with the fitted transformer output from the transform step to create a model pipeline. 
236 | Finally, this model pipeline is evaluated against the transformed training and validation datasets to compute performance metrics.  
237 | 
238 | Custom evaluation metrics are computed according to definitions in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py)
239 | and the `metrics` section of `pipeline.yaml`; see [Custom Metrics](#custom-metrics) section for reference. 
240 | 
241 | The model pipeline and its associated parameters, performance metrics, and lineage information are logged to [MLflow Tracking](https://www.mlflow.org/docs/latest/tracking.html), producing an MLflow Run.
242 | 
243 | The user-defined estimator function should be written in [`steps/train.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/train.py), 
244 | and should return an unfitted estimator that is `sklearn`-compatible; that is, the returned object should define 
245 | `fit()` and `transform()` methods. `steps/train.py` contains an example placeholder function.
246 | 
247 | The train step is configured by the `steps.train` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml):
248 | <details>
249 | <summary><strong><u>Full configuration reference</u></strong></summary>
250 | 
251 | - `estimator_method`: string. Required.  
252 | Fully qualified name of the method that returns an `sklearn`-compatible estimator used for model training.  
253 | <u>Example</u>:
254 |   ```
255 |   estimator_method: steps.train.estimator_fn
256 |   ```
257 | 
258 | - Tuning configuration reference
259 | 
260 |    - `enabled`: boolean. Required.  
261 |    Indicates whether or not tuning is enabled.
262 | 
263 |    - `max_trials`: int. Required.  
264 |    Max tuning trials to run.
265 | 
266 |    - `algorithm`: string. Optional.  
267 |    Indicates whether or not tuning is enabled.
268 | 
269 |    - `early_stop_fn`: string. Optional.  
270 |    Early stopping function to be passed to `hyperopt`.
271 | 
272 |    - `parallelism`: int. Optional.  
273 |    Number of workers to run `hyperopt` across.
274 | 
275 |    - `sample_fraction`: float. Optional.  
276 |    Sampling fraction in the range `(0, 1.0]` to indicate the amount of data used in tuning.
277 | 
278 |    - `parameters`: list. Required.  
279 |    `hyperopt` search space in yaml format.
280 | 
281 |   <u>Example</u>:
282 |   ```
283 |   tuning:
284 |     enabled: True
285 |     algorithm: "hyperopt.rand.suggest"
286 |     max_trials: 5
287 |     early_stop_fn: "hyperopt.early_stop.no_progress_loss(10)"
288 |     parallelism: 1
289 |     sample_fraction: 0.5
290 |     parameters:
291 |         alpha:
292 |           distribution: "uniform"
293 |           low: 0.0
294 |           high: 0.01
295 |         penalty:
296 |           values: ["l1", "l2", "elasticnet"]
297 |   ```
298 | 
299 | </details>
300 | 
301 | **Step artifacts**:
302 | - `model`: the [MLflow Model](https://www.mlflow.org/docs/latest/models.html) pipeline created in the train step 
303 | as a [PyFuncModel](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.PyFuncModel) instance.
304 | 
305 | 
306 | ### Evaluate step
307 | The evaluate step evaluates the model pipeline created by the train step on the test dataset output from the 
308 | split step, computing performance metrics and model explanations. 
309 | 
310 | Performance metrics are compared against configured thresholds to produce a `model_validation_status`, which indicates 
311 | whether or not a model is validated to be registered to the [MLflow Model Registry](https://www.mlflow.org/docs/latest/model-registry.html) 
312 | by the subsequent [register step](#register-step).  
313 | These model performance thresholds are defined in the 
314 | `validation_criteria` section of the `evaluate` step definition in `pipeline.yaml`. 
315 | Custom evaluation metrics are computed according to definitions in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py)
316 | and the `metrics` section of `pipeline.yaml`; see the [custom metrics section](#custom-metrics) for reference. 
317 | 
318 | Model performance metrics and explanations are logged to the same MLflow Tracking Run used by the train step.
319 | 
320 | The evaluate step is configured by the `steps.evaluate` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml):
321 | <details>
322 | <summary><strong><u>Full configuration reference</u></strong></summary>
323 | 
324 | - `validation_criteria`: list. Optional.  
325 | A list of validation thresholds, each of which a trained model must meet in order to be eligible for 
326 | registration in the [register step](#register-step).
327 | A definition for a validation threshold consists of a metric name
328 | (either a [built-in metric](#built-in-metrics) or a [custom metric](#custom-metrics)), and a threshold value.  
329 | <u>Example</u>:
330 |   ```
331 |   validation_critera:
332 |     - metric: root_mean_squared_error
333 |       threshold: 10
334 |   ```
335 | </details>
336 | 
337 | **Step artifacts**:
338 | - `run`: the MLflow Tracking Run containing the model pipeline, as well as performance and metrics created during 
339 | the train and evaluate steps.
340 | 
341 | 
342 | ### Register step
343 | The register step checks the `model_validation_status` output of the preceding [evaluate step](#evaluate-step) and, 
344 | if model validation was successful (if model_validation_status is `'VALIDATED'`), registers the model pipeline created
345 | by the train step to the MLflow Model Registry. If the `model_validation_status` does not indicate that the model 
346 | passed validation checks (if model_validation_status is `'REJECTED'`), the model pipeline is **not** registered to the 
347 | MLflow Model Registry.  
348 | If the model pipeline is registered to the MLflow Model Registry, a `registered_model_version` is produced containing 
349 | the model name and the model version.
350 | 
351 | The register step is configured by the `steps.register` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml):
352 | <details>
353 | <summary><strong><u>Full configuration reference</u></strong></summary>
354 | 
355 | - `model_name`: string. Required.  
356 | Specifies the name to use when registering the trained model to the model registry.
357 | 
358 | 
359 | - `allow_non_validated_model`: boolean. Required.  
360 | Whether to allow registration of models that fail to meet performance thresholds.
361 | 
362 | </details>
363 | 
364 | **Step artifacts**:
365 | - `registered_model_version`: the MLflow Model Registry [ModelVersion](https://mlflow.org/docs/latest/model-registry.html#concepts)
366 | registered in this step.
367 | 
368 | 
369 | ### Batch scoring
370 | After model training, the regression pipeline provides the capability to score new data with the
371 | trained model.
372 | 
373 | #### Ingest Scoring step
374 | The ingest scoring step, defined in the `data_scoring` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), 
375 | specifies the dataset used for batch scoring and has the same API as the [ingest step](#ingest-step).
376 | 
377 | **Step artifacts**:
378 | - `ingested_scoring_data`: the ingested scoring data as a Pandas DataFrame.
379 | 
380 | #### Predict step
381 | The predict step uses the model registered by the [register step](#register-step) to score the 
382 | ingested dataset produced by the [ingest scoring step](#ingest-scoring-step) and writes the resulting 
383 | dataset to the specified output format and location. To fix a specific model for use in the predict 
384 | step, provide its model URI as the `model_uri` attribute of the `pipeline.yaml` predict step definition.
385 | 
386 | The predict step is configured by the `steps.predict` section in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml):
387 | <details>
388 | <summary><strong><u>Full configuration reference</u></strong></summary>
389 | 
390 | - `output_format`: string. Required.  
391 | Specifies the output format of the scored data from the predict step. One of `parquet`, `delta`, and 
392 | `table`. The `parquet` format writes the scored data as parquet files under a specified path. The 
393 | `delta` format writes the scored data as a delta table under a specified path. The `table` format 
394 | writes the scored data as delta table and creates a metastore entry for this table with a specified name.
395 | 
396 | 
397 | - `output_location`: string. Required.  
398 | For the `parquet` and `delta` output formats, this attribute specifies the output path for writing 
399 | the scored data. In Databricks, this path will be written to be under [DBFS](https://docs.databricks.com/dbfs/index.html), 
400 | e.g. the path `my/special/path` will be written under `/dbfs/my/special/path`. For the `table` output 
401 | format, this attribute specifies the table name that is used to create the metastore entry for the 
402 | written delta table.
403 | <u>Example</u>: 
404 |   ```
405 |   output_location: ./outputs/predictions
406 |   ```
407 | 
408 | 
409 | - `model_uri`: string. Optional.  
410 | Specifies the URI of the model to use for batch scoring. If empty, the latest model version produced
411 | by the register step is used. If the register step was cleared, the latest version of the 
412 | registered model specified by the `model_name` attribute of the `pipeline.yaml` [register step](#register-step) 
413 | will be used.  
414 | <u>Example</u>: 
415 |   ```
416 |   model_uri: models/model.pkl
417 |   ```
418 | 
419 | 
420 | - `result_type`: string. Optional. Defaults to `double`.  
421 | Specifies the data type for predictions generated by the model. See the 
422 | [MLflow spark_udf API docs](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html#mlflow.pyfunc.spark_udf) 
423 | for more information.
424 | 
425 | 
426 | - `save_mode`: string. Optional. Defaults to `default`.  
427 | Specifies the save mode used by Spark for writing the scored data. See the 
428 | [PySpark save modes documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes) 
429 | for more information.
430 | </details>
431 | 
432 | 
433 | **Step artifacts**:
434 | - `scored_data`: the scored dataset, with model predictions under the `prediction` column, as a Pandas DataFrame.
435 | 
436 | 
437 | ### MLflow Tracking / Model Registry configuration
438 | The MLflow Tracking server can be configured to log MLflow runs to a specific server. Tracking information is specified
439 | in the profile configuration files - [`profiles/local.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/local.yaml)
440 | if running locally and [`profiles/databricks.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/profiles/databricks.yaml) 
441 | if running on Databricks.  
442 | 
443 | Configuring a tracking server is optional. If this configuration is absent, the default experiment will be used.
444 | 
445 | Tracking information is configured with the `experiment` section in the profile configuration:
446 | <details>
447 | <summary><strong><u>Full configuration reference</u></strong></summary>
448 | 
449 | - `name`: string. Required, if configuring tracking.  
450 | Name of the experiment to log MLflow runs to.
451 | 
452 | 
453 | - `tracking_uri`: string. Required, if configuring tracking.  
454 | URI of the MLflow tracking server to log runs to. Alternatively, the `MLFLOW_TRACKING_URI` environment variable can be [set to point to a valid tracking server](https://www.mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_tracking_uri).
455 | 
456 | 
457 | - `artifact_location`: string. Optional. 
458 | URI of the location to log run artifacts to.
459 | 
460 | </details>
461 | 
462 | To register trained models to the MLflow Model Registry, further configuration may be required. If unspecified, models will be logged to the same server as specified in the tracking URI. 
463 | 
464 | To register models to a different server, specify the desired server in the `model_registry` section in the profile configuration:
465 | <details>
466 | <summary><strong><u>Full configuration reference</u></strong></summary>
467 | 
468 | - `uri`: string. Required, if this section is present.  
469 | URI of the model registry server to which to register trained models.
470 | 
471 | </details>
472 | 
473 | ### Metrics
474 | Evaluation metrics calculate model performance against different datasets. The metrics defined in the pipeline 
475 | will be calculated as part of the training and evaluation steps, and calculated values will be recorded in each 
476 | step’s information card.
477 | 
478 | This regression pipeline features a set of built-in metrics, and supports user-defined metrics as well.
479 | 
480 | The **primary evaluation metric** is the one that will be used to select the best performing model in the MLflow UI as
481 | well as in the train and evaluation steps. This can be either a built-in metric or a custom metric (see below).  
482 | Models are ranked by this primary metric.
483 | 
484 | Metrics are configured under the `metrics` section of [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), according to the following specification:
485 | <details>
486 | <summary><strong><u>Full configuration reference</u></strong></summary>
487 | 
488 | - `primary`: string. Required.  
489 | The name of the primary evaluation metric.
490 | 
491 | 
492 | - `custom`: string. Optional.  
493 | A list of custom metric configurations.
494 | 
495 | </details>
496 | 
497 | Note that each metric specifies a boolean value `greater_is_better`, which indicates whether a higher value for that 
498 | metric is associated with better model performance.
499 | 
500 | #### Built-in metrics
501 | The following metrics are built-in. Note that `greater_is_better = False` for all these metrics:
502 | 
503 | - `mean_absolute_error`
504 | - `mean_squared_error`
505 | - `root_mean_squared_error`
506 | - `max_error`
507 | - `mean_absolute_percentage_error`
508 | 
509 | #### Custom metrics
510 | Custom evaluation metrics define how trained models should be evaluated against custom criteria not captured by 
511 | built-in `sklearn` evaluation metrics.
512 | 
513 | Custom evaluation metric functions should be defined in [`steps/custom_metrics.py`](https://github.com/mlflow/mlp-regression-template/blob/main/steps/custom_metrics.py). 
514 | Each should accept two parameters:
515 | - `eval_df`: DataFrame.  
516 | A Pandas DataFrame containing two columns:
517 |   - `prediction`: Predictions produced by submitting input data to the model.
518 |   - `target`: Corresponding target truth values.
519 | 
520 | 
521 | - `builtin_metrics`: `Dict[str, int]`.  
522 | The built-in metrics calculated during model evaluation. Maps metric names to corresponding scalar values.
523 | 
524 | The custom metric function should return a `Dict[str, int]`, mapping custom metric names to corresponding scalar metric values.
525 | 
526 | Custom metrics are specified as a list under the `metrics.custom` key in [`pipeline.yaml`](https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml), specified as follows:
527 | - `name`: string. Required.  
528 | Name of the custom metric. This will be the name by which you refer to this metric when including it in model evaluation or model training.
529 | 
530 | 
531 | - `function`: string. Required. Specifies the function this custom metric refers to.
532 | 
533 | 
534 | - `greater_is_better`: boolean. Required. Boolean indicating whether a higher metric value indicates better model 
535 | performance.
536 | 
537 | An example custom metric configuration is as follows:
538 | ```
539 | custom:
540 |  - name: weighted_mean_square_error
541 |    function: steps.custom_metrics.get_custom_metrics
542 |    greater_is_better: True
543 | ```
544 | 


--------------------------------------------------------------------------------
/notebooks/databricks.py:
--------------------------------------------------------------------------------
 1 | # Databricks notebook source
 2 | 
 3 | # MAGIC %md
 4 | # MAGIC # MLflow Regression Pipeline Databricks Notebook
 5 | # MAGIC This notebook runs the MLflow Regression Pipeline on Databricks and inspects its results.
 6 | # MAGIC
 7 | # MAGIC For more information about the MLflow Regression Pipeline, including usage examples,
 8 | # MAGIC see the [Regression Pipeline overview documentation](https://mlflow.org/docs/latest/pipelines.html#regression-pipeline)
 9 | # MAGIC and the [Regression Pipeline API documentation](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#module-mlflow.pipelines.regression.v1.pipeline).
10 | 
11 | # COMMAND ----------
12 | 
13 | # MAGIC %pip install mlflow[pipelines]
14 | # MAGIC %pip install -r ../requirements.txt
15 | 
16 | # COMMAND ----------
17 | 
18 | # MAGIC %md ### Create a new pipeline with "databricks" profile:
19 | 
20 | # COMMAND ----------
21 | 
22 | from mlflow.pipelines import Pipeline
23 | 
24 | p = Pipeline(profile="databricks")
25 | 
26 | # COMMAND ----------
27 | 
28 | # MAGIC %md ### Inspect a newly created pipeline using a graphical representation:
29 | 
30 | # COMMAND ----------
31 | 
32 | p.inspect()
33 | 
34 | # COMMAND ----------
35 | 
36 | # MAGIC %md ### Ingest the dataset into the pipeline:
37 | 
38 | # COMMAND ----------
39 | 
40 | p.run("ingest")
41 | 
42 | # COMMAND ----------
43 | 
44 | # MAGIC %md ### Split the dataset in train, validation and test data profiles:
45 | 
46 | # COMMAND ----------
47 | 
48 | p.run("split")
49 | 
50 | # COMMAND ----------
51 | 
52 | training_data = p.get_artifact("training_data")
53 | training_data.describe()
54 | 
55 | # COMMAND ----------
56 | 
57 | p.run("transform")
58 | 
59 | # COMMAND ----------
60 | 
61 | # MAGIC %md ### Using training data profile, train the model:
62 | 
63 | # COMMAND ----------
64 | 
65 | p.run("train")
66 | 
67 | # COMMAND ----------
68 | 
69 | trained_model = p.get_artifact("model")
70 | print(trained_model)
71 | 
72 | # COMMAND ----------
73 | 
74 | # MAGIC %md ### Evaluate the resulting model using validation data profile:
75 | 
76 | # COMMAND ----------
77 | 
78 | 
79 | p.run("evaluate")
80 | 
81 | # COMMAND ----------
82 | 
83 | # MAGIC %md ### Register the trained model in the registry:
84 | 
85 | # COMMAND ----------
86 | 
87 | p.run("register")
88 | 


--------------------------------------------------------------------------------
/notebooks/jupyter.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "%load_ext autoreload\n",
 10 |     "%autoreload 2"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {},
 16 |    "source": [
 17 |     "# MLflow Regression Pipeline Notebook\n",
 18 |     "\n",
 19 |     "This notebook runs the MLflow Regression Pipeline on Databricks and inspects its results. For more information about the MLflow Regression Pipeline, including usage examples, see the [Regression Pipeline overview documentation](https://mlflow.org/docs/latest/pipelines.html#regression-pipeline) the [Regression Pipeline API documentation](https://mlflow.org/docs/latest/python_api/mlflow.pipelines.html#module-mlflow.pipelines.regression.v1.pipeline)."
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": null,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "from mlflow.pipelines import Pipeline\n",
 29 |     "\n",
 30 |     "p = Pipeline(profile=\"local\")"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "p.inspect()"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "p.run(\"ingest\")"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": null,
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "p.run(\"split\")"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "training_data = p.get_artifact(\"training_data\")\n",
 67 |     "training_data.describe()"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "p.run(\"transform\")"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": null,
 82 |    "metadata": {},
 83 |    "outputs": [],
 84 |    "source": [
 85 |     "p.run(\"train\")"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": null,
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "trained_model = p.get_artifact(\"model\")\n",
 95 |     "print(trained_model)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "p.run(\"evaluate\")"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "p.run(\"register\")"
114 |    ]
115 |   }
116 |  ],
117 |  "metadata": {
118 |   "interpreter": {
119 |    "hash": "c462df992c775797bd5d542b280333286dbcc2ffa1f781b674f30e76741ca83d"
120 |   },
121 |   "kernelspec": {
122 |    "display_name": "Python 3 (ipykernel)",
123 |    "language": "python",
124 |    "name": "python3"
125 |   },
126 |   "language_info": {
127 |    "codemirror_mode": {
128 |     "name": "ipython",
129 |     "version": 3
130 |    },
131 |    "file_extension": ".py",
132 |    "mimetype": "text/x-python",
133 |    "name": "python",
134 |    "nbconvert_exporter": "python",
135 |    "pygments_lexer": "ipython3",
136 |    "version": "3.9.12"
137 |   }
138 |  },
139 |  "nbformat": 4,
140 |  "nbformat_minor": 4
141 | }
142 | 


--------------------------------------------------------------------------------
/pipeline.yaml:
--------------------------------------------------------------------------------
  1 | # `pipeline.yaml` is the main configuration file for an MLflow Pipeline.
  2 | # Required pipeline parameters should be defined in this file with either concrete values or
  3 | # variables such as {{ INGEST_DATA_LOCATION }}.
  4 | #
  5 | # Variables must be dereferenced in a profile YAML file, located under `profiles/`.
  6 | # See `profiles/local.yaml` for example usage. One may switch among profiles quickly by
  7 | # providing a profile name such as `local` in the Pipeline object constructor:
  8 | # `p = Pipeline(profile="local")`
  9 | #
 10 | # NOTE: All "FIXME::REQUIRED" fields in pipeline.yaml and profiles/*.yaml must be set correctly
 11 | #       to adapt this template to a specific regression problem. To find all required fields,
 12 | #       under the root directory of this pipeline, type on a unix-like command line:
 13 | #       $> grep "# FIXME::REQUIRED:" pipeline.yaml profiles/*.yaml
 14 | #
 15 | # NOTE: YAML does not support tabs for indentation. Please use spaces and ensure that all YAML
 16 | #       files are properly formatted.
 17 | 
 18 | template: "regression/v1"
 19 | # Specifies the dataset to use for model development
 20 | data:
 21 |   # Dataset locations on the local filesystem are supported, as well as HTTP(S) URLs and
 22 |   # any other remote locations resolvable by MLflow, such as those listed in
 23 |   # https://mlflow.org/docs/latest/tracking.html#artifact-stores
 24 |   location: {{INGEST_DATA_LOCATION}}
 25 |   # Beyond `parquet` datasets, the `spark_sql` and `delta` formats are also natively supported for
 26 |   # use with Spark
 27 |   format: {{INGEST_DATA_FORMAT|default('parquet')}}
 28 |   # Datasets with other formats, including `csv`, can be used by implementing and
 29 |   # specifying a `custom_loader_method`
 30 |   custom_loader_method: steps.ingest.load_file_as_dataframe
 31 |   # If the `spark_sql` `format` is specified,
 32 |   # And if the table location format is path-like, use the following sql command for Spark to read
 33 |   # sub-columns from the table:
 34 |   # sql: SELECT col1, col2 FROM delta.`{{INGEST_DATA_LOCATION}}`
 35 |   # And if the table location format is table-like, use the following sql command for Spark to read
 36 |   # sub-columns from the table:
 37 |   # sql: SELECT col1, col2 FROM {{INGEST_DATA_LOCATION}}
 38 |   # If the `delta` `format` is specified, you can also configure the Delta table `version` to read
 39 |   # or the `timestamp` at which to read data
 40 |   # version: 2
 41 |   # timestamp: 2022-06-01T00:00:00.000Z
 42 | # FIXME::OPTIONAL: Specify the dataset to use for batch scoring. All params serve the same function 
 43 | #                  as in `data`
 44 | # data_scoring:
 45 | #   location: {{INGEST_SCORING_DATA_LOCATION}}
 46 | #   format: {{INGEST_SCORING_DATA_FORMAT|default('parquet')}}
 47 | #   custom_loader_method: steps.ingest.load_file_as_dataframe
 48 | #   sql: SELECT * FROM delta.`{{INGEST_SCORING_DATA_LOCATION}}`
 49 | #
 50 | # FIXME::REQUIRED: Specifies the target column name for model training and evaluation.
 51 | #
 52 | target_col: ""
 53 | steps:
 54 |   split:
 55 |     #
 56 |     # FIXME::OPTIONAL: Adjust the train/validation/test split ratios below.
 57 |     #
 58 |     split_ratios: [0.75, 0.125, 0.125]
 59 |     #
 60 |     #  FIXME::OPTIONAL: Specifies the method to use to "post-process" the split datasets. Note that
 61 |     #                   arbitrary transformations should go into the transform step.
 62 |     post_split_filter_method: steps.split.create_dataset_filter
 63 |   transform:
 64 |     #
 65 |     #  FIXME::OPTIONAL: Specifies the method that defines an sklearn-compatible transformer, which
 66 |     #                   applies input feature transformation during model training and inference.
 67 |     transformer_method: steps.transform.transformer_fn
 68 |   train:
 69 |     using: estimator_spec
 70 |     # Specifies the method that defines the estimator type and parameters to use for model training
 71 |     estimator_method: steps.train.estimator_fn
 72 |   evaluate:
 73 |     #
 74 |     # FIXME::OPTIONAL: Sets performance thresholds that a trained model must meet in order to be
 75 |     #                  eligible for registration to the MLflow Model Registry.
 76 |     #
 77 |     # validation_criteria:
 78 |     #   - metric: root_mean_squared_error
 79 |     #     threshold: 10
 80 |   register:
 81 |     #
 82 |     # FIXME::REQUIRED: Specifies the name of the Registered Model to use when registering a trained
 83 |     #                  model to the MLflow Model Registry.
 84 |     #
 85 |     model_name: ""
 86 |     # Indicates whether or not a model that fails to meet performance thresholds should still
 87 |     # be registered to the MLflow Model Registry
 88 |     allow_non_validated_model: false
 89 |   # FIXME::OPTIONAL: Configure the predict step for batch scoring. See README.md for full 
 90 |   #                  configuration reference.
 91 |   # predict:
 92 |   #   output_format: {{SCORED_OUTPUT_DATA_FORMAT|default('parquet')}}
 93 |   #   output_location: {{SCORED_OUTPUT_DATA_LOCATION}}
 94 |   #   model_uri: "models/model.pkl"
 95 |   #   result_type: "double"
 96 |   #   save_mode: "default
 97 | metrics:
 98 |   #
 99 |   # FIXME::REQUIRED: Sets the primary metric to use to evaluate model performance. This primary
100 |   #                  metric is used to select best performing models in MLflow UI as well as in
101 |   #                  train and evaluation step.
102 |   #                  Built-in metrics are: example_count, mean_absolute_error, mean_squared_error,
103 |   #                  root_mean_squared_error, sum_on_label, mean_on_label, r2_score, max_error,
104 |   #                  mean_absolute_percentage_error
105 |   primary: ""
106 |   #
107 |   # FIXME::OPTIONAL: Defines custom performance metrics to compute during model development.
108 |   #
109 |   # custom:
110 |   #   - name: ""
111 |   #     function: get_custom_metrics
112 |   #     greater_is_better: False
113 | 
114 | 


--------------------------------------------------------------------------------
/profiles/databricks.yaml:
--------------------------------------------------------------------------------
 1 | #
 2 | # FIXME::REQUIRED: set an MLflow experiment name to track pipeline executions and artifacts. On Databricks, an
 3 | #                  experiment name must be a valid path in the workspace.
 4 | #
 5 | experiment:
 6 |   name: ""
 7 | #
 8 | # FIXME::OPTIONAL: Set the registry server URI, useful if you have a registry server different
 9 | #                  from the tracking server. First create a Databricks Profile, see
10 | #                  https://github.com/databricks/databricks-cli#installation
11 | # model_registry:
12 | #   uri: "databricks://DATABRICKS_PROFILE_NAME"
13 | 
14 | # FIXME::REQUIRED: Specify the training and evaluation data location. This is usually a DBFS
15 | # location ("dbfs:/...") or a SQL table ("SCHEMA.TABLE").
16 | INGEST_DATA_LOCATION: ""
17 | #
18 | # FIXME::OPTIONAL: Specify the format of the training and evaluation dataset. Natively supported
19 | #                  formats are: parquet, spark_sql, delta.
20 | # INGEST_DATA_FORMAT: parquet
21 | #
22 | # FIXME::OPTIONAL: Specify the scoring data location.
23 | # INGEST_SCORING_DATA_LOCATION: ""
24 | #
25 | # FIXME::OPTIONAL: Specify the format of the scoring dataset. Natively supported formats are:
26 | #                  parquet, spark_sql, delta.
27 | # INGEST_SCORING_DATA_FORMAT: parquet
28 | #
29 | # FIXME::OPTIONAL: Specify the output location of the batch scoring predict step.
30 | # SCORED_OUTPUT_DATA_LOCATION: ""
31 | #
32 | # FIXME::OPTIONAL: Specify the format of the scored dataset. Natively supported formats are:
33 | #                  parquet, delta, table.
34 | # SCORED_OUTPUT_DATA_FORMAT: parquet
35 | 


--------------------------------------------------------------------------------
/profiles/local.yaml:
--------------------------------------------------------------------------------
 1 | #
 2 | # FIXME::REQUIRED: set an MLflow experiment name to track pipeline executions and artifacts.
 3 | #
 4 | experiment:
 5 |   name: ""
 6 |   tracking_uri: "sqlite:///metadata/mlflow/mlruns.db"
 7 |   artifact_location: "./metadata/mlflow/mlartifacts"
 8 | #
 9 | # FIXME::OPTIONAL: Set the registry server URI. This property is especially useful if you have a
10 | #                  registry server that’s different from the tracking server.
11 | # model_registry:
12 | #   uri: "sqlite:///metadata/mlflow/registry.db"
13 | #
14 | # FIXME::REQUIRED: Specify the training and evaluation data location.
15 | INGEST_DATA_LOCATION: ""
16 | #
17 | # FIXME::OPTIONAL: Specify the format of the training and evaluation dataset. Natively supported
18 | #                  formats are: parquet, spark_sql, delta.
19 | # INGEST_DATA_FORMAT: parquet
20 | #
21 | # FIXME::OPTIONAL: Specify the scoring data location.
22 | # INGEST_SCORING_DATA_LOCATION: ""
23 | #
24 | # FIXME::OPTIONAL: Specify the format of the scoring dataset. Natively supported formats are:
25 | #                  parquet, spark_sql, delta.
26 | # INGEST_SCORING_DATA_FORMAT: parquet
27 | #
28 | # FIXME::OPTIONAL: Specify the output location of the batch scoring predict step.
29 | # SCORED_OUTPUT_DATA_LOCATION: ""
30 | #
31 | # FIXME::OPTIONAL: Specify the format of the scored dataset. Natively supported formats are:
32 | #                  parquet, delta, table.
33 | # SCORED_OUTPUT_DATA_FORMAT: parquet
34 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | mlflow[pipelines]>=1.27.0,<2.0
2 | pandas>=1.3.*
3 | scikit-learn>=1.1.*
4 | ipykernel>=6.12.*
5 | ipython>=7.32.*
6 | shap>=0.40.*
7 | 


--------------------------------------------------------------------------------
/requirements/lint-requirements.txt:
--------------------------------------------------------------------------------
1 | pylint==2.11.1
2 | black==22.3.0
3 | 


--------------------------------------------------------------------------------
/requirements/test-requirements.txt:
--------------------------------------------------------------------------------
1 | ## Test-only dependencies
2 | pytest
3 | 


--------------------------------------------------------------------------------
/steps/custom_metrics.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module defines custom metric functions that are invoked during the 'train' and 'evaluate'
 3 | steps to provide model performance insights. Custom metric functions defined in this module are
 4 | referenced in the ``metrics`` section of ``pipeline.yaml``, for example:
 5 | 
 6 | .. code-block:: yaml
 7 |     :caption: Example custom metrics definition in ``pipeline.yaml``
 8 | 
 9 |     metrics:
10 |       custom:
11 |         - name: weighted_mean_squared_error
12 |           function: weighted_mean_squared_error
13 |           greater_is_better: False
14 | """
15 | from typing import Dict
16 | 
17 | from pandas import DataFrame
18 | 
19 | 
20 | def get_custom_metrics(
21 |     eval_df: DataFrame,
22 |     builtin_metrics: Dict[str, int],  # pylint: disable=unused-argument
23 | ) -> Dict[str, int]:
24 |     """
25 |     FIXME::OPTIONAL: provide function doc string.
26 |     :param eval_df: A Pandas DataFrame containing the following columns:
27 |                     - ``"prediction"``: Predictions produced by submitting input data to the model.
28 |                     - ``"target"``: Ground truth values corresponding to the input data.
29 |     :param builtin_metrics: A dictionary containing the built-in metrics that are calculated
30 |                             automatically during model evaluation. The keys are the names of the
31 |                             metrics and the values are the scalar values of the metrics. For more
32 |                             information, see
33 |                             https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.evaluate.
34 |     :return: A single-entry dictionary containing the custom metrics. The key is the metric name
35 |              and the value is the scalar metric value. Note that custom metric functions can
36 |              return dictionaries with multiple metric entries as well.
37 |     """
38 |     # FIXME::OPTIONAL: implement custom metrics calculation here.
39 | 
40 |     raise NotImplementedError
41 | 


--------------------------------------------------------------------------------
/steps/ingest.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module defines the following routines used by the 'ingest' step of the regression pipeline:
 3 | 
 4 | - ``load_file_as_dataframe``: Defines customizable logic for parsing dataset formats that are not
 5 |   natively parsed by MLflow Pipelines (i.e. formats other than Parquet, Delta, and Spark SQL).
 6 | """
 7 | from pandas import DataFrame
 8 | 
 9 | 
10 | def load_file_as_dataframe(file_path: str, file_format: str) -> DataFrame:
11 |     """
12 |     Load content from the specified dataset file as a Pandas DataFrame.
13 | 
14 |     This method is used to load dataset types that are not natively  managed by MLflow Pipelines
15 |     (datasets that are not in Parquet, Delta Table, or Spark SQL Table format). This method is
16 |     called once for each file in the dataset, and MLflow Pipelines automatically combines the
17 |     resulting DataFrames together.
18 | 
19 |     :param file_path: The path to the dataset file.
20 |     :param file_format: The file format string, such as "csv".
21 |     :return: A Pandas DataFrame representing the content of the specified file.
22 |     """
23 |     # FIXME::OPTIONAL: implement the handling of non-natively supported file_format.
24 | 
25 |     raise NotImplementedError
26 | 


--------------------------------------------------------------------------------
/steps/split.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module defines the following routines used by the 'split' step of the regression pipeline:
 3 | 
 4 | - ``create_dataset_filter``: Defines customizable logic for filtering the training, validation,
 5 |   and test datasets produced by the data splitting procedure. Note that arbitrary transformations
 6 |   should go into the transform step.
 7 | """
 8 | 
 9 | from pandas import DataFrame, Series
10 | 
11 | 
12 | def create_dataset_filter(dataset: DataFrame) -> Series(bool):
13 |     """
14 |     Mark rows of the split datasets to be additionally filtered. This function will be called on
15 |     the training, validation, and test datasets.
16 | 
17 |     :param dataset: The {train,validation,test} dataset produced by the data splitting procedure.
18 |     :return: A Series indicating whether each row should be filtered
19 |     """
20 |     # FIXME::OPTIONAL: implement post-split filtering on the dataframes, such as data cleaning.
21 | 
22 |     return Series(True, index=dataset.index)
23 | 


--------------------------------------------------------------------------------
/steps/train.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module defines the following routines used by the 'train' step of the regression pipeline:
 3 | 
 4 | - ``estimator_fn``: Defines the customizable estimator type and parameters that are used
 5 |   during training to produce a model pipeline.
 6 | """
 7 | from typing import Dict, Any
 8 | 
 9 | 
10 | def estimator_fn(estimator_params: Dict[str, Any] = {}):
11 |     """
12 |     Returns an *unfitted* estimator that defines ``fit()`` and ``predict()`` methods.
13 |     The estimator's input and output signatures should be compatible with scikit-learn
14 |     estimators.
15 |     """
16 |     #
17 |     # FIXME::OPTIONAL: return a scikit-learn-compatible regression estimator with fine-tuned
18 |     #                  hyperparameters.
19 |     from sklearn.linear_model import SGDRegressor
20 | 
21 |     return SGDRegressor(**estimator_params)
22 | 


--------------------------------------------------------------------------------
/steps/transform.py:
--------------------------------------------------------------------------------
 1 | """
 2 | This module defines the following routines used by the 'transform' step of the regression pipeline:
 3 | 
 4 | - ``transformer_fn``: Defines customizable logic for transforming input data before it is passed
 5 |   to the estimator during model inference.
 6 | """
 7 | 
 8 | def transformer_fn():
 9 |     """
10 |     Returns an *unfitted* transformer that defines ``fit()`` and ``transform()`` methods.
11 |     The transformer's input and output signatures should be compatible with scikit-learn
12 |     transformers.
13 |     """
14 |     #
15 |     # FIXME::OPTIONAL: return a scikit-learn-compatible transformer object.
16 |     #
17 |     # Identity feature transformation is applied when None is returned.
18 |     return None
19 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mlflow/mlp-regression-template/e1b0b66d2342afbbaffb832d163681e9b67854a6/tests/__init__.py


--------------------------------------------------------------------------------
/tests/train_test.py:
--------------------------------------------------------------------------------
 1 | from steps.train import estimator_fn
 2 | from sklearn.utils.estimator_checks import check_estimator
 3 | 
 4 | 
 5 | def test_train_fn_returns_object_with_correct_spec():
 6 |     regressor = estimator_fn()
 7 |     assert callable(getattr(regressor, "fit", None))
 8 |     assert callable(getattr(regressor, "predict", None))
 9 | 
10 | 
11 | def test_train_fn_passes_check_estimator():
12 |     regressor = estimator_fn()
13 |     check_estimator(regressor)
14 | 


--------------------------------------------------------------------------------
/tests/transform_test.py:
--------------------------------------------------------------------------------
1 | from steps.transform import transformer_fn
2 | 
3 | 
4 | def test_tranform_fn_returns_object_with_correct_spec():
5 |     transformer = transformer_fn()
6 |     if transformer:
7 |         assert callable(getattr(transformer, "fit", None))
8 |         assert callable(getattr(transformer, "transform", None))
9 | 


--------------------------------------------------------------------------------