├── LICENSE ├── README.md └── roadmap ├── 2020 ├── MLOpsRoadmap2020.md └── images │ ├── README.md │ ├── key.svg │ ├── mlops-horizontal-color.png │ ├── mlops-horizontal-color.svg │ └── solutions.svg ├── 2021 ├── MLOpsRoadmap2021.md └── images │ ├── README.md │ ├── key.svg │ ├── mlops-horizontal-color.png │ ├── mlops-horizontal-color.svg │ └── solutions.svg ├── 2022 ├── MLOpsRoadmap2022.md └── images │ ├── README.md │ ├── key.svg │ ├── mlops-horizontal-color.png │ ├── mlops-horizontal-color.svg │ └── solutions.svg ├── 2024 ├── MLOpsRoadmap2024.md └── images │ ├── README.md │ ├── key.svg │ ├── mlops-horizontal-color.png │ ├── mlops-horizontal-color.svg │ └── solutions.svg └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # sig-mlops 2 | 3 | CDF Special Interest Group - *MLOps* 4 | 5 | - Wiki: https://github.com/cdfoundation/sig-mlops/wiki 6 | - Mail list: https://lists.cd.foundation/g/sig-mlops 7 | - Email list: sig-mlops@lists.cd.foundation 8 | - Agenda and Meeting minutes: http://bit.ly/mlops-sig 9 | - [Slack community](https://join.slack.com/t/cdeliveryfdn/shared_invite/zt-nwc0jjd0-G65oEpv5ynFfPD5oOX5Ogg) be sure to join the `#sig-mlops` channel 10 | 11 | The draft 2024 document is now available for PRs. Please feel free to discuss changes in the Slack channel. 12 | 13 | 29 | -------------------------------------------------------------------------------- /roadmap/2020/MLOpsRoadmap2020.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # MLOps Roadmap 2020 4 | 5 | > **Warning** 6 | > This document is now outdated. Please see [the 2021 edition](../2021/MLOpsRoadmap2021.md) 7 | 8 | ## About this document 9 | 10 | This document sets out the current state of MLOps and provides a five year roadmap for future customer needs which is intended to support pre-competitive collaboration across the industry with a view to improving the overall state of MLOps as a capability for all. 11 | 12 | It is intended that this document be iteratively refined by group consensus within the MLOps SIG across a series of regular meetings and then published annually whilst relevant. 13 | 14 |
15 | 16 | **Acknowledgements** 17 | 18 | Current active contributors to the MLOps SIG Roadmap: 19 | 20 | Terry Cox, Bootstrap Ltd 21 | 22 | Michael Neale, CloudBees 23 | 24 | Kara de la Marck, CloudBees 25 | 26 | Ian Hellström, D2IQ 27 | 28 | Almog Baku, Rimoto 29 | 30 | 31 | 32 | 33 | *Initial publication - Sept 25th, 2020* 34 | 35 |
36 | 37 | # Introduction 38 | 39 | ## Current State of MLOps 40 | 41 | 42 | 43 | ### What is MLOps? 44 | 45 | MLOps could be narrowly defined as "the ability to apply DevOps principles to Machine Learning applications" however as we shall see shortly, this narrow definition misses the true value of MLOps to the customer. Instead, we define MLOps as “the extension of the DevOps methodology to include Machine Learning and Data Science assets as first class citizens within the DevOps ecology”. 46 | 47 | MLOps should be viewed as a practice for consistently managing the ML aspects of products in a way that is unified with all of the other technical and non-technical elements necessary to successfully commercialise those products with maximum potential for viability in the marketplace. This includes DataOps, too, as machine learning without complete, consistent, semantically valid, correct, timely, and unbiased data is problematic or leads to flawed solutions that can exacerbate built-in biases. 48 | 49 | 50 | > MLOps is not to be confused with "AIOps". AIOps often means an application of AI technologies to Ops data with sometimes unclear aims for gaining insights. These terms are still evolving, but for the purposes of this document we do not mean AIOps in the latter usage. Some organisations are more comfortable with the designation 'AI' rather than 'Machine Learning' and so it is to be expected that MLOps may be referred to by AIOps in those domains, however the reverse is not true as use of the term 'AIOps' may not refer to the MLOps methodology. 51 | 52 | 53 | ### What is MLOps not? 54 | 55 | It sometimes helps to consider what anti-patterns exist around a concept in order to better understand it. 56 | 57 | For example, MLOps is not "putting Jupyter Notebooks into production environments". 58 | 59 | RAD tools like Jupyter Notebooks can be extremely useful, both in classroom environments and in exploring problem spaces to understand potential approaches to mathematical problems. However, like all Rapid Application Development tools, they achieve the rapid element of their name by trading off other key non-functional requirements like maintainability, testability and scalability. 60 | 61 | In the next section, we will discuss the key drivers for MLOps and expand upon the requirements for a true DevOps approach to managing ML assets. At this point in the development of the practice, it perhaps helps to understand that much of ML and AI research and development activity has been driven by Data Science rather than Computer Science teams. This specialisation has enabled great leaps in the ML field but at the same time means that a significant proportion of ML practitioners have never been exposed to the lessons of the past seventy years of managing software assets in commercial environments. 62 | 63 | As we shall see, this can result in large conceptual gaps between what is involved in creating a viable proof of concept of a trained ML model on a Data Scientist’s laptop vs what it subsequently takes to be able to safely transition that asset into a commercial product in production environments. It is not therefore unfair to describe the current state of MLOps in 2020 as still on the early path towards maturity and to consider that much of the early challenge for adoption will be one of education and communication rather than purely technical refinements to tooling. 64 | 65 | Compounding this challenge, Machine Learning solutions tend to be decision-making systems rather than just data processing systems and thus will be required to be held accountable to much higher standards than those applied to the best quality software delivery projects. The bar for quality and governance processes is therefore very high, in many cases representing legal compliance processes mandated by regional legislation. 66 | 67 | As these decision-making solutions increasingly displace human decision-makers in commerce and government, we encounter a new class of governance problems, collectively known as ‘Responsible AI’. These introduce a range of challenges around complex issues such as ethics, fairness and bias in ML models and often fall under government regulation requiring interpretability and explainability of models, often to a standard higher than that applied to human staff. 68 | 69 | As a result, it will be necessary for MLOps to develop in a manner that facilitates complex and sensitive governance processes that are of a standard appropriate to what is expected to become a highly regulated area. 70 | 71 |
72 | 73 | ## Drivers 74 | 75 | 76 | 77 | ### General DevOps drivers applied to MLOps 78 | 79 | * Optimising the process of taking ML features into production by reducing Lead Time 80 | 81 | * Optimising the feedback loop between production and development for ML assets 82 | 83 | * Supporting the problem-solving cycle of experimentation and feedback for ML applications 84 | 85 | * Unifying the release cycle for ML and conventional assets 86 | 87 | * Enabling automated testing of ML assets 88 | 89 | * Application of Agile principles to ML projects 90 | 91 | * Supporting ML assets as first class citizens within CI/CD systems 92 | 93 | * Enabling shift-left on Security to include ML assets 94 | 95 | * Improving quality through standardisation across conventional and ML assets 96 | 97 | * Applying Static Analysis, Dynamic Analysis, Dependency Scanning and Integrity Checking to ML assets 98 | 99 | * Reducing Mean Time To Restore for ML applications 100 | 101 | * Reducing Change Fail Percentage for ML applications 102 | 103 | * Management of technical debt across ML assets 104 | 105 | * Enabling whole-of-life cost savings at a product level 106 | 107 | * Reducing overheads of IT management through economies of scale 108 | 109 | * Facilitating the re-use of ML approaches through template or ‘quickstart’ projects 110 | 111 | * Managing risk by aligning ML deliveries to appropriate governance processes 112 | 113 | 114 | 115 | ### Drivers unique to Machine Learning solutions that represent MLOps requirements 116 | 117 | * Mitigating the risks associated with producing decision-making products 118 | 119 | * Provide specific testing support for detecting ML-specific errors 120 | 121 | * Verifying that algorithms make decisions aligned to customer values 122 | 123 | * Tracking long-term stability of ML asset performance 124 | 125 | * Incorporate ethical governance into management of ML assets 126 | 127 | * Enabling automated bias detection testing 128 | 129 | * Ensuring explainability of decisions 130 | 131 | * Ensuring fairness in decisions 132 | 133 | * Providing transparency and audit-ability of ML assets 134 | 135 | * Allowing for provability of decision-making algorithms 136 | 137 | * Ensuring long term supportability of long-lived ML assets 138 | 139 | * Manage ML-specific security risks such as adversarial attacks 140 | 141 | * Providing clear action paths to suspend or revert failed ML assets in production environments 142 | 143 | * Managing economic risks associated with deployment of key decision-making systems on customer infrastructure 144 | 145 | * Leveraging ML techniques to improve MLOps solutions 146 | 147 | * Mitigating liability risk and accountability through appropriate due-diligence over ML assets 148 | 149 | * Facilitating compliance verification against regional AI legislation 150 | 151 | * Enabling fast development life cycles by offering out-of-the-box (correlated) sampling of relevant data sets 152 | 153 | * Enabling online (streaming) and offline (batch) model predictions 154 | 155 | * Accelerating enterprise-wide development of models through ready-to-reuse features (i.e. feature stores) 156 | 157 |
158 | 159 | ## Vision 160 | 161 | 162 | 163 | An optimal MLOps experience is one where Machine Learning assets are treated consistently with all other software assets within a CI/CD environment. Machine Learning models can be deployed alongside the services that wrap them and the services that consume them as part of a unified release process. 164 | 165 | MLOps must be language-agnostic. Training scripts for ML models can be written in a wide range of different programming languages, so a focus on Python is artificially limiting to widespread adoption. 166 | 167 | MLOps must be framework-agnostic. There are a plethora of different Machine Learning frameworks commonly in use today and it is to be expected that these will continue to proliferate over time. It must be possible to use any desired framework within an MLOps context and be possible to combine multiple frameworks in any given deployment. 168 | 169 | MLOps must be platform and infrastructure agnostic. Adoption of MLOps is predicated upon being able to operate using this approach within the constraints of previously defined corporate infrastructure. It should be presumed unlikely that MLOps alone is a sufficient driver to motivate fundamental change to infrastructure. 170 | 171 | It is very important that MLOps should make no over-simplified assumptions with respect to hardware. To achieve currently known customer requirements in a range of ML fields, it will be necessary to achieve performance gains of at least three orders of magnitude. This can be expected to lead to significant change with respect to the hardware utilised to train and execute models. 172 | 173 | We make the following assertions: 174 | 175 | - Models may expect to be **trained** on various combinations CPU, GPU, TPU, dedicated ASICs or custom neuro-morphic silicon, with new hardware entering the market regularly. 176 | - Models may expect to be **executed** on various combinations CPU, GPU, TPU, dedicated ASICs or custom neuro-morphic silicon, again with new hardware entering the market regularly. 177 | - It **must not** be assumed that models will be executed on the same hardware as they are trained upon. 178 | 179 | It is desirable to have the ability to train once but run anywhere, that is, models are trained on specific hardware to minimise training time, and optimised for different target devices based on cost or performance considerations, however this aspiration may need to be tempered with a degree of pragmatism. The ability to train on a given hardware platform requires, at minimum, dedicated driver support, and in practice usually also necessitates extensive library / framework support. The ability to execute on a given hardware platform may involve having to significantly optimise resource usage to lower product cost or require the production of dedicated silicon specific to a single task. 180 | 181 | Whilst there will undoubtedly be common scenarios, for example in Cloud deployment, where CPU -> CPU or GPU -> GPU is sufficient to meet requirements, MLOps must allow for a broad range of potential cross-compilation scenarios. 182 | 183 | MLOps implementations should follow a ‘convention over configuration’ pattern, seeking to minimise the level of build-specific assets that need to be maintained on top of the desired solution. Where configuration is necessary, it should be synthesised where possible and operate on the principle of always creating working examples that can be modified incrementally by customers to meet their needs. 184 | 185 | The use of MLOps should teach best known methods of applying MLOps. It should be recognised that many customers will be experts in the field of Data Science but may have had relatively little exposure to DevOps or other SDLC principles. To minimise the learning curve, MLOps defaults should always align to best practice in production environments rather than ‘quick-and-dirty’ as might be the case for expert users wishing to trial an instance of the tooling in a test bench environment. 186 | 187 |
188 | 189 | # Scope 190 | 191 | The focus of this roadmap is upon aspects relating to extending DevOps principles to the area of Machine Learning and does not discuss basic features of DevOps or CI/CD approaches that are common across both domains. The intention is for MLOps to decorate DevOps rather than differentiate. 192 | 193 |
194 | 195 | # Summary and Updates 196 | 197 | 198 | 199 | This is the first edition of the MLOps Roadmap and sets out the initial challenges for the approach with suggested potential solutions. 200 | 201 |
202 | 203 | # Challenges 204 | 205 | In this section, we identify challenges in the near and far term that constrain the customer from successfully adopting and benefiting from MLOps principles: 206 | 207 | 208 | 209 | 210 | 213 | 214 | 215 | 216 | 218 | 219 | 220 | 221 | 227 | 228 | 229 | 230 | 238 | 239 | 240 | 241 | 247 | 248 | 249 | 250 | 252 | 253 | 254 | 255 | 258 | 259 | 260 | 261 | 263 | 264 | 265 | 266 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 279 | 280 | 281 | 282 | 286 | 287 | 288 | 289 | 292 | 293 | 294 | 295 | 303 | 304 | 305 | 306 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 |
Educating data science teams regarding the risks of trying to use Jupyter Notebooks in production- Communicating the SDLC and the lessons of DevOps in a way that is comfortable for the audience
211 | - Highlighting the dangers of using RAD tools in production
212 | - Demonstrate simple alternatives that are easy to use as part of an MLOps process
Treat ML assets as first class citizens in DevOps processes- Extend CI/CD tools to support ML assets
217 | - Integrate ML assets with version control systems
Providing mechanisms by which training/testing/validation data sets, training scripts, models, experiments, and service wrappers may all be versioned appropriately- All assets under version control
222 | - ML assets include sets of data
223 | - Data volumes may be large
224 | - Data may be sensitive
225 | - Data ownership may be complex
226 | - Multiple platforms may be involved
Providing mechanisms by which changes to training/testing/validation data sets, training scripts, models, experiments, and service wrappers may all be auditable across their full lifecycle- Models may be retrained based upon new training data
231 | - Models may be retrained based upon new training approaches
232 | - Models may be self-learning
233 | - Models may degrade over time
234 | - Models may be deployed in new applications
235 | - Models may be subject to attack and require revision
236 | - Incidents may occur that require root cause analysis and change
237 | - Corporate or government compliance may require audit or investigation
Treating training/testing/validation data sets as managed assets under a MLOps workflow- Content of training/testing/validation data sets is essential to audit process, root cause analysis
242 | - Data is not typically treated as a managed asset under conventional CI/CD
243 | - Data may reside across multiple systems
244 | - Data may only be able to reside in restricted jurisdictions
245 | - Data storage may not be immutable
246 | - Data ownership may be a factor
Managing the security of data in the MLOps process with particular focus upon the increased risk associated with aggregated data sets used for training or batch processing- Aggregated data carries additional risk and represents a higher value target
251 | - Migration of data into Cloud environments, particularly for training, may be problematic
Implications of privacy, GDPR and the right to be forgotten upon training sets and deployed models- The right to use data for training purposes may require audit
256 | - Legal compliance may require removing data from audited training sets, which in turn implies the need to retrain and redeploy models built from that data.
257 | - The right to be forgotten or the right to opt out of certain data tracking may require per-user reset of model predictions
Methods for wrapping trained models as deployable services in scenarios where data scientists training the models may not be experienced software developers with a background in service-oriented designOperational use of a model brings new architectural requirements that may sit outside the domain of expertise of the data scientists who created it.

262 | Model developers may not be software developers and therefore experience challenges implementing APIs around their models and integrating within solutions.
Approaches for enabling all Machine Learning frameworks to be used within the scope of MLOps, regardless of language or platformCommon mistake to assume ML = Python

267 | Many commonly used frameworks such as PyTorch and TensorFlow exist and can be expected to continue to proliferate.

268 | MLOps must not be opinionated about frameworks or languages.
Approaches for enabling MLOps to support a broad range of target platforms, including but not limited to CPU, GPU, TPU, custom ASICs and neuromorphic silicon.Choice of training platform and operational platform may be varied and could be different for separate models in a single project
Methods for ensuring efficient use of hardware in both training and operational scenarios- Hardware accelerators are expensive and difficult to virtualise
277 | - Cadence of training activities impacts cost of ownership
278 | - Elastic scaling of models against demand in operation is challenging when based upon hardware acceleration
Approaches for applying MLOps to very large scale problems at petabyte scale and beyond- Problems associated with moving and refreshing large training sets
283 | - Problems associated with distributing training loads across hardware accelerators
284 | - Problems with speed of distributing training data to correct hardware accelerators
285 | - Problems of provisioning / releasing large pools of hardware resources
Providing appropriate pipeline tools to manage MLOps workflows transparently as part of existing DevOps solutions- Integration of ML assets with existing CD/CD solutions
290 | - Extending Cloud-native build tools to support allocation of ML assets, data and hardware during training builds
291 | - Hardware pool management
Testing ML assets appropriately- Conventional Unit / Integration / BDD / UAT testing
296 | - Adversarial testing
297 | - Bias detection
298 | - Fairness testing
299 | - Ethics testing
300 | - Interpretability
301 | - Stress testing
302 | - Security testing
Governance processes for managing the release cycle of MLOps assets, including Responsible AI principles- Managing release of new training sets to data science team
307 | - Establishing thresholds for acceptable models
308 | - Monitoring model performance (and drift) over time to feed into thresholds for retraining and deployments
309 | - Managing competitive training of model variants against each other in dev environments
310 | - Managing release of preferred models into staging environments for integration and UAT
311 | - Managing release of specific model versions into production environments for specific clients/deployments
312 | - Managing root cause analysis for incident investigation
313 | - Observability / interpretability
314 | - Explainability and compliance
Management of shared dependencies between training and operational phasesA number of ML approaches require the ability to have reusable resources that are applied both during training and during the pre-processing of data being passed to operational models. It is necessary to be able to synchronise these assets across the lifecycle of the model. e.g. Preprocessing, Validation, Word embeddings etc.
Abstraction for modelsStored models are currently often little more than serialised objects. To decouple training languages, platforms and hardware from operational languages, platforms and hardware it is necessary to have broadly supported standard intermediate storage formats for models that can be used reliably to decouple training and operational phases.
Longevity of ML assetsDecision-making systems can be expected to require very long effective operating lifetimes. It will be necessary in some scenarios to be able to refer to instances of models across significant spans of time and therefore forward and backward compatibility issues, storage formats and migration of long running transactions are all to be considered.
Managing and tracking trade-offsSolutions including ML components will be required to manage trade-offs between multiple factors, for example in striking a balance between model accuracy and customer privacy, or explainability and the risk of revealing data about individuals in the data set. It may be necessary to provide intrinsic metrics to help customers balance these equations in production. It should also be anticipated that customers will need to be able to safely A/B test different scenarios to measure their impact upon this balance.
Escalation of data categoriesAs a side effect of applying governance processes to check for fairness and bias within models, it may become necessary to hold special category data providing information about race, religion or belief, sexual orientation, disability, pregnancy, or gender reassignment in order to detect such biases. As a result of this, there will be an escalation of data sensitivity and in the legal constraints that apply to the solution.
Intrinsic protection of modelsModels are vulnerable to certain common classes of attack, such as:
339 |    - Black box attempts to reverse engineer them
340 |    - Model Inversion attacks attempting to extract data about individuals
341 |    - Membership Inference attacks attempting to verify the presence of individuals in data sets
342 |    - Adversarial attacks using tailored data to manipulate outcomes
343 | It should be anticipated that there will be a need for generic protections against these classes of challenge across all deployed models.
Emergency cut outAs models are trained typically with production/runtime data, and act on production data, there can be cases where undesirable behaviour of a recently deployed model change is only apparent in a production environment. One example is a chat bot that uses appropriate language in reaction to some interactions and training sets. There is a need to have the ability to quickly cut out a model or roll back immediately to an earlier version should this happen.
Online learningThere are systems in use that use online learning, where a model or similar evolves in near real time with the data flowing in (and there is not necessarily a deploy stage). Systems also may modify themselves at runtime in this case without any sort of rebuilding or human approval. This will require further research and observation of emerging practices and use cases of this approach to machine learning.
359 | 360 |
361 | 362 | # Technology Requirements 363 | 364 | In this section, we capture specific technology requirements to enable progress against the challenges identified above: 365 | 366 | 367 | 368 | 369 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 389 | 390 | 391 | 392 | 395 | 396 | 397 | 398 | 400 | 401 | 402 | 403 | 408 | 409 | 410 | 411 | 422 | 423 | 424 | 425 | 427 | 428 | 429 | 430 | 432 | 433 | 434 | 435 | 438 | 439 | 440 | 441 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 464 | 465 | 466 | 467 | 471 | 472 | 473 | 474 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 489 | 490 | 491 | 492 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 |
Educating data science teams regarding the risks of trying to use Jupyter Notebooks in productionJupyter Notebooks are the tool we use to educate Data Scientists as they can easily be used to explore ad hoc ML problems incrementally on a laptop. Unfortunately, when all you have is a hammer, everything tends to look like a nail.

We see Jupyter Notebooks featuring in production scenarios, not because this are the best tool for the job, but because this is the tool we taught people to use and because we didn't teach about any of the problems inherent in using that tool.

This approach persists because of an ongoing gap in the tool chain.

370 | - Improved technology solutions are required that enable Data Scientists to easily run experiments at scale on elastic Cloud resources in a consistent, audited and repeatable way.
371 | - These solutions should integrate with existing Integrated Development Environments, Source Code Control and Quality Management tools.
372 | - Solutions should integrate with CI/CD environments so as to facilitate Data Scientists setting up multiple variations on a training approach and launching these as parallel, audited activities on Cloud infrastructure.
373 | - Working with the training of models and the consumption of trained models should be a seamless experience within a single toolchain.
374 | - Tooling should introduce new Data Scientists to the core concepts of software asset management, quality management and automated testing.
375 | - Tooling should enable appropriate governance processes around the release of ML assets into team environments.
Treat ML assets as first class citizens in DevOps processesML assets include but are not limited to training sets, training configurations (hyper-parameters), scripts and models. Some of these assets can be large, some are more like source code assets. We should be able to track changes to all these assets, and see how the changes relate to each other. Reporting on popular DevOps "metrics" like "Change Failure Rate" and "Mean cycle time" should be possible if done correctly. Whilst some of these assets are large, they are not without precedent in the DevOps world. People have been handling changes to database configurations, copying and backup of data for some time, so they same should apply to all ML assets. In the following sections we can explore how this may be implemented.
Providing mechanisms by which training sets, training scripts and service wrappers may all be versioned appropriatelyTraining sets are prepared data sets which are extracted from production data. They may contain PII or sensitive information, so making the data available to developers in a way analogous to source code may be problematic. Training scripts are smaller artefacts but are sometimes coupled to the training sets. All scripts should be versioned in a way that makes the connected to the training sets that they are associated with. Scripts and training sets are also coupled to meta data, such as instructions as to how training sets are split up for testing and validation, so a model can be reproduced in ideally a deterministic manner if required.
Providing mechanisms by which changes to training sets, training scripts and service wrappers may all be auditable across their full lifecycleIt is an essential requirement that all changes to ML assets must result in a robust audit trail capable of meeting forensic standards of analysis. It must be possible to work backwards from a given deployment of assets at a specified time, tracing all changes to this set of assets and the originators of each change. Tooling supporting decision-making systems in applications working with sensitive personal data or life-threatening situations must support non-repudiation and immutable audit records.

388 | It should be anticipated that customers will be operating in environments subject to legal or regulatory compliance requirements that will vary by industry and jurisdiction that may require varying standards of audit transparency, including requirements for third party audit.
Treating training sets as managed assets under a MLOps workflowOne of the difficult challenges for MLOps tooling is to be able to treat data as a managed asset in an MLOps workflow. Typically, it should be expected that traditional source code control techniques are inapplicable to data assets, which may be very large and reside in environments that are not amenable to tracking changes in the same manner as source code. New meta-data techniques must be created to effectively record and manage aggregates of data that represent specific versions of training, testing or validation sets. Given the variety of data storage mechanisms in common usage, this will likely necessitate pluggable extensions to tooling.

393 | It must be possible to define a specific set of data which can be introduced into MLOps tooling for a given training run to produce a model version, and to retrospectively inspect the set of data that was used to create a known model version in the past. Multiple model instances may share one data set and subsequent iterations of a model may be required to be regression tested against specific data set instances. It should be recognised that data sets are long-lived assets that may have compliance implications but which may also be required to be edited in response to data protection requests, invalidating all models based upon a set. 394 |
Managing the security of data in the MLOps process with particular focus upon the increased risk associated with aggregated data sets used for training or batch processingThe vast majority of MLOps use-cases can be expected to involve mission-critical applications, sensitive personal data and large aggregated data sets which all represent high value targets with high impact from a security breach. As a result, MLOps tooling must adopt a 'secure-by-design' position rather than assuming that customers will harden their deployments as part of their responsibilities.

399 | Solutions must not default to insecure configurations for convenience, nor should they provide user-facing options that invalidate system security as a side effect of adding functionality.
Implications of privacy, GDPR and the right to be forgotten upon training sets and deployed models- Tooling should provide mechanisms for auditing individual permissions to use source data as part of training sets, with the assumption that permission may be withdrawn at any time

404 | - Tooling should provide mechanisms to invalidate the status of deployed models where permissions to use have been revoked

405 | - Tooling may optionally provide mechanisms to automatically retrain and revalidate models on the basis of revoked permissions

406 | - Tooling should facilitate user-specific exceptions to model invocation rules where this is necessary to enable the right to be forgotten or to support the right to opt out of certain types of data tracking 407 |
Methods for wrapping trained models as deployable services in scenarios where data scientists training the models may not be experienced software developers with a background in service-oriented designModels that have been trained and which have passed acceptance testing need to be deployed as part of a broader application. This might take the form of elastically scalable Cloud services within a distributed web application, or as an embedded code/data bundle within a mobile application or other physical device. In some cases, it may be expected that models need to be translated to an alternate form prior to deployment, perhaps as components of a dedicated FPGA or ASIC in a hardware solution.

412 | - MLOps tooling should integrate and simplify the deployment of models into customer applications, according to the architecture specified by the customer.

413 | - MLOps tooling should not force a specific style of deployment for models, such as a dedicated, central 'model server'

414 | - Tooling should assume that model execution in Cloud environments must be able to scale elastically

415 | - Tooling should allow for careful management of execution cost of models in Cloud environments, to mitigate the risk of unexpected proliferation of consumption of expensive compute resources

416 | - Tooling should provide mechanisms for out-of-the-box deployment of models against common architectures, with the assumption that customers may not be expert service developers

417 | - Tooling should provide automated governance processes to manage the release of models into production environments in a controlled manner

418 | - Tooling should provide the facility to upgrade and roll-back deployed models across environments

419 | - It should be assumed that models represent reusable assets that may be deployed in the form of multiple instances at differing point release versions across many independent production environments

420 | - It should be assumed that more than one point release version of a model may be deployed concurrently in order to support phased upgrades of system functionality across customer environments 421 |
Approaches for enabling all Machine Learning frameworks to be used within the scope of MLOps, regardless of language or platformMLOps is a methodology that must be applicable in all environments, using any programming language or framework. Implementations of MLOps tooling may be opinionated about the approach to the methodology but must be agnostic to the underlying technologies used to implement the models and services associated.

426 | It should be possible to use MLOps tooling to deploy solution components utilising different languages or frameworks using loosely-coupled principles to provide compatibility layers.
Approaches for enabling MLOps to support a broad range of target platforms, including but not limited to CPU, GPU, TPU, custom ASICs and neuromorphic silicon.MLOps should be considered as a cross-compilation problem where the architectures of the source and target platforms may be different. In trivial cases, models may be trained on say CPU or GPU and deployed to execute on the same CPU or GPU architecture, however other scenarios already exist and should be expected to be increasingly likely in the future. This may include training on GPU / TPU and executing on CPU or, in edge devices, training on any architecture and then translating the models into physical logic that can be implemented at very low cost / size / power directly on FPGA or ASIC devices.

431 | This implies the need for architecture-independent intermediate formats to facilitate cross-deployment or cross-compilation onto target platforms.
Methods for ensuring efficient use of hardware in both training and operational scenariosML training and inferencing operations are typically very processing intensive operations that can expect to be accelerated by the use of dedicated hardware. Training is essentially an intermittent ad-hoc process that may run for hours-to-days to complete a single training run, demanding full utilisation of large scale compute resources for this period and then releasing that demand outside active training runs. Similarly, inferencing on a model may be processing intensive during a given execution. Elastic scaling of hardware resources will be essential for minimising cost of ownership however the dedicated nature of existing accelerator cards makes it currently hard to scale these elastically in today's Cloud infrastructure.

436 | Additionally, some accelerators provide the ability to subdivide processing resource into smaller allocation units to allow for efficient allocation of smaller work units within high capacity infrastructure.

437 | It will be necessary to extend the capabilities of existing Cloud platforms to permit more efficient utilisation of expensive compute resources whilst managing overall demand across multiple customers in a way that mitigates security and privacy concerns.
Approaches for applying MLOps to very large scale problems at petabyte scale and beyondAs of 2020, large ML data sets are considered to start at around 50TB and very large data sets may derive from petabytes of source data, especially in visual applications such as autonomous vehicle control. At these scales, it becomes necessary to spread ML workloads across thousands of GPU instances in order to keep overall training times within acceptable elapsed time windows (less than a week per run).

442 | Individual GPUs are currently able to process in the order of 1-10GB of data per second but only have around 40GB of local RAM. An individual server can be expected to have around 1TB of conventional RAM and around 15TB of local high speed storage as cache for around 8 GPUs, so may be able to transfer data between these and the compute units at high hundreds of GB/s with upstream connections to network storage running at low hundreds of GB/s.

443 | Efficient workflows rely upon being able to reduce problems into smaller sub-units with constrained data requirements or systems start to become I/O bound. MLOps tooling for large problems must be able to efficiently decompose training and inferencing workloads into individual operations and data sets that can be effectively distributed as parallel activities across a homogeneous infrastructure with a supercomputing-style architecture. This can be expected to exceed the capabilities of conventional Cloud computing infrastructure and require dedicated hardware components and architecture so any MLOps tooling must have appropriate awareness of the target architecture in order to optimise deployments.

444 | At this scale, it is not feasible to create multiple physical copies of petabytes of training data due to storage capacity constraints and limitations of data transfer rates, so strategies for versioning sets of data with metadata against an incrementally growing pool will be necessary.
Providing appropriate pipeline tools to manage MLOps workflows transparently as part of existing DevOps solutionsExisting projects using DevOps practices typically will have pipelines and delivery automated. Ideally MLOps solutions would extend this rather than replace it. In some cases it may be necessary to have a ML model (for example) have its own tooling and pipeline, but that should be the exception (as typically there is a non trivial amount of source code that goes along with the training and model for scripts and endpoints, as covered previously).
Testing ML assets appropriatelyML assets should be considered at least at the same level as traditional source code in terms of testing (unit, integration, end to end, acceptance etc). Metrics like coverage may still apply to scripts and service endpoints, it not for the model itself (as it isn't derived from source code). Further to this, typically models when deployed are used in decision making capacities where the stakes are higher, or there are potential governance or compliance or bias risks. This implies that testing will need to cover far more than source code, but actively show in a way suitable to a variety of stakeholders, how it was tested (for example was socioeconomic bias testing included). The testing involved should be in an accessible fashion so it is not only available for developers to audit.
Governance processes for managing the release cycle of MLOps assets, including Responsible AI principlesMLOps as a process extends governance requirements into areas beyond those typically considered as part of conventional DevOps practices. It is necessary to be able to extend auditing and traceability of MLOps assets all the way back into the data that is chosen for the purposes of training models in the first instance. MLOps tooling will need to provide mechanisms for managing the release of new training sets for the purposes of training, with the consideration that many data scientists may be working on a given model, and more than one model may be trained against a specific instance of a training data set. Customers must have the capability to retain a history of prior versions of training data and be able to recreate specific environments as the basis of new avenues of investigation, remedial work or root cause analysis, potentially under forensic conditions.

457 | The training process is predicated upon the idea of setting predefined success criteria for a given training session. Tooling should make it easy for data science teams to clearly and expressively declare success criteria against which an automated training execution will be evaluated, such that no manual intervention is required to determine when a particular training run meets the standard required for promotion to a staging environment.

458 | Tooling should also offer the ability to manage competitive training of model variants against each other as part of automated batch training activities. This could involve predefined sets of hyper-parameters to test in parallel training executions, use of smart hyper-parameter tuning libraries as part of training scripts, or evolutionary approaches to iteratively creating model variants.

459 | It should be possible to promote preferred candidate models into staging environments for integration and acceptance testing using a defined set of automated governance criteria and providing a full audit trail that can be retained across the lifetime of the ML asset being created.

460 | Tooling should permit the selective promotion of specific model versions into target production environments with the assumption that customers may need to manage multiple live versions of any given model in production for multiple client environments. Again, this requires a persistent audit trail for each deployment.

461 | It should be assumed that the decision-making nature of ML-based products will require that any incident or defect in a production environment may result in the need for a formal investigation or root-cause analysis, potentially as part of a compliance audit or litigation case. Tooling should facilitate the ability to easily walk backwards through audit trail pathways to establish the full state of all assets associated with a given deployment and all governance decisions associated. This should be implemented in such a way as to be easily initiated by non-technical staff and provide best efforts at non-repudiation and tamper protection of any potential evidence.

462 | Models themselves can be expected to be required to be constructed with a level of conformance to observability, interpretability and explainability standards, which may be defined in legislation in some cases. This is a fundamentally hard problem and one which will have a significant impact upon the practical viability of deploying ML solutions in some fields. Typically, these concerns have an inverse relationship with security and privacy requirements so it is important that tooling consider the balance of these trade-offs when providing capabilities in these areas. 463 |
Management of shared dependencies between training and operational phasesAppropriate separation of concerns should be facilitated in the implementation of training scripts so that aspects associated with the preprocessing of data are handled independently from the core training activities.

468 | Tooling should provide clean mechanisms for efficiently and safely deploying code associated with preprocessing activities to both training and service deployment environments. It is critical that preprocessing algorithms are kept consistent between these environments at all times and a change in one must trigger a change in the other or raise an alert status capable of blocking an accidental release of mismatched libraries.

469 | It should be considered that the target environment for deployment may differ from that of training so it may be necessary to support implementations of preprocessing functions in different languages or architectures. Under such circumstances, tooling should provide robust mechanisms for ensuring an ongoing link between the implementations of these algorithms, preferably with a single, unified form of testing to prevent divergence. 470 |
Abstraction layer for modelsIt is unsafe to assume that models will be deployed into environments that closely match those it which the model was originally trained. As a result, the use of object serialisation for model description is only viable in a narrow range of potential use cases. Typically, we need to be able to use a platform independent intermediate format to describe models so that this can act as a normalised form for data exchange between training and operational environments.

475 | This form must be machine readable and structured in such a way that it is extensible to support a wide range of evolving ML techniques and can easily have new marshalling and unmarshalling components added as new source and target environments are adopted. The normalised form should be structured such that the training environment has no need to know about the target operational environment in advance.
Longevity of ML assetsWith traditional software assets, such as binaries, there is some expectation of backwards compatibility (in the case of, for example, Java, this compatibility of binaries has spanned decades). ML Assets such as binaries will need to have some reasonable backwards compatibility. Versioning of artefacts for serving the model are also important (but typically that is a better known challenge). In the case of backwards breaking changes, ML assets such as training scripts, tests and data sets will need to be accessed to reproduce a model for the new target runtime.
Managing and tracking trade-offsML solutions always involve trade-offs between different compromises. A typical example might be the trade-off between model accuracy, model explainability and data privacy. Viewed as the points of a triangle, we can select for a model that fits some point within the triangle where its distance from each vertex represents proximity to that ideal case. Since we train models by discovery, we can only make changes to our training data sets and hyper-parameters and then evaluate the properties of any resulting model by testing against these desired properties. Because of this, it is important that MLOps tooling provides capabilities to automate much of the heavy lifting associated with managing trade-offs in general. This might take the form of automated training of variations upon a basic model which are subsequently tested against a panel of selection criteria, evaluated and presented to customers in such a way as to make their relevant properties easily interpretable.
Escalation of data categoriesTo obtain an accurate model, or to prevent the production of models with undesirable biases, it may be necessary to store data of a very sensitive nature or legally protected categories. This data may be used to vet a model pre release, or for training, in any case it will likely be persisted along with training scripts. This will require strong data protections (as with any database of a sensitive nature) and auditability of access. MLOps systems are likely to be targets of attack to obtain this data, meaning stronger considerations for protections that just source code would be required.

488 | It is anticipated that regulatory requirements intended to reduce the impact of bias or fairness issues will have unintended consequences relating to the sensitivity of other data that must be collected to fulfil these requirements, creating additional privacy risks.
Intrinsic protection of modelsModel inferencing will have to embrace modern application security techniques to protect the model against these kinds of attacks. Inferencing might be protected through restriction of access(tokens), rate-limiting, and monitoring of incoming traffic. In addition, as part of the integration test phase, there is a requirement to test the model against adversarial attacks(both common attacks, and domain-specific attacks) in a sandboxed environment.

493 | It must also be recognised that Python, whilst convenient as a language for expressing ML concepts, is an interpreted scripting language that is intrinsically insecure in production environments since any ad-hoc Python source that can be injected into a Python environment can be executed without constraint, even if shell access is disabled. The long term use of Python to build mission-critical ML models should be discouraged in favour of more secure-by-design options.
Emergency cut outAs a model may need to abruptly be cut out, this may need to be done at the service wrapper level as an emergency measure. Rolling back to a previous version of the model and a service wrapper is desirable, but only if it is fast enough for safety reasons. At the very least, the ability to deny service in the span of minutes in cases of misbehaviour is required. This cut out needs to be human triggered at least, and possibly triggered via a live health check in the service wrapper. It is common in traditional service deployments to have health and liveness checks, a similar thing exists for deployed models where health includes acceptable behaviour.
Online learningSelf-learning ML techniques require that models are trained live in production environments, often continuously. This places the behaviour of such models outside the governance constraints of current MLOps platforms and introduces potentially unconstrained risk that must be managed in end user code. Further consideration must be given to identifying ways in which MLOps capabilities can be extended into this space in order to provide easier access to best known methods for mitigating the risk of degrading quality.
504 | 505 |
506 | 507 | # Potential Solutions 508 | 509 | 510 | 511 | 512 | This section addresses a timeline of potential solutions to assist in understanding the relative maturity of each area of concern and to highlight where significant gaps exist: 513 | 514 | **Key:** 515 | 516 | 517 | 518 | 519 | 520 |
521 | 522 | # Cross-Cutting Concerns 523 | 524 | The following areas represent aspects of MLOps that overlap with issues covered by other SIGs and indicate where we must act to ensure alignment of work on these elements: 525 | 526 | The following cross-cutting concerns are identified: 527 | 528 | * CI/CD tooling across DevOps and MLOps scenarios 529 | 530 | * Pipeline management 531 | 532 | * Common testing aspects such as code coverage, code quality, licence tracking, CVE checking etc. 533 | 534 | * Data quality 535 | 536 | * Security 537 | 538 |
539 | 540 | # Conclusions and Recommendations 541 | 542 | 543 | 544 | As of 2020 we believe that, whilst some techniques exist for treating ML assets and practices as first class, they are not evenly applied, and should be widely adopted, practiced and continuously improved. 545 | 546 | Throughout 2020 there have been interesting developments in the application of AI and ML, often negative, in the case of knee jerk regulation, bias and negative press for example. Many of these could be addressed or improved (and in some cases prevented) with the application of the MLOps practices discussed in this document. Some techniques require more research, however many of them are already in existence in some form and merely require continuous improvement in the future. The one area we do not have a clear line of sight in is Online Learning (including but not limited to scenarios like model-less reinforcement learning) and more work must be done to understand how we can further governance goals in this context. 547 | 548 | We recommend that development continue in the next year toward improved auditability of changes to training data, scripts and service wrappers across their full lifecycle. As AI-powered systems mature there will be more requirement to treat models like other software assets through their lifecycle. Training data must become a managed asset like any other input to software development. 549 | 550 | We anticipate that the security of data that is used for training, and the tension between the issues of machine learning, privacy, GDPR and the "right to be forgotten" will remain as ongoing challenges until well in 2022. 551 | 552 |
553 | 554 | # Glossary 555 | 556 | Short definitions of terms used that aren't explained inline. 557 | 558 | * Training: act of combining data with (hyper) parameters to yield a model 559 | * Model: A deployable (usually binary) artefact that is the result of training. Can be used at runtime to make predictions for example 560 | * Hyper-parameter: parameters used during training of a model, typically set by a data scientist/human 561 | * Parameter: a configuration variable set after the training phase (usually part of a model) 562 | * Endpoint: typically a model is deployed to an endpoint which may be a HTTPS service which serves up predictions (models may be also deployed to devices and other places) 563 | * Training Pipeline: all the steps needed to prepare a model 564 | * Training set: a set of example data used for training a model 565 | 566 | 567 | 568 | 569 | 570 | -------------------------------------------------------------------------------- /roadmap/2020/images/README.md: -------------------------------------------------------------------------------- 1 | # Supporting images for MLOps Roadmap. 2 | 3 | GitHub Markdown has limited support for SVG so **please do not try editing these with an image editing package!** 4 | 5 | They are simple enough to modify using a basic text editor. -------------------------------------------------------------------------------- /roadmap/2020/images/key.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Key 13 | 14 | 15 | 16 | 17 | 18 | To Be Determined 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | Research Required 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | Development Underway 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | Qualification / Pre-Production 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | Continuous Improvement 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /roadmap/2020/images/mlops-horizontal-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cdfoundation/sig-mlops/3ca07a452bd12310df84202286714cf2048ea0f1/roadmap/2020/images/mlops-horizontal-color.png -------------------------------------------------------------------------------- /roadmap/2020/images/mlops-horizontal-color.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 13 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 41 | 42 | 43 | 44 | 45 | 59 | 60 | 62 | 67 | 73 | 77 | 81 | 85 | 90 | 93 | 97 | 101 | 105 | 107 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /roadmap/2020/images/solutions.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Potential Solutions 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | Solution 23 | 2020 24 | 2021 25 | 2022 26 | 2023 27 | 2024 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | Educating data science teams regarding the risks of trying 40 | to use Jupyter Notebooks in production 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | Treat ML assets as first class citizens in DevOps processes 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | Providing mechanisms by which training sets, training scripts 65 | and service wrappers may all be versioned appropriately 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | Providing mechanisms by which changes to training sets, 78 | training scripts and service wrappers may all be auditable 79 | across their full lifecycle 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | Treating training sets as managed assets under a MLOps 92 | workflow 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | Managing the security of data in the MLOps process with 106 | particular focus upon the increased risk associated with 107 | aggregated data sets used for training or batch processing 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | Implications of privacy, GDPR and the right to be forgotten 120 | upon training sets and deployed models 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | Methods for wrapping trained models as deployable services 134 | in scenarios where data scientists training the models may not 135 | be experienced software developers 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | Approaches for enabling all Machine Learning frameworks to 148 | be used within the scope of MLOps, regardless of language or 149 | platform 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | Approaches for enabling MLOps to support a broad range of 162 | target platforms, including but not limited to CPU, GPU, TPU, 163 | custom ASICs and neuromorphic silicon 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | Methods for ensuring efficient use of hardware in both training 176 | and operational scenarios 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | Approaches for applying MLOps to very large scale problems 190 | at petabyte scale and beyond 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | Providing appropriate pipeline tools to manage MLOps 204 | workflows transparently as part of existing DevOps solutions 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | Testing ML assets appropriately 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | Governance processes for managing the release cycle of 232 | MLOps assets, including Responsible AI principles 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | Management of shared dependencies between training and 246 | operational phases 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | Abstraction layer for models 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | Longevity of ML assets 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | Managing and tracking trade-offs 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | Escalation of data categories 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | Intrinsic protection of models 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | Emergency cut out 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | Online learning 344 | 345 | 346 | 347 | 348 | 349 | 350 | -------------------------------------------------------------------------------- /roadmap/2021/images/README.md: -------------------------------------------------------------------------------- 1 | # Supporting images for MLOps Roadmap. 2 | 3 | GitHub Markdown has limited support for SVG so **please do not try editing these with an image editing package!** 4 | 5 | They are simple enough to modify using a basic text editor. -------------------------------------------------------------------------------- /roadmap/2021/images/key.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Key 13 | 14 | 15 | 16 | 17 | 18 | To Be Determined 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | Research Required 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | Development Underway 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | Qualification / Pre-Production 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | Continuous Improvement 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /roadmap/2021/images/mlops-horizontal-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cdfoundation/sig-mlops/3ca07a452bd12310df84202286714cf2048ea0f1/roadmap/2021/images/mlops-horizontal-color.png -------------------------------------------------------------------------------- /roadmap/2021/images/mlops-horizontal-color.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 13 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 41 | 42 | 43 | 44 | 45 | 59 | 60 | 62 | 67 | 73 | 77 | 81 | 85 | 90 | 93 | 97 | 101 | 105 | 107 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /roadmap/2021/images/solutions.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Potential Solutions 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | Solution 23 | 2021 24 | 2022 25 | 2023 26 | 2024 27 | 2025 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | Educating data science teams regarding the risks of trying 40 | to use Jupyter Notebooks in production 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | Treat ML assets as first class citizens in DevOps processes 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | Providing mechanisms by which training sets, training scripts 65 | and service wrappers may all be versioned appropriately 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | Providing mechanisms by which changes to training sets, 78 | training scripts and service wrappers may all be auditable 79 | across their full lifecycle 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | Treating training sets as managed assets under a MLOps 92 | workflow 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | Managing the security of data in the MLOps process with 106 | particular focus upon the increased risk associated with 107 | aggregated data sets used for training or batch processing 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | Implications of privacy, GDPR and the right to be forgotten 120 | upon training sets and deployed models 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | Methods for wrapping trained models as deployable services 134 | in scenarios where data scientists training the models may not 135 | be experienced software developers 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | Approaches for enabling all Machine Learning frameworks to 148 | be used within the scope of MLOps, regardless of language or 149 | platform 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | Approaches for enabling MLOps to support a broad range of 162 | target platforms, including but not limited to CPU, GPU, TPU, 163 | custom ASICs and neuromorphic silicon 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | Methods for ensuring efficient use of hardware in both training 176 | and operational scenarios 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | Approaches for applying MLOps to very large scale problems 190 | at petabyte scale and beyond 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | Providing appropriate pipeline tools to manage MLOps 204 | workflows transparently as part of existing DevOps solutions 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | Testing ML assets appropriately 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | Governance processes for managing the release cycle of 232 | MLOps assets, including Responsible AI principles 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | Management of shared dependencies between training and 246 | operational phases 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | Abstraction layer for models 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | Longevity of ML assets 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | Managing and tracking trade-offs 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | Escalation of data categories 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | Intrinsic protection of models 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | Emergency cut out 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | Online learning 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | Prioritising training activities 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | Guardrail metrics 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | Government regulation of AI 386 | 387 | 388 | 389 | 390 | 391 | 392 | -------------------------------------------------------------------------------- /roadmap/2022/images/README.md: -------------------------------------------------------------------------------- 1 | # Supporting images for MLOps Roadmap. 2 | 3 | GitHub Markdown has limited support for SVG so **please do not try editing these with an image editing package!** 4 | 5 | They are simple enough to modify using a basic text editor. -------------------------------------------------------------------------------- /roadmap/2022/images/key.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Key 13 | 14 | 15 | 16 | 17 | 18 | To Be Determined 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | Research Required 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | Development Underway 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | Qualification / Pre-Production 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | Continuous Improvement 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /roadmap/2022/images/mlops-horizontal-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cdfoundation/sig-mlops/3ca07a452bd12310df84202286714cf2048ea0f1/roadmap/2022/images/mlops-horizontal-color.png -------------------------------------------------------------------------------- /roadmap/2022/images/mlops-horizontal-color.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 13 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 41 | 42 | 43 | 44 | 45 | 59 | 60 | 62 | 67 | 73 | 77 | 81 | 85 | 90 | 93 | 97 | 101 | 105 | 107 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /roadmap/2022/images/solutions.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Potential Solutions 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | Solution 23 | 2022 24 | 2023 25 | 2024 26 | 2025 27 | 2026 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | Educating data science teams regarding the risks of trying 40 | to use Jupyter Notebooks in production 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | Treat ML assets as first class citizens in DevOps processes 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | Providing mechanisms by which training sets, training scripts 65 | and service wrappers may all be versioned appropriately 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | Providing mechanisms by which changes to training sets, 78 | training scripts and service wrappers may all be auditable 79 | across their full lifecycle 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | Treating training sets as managed assets under a MLOps 92 | workflow 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | Managing the security of data in the MLOps process with 106 | particular focus upon the increased risk associated with 107 | aggregated data sets used for training or batch processing 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | Implications of privacy, GDPR and the right to be forgotten 120 | upon training sets and deployed models 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | Methods for wrapping trained models as deployable services 134 | in scenarios where data scientists training the models may not 135 | be experienced software developers 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | Approaches for enabling all Machine Learning frameworks to 148 | be used within the scope of MLOps, regardless of language or 149 | platform 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | Approaches for enabling MLOps to support a broad range of 162 | target platforms, including but not limited to CPU, GPU, TPU, 163 | custom ASICs and neuromorphic silicon 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | Methods for ensuring efficient use of hardware in both training 176 | and operational scenarios 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | Approaches for applying MLOps to very large scale problems 190 | at petabyte scale and beyond 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | Providing appropriate pipeline tools to manage MLOps 204 | workflows transparently as part of existing DevOps solutions 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | Testing ML assets appropriately 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | Governance processes for managing the release cycle of 232 | MLOps assets, including Responsible AI principles 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | Management of shared dependencies between training and 246 | operational phases 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | Abstraction layer for models 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | Longevity of ML assets 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | Managing and tracking trade-offs 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | Escalation of data categories 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | Intrinsic protection of models 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | Emergency cut out 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | Online learning 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | Prioritising training activities 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | Guardrail metrics 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | Government regulation of AI 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | Understanding ML models as a part of the broader product(s) 400 | in which they reside (rather than as independent products) 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | Educating data science practitioners on the approaches and 414 | best practices used in product development while educating 415 | product development teams on the requirements of ML 416 | 417 | 418 | 419 | -------------------------------------------------------------------------------- /roadmap/2024/images/README.md: -------------------------------------------------------------------------------- 1 | # Supporting images for MLOps Roadmap. 2 | 3 | GitHub Markdown has limited support for SVG so **please do not try editing these with an image editing package!** 4 | 5 | They are simple enough to modify using a basic text editor. -------------------------------------------------------------------------------- /roadmap/2024/images/key.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | Key 14 | 15 | 16 | 17 | 18 | 19 | To Be Determined 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | Research Required 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | Development Underway 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | Qualification / Pre-Production 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | Continuous Improvement 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /roadmap/2024/images/mlops-horizontal-color.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cdfoundation/sig-mlops/3ca07a452bd12310df84202286714cf2048ea0f1/roadmap/2024/images/mlops-horizontal-color.png -------------------------------------------------------------------------------- /roadmap/2024/images/mlops-horizontal-color.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 13 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 41 | 42 | 43 | 44 | 45 | 59 | 60 | 62 | 67 | 73 | 77 | 81 | 85 | 90 | 93 | 97 | 101 | 105 | 107 | 112 | 113 | 114 | -------------------------------------------------------------------------------- /roadmap/2024/images/solutions.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | Potential Solutions 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | Solution 24 | 2025 25 | 2026 26 | 2027 27 | 2028 28 | 2029 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | Educating data science teams regarding the risks of trying 41 | to use Jupyter Notebooks in production 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | Treat ML assets as first class citizens in DevOps processes 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | Providing mechanisms by which training sets, training scripts 66 | and service wrappers may all be versioned appropriately 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | Providing mechanisms by which changes to training sets, 79 | training scripts and service wrappers may all be auditable 80 | across their full lifecycle 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | Treating training sets as managed assets under a MLOps 93 | workflow 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | Managing the security of data in the MLOps process with 107 | particular focus upon the increased risk associated with 108 | aggregated data sets used for training or batch processing 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | Implications of privacy, GDPR and the right to be forgotten 121 | upon training sets and deployed models 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | Methods for wrapping trained models as deployable services 135 | in scenarios where data scientists training the models may not 136 | be experienced software developers 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | Approaches for enabling all Machine Learning frameworks to 149 | be used within the scope of MLOps, regardless of language or 150 | platform 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | Approaches for enabling MLOps to support a broad range of 163 | target platforms, including but not limited to CPU, GPU, TPU, 164 | custom ASICs and neuromorphic silicon 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | Methods for ensuring efficient use of hardware in both training 177 | and operational scenarios 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | Approaches for applying MLOps to very large scale problems 191 | at petabyte scale and beyond 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | Providing appropriate pipeline tools to manage MLOps 205 | workflows transparently as part of existing DevOps solutions 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | Testing ML assets appropriately 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | Governance processes for managing the release cycle of 233 | MLOps assets, including Responsible AI principles 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | Management of shared dependencies between training and 247 | operational phases 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | Abstraction layer for models 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | Longevity of ML assets 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | Managing and tracking trade-offs 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | Escalation of data categories 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | Intrinsic protection of models 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | Emergency cut out 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | Online learning 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | Prioritising training activities 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | Guardrail metrics 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | Government regulation of AI 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | Understanding ML models as a part of the broader product(s) 401 | in which they reside (rather than as independent products) 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | Educating data science practitioners on the approaches and 415 | best practices used in product development while educating 416 | product development teams on the requirements of ML 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | Vulnerability to Supply Chain attacks 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | Understanding the chain of responsibility associated with 443 | ML assets 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | Managing ethical and legal issues in the supply chain 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | Loss of control over IP 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | Risk of overspend on Cloud resources 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | Energy usage and sustainability 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | Availability of critical hardware 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | Risks of feedback loops 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | Mistaking plausibility for reasoning 541 | 542 | 543 | 544 | 545 | 546 | -------------------------------------------------------------------------------- /roadmap/README.md: -------------------------------------------------------------------------------- 1 | # mlops-roadmap 2 | 3 | CDF Special Interest Group - *MLOps Roadmap* 4 | 5 | This is the home of the MLOps Roadmap working group. --------------------------------------------------------------------------------