├── .gitignore ├── .gitmodules ├── 1807.05351.pdf ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── ML Schema Core Specification.pdf ├── README.md ├── _prior_art ├── README.md ├── kubeflow │ ├── Artifact.proto │ ├── ArtifactConnection.proto │ ├── Data.proto │ ├── Executable.proto │ ├── Framework.proto │ ├── Metrics.proto │ ├── Model.proto │ ├── Project.proto │ ├── README.md │ ├── Run.proto │ ├── TimeRange.proto │ └── UUID.proto ├── mlflow │ └── README.md ├── modeldb │ └── README.md ├── pachyderm │ └── README.md └── seldon │ └── README.md ├── common └── object.md ├── data ├── artifact.md ├── datapath.md ├── dataset.md ├── dataset.yml ├── datastore.md └── readme.md ├── docs ├── archive │ └── README-old.md └── assets │ └── logos │ ├── mlspec_logo.png │ └── mlspec_logo_light.png ├── experiment_tracking ├── README.md ├── experiment_example.yml └── run.md ├── logging_proto ├── README.md └── inferenceLog.yml ├── metadata_file ├── README.md └── metadata.yaml ├── model_packaging ├── README.md ├── data.yaml ├── model.yaml ├── model_example.yml ├── model_onnx_conversion.yaml ├── model_packaging.md ├── model_packaging.yaml ├── model_scoring.yaml ├── model_serving.yaml └── model_training.yaml ├── monitoring_proto ├── README.md └── inferenceRequest.yml └── pipelines ├── module.md ├── pipeline.md └── pipeline.yml /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "_prior_art/_prior_art/tf_metadata"] 2 | path = _prior_art/_prior_art/tf_metadata 3 | url = https://github.com/tensorflow/metadata/ 4 | -------------------------------------------------------------------------------- /1807.05351.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/1807.05351.pdf -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | We have adopted the [Contributor Covenant](https://www.contributor-covenant.org/version/2/0/code_of_conduct/) as our Code of Conduct. 4 | 5 | Please read the full text at [https://www.contributor-covenant.org/version/2/0/code_of_conduct/](https://www.contributor-covenant.org/version/2/0/code_of_conduct/). 6 | 7 | For any questions or concerns, please contact the project maintainers. 8 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to ML Spec 2 | 3 | Thank you for your interest in contributing to ML Spec! We welcome contributions from the community to help improve and advance the project. This document outlines the guidelines and best practices for contributing to ML Spec. 4 | 5 | ## Code of Conduct 6 | 7 | We have adopted the [Contributor Covenant](https://www.contributor-covenant.org/version/2/0/code_of_conduct/) as our Code of Conduct. By participating in this project, you agree to abide by its terms. Please read the [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) file for more information. 8 | 9 | ## Ways to Contribute 10 | 11 | There are several ways you can contribute to ML Spec: 12 | 13 | - Reporting bugs and issues 14 | - Suggesting enhancements and new features 15 | - Improving documentation 16 | - Submitting pull requests with bug fixes or new features 17 | - Providing feedback and suggestions for improvement 18 | 19 | ## Getting Started 20 | 21 | 1. Fork the ML Spec repository on GitHub. 22 | 2. Clone your forked repository to your local machine. 23 | 3. Create a new branch for your contribution. 24 | 4. Make your changes and commit them with descriptive messages. 25 | 5. Push your changes to your forked repository. 26 | 6. Submit a pull request to the main ML Spec repository. 27 | 28 | ## Pull Request Guidelines 29 | 30 | When submitting a pull request, please ensure the following: 31 | 32 | - Provide a clear and descriptive title. 33 | - Include a detailed description of the changes made and the problem they solve. 34 | - Reference any relevant issues or pull requests. 35 | - Ensure your code follows the project's coding conventions and style guide. 36 | - Include tests for any new functionality or bug fixes. 37 | - Update the documentation if necessary. 38 | 39 | ## Issue Reporting 40 | 41 | If you encounter a bug or have a feature request, please submit an issue on the GitHub issue tracker. When reporting an issue, please provide the following information: 42 | 43 | - A clear and descriptive title. 44 | - A detailed description of the issue or feature request. 45 | - Steps to reproduce the issue (if applicable). 46 | - Any relevant error messages or logs. 47 | - Your operating system and version. 48 | - Any other relevant information. 49 | 50 | ## Communication 51 | 52 | If you have any questions, suggestions, or need clarification, you can reach out to the maintainers and the community through the following channels: 53 | 54 | - GitHub Issues: For bug reports, feature requests, and general discussions related to the project. 55 | - ML Spec Mailing List: For general discussions, announcements, and community engagement. 56 | - ML Spec Slack Channel: For real-time communication and collaboration with the community. 57 | 58 | ## License 59 | 60 | By contributing to ML Spec, you agree that your contributions will be licensed under the [Apache License 2.0](https://github.com/mlspec/MLSpec/blob/master/LICENSE). 61 | 62 | Thank you for your contributions and helping to make ML Spec better! 63 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /ML Schema Core Specification.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/ML Schema Core Specification.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![MLSpec Logo](./docs/assets/logos/mlspec_logo.png#gh-light-mode-only) 2 | ![MLSpec Logo](./docs/assets/logos/mlspec_logo_light.png#gh-dark-mode-only) 3 | 4 | MLSpec is an open source framework for defining and verifying machine learning (ML) workflows. The project provides a standardized schema and libraries to specify and validate the various stages of an ML pipeline, from data preprocessing to model training, evaluation, and deployment. 5 | 6 | ## Table of Contents 7 | 8 | - [Background](#background) 9 | - [Foundational Work](#foundational-work) 10 | - [Existing Multi-Stage ML Workflows](#existing-multi-stage-ml-workflows) 11 | - [Specification](#specification) 12 | - [MLSpec Standards](#mlspec-standards) 13 | - [End-to-End Complete Lifecycle](#end-to-end-complete-lifecycle) 14 | - [Repository Structure](#repository-structure) 15 | - [Vision and Direction](#vision-and-direction) 16 | - [Enhancing Model Interpretability and Trust](#enhancing-model-interpretability-and-trust) 17 | - [Roadmap](#roadmap) 18 | - [Contributing](#contributing) 19 | - [Code of Conduct](#code-of-conduct) 20 | - [Acknowledgments](#acknowledgments) 21 | [Help Shape The Future of MLSpec!](#help-shape-the-future-of-mlspec) 22 | 23 | ## Background 24 | 25 | The field of machine learning has seen significant advancements in recent years, with the development of various frameworks, tools, and platforms to support the ML lifecycle. However, the lack of standardization and interoperability among these tools has led to challenges in reproducing, sharing, and governing ML workflows across different environments and organizations. 26 | 27 | ### Foundational Work 28 | 29 | MLSpec builds upon the ideas and concepts from the foundational work in the field of ML workflow specification and verification. Some notable contributions include: 30 | 31 | - [Predictive Model Markup Language (PMML)](https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language) 32 | - [Portable Format for Analytics (PFA)](https://dmg.org/pfa/) 33 | - [ML Metadata](https://www.tensorflow.org/tfx/guide/mlmd) 34 | 35 | These projects have paved the way for standardizing ML model serialization and metadata, but they primarily focus on individual models rather than end-to-end workflows. 36 | 37 | MLSpec also draws inspiration from the MLSchema project, which has made significant contributions to the field of ML workflow specification. Due to the instability of the original MLSchema website, we have mirrored some of their key resources here: 38 | 39 | - [MLSchema Paper](https://github.com/mlspec/MLSpec/blob/master/1807.05351.pdf) 40 | - [MLSchema Core Specification](https://github.com/mlspec/MLSpec/blob/master/ML%20Schema%20Core%20Specification.pdf) 41 | 42 | These resources provide valuable insights into the design principles and approaches behind ML workflow specification and have influenced the development of MLSpec. 43 | 44 | ### Existing Multi-Stage ML Workflows 45 | 46 | Several prominent companies and organizations have developed their own multi-stage ML workflow solutions to address the challenges of managing end-to-end machine learning pipelines. These projects have focused on combining ML and batch processing capabilities to create robust and scalable workflows. 47 | 48 | Some notable examples of these multi-stage ML workflow solutions include: 49 | 50 | - [Facebook’s FBLearner Flow](https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/): FBLearner Flow is Facebook's internal ML platform that enables engineers and data scientists to build, train, and deploy ML models at scale. It provides a unified interface for managing the entire ML lifecycle, from data preparation to model serving. 51 | 52 | - [Google's TFX:](https://dl.acm.org/citation.cfm?id=3098021) TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. It provides a suite of components and libraries for building, training, and serving ML models, with a focus on scalability and reproducibility. 53 | 54 | - [Kubeflow Pipelines](https://cloud.google.com/blog/products/ai-machine-learning/getting-started-kubeflow-pipelines): Kubeflow Pipelines is an open-source platform for building and deploying portable, scalable ML workflows on Kubernetes. It allows users to define and orchestrate complex ML pipelines using a declarative approach. 55 | 56 | - [Microsoft Azure ML Pipelines](https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2): Azure ML Pipelines is a service within the Azure Machine Learning platform that enables the creation and management of end-to-end ML workflows. It provides a visual interface and SDK for building, scheduling, and monitoring ML pipelines. 57 | 58 | - [Netflix Meson](https://netflixtechblog.com/meson-workflow-orchestration-for-netflix-recommendations-fc932625c1d9): Meson is Netflix's internal platform for building and managing ML workflows. It provides a unified framework for data preparation, model training, and deployment, with a focus on scalability and ease of use. 59 | 60 | - [Spotify's Luigi](https://github.com/spotify/luigi): Luigi is an open-source Python library developed by Spotify for building and scheduling complex pipelines of batch jobs. While not specifically designed for ML workflows, it has been widely adopted in the ML community for managing data processing and model training pipelines. 61 | 62 | - [Uber's Michelangelo](https://www.uber.com/blog/michelangelo-machine-learning-platform/): Michelangelo is Uber's ML platform that powers a wide range of services, from fraud detection to customer support. It provides end-to-end functionality for building, training, and deploying ML models at scale. 63 | 64 | From studying these projects and their associated papers, we have identified a set of common steps that encompass the end-to-end machine learning workflow. These steps include data ingestion, data preparation, feature engineering, model training, model evaluation, model deployment, and monitoring. 65 | 66 | MLSpec aims to provide a standardized specification and framework for defining and executing these multi-stage ML workflows, drawing inspiration from the successful approaches and best practices established by these industry leaders. 67 | 68 | # Specificiation 69 | 70 | ## MLSpec Standards 71 | 72 | MLSpec aims to establish a set of standards for various components involved in an end-to-end machine learning workflow. By defining these standards, MLSpec seeks to promote interoperability, reproducibility, and best practices across different ML tools and platforms. 73 | 74 | The standards cover the following key areas: 75 | 76 | 1. **Workflow Orchestration**: MLSpec will define a standard set of endpoints that each step in an ML workflow should expose. These endpoints may include: 77 | - `/ok`: An endpoint to check the health and readiness of the step. 78 | - `/varz`: An endpoint to retrieve various runtime variables and configurations. 79 | - `/metrics`: An endpoint to expose performance metrics and monitoring data. 80 | - Additional endpoints for step-specific functionality and control. 81 | 82 | By standardizing these endpoints, MLSpec enables consistent monitoring, control, and integration of ML workflow steps across different orchestration platforms. 83 | 84 | 2. **Model Management**: MLSpec will provide guidelines and standards for versioning, packaging, and deploying ML models. This may include: 85 | - Model serialization formats and conventions. 86 | - Metadata schemas for capturing model provenance, hyperparameters, and performance metrics. 87 | - APIs for model serving and inference. 88 | - Best practices for model versioning and lineage tracking. 89 | 90 | Standardizing model management practices promotes model reproducibility, interpretability, and maintainability. 91 | 92 | 3. **Logging**: MLSpec will define a standard logging format for capturing relevant information about each inference request. This logging format will align with the NCSA (National Center for Supercomputing Applications) standard log format, which includes fields such as: 93 | - Timestamp 94 | - Request ID 95 | - User ID 96 | - Model version 97 | - Input data 98 | - Output predictions 99 | - Latency 100 | - Additional metadata 101 | 102 | Adopting a standardized logging format enables consistent monitoring, debugging, and analysis of ML system behavior. 103 | 104 | 4. **Data Validation and Quality**: MLSpec will provide guidelines and standards for validating and ensuring the quality of input data at various stages of the ML workflow. This may include: 105 | - Data schema validation. 106 | - Data quality checks (e.g., missing values, outliers, data drift). 107 | - Data versioning and lineage tracking. 108 | - Integration with data validation frameworks and tools. 109 | 110 | Ensuring data quality and validation helps maintain the integrity and reliability of ML workflows. 111 | 112 | 5. **Experiment Tracking**: MLSpec will define standards for tracking and managing ML experiments, including: 113 | - Experiment metadata schemas. 114 | - APIs for logging and querying experiment runs. 115 | - Best practices for organizing and versioning experiment artifacts. 116 | - Integration with popular experiment tracking frameworks. 117 | 118 | Standardizing experiment tracking enables reproducibility, comparison, and analysis of ML experiments. 119 | 120 | By establishing standards across these critical components of the ML workflow, MLSpec aims to foster a more robust, interoperable, and governed ecosystem for machine learning development and deployment. 121 | 122 | ## End-to-End Complete Lifecycle 123 | 124 | With MLSpec, we believe that every stage of an ML lifecycle requires some form of metadata management. We have identified the following steps as critical components of a complete ML lifecycle: 125 | 126 | 1. **Codify Objectives**: Detail the model outputs, possible errors, and minimum success criteria for launching in code. Use a simple DSL that can be used to verify success/failure programmatically for automated deployment. 127 | 128 | 2. **Data Ingestion**: Specify the tools/connectors (e.g., ODBC, Spark, HDFS, CSV, etc.) used for pulling in data, along with the queries used (including signed datasets), sharding strategies, and any labeling or synthetic data generation/simulation techniques. 129 | 130 | 3. **Data Analysis**: Provide a set of descriptive statistics on the included features and configurable slices of the data. Identify outliers. 131 | 132 | 4. **Data Transformation**: Document the data conversions and feature wrangling techniques (e.g., feature to-integer mappings) used, as well as any outliers that were programmatically eliminated. 133 | 134 | 5. **Data Validation**: Apply validation to the data based on a versioned, succinct description of the expected properties of the data. Use schemas to prevent bad behavior, such as training on deprecated data. Provide mechanisms to generate the first version of the schema (e.g., `select * from foo limit 30`) that can be used to drive other platform components, such as automatic feature-engineering or data-analysis tools. 135 | 136 | 6. **Data Splitting (including partitioning)**: Record how the data is split into training, validation, hold back, and debugging sets, along with the results of validation for statistics of each set. Use metadata to detect leakage of training data into testing data and/or overfitting. 137 | 138 | 7. **Model Training/Tuning**: Capture metadata about how the model is packaged and the distribution strategy, hyperparameters searched and results of the search, results of any conversions to other model serving formats (e.g., TF -> ONNX), and techniques used to quantize/compress/prune the model and the results. 139 | 140 | 8. **Model Evaluation/Validation**: Record the results of evaluation and validation of models to ensure they meet the original codified objectives before serving them to users. Compute metrics on slices of data, both for improving performance and avoiding bias (e.g., gender A gets significantly better results than gender B). Document the source of data used for validation. 141 | 142 | 9. **Test**: Record the results of final confirmation for the model on the hold back data set. This MUST BE A SEPARATE STEP FROM #8. Document the source of data used for the final test. 143 | 144 | 10. **Model Packaging**: Capture metadata about the model package, including additional security constraints, monitoring agents, signing, etc. Provide descriptions of the necessary infrastructure (e.g., P100s, 16 GB of RAM, etc.). 145 | 146 | 11. **Serving**: Record the results of rolling the model out to production. 147 | 148 | 12. **Monitoring**: Provide live queryable metadata that enables liveness checking and ML-specific signals that need action, such as significant deviation from previous model performance or degradation of the model performance over time. Include a rollback strategy (e.g., if this model is failing, use `model last-year.last-month.pkl`). 149 | 150 | 13. **Logging**: Generate an NCSA-style record per inference request, including a cryptographically secure record of the version of the pipeline (including features) and data used to train. 151 | 152 | By capturing and managing metadata at each stage of the ML lifecycle, MLSpec aims to provide a comprehensive and standardized approach to ensuring the reproducibility, interpretability, and governance of end-to-end machine learning workflows. 153 | 154 | ## Repository Structure 155 | 156 | - [common](./common) 157 | 158 | - [object](./common/object.md) 159 | 160 | General notes applicable to multiple objects in the system. How they are identified and named, basic operations, etc. 161 | 162 | - [data](./data) 163 | 164 | - [datastore](./data/datastore.md) 165 | 166 | Data storages 167 | 168 | - [datapath](./data/datapath.md) 169 | 170 | Data references 171 | 172 | - [artifact](artifact.md) 173 | 174 | Data produced by runs 175 | 176 | - [dataset](./data/dataset.md) 177 | 178 | Named and versioned data in storage 179 | 180 | - [pipelines](./pipelines) 181 | 182 | - [pipeline](pipeline.md) 183 | 184 | DAG for executing computation on data and training and deploying models 185 | 186 | - [module](module.md) 187 | 188 | Reusable definition of computation, includes script, set of expected inputs, outputs, etc. 189 | 190 | - [experiment_tracking](./experiment_tracking) 191 | 192 | - [run](./experiment_tracking/run.md) 193 | 194 | Tracked execution of pipeline or single script on compute 195 | 196 | - [model_packaging](./model_packaging) 197 | 198 | - [models](./model_packaging/README.md) 199 | 200 | Trained models 201 | 202 | - logging_proto 203 | 204 | - monitoring_proto 205 | 206 | - [metadata_file](./metadata_file) 207 | 208 | - [metadata](./metadata_file/metadata.yaml) 209 | 210 | The metadata file used to recreate the ML workflow 211 | 212 | ## Vision and Direction 213 | 214 | Our vision for MLSpec is to establish it as a robust and widely adopted framework for defining, standardizing, and verifying complex ML workflows. We aim to: 215 | 216 | 1. **Enhance Framework Support**: Extend MLSpec to support the latest ML frameworks and libraries, such as PyTorch, XGBoost, and LightGBM, enabling seamless integration with cutting-edge techniques and architectures. 217 | 218 | 2. **Accommodate Complex Workflows**: Expand the MLSpec schema to accommodate intricate, multi-stage ML pipelines, including data preprocessing, feature engineering, model training, evaluation, and deployment. 219 | 220 | 3. **Integrate with MLOps and AutoML**: Align MLSpec with modern MLOps practices and AutoML frameworks, enabling streamlined workflow management and automation. 221 | 222 | 4. **Improve Governance and Compliance**: Introduce a methodology for recording attestations of workflow execution in accordance with schema to support governance and compliance requirements. 223 | 224 | 5. **Foster Community Engagement**: Revitalize the project's community by improving documentation, providing clear contributing guidelines, and actively engaging with users and contributors. 225 | 226 | 6. **Integrate with Workflow Orchestration Tools**: Provide seamless integration with popular workflow management platforms, such as Apache Airflow, Flyte, and Prefect, allowing users to leverage MLSpec for defining and verifying ML workflows within their existing orchestration pipelines. 227 | 228 | ## Enhancing Model Interpretability and Trust 229 | 230 | One of the key challenges in the adoption and deployment of machine learning models is their interpretability and the trust users place in them. Many ML models are often considered "black boxes," making it difficult for users to understand how they arrive at their predictions or decisions. This lack of transparency can lead to a lack of trust and hesitation in relying on these models, especially in critical domains such as healthcare, finance, and legal systems. 231 | 232 | MLSpec aims to address this challenge by providing a framework for building interpretable and transparent ML workflows. By standardizing the end-to-end ML lifecycle and promoting best practices for model development, evaluation, and deployment, MLSpec enables users to gain insights into the behavior and decision-making process of ML models. 233 | 234 | Some of the ways MLSpec promotes model interpretability and trust include: 235 | 236 | 1. **Standardized Model Metadata**: MLSpec defines a standard schema for capturing and storing metadata about ML models, including information about their architecture, training data, hyperparameters, and performance metrics. This metadata provides a clear and comprehensive view of the model's characteristics and behavior. 237 | 238 | 2. **Model Evaluation and Validation**: MLSpec emphasizes rigorous model evaluation and validation practices to ensure that models meet the desired performance criteria and are free from biases or unintended consequences. By standardizing evaluation metrics and techniques, MLSpec enables users to assess the reliability and trustworthiness of models objectively. 239 | 240 | 3. **Model Explainability Techniques**: MLSpec encourages the use of model explainability techniques, such as feature importance analysis, partial dependence plots, and counterfactual explanations, to provide insights into how models make predictions. These techniques help users understand the factors influencing model decisions and identify potential issues or biases. 241 | 242 | 4. **Governance and Auditing**: MLSpec includes mechanisms for model governance and auditing, allowing organizations to track and monitor the lifecycle of ML models. This includes capturing information about model lineage, versioning, and approvals, ensuring that models adhere to regulatory requirements and ethical standards. 243 | 244 | By focusing on model interpretability and trust, MLSpec aims to foster the responsible development and deployment of ML models. It provides the necessary tools and guidelines to build transparent and accountable ML workflows, enabling users to have confidence in the models they use and the decisions they make. 245 | 246 | Join us in our mission to create a more interpretable and trustworthy ML ecosystem with MLSpec! 247 | 248 | ## Roadmap 249 | 250 | Our short-term goals (next 3-6 months) include: 251 | 252 | - [ ] Refactor the library codebase to improve maintainability and extensibility. 253 | - [ ] Add support for PyTorch and XGBoost frameworks. 254 | - [ ] Enhance the schema to accommodate complex, multi-stage workflows. 255 | - [ ] Implement workflow attestation and digital signing capabilities. 256 | - [ ] Overhaul the documentation and contributing guidelines. 257 | - [ ] Develop an Apache Airflow operator/plugin for integrating MLSpec -defined workflows. 258 | 259 | Our medium-term goals (6-12 months) include: 260 | 261 | - [ ] Integrate with popular MLOps platforms and AutoML frameworks. 262 | - [ ] Develop tools and dashboards for governance and compliance reporting. 263 | - [ ] Collaborate with the Apache Airflow, Kubeflow, Prefect, Argo Workflows, MLflow, and Flyte communities to develop seamless integrations for defining and verifying ML workflows within their respective orchestration platforms. 264 | - [ ] Establish industry partnerships and collaborations to promote adoption. 265 | - [ ] Foster an active community of contributors and users. 266 | 267 | We welcome contributions from the community to refine and expand this roadmap. 268 | 269 | ## Contributing 270 | 271 | We are excited to have you on board! Please refer to the [CONTRIBUTING.md](CONTRIBUTING.md) file for detailed guidelines on how to get involved, whether it's by reporting issues, submitting pull requests, or participating in discussions. 272 | 273 | ## Code of Conduct 274 | 275 | To ensure a welcoming and inclusive environment, we have adopted the [Contributor Covenant Code of Conduct](CODE_OF_CONDUCT.md). Please review and adhere to these guidelines. 276 | 277 | ## Acknowledgments 278 | 279 | We would like to express our gratitude to David Aronchick (@aronchick) and the authors and contributors of MLSpec for their pioneering work on this project. Their efforts have laid the foundation for what we aim to achieve. 280 | 281 | ## Help Shape The Future of MLSpec! 282 | 283 | We invite you to be a part of the MLSpec journey. Try out the new developments, provide feedback, and contribute your ideas and code. Together, we can shape the future of standardized and verifiable ML workflows. 284 | 285 | For discussions and updates, stay tuned for upcoming mailing list and social accounts. 286 | 287 | We believe MLSpec has the potential to become a foundational tool for building reliable and governed ML pipelines, and we look forward to working with the community to realize this vision. 288 | -------------------------------------------------------------------------------- /_prior_art/README.md: -------------------------------------------------------------------------------- 1 | **Example Repo** 2 | 3 | Examples of metadata used from other projects. 4 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Artifact.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/ArtifactConnection.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Data.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Executable.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Framework.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Metrics.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Model.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/Project.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/_prior_art/kubeflow/README.md -------------------------------------------------------------------------------- /_prior_art/kubeflow/Run.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/TimeRange.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/kubeflow/UUID.proto: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2019 The Kubeflow Authors. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | limitations under the License. 15 | */ 16 | 17 | syntax = "proto3"; 18 | 19 | package protobuf; 20 | 21 | import "google/protobuf/timestamp.proto"; 22 | import "google/protobuf/struct.proto"; 23 | 24 | // Instead of have project_uuid, run_uuid and etc, we have uuid with type 25 | // information. 26 | message UUID { 27 | string value = 1; 28 | Type type = 2; 29 | enum Type { 30 | UNKNOWN = 0; 31 | PROJECT = 1; 32 | RUN = 2; 33 | ARTIFACT_CONNECTION = 3; 34 | DATA_METADATA = 4; 35 | MODEL_METADATA = 5; 36 | EXECUTABLE_METADATA = 6; 37 | METRICS = 7; 38 | } 39 | } 40 | 41 | message Project { 42 | UUID id = 1; 43 | string name = 2; 44 | string description = 3; 45 | repeated UUID runs = 4; 46 | repeated UUID artifacts = 5; 47 | map annotations = 6; 48 | } 49 | 50 | message Run { 51 | UUID id = 1; 52 | string name = 2; 53 | string description = 3; 54 | UUID project = 4; 55 | repeated UUID artifacts = 5; 56 | map annotations = 6; 57 | } 58 | 59 | message ArtifactConnection { 60 | UUID id = 1; 61 | UUID first_artifact = 2; 62 | UUID second_artifact = 3; 63 | UUID run = 4; 64 | UUID project = 5; 65 | } 66 | 67 | message ArtifactMetadata { 68 | oneof metadata { 69 | DataMetadata data_metadata = 1; 70 | ExecutableMetadata executable_metadata = 2; 71 | ModelMetadata model_metadata = 3; 72 | } 73 | } 74 | 75 | message DataMetadata { 76 | UUID id = 1; 77 | string name = 2; 78 | string description = 3; 79 | string source = 4; 80 | string query = 5; 81 | string version = 6; 82 | google.protobuf.Timestamp ingestTime = 7; 83 | TimeRange timerange = 8; 84 | repeated UUID runs = 9; 85 | repeated UUID projects = 10; 86 | map annotations = 11; 87 | repeated UUID jobs = 12; 88 | } 89 | 90 | message ModelMetadata { 91 | UUID id = 1; 92 | string name = 2; 93 | string description = 3; 94 | string kind = 4; 95 | string version = 5; 96 | repeated string tags = 15; 97 | map hyperparameters = 6; 98 | Framework framework = 7; 99 | string storage_location = 8; 100 | google.protobuf.Timestamp create_ts = 14; 101 | repeated UUID metrics_ids = 9; 102 | UUID run = 10; 103 | UUID project = 11; 104 | map annotations = 12; 105 | repeated UUID jobs = 13; 106 | } 107 | 108 | message ExecutableMetadata { 109 | UUID id = 1; 110 | string name = 2; 111 | string description = 3; 112 | string repository = 4; 113 | string version = 5; 114 | repeated string tags = 6; 115 | google.protobuf.Timestamp create_ts = 7; 116 | repeated UUID runs = 8; 117 | repeated UUID projects = 9; 118 | map annotations = 10; 119 | repeated UUID jobs = 11; 120 | } 121 | 122 | message Metrics { 123 | UUID id = 1; 124 | UUID model = 2; 125 | UUID data = 3; 126 | UUID job = 4; 127 | string kind = 5; 128 | string description = 8; 129 | map values = 6; 130 | map annotations = 7; 131 | } 132 | 133 | message Framework { 134 | string name = 1; 135 | string version = 2; 136 | } 137 | 138 | message TimeRange { 139 | google.protobuf.Timestamp start = 1; 140 | google.protobuf.Timestamp end = 2; 141 | } 142 | -------------------------------------------------------------------------------- /_prior_art/mlflow/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/_prior_art/mlflow/README.md -------------------------------------------------------------------------------- /_prior_art/modeldb/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/_prior_art/modeldb/README.md -------------------------------------------------------------------------------- /_prior_art/pachyderm/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/_prior_art/pachyderm/README.md -------------------------------------------------------------------------------- /_prior_art/seldon/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/_prior_art/seldon/README.md -------------------------------------------------------------------------------- /common/object.md: -------------------------------------------------------------------------------- 1 | # Object 2 | 3 | All system managed objects generally have the following attributes and actions. 4 | 5 | ### Attributes 6 | 7 | - **Id** 8 | 9 | [*String, Required*] Id is system generated string, immutable and uniquely identifying an object. System must ensure its uniqueness within given workspace over the lifetime of the worksapce. Usually [UUIDs (GUIDs)](https://en.wikipedia.org/wiki/Universally_unique_identifier) are used for this. 10 | 11 | Example: `123e4567-e89b-12d3-a456-426655440000` 12 | 13 | The maximum length is 40 characters. 14 | 15 | Note: 40 characters is sufficient to represent either UUID or SHA-1 hashes as human readable strings. 16 | 17 | - **Name** 18 | 19 | [*String, Required*] 20 | 21 | Name is string assigned by user, uniquely identifying an object. It's mutable, and can be changed by user as long as it stays unique within specific scope at any moment of time among non-archived objects. Scope is usually workspace, but in some cases uniqueness can be enforced within smaller scope, such as artifact names are unique only within run. 22 | 23 | Additional criteria: 24 | - It should consist of alphanumeric characters, hyphens (-), and underscores (_). 25 | - It should start with an alphanumeric character. 26 | - It should not contain spaces or any other special characters. 27 | - It should be case-insensitive for uniqueness checks. 28 | - The maximum length is 255 characters. 29 | 30 | Examples of valid names: 31 | - `my-dataset` 32 | - `model_v1` 33 | - `preprocessing_pipeline_2` 34 | 35 | - **Status** 36 | 37 | [*Enum, Required*] 38 | 39 | Allowed values: Active, Deprecated, Archived 40 | 41 | Not all statuses are applicable to some object types. 42 | 43 | - *Active* object is visible in default lists, can be retrieved by name or id, and can be used in the system 44 | - *Deprecated* object is not visible in default lists, can be retrieved by name or id, and can be used in the system. User might be warned that this object is deprecated and recommended to use some other object instead. 45 | - *Archived* object is not visible in default lists, can't be retrieved by name and can't be used to execute new jobs. It can be retrieved by Id, most of metadata is still readable for the purpose of exploring history of past executions. 46 | 47 | - **Description** 48 | 49 | [*String, Optional*]. 50 | 51 | Arbitrary text description for given object. 52 | 53 | The maximum length is 32K characters. 54 | 55 | 56 | 57 | ### Actions 58 | 59 | - Create 60 | 61 | - Get 62 | 63 | - Archive 64 | 65 | Object is not deleted. Name is can be reused for some other object. 66 | 67 | - Rename 68 | 69 | 70 | 71 | ## Deletion 72 | 73 | Hard deletion is unsupported for system managed objects. Objects can be archived to free up name, and hide object from default lists or UX. However, objects are preserved to allow user later discover history of ML, and prevent breakage of other objects referencing given object 74 | -------------------------------------------------------------------------------- /data/artifact.md: -------------------------------------------------------------------------------- 1 | # Artifact 2 | 3 | *Artifacts* are pieces of data produced by runs. 4 | 5 | Artifacts have the following properties 6 | 7 | - Id 8 | 9 | System assigned id, unique within workspace 10 | 11 | - Name 12 | 13 | User assigned name, unique within the context of run which produced it 14 | 15 | - Created By Run Id 16 | 17 | [Run](run.md) which produced this artifact 18 | 19 | - Data Path 20 | 21 | Reference to data stored in storage (includes Data Store, relative path, etc.) 22 | 23 | Artifact can be promoted to [DataSet](dataset.md) 24 | 25 | -------------------------------------------------------------------------------- /data/datapath.md: -------------------------------------------------------------------------------- 1 | # DataPath 2 | 3 | ### DataPath 4 | 5 | DataPath is used to refer to something stored in DataStore. DataPath always contains reference to DataStore (which user usually does by name, and system usually tracks by Id), and DataStore type specific properties. In most cases it's relative path in given DataStore, but sometimes can be more complex, such as SQL Table name or SQL Query used to retrieve data from SQL Database. 6 | 7 | Example DataPath: 8 | 9 | ```json 10 | { 11 | "DataStore": "MyCloudStorageContainer", 12 | "RelativePath": "/data/summerproject/1.txt" 13 | } 14 | ``` 15 | 16 | DataPaths don't have names, Ids, and other metadata. For those, see Artifacts and DataSets 17 | 18 | -------------------------------------------------------------------------------- /data/dataset.md: -------------------------------------------------------------------------------- 1 | # Dataset 2 | 3 | Datasets represent named data stored within a Datastore. Conceptually Datasets will be enabling data to be effectively used in various scenarios by abstracting the underlying storage (such as formatting and encoding) and access complexities (such as reading and transforming to a meaningful form) 4 | 5 | # Dataset Version 6 | 7 | Datasets are versioned based on its definition (location of the data and steps to transform). Dataset definition can be simple Datapath (file path relative to Datastore) or a complex Dataflow (DataPrep dataflow json). 8 | 9 | Any changes to definition will necessitate the creation of a new version. Versioning concept is here same as source code commit in VSO or similar systems. 10 | 11 | # Archiving a Version 12 | 13 | Individual Dataset versions can be Archived when versions not supposed to be used for any reasons (such as underlying data no longer available). When an Archived Dataset version used in a Pipeline, execution will be blocked with error. No further actions can be performed on archived Dataset versions, but the references will be kept intact. 14 | 15 | # Deprecating a Version 16 | 17 | Individual Dataset versions can be deprecated when usage is no longer recommended and a replacement available. When a deprecated Dataset version used in a Pipeline, a warning message gets returned but execution will not be blocked. 18 | 19 | # Reactivate a Version 20 | 21 | Dataset versions can be reactivated to be used again. 22 | 23 | # Profile 24 | 25 | Profile of a Dataset version is Schema and various statistical measures of underlying data at a point in time. Profile of Dataset version will be stall when underlying data changes. 26 | 27 | Note: At this point only, Profile freshness can be calculated for file based Datastore (Azure Block Blob/ File Blob/ ADLS) 28 | 29 | # Dataset Checkpoint 30 | 31 | A checkpoint is a combination of Profile and an optional materialized copy of the data itself, tied to a specific dataset version. 32 | 33 | When data gets materialized, a new profile will be generated for the materialized data instead of current profile though it is still valid. This is to ensure freshness of the Profile while data being materialized. 34 | 35 | Also, when a checkpoint with materialized data is used in execution/ Training, the materialized data gets used instead of live data. 36 | 37 | # Dataset Views 38 | 39 | Dataset Views are simple user defined filters on a Dataset version to select subset of rows. Each Dataset View will have one Dataflow filter expression. Refer Dataflow expressions for more details. 40 | 41 | Eg: Weather Dataset version 1 can be pointing to all data between 2015 and 2018, The view will be Weather_2015, Weather_2016 and so on. 42 | -------------------------------------------------------------------------------- /data/dataset.yml: -------------------------------------------------------------------------------- 1 | mlSpecVersion: 0.0.1 2 | id: 592f0c1c-72ae-4236-9202-7e6aff1954f4 3 | location: https://internal.contoso.com/datasets/90d47e48-9105-49d0-a159-5c029f97ecd0 4 | description: Error Logs 2019-08-22 5 | dateGenerated: 2019-08-22 6 | isDeprecated: false 7 | isReadOnly: true 8 | isVisible: true 9 | name: foo 10 | preview: ref 11 | profile: ref 12 | state: active 13 | tags: 14 | - key: X 15 | value: '100' 16 | version: 1.0.1 17 | versionedBy: BobSmith 18 | versionedOn: 2019-08-21 19 | versionNotes: None 20 | -------------------------------------------------------------------------------- /data/datastore.md: -------------------------------------------------------------------------------- 1 | # Datastore 2 | 3 | ### Datastore 4 | 5 | Datastores are objects which indicate the location of a stored data. Example of Datastore can be a SQL Database, or cloud storage account like an Azure Blob or AWS S3 bucket, or DBFS file system in a Databricks cluster. Datastore can be either a pointer to the data, or used to store various pieces of data, organized into files, folders, sql tables, etc. - depending on its type. 6 | 7 | The Datastore Object shouldn't be used for referencing specific file or folder, it's intended to be a bigger container. DataPaths are used to refer to specific data (like file or folder) in Datastore. Datastore can be imagined as "place", and DataPath can be imagined as "thing" stored in "place". 8 | 9 | Datastore has user assigned *name* and *type*, and system generated *id*. Depending on type, Datastore can have more properties. For example, for SQL Database can have properties such as connection string, database name. 10 | 11 | Name is unique and assigned by user, at any moment of time only single datastore can have given name. User can rename Datastore. Id is also unique, but immutable and system generated. Usually *Id* is GUID. 12 | 13 | Datastore can be archived. When it's archived, name is freed up and can be used for another Datastore. 14 | -------------------------------------------------------------------------------- /data/readme.md: -------------------------------------------------------------------------------- 1 | # Data tracking 2 | 3 | Data tracking is built on several abstractions: 4 | 5 | - [DataStore](datastore.md) 6 | 7 | A place for storing data. Could be a storage account, SQL database, HDFS filesystem, etc. 8 | 9 | - [DataPath](datapath.md) 10 | 11 | A thing stored in DataStore. Always has reference to DataStore, and some way to specify location within it, for example relative path. Could be file, folder, table in SQL database, etc. 12 | 13 | - [Artifact](artifact.md) 14 | 15 | Piece of data produced by run, has identifier (and can be stored in registry of artifacts) and DataPath. 16 | 17 | - [DataSet](dataset.md) 18 | 19 | Named and versioned data, with rich metadata (profile, schema, snapshots, etc.). Uses DataPath to point to actual data. -------------------------------------------------------------------------------- /docs/archive/README-old.md: -------------------------------------------------------------------------------- 1 | # MLSpec 2 | A project to standardize the intercomponent schemas for a multi-stage ML Pipeline 3 | 4 | # Prior Art 5 | Many of these concepts are inspired by the previous work from the [MLSchema project](http://ml-schema.github.io/documentation/ML%20Schema.html) (Mirrored here due to site instability) - 6 | - Paper - https://github.com/mlspec/MLSpec/blob/master/1807.05351.pdf 7 | - Website - https://github.com/mlspec/MLSpec/blob/master/ML%20Schema%20Core%20Specification.pdf 8 | 9 | # Background 10 | The machine learning industry has embraced the concept of cloud-native architectures, made up of multiple component parts loosely coupled together. One of the issues with this approach, however, has been that while the steps of a machine learning pipeline have been fairly well articulated in a wide variety of publications, the specifications for how to wire together these steps remains highly varied, and make it difficult to build any standard tools that might simplify or formalize machine learning operations. 11 | 12 | This project is about establishing community driven standards that automated tooling can consume and output. Ideally, this enables the next opportunity around standardized ML software engineering practices. 13 | 14 | # Existing Multi-Stage ML Workflows 15 | The below provide inspiration as projects which focus on ML and batch solutions: 16 | 17 | - [Facebook’s FBLearner Flow](https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/) 18 | - [Google’s TFX Paper](https://dl.acm.org/citation.cfm?id=3098021) 19 | - [Kubeflow Pipelines](https://cloud.google.com/blog/products/ai-machine-learning/getting-started-kubeflow-pipelines) 20 | - [Microsoft Azure ML Pipelines](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) 21 | - [Netflix Meson](https://medium.com/netflix-techblog/meson-workflow-orchestration-for-netflix-recommendations-fc932625c1d9) 22 | - [Spotify’s Luigi](https://github.com/spotify/luigi) 23 | - [Uber’s Michelangelo](https://eng.uber.com/michelangelo/) [[paper](http://proceedings.mlr.press/v67/li17a/li17a.pdf)] 24 | 25 | From these papers, we feel the following steps summarize all the steps in an end-to-end machine learning workflow. 26 | 27 | # Proposed standards 28 | We propose a standard around the following components. 29 | 30 | - Workflow orchestration – what are the standard endpoints that each step in an ML workflow require (e.g. /ok, /varz, /metrics, etc) 31 | - Model - … 32 | - Logging – what is the NCSA standard log for each inference request? 33 | - Other… 34 | 35 | # End-to-End Complete Lifecycle 36 | We feel that over time, every stage of an ML lifecycle will need some form of metadata management. The below represent a collection of these steps: 37 | 38 | 1. *Codify Objectives* - Detail the model outputs, possible errors and minimum success for launching in code; a simple DSL that can be used to verify success/failure programmatically for automated deployment 39 | 2. *Data Ingestion* - What tools/connectors (e.g. ODBC, Spark, HDFS, CSV, etc) were used for pulling in data; what queries were used (including signed datasets); sharding strategies; May include labelling or synthetic data generation/simulation. 40 | 3. *Data Analysis* - Set of descriptive statistics on the included features and configurable slices of the data. Identification of outliers. 41 | 4. *Data Transformation* - What data conversions and feature wrangling (e.g. feature to-integer mappings) were used; what outliers were programmatically eliminated 42 | 5. *Data Validation* - What validation was applied to the data based on a versioned, succinct description of the expected properties of the data; schema can also be used to prevent bad behavior, such as training on deprecated data; mechanisms to generate the first version of the schema (e.g. select * from foo limit 30) that can be used to drive other platform components, e.g., automatic feature-engineering or data-analysis tools. 43 | 6. *Data Splitting (including partitioning)* - How the data is split into training, validation, hold back & debugging sets and records and gets results of validation for statistics of each set; metadata here may be be used to detect leakage of training data into testing data and/or overfit 44 | 7. *Model Training/Tuning* - Metadata about how the model is packaged and the distribution strategy; hyperparameters searched and results of the search; results of any conversions to other model serving format (e.g. TF -> ONNX); techniques used to quantize/compress/prune model and the results 45 | 8. *Model Evaluation/Validation* - Result of evaluation and validation of model to ensure they meet original codified objectives before serving them to users; computation of metrics on slices of data, both for improving performance and avoiding bias (e.g. gender A gets significantly better results than gender B); source of data used for validation 46 | 9. *Test* - Results of final confirmation for model on the hold back data set; MUST BE SEPARATE STEP FROM #8; source of data used for final test 47 | 10. *Model Packaging* - Metadata about model package; includes adding additional security constraints, monitoring agents, signing, etc.; descriptions of the necessary infrastructure (e.g. P100s, 16 GB of RAM, etc) 48 | 11. *Serving* - Results of rolling model out to production 49 | 12. *Monitoring* - Live queryable metadata that provides liveness checking and ML-specific signals that need action, such as significant deviation from previous model performance or degradation of the model performance over time; ideally includes rollback strategy (e.g. if this model is failing, use model last-year.last-month.pkl) 50 | 13. *Logging* - NCSA-style record per inference request, including a cryptographically secure record of the version of the pipeline (including features) and data used to train. 51 | 52 | # Table of contents for MLSpec repo 53 | 54 | - [common](./common) 55 | 56 | - [object](./common/object.md) 57 | 58 | General notes applicable to multiple objects in the system. How they are identified and named, basic operations, etc. 59 | 60 | - [data](./data) 61 | 62 | - [datastore](datastore.md) 63 | 64 | Data storages 65 | 66 | - [datapath](./data/datapath.md) 67 | 68 | Data references 69 | 70 | - [artifact](artifact.md) 71 | 72 | Data produced by runs 73 | 74 | - [dataset](./data/dataset.md) 75 | 76 | Named and versioned data in storage 77 | 78 | - [pipelines](./pipelines) 79 | 80 | - [pipeline](pipeline.md) 81 | 82 | DAG for executing computation on data and training and deploying models 83 | 84 | - [module](module.md) 85 | 86 | Reusable definition of computation, includes script, set of expected inputs, outputs, etc. 87 | 88 | - [experiment_tracking](./experiment_tracking) 89 | 90 | - [run](./experiment_tracking/run.md) 91 | 92 | Tracked execution of pipeline or single script on compute 93 | 94 | - [model_packaging](./model_packaging) 95 | 96 | - [models](./model_packaging/README.md) 97 | 98 | Trained models 99 | 100 | - logging_proto 101 | 102 | - monitoring_proto 103 | 104 | - [metadata_file](./metadata_file) 105 | 106 | - [metadata](./metadata_file/metadata.yaml) 107 | 108 | The metadata file used to recreate the ML workflow 109 | -------------------------------------------------------------------------------- /docs/assets/logos/mlspec_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/docs/assets/logos/mlspec_logo.png -------------------------------------------------------------------------------- /docs/assets/logos/mlspec_logo_light.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/docs/assets/logos/mlspec_logo_light.png -------------------------------------------------------------------------------- /experiment_tracking/README.md: -------------------------------------------------------------------------------- 1 | # Experiment Tracking 2 | 3 | Experiment tracking is handled by capturing [runs](run.md), their metadata and relationships to other objects. 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------- /experiment_tracking/experiment_example.yml: -------------------------------------------------------------------------------- 1 | mlSpecVersion: 0.0.1 2 | id: 592f0c1c-72ae-4236-9202-7e6aff1954f4 3 | project_id: 5999460a-86e4-4b3e-b48d-b4c73b9d73b0 4 | experiment_id: 3eb00c1e-e06a-4a50-b911-54e6fd4d5746 5 | name: 'test experiment run' 6 | submittedDate: '2019-01-01T000000' 7 | submitterId: 'GUID' 8 | hyperparameters: 9 | - key: C 10 | value: '100' 11 | - key: solver 12 | value: lbfgs 13 | - key: max_iter 14 | value: '1000' 15 | artifacts: 16 | key: model 17 | path: output/census_logreg_simple.gz 18 | artifact_type: MODEL 19 | datasets: 20 | key: input_data 21 | path: data/credit-default.csv 22 | artifact_type: DATA 23 | metrics: 24 | key: accuracy 25 | value: '0.7787333333333334' 26 | value_type: NUMBER 27 | tags: 28 | - key: origin 29 | value: ENUS 30 | properties: 31 | - key: systemId 32 | value: foo 33 | -------------------------------------------------------------------------------- /experiment_tracking/run.md: -------------------------------------------------------------------------------- 1 | # Run 2 | 3 | Run is a single execution of some 'job', such as script, pipeline, etc. Once executed, it can't execute again. 4 | 5 | Runs have various metadata, metrics, artifacts, etc. associated with them. Also runs have references to other runs and other objects in the system 6 | 7 | ### Relationships 8 | 9 | Runs can have parent-child relationships with other runs. These relationships help establish a hierarchical structure and capture dependencies between runs. 10 | 11 | - Parent Run: A run can have a parent run, indicating that it is a part of a larger workflow or a subsequent run in a series of related runs. The parent run acts as a container or grouping mechanism for its child runs. 12 | - Child Runs: A run can have multiple child runs, representing sub-tasks or sub-components of the parent run. Child runs are typically spawned by the parent run and can inherit certain properties or configurations from the parent. 13 | The parent-child relationships can be captured using the following fields: 14 | 15 | - parent_run_id: The ID of the parent run, if applicable. This field establishes the link between a child run and its parent run. 16 | - child_run_ids: An array of IDs representing the child runs associated with the current run. This field allows for tracking the child runs spawned by the current run. 17 | By capturing these relationships, you can create a hierarchical structure of runs, enabling better organization, traceability, and understanding of the dependencies between different parts of the ML workflow. 18 | 19 | ## Tags 20 | 21 | Tags are key-value pairs that provide additional metadata and annotations for a run. They allow for flexible categorization, filtering, and querying of runs based on specific attributes or characteristics. 22 | 23 | * Tags can be used to label runs with meaningful information, such as the purpose of the run, the algorithm used, the dataset version, or any other relevant contextual information. 24 | * Each tag consists of a key and a value, both represented as strings. 25 | * Multiple tags can be associated with a single run, allowing for rich metadata capture. 26 | * Tags can be used to group related runs, filter runs based on specific criteria, or provide additional context for analysis and interpretation. 27 | 28 | Example tags: 29 | 30 | ``` 31 | - key: "experiment_type" 32 | value: "hyperparameter_tuning" 33 | - key: "dataset_version" 34 | value: "v2.0" 35 | - key: "algorithm" 36 | value: "random_forest" 37 | ``` 38 | 39 | ## Metrics 40 | 41 | Metrics are quantitative measurements or outcomes associated with a run. They capture the performance, accuracy, or other relevant numerical values that are recorded during the execution of a run. 42 | 43 | * Metrics provide a way to track and compare the performance of different runs or experiments. 44 | * Each metric consists of a key, which represents the name or identifier of the metric, and a value, which is the numerical value associated with the metric. 45 | * Metrics can be of different types, such as scalar values (e.g., accuracy, loss), arrays or vectors (e.g., precision-recall curve), or even dictionaries or JSON objects for more complex metrics. 46 | * Metrics can be captured at different points during the run, such as at regular intervals or at the end of the run. 47 | 48 | Example metrics: 49 | 50 | ``` 51 | metrics: 52 | - key: "accuracy" 53 | value: 0.87 54 | value_type: "scalar" 55 | - key: "precision_recall_curve" 56 | value: [0.92, 0.88, 0.85, 0.80] 57 | value_type: "array" 58 | - key: "confusion_matrix" 59 | value: { 60 | "true_positive": 100, 61 | "true_negative": 200, 62 | "false_positive": 20, 63 | "false_negative": 30 64 | } 65 | value_type: "object" 66 | ``` 67 | 68 | ## Artifacts 69 | 70 | Artifacts are data produced by a run. A single run can produce multiple artifacts. Artifacts are always stored as blobs identified by DataPaths - those could be files, folders, SQL tables, etc. Because Artifacts always point to data using DataPaths, they always represent data stored in Datastores. Different artifacts for a given run can be stored in different Datastores. Some artifacts can be published as (or promoted to) DataSets. 71 | 72 | See [Artifact](artifact.md) for more information. 73 | 74 | ### Logs 75 | 76 | Logs are represented as artifacts. A run can have several log objects - usual ones are stdout, stderr, driver log, etc. They are used to capture traces of the executing job. Depending on the implementation, logs can be raw text, HTML, or some other text format. 77 | 78 | -------------------------------------------------------------------------------- /logging_proto/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mlspec/MLSpec/c4fe68b0d4d62d61ff56e434b54af383e09abf27/logging_proto/README.md -------------------------------------------------------------------------------- /logging_proto/inferenceLog.yml: -------------------------------------------------------------------------------- 1 | applicationInfo: 2 | - applicationId 3 | - applicationEndpoint 4 | requestInfo: 5 | - requestId 6 | - requestTimestamp 7 | - requestLatency 8 | modelInfo: 9 | models: 10 | type: array 11 | items: 12 | type: string 13 | -------------------------------------------------------------------------------- /metadata_file/README.md: -------------------------------------------------------------------------------- 1 | # [WIP] Metadata_file 2 | This folder contains a metadata file for the ML job that we think what it should look like. We hope to recreate the ML job based on this metadata file. We started the first version of the metadata file based on Kubeflow mnist example, so we only focus on this one particular example. This metadata file will be expanded to adapt to more scenarios in ML world and we are happy to hear your advices. 3 | 4 | Feel free to submit an issue or pr if you have any question or advice 5 | 6 | 7 | # [WIP] Abstractions 8 | There are 6 sections in the metadata description 9 | 10 | 1. *framework* - the name, version of the framework, also contains runtime and other supporting files for optional 11 | 2. *model* - the model used for this job 12 | 3. *dataset* - dataset acquisition, storage, loading and other dataset processings 13 | 4. *data_process* - different data process functions is applied based on different scenarios. Right now only some basic mnist image classification functions is included 14 | 5. *model_architecture* - includes input, output and other model architecture definitions 15 | 6. *training_params* - parameters for training, includes lr, loss, batch_size etc. 16 | 17 | 18 | -------------------------------------------------------------------------------- /metadata_file/metadata.yaml: -------------------------------------------------------------------------------- 1 | framework: 2 | - name: tensorflow 3 | - version: 1.12.0 4 | - runtime: python2.7 5 | - requirement(optional): 6 | - numpy: 1.14.0 7 | - grpc: 0.3.post19 8 | 9 | model: 10 | - name: image-recognition 11 | - version: 1.0 12 | - source : http://... or test.py 13 | - creator: Xiyuan_Wang 14 | - time: 2019-02-26 15 | - type: Enum(Keras, Graph...) 16 | 17 | dataset: 18 | - name: mnist 19 | - version: 1.0 20 | - source: http://... or dataset.zip 21 | 22 | # data_process differs in various ways, this is particularly for the mnist example 23 | # suggest to implement with a base + plugin structure. Base contains the most basic keyword definition like 24 | # the dataset_split while plugin contains functions in different scenarios like image recognition, voice 25 | # recognition and so on 26 | data_process: 27 | - data_load: 28 | - data_split: 29 | - padding: 30 | - truncating: 31 | - type: enum(Image, NLP, voice) 32 | - key1: value1 33 | - key2: value2 34 | 35 | # the model architecture combines with 'model' if there already exists the model file, or this section may 36 | # be useful with the Dynamic Computation Graphs? 37 | model_architecture: 38 | - input: 39 | - fully_connected_layer: 40 | - output: 41 | - dropout: 42 | - embedding: 43 | - batch_normalization: 44 | 45 | 46 | training_params: 47 | - learning_rate: 48 | - loss: 49 | - batch_size: 50 | - epoch: 51 | - optimizer: 52 | - xxx 53 | - yyy 54 | - train_op: 55 | -------------------------------------------------------------------------------- /model_packaging/README.md: -------------------------------------------------------------------------------- 1 | # Model Packaging 2 | Models may come from a variety of places: 3 | - Forked from a baseline experiment with sources tracked in a Git repository 4 | - Developed in a corporate environment's compliant experimentation system 5 | - Shared as part of a larger Model Ensemble. 6 | 7 | Models may need to be consumed by several different upstream inferencing stacks. 8 | We should be able to track rich metadata around where models came from, what they are capable of, and where they are running. 9 | 10 | # Why is this metadata important? 11 | You are a bank being sued for discrimination in how you chose loan recipients via an ML model. 12 | Prove that the model you used was (a) not tampered with (b) had an audit trail for training and (c) the data set was unbiased. 13 | 14 | # Key Requirements 15 | - Track paper trail requirements for outputs produced in training pipelines. 16 | - Enable easy registration and updates of models across a variety of storage formats. 17 | - Provide a fluent command line experience / set of REST APIs for registering models. 18 | - Express model schema and service schema (how do I use the model?) 19 | - Support a model policy store 20 | - Simplify operationalization of a model 21 | - Track model lineage 22 | - Which code / data / experiment produced a model 23 | - Pipelines used to produce a model 24 | - Other models used in the inference process for a model (composites) 25 | - At minimum, capture the following: 26 | - code 27 | - data 28 | - config (conda env, base docker image, … ) 29 | - Track metrics associated with a model 30 | - Make it easy to compare metrics across different versions of a model 31 | - Expose available deployment paths for inference 32 | - What can we convert it to? Where can it be deployed to? 33 | - New code or new model as a trigger for Deployment 34 | - Change management integrated into the pipeline directly 35 | - Track breaking changes on models through semantic versioning 36 | -------------------------------------------------------------------------------- /model_packaging/data.yaml: -------------------------------------------------------------------------------- 1 | # data (Optional) 2 | # source_id: (Optional) Extension file id regarding the data source. 3 | # domain: (Optional) Metadata about the data domain. 4 | # website: (Optional) Links to the data description 5 | # license: (Optional) Data license 6 | 7 | data: 8 | source_id: IMDB-WIKI 9 | domain: "Image" 10 | website: "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/" 11 | license: "Apache 2.0" 12 | -------------------------------------------------------------------------------- /model_packaging/model.yaml: -------------------------------------------------------------------------------- 1 | # name: (Required) name of this model file 2 | # description: (Optional) description of this model file 3 | # author: (Required for trainable) 4 | # name: (Required for trainable) name of this training job's author 5 | # email: (Required for trainable) email of this training job's author 6 | # framework: (Required) 7 | # name: (Required) ML/DL framework format that the model is stored as. 8 | # version: (Optional) Framework version used for this model 9 | # runtimes: (Required for trainable) 10 | # name: (Required for trainable) programming language for the model runtime 11 | # version: (Required for trainable) programming language version for the model runtime 12 | # license: (Optional) License for this model. 13 | # domain: (Optional) Domain metadata for this model. 14 | # purpose: (Optional) Purpose of this model, e.g. binary_classification 15 | # binary_classification - Binary classification 16 | # multiclass_classification - Multiclass classification 17 | # regression_prediction - Regression – Prediction 18 | # regression_recognition - Recognition – Detection 19 | # website: (Optional) Links that explain this model in more details 20 | # labels: (Optional list) labels and tags for this model 21 | # - url: (Optional) Link to the ML/DL model page. 22 | # - pipeline_uuids: (Optional) Linkage with a list of execuable pipelines. 23 | 24 | name: Facial Age estimator Model 25 | model_identifier: facial-age-estimator 26 | description: Sample Model trained to classify the age of the human face. 27 | author: 28 | name: DL Developer 29 | email: "me@ibm.com" 30 | framework: 31 | name: "tensorflow" 32 | version: "1.13.1" 33 | runtimes: 34 | name: python 35 | version: "3.5" 36 | 37 | license: "Apache 2.0" 38 | domain: "Facial Recognition" 39 | purpose: 40 | website: "https://developer.ibm.com/exchanges/models/all/max-facial-age-estimator" 41 | labels: 42 | - url: 43 | - pipeline_uuids: ["abcd1234"] 44 | -------------------------------------------------------------------------------- /model_packaging/model_example.yml: -------------------------------------------------------------------------------- 1 | specVersion: 0.0.1 2 | # We need semantic versioning to indicate breaking changes 3 | 4 | runtimeArtifacts: 5 | # Collection of files (asset) required to instantiate the model 6 | 7 | example: http://github.com/example 8 | # Per artifact, we track 9 | # - location is URI which can point to a variety of stores (ADLS / Git / Blob store / ACR path / …) 10 | # - (optional) relativePath (if it needs to be laid out specially when turning the model into a service) 11 | 12 | framework: 13 | name: tensorflow 14 | # (scikit / pytorch / sparkml / onnx / MLnet / etc.) 15 | version: 1.7 16 | # Provides specific optimizers / value add for known flavors of models 17 | # Version of model flavor (tf 1.7 / 1.8) 18 | custom: 19 | # Also provides a “custom” option 20 | 21 | accelerator: V100 22 | 23 | time_created: 2019-01-12T22:53:18+00:00 24 | # when model was created 25 | 26 | created_by: sam@rockwell.com 27 | purpose: binary_classification 28 | # binary_classification - Binary classification 29 | # multiclass_classification - Multiclass classification 30 | # regression_prediction - Regression – Prediction 31 | # regression_recognition - Recognition – Detection 32 | 33 | clustering: xxx 34 | dimensionality_reduction: xxx 35 | # Clustering and dimensionality reduction 36 | 37 | custom: 38 | # Open extensible field by model type 39 | # Composition - Describes the collection of models used for Composite Model 40 | origin: 41 | # Origin (collection of artifacts) 42 | codeAssets: xxx 43 | dataAssets: xxx 44 | created_by: xxx 45 | pipeline: xxx 46 | logs: xxx 47 | expiry: xxx 48 | 49 | metadata: 50 | # Optional section 51 | schema: 52 | #Specify the features required for scoring with the model (input, type, shape) 53 | 54 | link_to_dataset: xxx 55 | # Linkage to model dataset / profile 56 | 57 | features: xxx 58 | # Captures model features 59 | 60 | service_schema: xxx 61 | # Service schema (generate Swagger client) 62 | output_schema: xxx 63 | # Output format, in contract (with same type of validation check as the inputs) (could explore compatible version ranges, etc.) 64 | metrics: 65 | key: value 66 | latency: 0.01 67 | # key/value pairs with user defined metrics (from training) 68 | # Model performance metrics (latency to inference) 69 | sampleInputs: http://uri/inputs.txt 70 | tags: 71 | # Arbitrary key / value pairs 72 | 73 | key: value 74 | # Used to express anything not specified in the loose typing above -------------------------------------------------------------------------------- /model_packaging/model_onnx_conversion.yaml: -------------------------------------------------------------------------------- 1 | # convert: (Optional) 2 | # onnx_convertable: (Optional) Enable convertion to ONNX format. 3 | # The model needs to be either trainable or servable. Default: False 4 | # model_source: (Required for onnx_convertable) Model binary path that needs the format conversion. 5 | # data_store: (Required) datastore for the model source 6 | # initial_model: 7 | # bucket: (Required) Bucket that has the model source 8 | # path: (Required) Bucket path that has the model source 9 | # url: (Optional) Link to the model 10 | # onnx_model: 11 | # bucket:(Required) Bucket to store the onnx model 12 | # path: (Required) Bucket path to store the onnx model 13 | # url: (Optional) Link to the converted model 14 | # tf_inputs: (Required for TensorFlow model) Input placeholder and shapes of the model. 15 | # tf_outputs: (Required for TensorFlow model) Output placeholders of the model. 16 | # tf_rtol: (Optional) Relative tolerance for TensorFlow 17 | # tf_atol: (Optional) Absolute tolerance for TensorFlow 18 | 19 | convert: 20 | onnx_convertable: true 21 | model_source: 22 | initial_model: 23 | data_store: age_datastore 24 | bucket: facial-age-estimator 25 | path: 2.0/assests/model.pt 26 | url: "" 27 | initial_model_local: 28 | path: /local/1.0/assets/ 29 | onnx_converted_model: 30 | onnx_model: 31 | data_store: age_datastore 32 | bucket: facial-age-estimator 33 | path: 3.0/assets/model.onnx 34 | url: "" 35 | onnx_model_local: 36 | path: /local/1.0/assets/ 37 | tf_inputs: 38 | "X:0": [1] 39 | tf_outputs: 40 | - pred:0 41 | tf_rtol: 0 42 | tf_atol: 0 43 | data_stores: 44 | - name: age_datastore 45 | type: s3 46 | connection: 47 | endpoint: https://s3-api.us-geo.objectstorage.softlayer.net 48 | access_key_id: xxxxxxxxxx 49 | secret_access_key: xxxxxxxxxxxxx 50 | 51 | # data_stores_file_paths: (Optional) - 52 | # - name: (Required) name of the data_store_file_path 53 | # key: value 54 | 55 | data_store_file_paths: 56 | - name: scoring_file_paths 57 | feature_file: 2.0/assets/features.csv 58 | input_schema_file: 2.0/assets//input_schema.json 59 | output_schema_file: 2.0/assets//output_schema.json 60 | sample_inputs_file: 2.0/assets//scoring_inputs.json 61 | 62 | # container_stores: (Optional) 63 | # - name: (Required) name of the container_store 64 | # connection: 65 | # container_registry: (Required) container registry for this container_store 66 | # container_registry_token: (Required if container registry is private) container registry token 67 | 68 | container_stores: 69 | - name: container_store 70 | connection: 71 | container_registry: docker.io 72 | container_registry_token: "" 73 | 74 | -------------------------------------------------------------------------------- /model_packaging/model_packaging.md: -------------------------------------------------------------------------------- 1 | # Model Packaging Yaml file 2 | 3 | The yaml file listed here allows to 'register' a model to a AI and ML platform. The goal is to define overall Model metadata in a standard way so that you can bring a model at any part of the lifecycle within the workflow system, if you so desire. For example someone might just want to train a model, or someone might have trained a model somewhere else, and would just like to serve. If they enter during training part, then some parts of the serving template should be filled automatically. Someone else might have a need to just convert their model to ONNX format and deploy. 4 | 5 | Even in serving, there will be different ways to describe your Model given how you have packaged them (container, or split pre-processing, prediction, post processing, and that too differs based on different model types). 6 | 7 | Training section is very evolved, and can handle multiple usecases. For serving, this starts with a sample for container based Model, but we hope to evolve in future. 8 | 9 | ## General Model Metadata 10 | 11 | ``` 12 | name: (Required) name of this model file 13 | description: (Optional) description of this model file 14 | author: (Required for trainable) 15 | name: (Required for trainable) name of this training job's author 16 | email: (Required for trainable) email of this training job's author 17 | framework: (Required) 18 | name: (Required) ML/DL framework format that the model is stored as. 19 | version: (Optional) Framework version used for this model 20 | runtimes: (Required for trainable) 21 | name: (Required for trainable) programming language for the model runtime 22 | version: (Required for trainable) programming language version for the model runtime 23 | labels: (Optional list) labels and tags for this model 24 | - url: (Optional) Link to the ML/DL model page. 25 | - pipeline_uuids: (Optional) Linkage with a list of execuable pipelines. 26 | license: (Optional) License for this model. 27 | domain: (Optional) Domain metadata for this model. 28 | purpose: (Optional) Purpose of this model, e.g. binary_classification 29 | website: (Optional) Links that explain this model in more details 30 | ``` 31 | ## Information Required for Model Training 32 | 33 | ``` 34 | train: (optional) 35 | trainable: (optional) Indicate the model is trainable. Default: False 36 | tested_platforms(optional list): platform on which this model can trained (current options: wml, ffdl, kubeflow) 37 | model_source: (Required for trainable) 38 | initial_model: (Required for trainable) 39 | data_store: (Required) datastore for the model code source 40 | bucket: (Required) Bucket that has the model code source 41 | path: (Required) Bucket path that has the model code source 42 | url: (Optional) Link to the model 43 | initial_model_local: (Optional) 44 | path: (Optional) Initial model code in the user local machine 45 | model_training_results: (Required for trainable) 46 | trained_model: (Required for trainable) 47 | data_store: (Required) datastore for the training result source 48 | bucket: (Required) Bucket that has the training result source 49 | path: (Required) Bucket path that has the training result source 50 | url: (Optional) Link to the model 51 | trained_model_local: (Optional) 52 | path: (Optional) Path to pull trained model in the user local machine 53 | data_source: (Optional) 54 | training_data: (Required for trainable) 55 | data_store: (Required) datastore for the model data source 56 | bucket: (Required) Bucket that has the model data source 57 | path: (Required) Bucket path that has the model data source 58 | url: (Optional) Link to the model 59 | training_data_local: (Optional) 60 | path: (Optional) Initial data files in the user local machine 61 | mount_type: (Required) object storage mount type 62 | evaluation_metrics: (optional) Define the metrics for the training job. 63 | type: (Required) evaluation_metrics type 64 | in: (Required) Path to store the evaluation_metrics 65 | training_container_image: (Optional) 66 | container_image_url: (Optional) Custom training container image url 67 | container_store: (Optional) container_store for the custom training image 68 | execution: (Required for trainable) 69 | command: (Required) Entrypoint commands to execute model code 70 | name: (Required) T-shirt size for training on Watson Machine Learning 71 | nodes: (Required) Number of nodes needed for this training job. Default: 1 72 | training_params: (Optional) list of hyperparameters for the training model 73 | - (optional) list of key(param name):value(param value) 74 | ``` 75 | 76 | ## Information required for Model Serving 77 | 78 | ``` 79 | serve: (Optional) 80 | servable: (Optional) Indicate the model is servable without training. Default: False 81 | tested_platforms (optional list): platform on which this model can served (current options: kubernetes, knative, seldon, wml, kfserving) 82 | model_source: (Optional) - (Required if servable is true) 83 | servable_model: (Required for s3 or url type) 84 | data_store: (Required for s3 type) datastore for the model source 85 | bucket: (Required for s3 type) Bucket that has the model source 86 | path: (Required for s3 type) Source path to the model 87 | url: (Required for url type) Source URL for the model 88 | servable_model_local: (Optional) 89 | path: (Optional) Servable model path in the user local machine 90 | serving_container_image: (Required for container type) 91 | container_image_url: (Required for container type) Container image to serve the model. 92 | container_store: (Optional) container_store name 93 | ``` 94 | 95 | ## Information required for Model Scoring 96 | 97 | ``` 98 | score: (Optional) 99 | scorable: (Optional) Indicate the model is scorable. Default: False 100 | model_feature_schema_source: (Required if scorable is true) 101 | scorable_model: (Required for s3 or url type) 102 | data_store: (Required for s3 type) datastore for the model source 103 | bucket: (Required for s3 type) Bucket that has the model source 104 | data_store_file_paths: (Required for s3 type) Source path to the model schema, features and test files 105 | url: (Required for url type) Source URL for the model 106 | secorable_model_local: (Required if local) 107 | path: (Optional) Servable model path in the user local machine 108 | metrics: Metrics for scoring 109 | ``` 110 | 111 | ## Data Metedata 112 | 113 | ``` 114 | data (Optional) 115 | source_id: (Optional) Extension file id regarding the data source. 116 | domain: (Optional) Metadata about the data domain. 117 | website: (Optional) Links to the data description 118 | license: (Optional) Data license 119 | ``` 120 | 121 | ## Data Location 122 | 123 | ``` 124 | data_stores: (Optional) - (Required for trainable) 125 | - name: (Required) name of the data_stores 126 | connection: 127 | endpoing: (Required) Object Storage endpoint URL or public Object Storage key link. 128 | access_key_id: (Required) Object Storage access_key_id 129 | secret_access_key: (Required) Object secret_access_key 130 | ``` 131 | 132 | ## File paths for Data Location 133 | 134 | ``` 135 | data_stores_file_paths: (Optional) - 136 | - name: (Required) name of the data_store_file_path 137 | key: value 138 | ``` 139 | 140 | ## Process - Mixin steps like training post process, serving pre process can be added 141 | 142 | ``` 143 | process: (Optional) 144 | - name: (Required) Script Process name. Can mix any kind of process here 145 | params: (Optional) Free flowing list of key:value paisrs 146 | staging_dir: (Optional) Staging directory within the local machine 147 | trained_model_path: (Optional) trained model path within the object storage bucket 148 | ``` 149 | ## Location for Docker container registry 150 | 151 | ``` 152 | container_stores: (Optional) 153 | - name: (Required) name of the container_store 154 | connection: 155 | container_registry: (Required) container registry for this container_store 156 | container_registry_token: (Required if container registry is private) container registry token 157 | ``` 158 | 159 | ## Data required for Model conversion to ONNX format 160 | 161 | ``` 162 | convert: (Optional) 163 | onnx_convertable: (Optional) Enable convertion to ONNX format. 164 | The model needs to be either trainable or servable. Default: False 165 | model_source: (Required for onnx_convertable) Model binary path that needs the format conversion. 166 | data_store: (Required) datastore for the model source 167 | initial_model: 168 | bucket: (Required) Bucket that has the model source 169 | path: (Required) Bucket path that has the model source 170 | url: (Optional) Link to the model 171 | onnx_model: 172 | bucket:(Required) Bucket to store the onnx model 173 | path: (Required) Bucket path to store the onnx model 174 | url: (Optional) Link to the converted model 175 | tf_inputs: (Required for TensorFlow model) Input placeholder and shapes of the model. 176 | tf_outputs: (Required for TensorFlow model) Output placeholders of the model. 177 | tf_rtol: (Optional) Relative tolerance for TensorFlow 178 | tf_atol: (Optional) Absolute tolerance for TensorFlow 179 | ``` 180 | -------------------------------------------------------------------------------- /model_packaging/model_packaging.yaml: -------------------------------------------------------------------------------- 1 | # name: (Required) name of this model file 2 | # description: (Optional) description of this model file 3 | # author: (Required for trainable) 4 | # name: (Required for trainable) name of this training job's author 5 | # email: (Required for trainable) email of this training job's author 6 | # framework: (Required) 7 | # name: (Required) ML/DL framework format that the model is stored as. 8 | # version: (Optional) Framework version used for this model 9 | # runtimes: (Required for trainable) 10 | # name: (Required for trainable) programming language for the model runtime 11 | # version: (Required for trainable) programming language version for the model runtime 12 | 13 | name: Facial Age estimator Model 14 | model_identifier: facial-age-estimator 15 | description: Sample Model trained to classify the age of the human face. 16 | author: 17 | name: DL Developer 18 | email: "me@ibm.com" 19 | framework: 20 | name: "tensorflow" 21 | version: "1.13.1" 22 | runtimes: 23 | name: python 24 | version: "3.5" 25 | 26 | # labels: (Optional list) labels and tags for this model 27 | # - url: (Optional) Link to the ML/DL model page. 28 | # - pipeline_uuids: (Optional) Linkage with a list of execuable pipelines. 29 | # license: (Optional) License for this model. 30 | # domain: (Optional) Domain metadata for this model. 31 | # purpose: (Optional) Purpose of this model, e.g. binary_classification 32 | # binary_classification - Binary classification 33 | # multiclass_classification - Multiclass classification 34 | # regression_prediction - Regression – Prediction 35 | # regression_recognition - Recognition – Detection 36 | # website: (Optional) Links that explain this model in more details 37 | 38 | license: "Apache 2.0" 39 | domain: "Facial Recognition" 40 | purpose: 41 | website: "https://developer.ibm.com/exchanges/models/all/max-facial-age-estimator" 42 | labels: 43 | - url: 44 | - pipeline_uuids: ["abcd1234"] 45 | 46 | # train: (optional) 47 | # trainable: (optional) Indicate the model is trainable. Default: False 48 | # tested_platforms(optional list): platform on which this model can trained (current options: wml, ffdl, kubeflow) 49 | # model_source: (Required for trainable) 50 | # initial_model: (Required for trainable) 51 | # data_store: (Required) datastore for the model code source 52 | # bucket: (Required) Bucket that has the model code source 53 | # path: (Required) Bucket path that has the model code source 54 | # url: (Optional) Link to the model 55 | # initial_model_local: (Optional) 56 | # path: (Optional) Initial model code in the user local machine 57 | # model_training_results: (Required for trainable) 58 | # trained_model: (Required for trainable) 59 | # data_store: (Required) datastore for the training result source 60 | # bucket: (Required) Bucket that has the training result source 61 | # path: (Required) Bucket path that has the training result source 62 | # url: (Optional) Link to the model 63 | # trained_model_local: (Optional) 64 | # path: (Optional) Path to pull trained model in the user local machine 65 | # data_source: (Optional) 66 | # training_data: (Required for trainable) 67 | # data_store: (Required) datastore for the model data source 68 | # bucket: (Required) Bucket that has the model data source 69 | # path: (Required) Bucket path that has the model data source 70 | # url: (Optional) Link to the model 71 | # training_data_local: (Optional) 72 | # path: (Optional) Initial data files in the user local machine 73 | # mount_type: (Optional) object storage mount type 74 | # evaluation_metrics: (optional) Define the metrics for the training job. 75 | # type: (Required) evaluation_metrics type 76 | # in: (Required) Path to store the evaluation_metrics 77 | # training_container_image: (Optional) 78 | # container_image_url: (Optional) Custom training container image url 79 | # container_store: (Optional) container_store for the custom training image 80 | # execution: (Required for trainable) 81 | # command: (Required) Entrypoint commands to execute model code 82 | # name: (Required) T-shirt size for training on Watson Machine Learning 83 | # nodes: (Required) Number of nodes needed for this training job. Default: 1 84 | # training_params: (Optional) list of hyperparameters for the training model 85 | # - (optional) list of key(param name):value(param value) 86 | train: 87 | trainable: true 88 | tested_platforms: 89 | - wml 90 | - ffdl 91 | model_source: 92 | initial_model: 93 | data_store: age_datastore 94 | bucket: facial-age-estimator 95 | path: 1.0/assets/ 96 | url: "" 97 | initial_model_local: 98 | path: /local/1.0/assets/ 99 | model_training_results: 100 | trained_model: 101 | data_store: age_datastore 102 | bucket: facial-age-estimator 103 | path: 1.0/assets/ 104 | url: "" 105 | trained_model_local: 106 | path: /local/1.0/assets/ 107 | data_source: 108 | training_data: 109 | data_store: age_datastore 110 | bucket: facial-age-estimator 111 | path: 1.0/assets/ 112 | training_data_url: 113 | training_data_local: 114 | path: /local/1.0/assets/ 115 | mount_type: mount_cos 116 | evaluation_metrics: 117 | type: tensorboard 118 | in: "$JOB_STATE_DIR/logs/tb/test" 119 | training_container_image: 120 | container_image_url: tensorflow/tensorflow:latest-gpu-py3 121 | container_store: container_store 122 | execution: 123 | command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz 124 | --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz 125 | --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 20000 126 | compute_configuration: 127 | name: k80 128 | nodes: 1 129 | training_params: 130 | - learning_rate: 131 | - loss: 132 | - batch_size: 133 | - epoch: 134 | - optimizer: 135 | - xxx 136 | - yyy 137 | - train_op: 138 | 139 | # serve: (Optional) 140 | # servable: (Optional) Indicate the model is servable. Default: False 141 | # tested_platforms (optional list): platform on which this model can served (current options: kubernetes, knative, seldon, wml, kfserving) 142 | # model_source: (Optional) - (Required if servable is true) 143 | # servable_model: (Required for s3 or url type) 144 | # data_store: (Required for s3 type) datastore for the model source 145 | # bucket: (Required for s3 type) Bucket that has the model source 146 | # path: (Required for s3 type) Source path to the model 147 | # url: (Required for url type) Source URL for the model 148 | # servable_model_local: (Optional) 149 | # path: (Optional) Servable model path in the user local machine 150 | # serving_container_image: (Required for container type) 151 | # container_image_url: (Required for container type) Container image to serve the model. 152 | # container_store: (Optional) container_store name 153 | 154 | serve: 155 | servable: true 156 | tested_platforms: 157 | - kubernetes 158 | - knative 159 | model_source: 160 | servable_model: 161 | data_store: age_datastore 162 | bucket: facial-age-estimator 163 | path: 2.0/assets/ 164 | url: "" 165 | servable_model_local: 166 | path: /local/1.0/assets/ 167 | url: "" 168 | scorable_model_local: 169 | path: /local/1.0/assets/ 170 | serving_container_image: 171 | container_image_url: "codait/max-facial-age-estimator:latest" 172 | container_store: container_store 173 | 174 | # score: (Optional) 175 | # scorable: (Optional) Indicate the model is scorable. Default: False 176 | # model_feature_schema_source: (Required if scorable is true) 177 | # scorable_model: (Required for s3 or url type) 178 | # data_store: (Required for s3 type) datastore for the model source 179 | # bucket: (Required for s3 type) Bucket that has the model source 180 | # data_store_file_paths: (Required for s3 type) Source path to the model schema, features and test files 181 | # url: (Required for url type) Source URL for the model 182 | # secorable_model_local: (Required if local) 183 | # path: (Optional) Servable model path in the user local machine 184 | # metrics: Metrics for scoring 185 | 186 | score: 187 | scorable: true 188 | model_features_schema_source: 189 | scorable_model: 190 | data_store: festure_schema_datastore 191 | bucket: feature_schema_bucket 192 | data_store_file_paths: scoring_file_paths 193 | scorable_model_local: 194 | data_store_file_paths: scoring_file_paths 195 | metrics: 196 | key: value 197 | latency: 0.01 198 | params: 199 | key: value 200 | 201 | # data (Optional) 202 | # source_id: (Optional) Extension file id regarding the data source. 203 | # domain: (Optional) Metadata about the data domain. 204 | # website: (Optional) Links to the data description 205 | # license: (Optional) Data license 206 | 207 | data: 208 | source_id: IMDB-WIKI 209 | domain: "Image" 210 | website: "https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/" 211 | license: "Apache 2.0" 212 | 213 | # process: (Optional) 214 | # - name: (Required) Script Process name. Can mix any kind of process here 215 | # params: (Optional) Free flowing list of key:value paisrs 216 | # staging_dir: (Optional) Staging directory within the local machine 217 | # trained_model_path: (Optional) trained model path within the object storage bucket 218 | 219 | process: 220 | - name: training_post_process 221 | params: 222 | key: value 223 | staging_dir: training_output/ 224 | trained_model_path: 225 | 226 | # data_stores: (Optional) - (Required for trainable) 227 | # - name: (Required) name of the data_stores 228 | # connection: 229 | # endpoing: (Required) Object Storage endpoint URL or public Object Storage key link. 230 | # access_key_id: (Required) Object Storage access_key_id 231 | # secret_access_key: (Required) Object secret_access_key 232 | 233 | data_stores: 234 | - name: age_datastore 235 | type: s3 236 | connection: 237 | endpoint: https://s3-api.us-geo.objectstorage.softlayer.net 238 | access_key_id: xxxxxxxxxx 239 | secret_access_key: xxxxxxxxxxxxx 240 | 241 | # data_stores_file_paths: (Optional) - 242 | # - name: (Required) name of the data_store_file_path 243 | # key: value 244 | 245 | data_store_file_paths: 246 | - name: scoring_file_paths 247 | feature_file: 2.0/assets/features.csv 248 | input_schema_file: 2.0/assets/input_schema.json 249 | output_schema_file: 2.0/assets/output_schema.json 250 | sample_inputs_file: 2.0/assets/scoring_inputs.json 251 | 252 | # container_stores: (Optional) 253 | # - name: (Required) name of the container_store 254 | # connection: 255 | # container_registry: (Required) container registry for this container_store 256 | # container_registry_token: (Required if container registry is private) container registry token 257 | 258 | container_stores: 259 | - name: container_store 260 | connection: 261 | container_registry: docker.io 262 | container_registry_token: "" 263 | 264 | # convert: (Optional) 265 | # onnx_convertable: (Optional) Enable convertion to ONNX format. 266 | # The model needs to be either trainable or servable. Default: False 267 | # model_source: (Required for onnx_convertable) Model binary path that needs the format conversion. 268 | # data_store: (Required) datastore for the model source 269 | # initial_model: 270 | # bucket: (Required) Bucket that has the model source 271 | # path: (Required) Bucket path that has the model source 272 | # url: (Optional) Link to the model 273 | # onnx_model: 274 | # bucket:(Required) Bucket to store the onnx model 275 | # path: (Required) Bucket path to store the onnx model 276 | # url: (Optional) Link to the converted model 277 | # tf_inputs: (Required for TensorFlow model) Input placeholder and shapes of the model. 278 | # tf_outputs: (Required for TensorFlow model) Output placeholders of the model. 279 | # tf_rtol: (Optional) Relative tolerance for TensorFlow 280 | # tf_atol: (Optional) Absolute tolerance for TensorFlow 281 | 282 | convert: 283 | onnx_convertable: true 284 | model_source: 285 | initial_model: 286 | data_store: age_datastore 287 | bucket: facial-age-estimator 288 | path: 2.0/assests/model.pt 289 | url: "" 290 | initial_model_local: 291 | path: /local/1.0/assets/ 292 | onnx_converted_model: 293 | onnx_model: 294 | data_store: age_datastore 295 | bucket: facial-age-estimator 296 | path: 3.0/assets/model.onnx 297 | url: "" 298 | onnx_model_local: 299 | path: /local/1.0/assets/ 300 | tf_inputs: 301 | "X:0": [1] 302 | tf_outputs: 303 | - pred:0 304 | tf_rtol: 0 305 | tf_atol: 0 306 | -------------------------------------------------------------------------------- /model_packaging/model_scoring.yaml: -------------------------------------------------------------------------------- 1 | 2 | # score: (Optional) 3 | # scorable: (Optional) Indicate the model is scorable. Default: False 4 | # model_feature_schema_source: (Required if scorable is true) 5 | # scorable_model: (Required for s3 or url type) 6 | # data_store: (Required for s3 type) datastore for the model source 7 | # bucket: (Required for s3 type) Bucket that has the model source 8 | # data_store_file_paths: (Required for s3 type) Source path to the model schema, features and test files 9 | # url: (Required for url type) Source URL for the model 10 | # secorable_model_local: (Required if local) 11 | # path: (Optional) Servable model path in the user local machine 12 | # metrics: Metrics for scoring 13 | 14 | score: 15 | scorable: true 16 | model_features_schema_source: 17 | scorable_model: 18 | data_store: festure_schema_datastore 19 | bucket: feature_schema_bucket 20 | data_store_file_paths: scoring_file_paths 21 | scorable_model_local: 22 | data_store_file_paths: scoring_file_paths 23 | metrics: 24 | key: value 25 | latency: 0.01 26 | params: 27 | key: value 28 | 29 | 30 | # process: (Optional) 31 | # - name: (Required) Script Process name. Can mix any kind of process here 32 | # params: (Optional) Free flowing list of key:value paisrs 33 | # staging_dir: (Optional) Staging directory within the local machine 34 | # trained_model_path: (Optional) trained model path within the object storage bucket 35 | 36 | process: 37 | - name: scoring_pre_process 38 | params: 39 | key: value 40 | staging_dir: serving_output/ 41 | 42 | # data_stores: (Optional) - (Required for trainable) 43 | # - name: (Required) name of the data_stores 44 | # connection: 45 | # endpoing: (Required) Object Storage endpoint URL or public Object Storage key link. 46 | # access_key_id: (Required) Object Storage access_key_id 47 | # secret_access_key: (Required) Object secret_access_key 48 | 49 | data_stores: 50 | - name: age_datastore 51 | type: s3 52 | connection: 53 | endpoint: https://s3-api.us-geo.objectstorage.softlayer.net 54 | access_key_id: xxxxxxxxxx 55 | secret_access_key: xxxxxxxxxxxxx 56 | 57 | # data_stores_file_paths: (Optional) - 58 | # - name: (Required) name of the data_store_file_path 59 | # key: value 60 | 61 | data_store_file_paths: 62 | - name: scoring_file_paths 63 | feature_file: 2.0/assets/features.csv 64 | input_schema_file: 2.0/assets/input_schema.json 65 | output_schema_file: 2.0/assets/output_schema.json 66 | sample_inputs_file: 2.0/assets/scoring_inputs.json 67 | 68 | # container_stores: (Optional) 69 | # - name: (Required) name of the container_store 70 | # connection: 71 | # container_registry: (Required) container registry for this container_store 72 | # container_registry_token: (Required if container registry is private) container registry token 73 | 74 | container_stores: 75 | - name: container_store 76 | connection: 77 | container_registry: docker.io 78 | container_registry_token: "" -------------------------------------------------------------------------------- /model_packaging/model_serving.yaml: -------------------------------------------------------------------------------- 1 | # serve: (Optional) 2 | # servable: (Optional) Indicate the model is servable. Default: False 3 | # tested_platforms (optional list): platform on which this model can served (current options: kubernetes, knative, seldon, wml, kfserving) 4 | # model_source: (Optional) - (Required if servable is true) 5 | # servable_model: (Required for s3 or url type) 6 | # data_store: (Required for s3 type) datastore for the model source 7 | # bucket: (Required for s3 type) Bucket that has the model source 8 | # path: (Required for s3 type) Source path to the model 9 | # url: (Required for url type) Source URL for the model 10 | # servable_model_local: (Optional) 11 | # path: (Optional) Servable model path in the user local machine 12 | # serving_container_image: (Required for container type) 13 | # container_image_url: (Required for container type) Container image to serve the model. 14 | # container_store: (Optional) container_store name 15 | 16 | serve: 17 | servable: true 18 | tested_platforms: 19 | - kubernetes 20 | - knative 21 | model_source: 22 | servable_model: 23 | data_store: age_datastore 24 | bucket: facial-age-estimator 25 | path: 2.0/assets/ 26 | url: "" 27 | servable_model_local: 28 | path: /local/1.0/assets/ 29 | url: "" 30 | scorable_model_local: 31 | path: /local/1.0/assets/ 32 | serving_container_image: 33 | container_image_url: "codait/max-facial-age-estimator:latest" 34 | container_store: container_store 35 | 36 | # process: (Optional) 37 | # - name: (Required) Script Process name. Can mix any kind of process here 38 | # params: (Optional) Free flowing list of key:value paisrs 39 | # staging_dir: (Optional) Staging directory within the local machine 40 | # trained_model_path: (Optional) trained model path within the object storage bucket 41 | 42 | process: 43 | - name: serving_pre_process 44 | params: 45 | key: value 46 | staging_dir: training_output/ 47 | trained_model_path: 48 | 49 | # data_stores: (Optional) - (Required for trainable) 50 | # - name: (Required) name of the data_stores 51 | # connection: 52 | # endpoing: (Required) Object Storage endpoint URL or public Object Storage key link. 53 | # access_key_id: (Required) Object Storage access_key_id 54 | # secret_access_key: (Required) Object secret_access_key 55 | 56 | data_stores: 57 | - name: age_datastore 58 | type: s3 59 | connection: 60 | endpoint: https://s3-api.us-geo.objectstorage.softlayer.net 61 | access_key_id: xxxxxxxxxx 62 | secret_access_key: xxxxxxxxxxxxx 63 | 64 | # data_stores_file_paths: (Optional) - 65 | # - name: (Required) name of the data_store_file_path 66 | # key: value 67 | 68 | data_store_file_paths: 69 | - name: serving_file_paths 70 | feature_file: 2.0/assets/features.csv 71 | input_schema_file: 2.0/assets/input_schema.json 72 | output_schema_file: 2.0/assets/output_schema.json 73 | sample_inputs_file: 2.0/assets/scoring_inputs.json 74 | 75 | # container_stores: (Optional) 76 | # - name: (Required) name of the container_store 77 | # connection: 78 | # container_registry: (Required) container registry for this container_store 79 | # container_registry_token: (Required if container registry is private) container registry token 80 | 81 | container_stores: 82 | - name: container_store 83 | connection: 84 | container_registry: docker.io 85 | container_registry_token: "" 86 | -------------------------------------------------------------------------------- /model_packaging/model_training.yaml: -------------------------------------------------------------------------------- 1 | # train: (optional) 2 | # trainable: (optional) Indicate the model is trainable. Default: False 3 | # tested_platforms(optional list): platform on which this model can trained (current options: wml, ffdl, kubeflow) 4 | # model_source: (Required for trainable) 5 | # initial_model: (Required for trainable) 6 | # data_store: (Required) datastore for the model code source 7 | # bucket: (Required) Bucket that has the model code source 8 | # path: (Required) Bucket path that has the model code source 9 | # url: (Optional) Link to the model 10 | # initial_model_local: (Optional) 11 | # path: (Optional) Initial model code in the user local machine 12 | # model_training_results: (Required for trainable) 13 | # trained_model: (Required for trainable) 14 | # data_store: (Required) datastore for the training result source 15 | # bucket: (Required) Bucket that has the training result source 16 | # path: (Required) Bucket path that has the training result source 17 | # url: (Optional) Link to the model 18 | # trained_model_local: (Optional) 19 | # path: (Optional) Path to pull trained model in the user local machine 20 | # data_source: (Optional) 21 | # training_data: (Required for trainable) 22 | # data_store: (Required) datastore for the model data source 23 | # bucket: (Required) Bucket that has the model data source 24 | # path: (Required) Bucket path that has the model data source 25 | # url: (Optional) Link to the model 26 | # training_data_local: (Optional) 27 | # path: (Optional) Initial data files in the user local machine 28 | # mount_type: (Optional) object storage mount type 29 | # evaluation_metrics: (optional) Define the metrics for the training job. 30 | # type: (Required) evaluation_metrics type 31 | # in: (Required) Path to store the evaluation_metrics 32 | # training_container_image: (Optional) 33 | # container_image_url: (Optional) Custom training container image url 34 | # container_store: (Optional) container_store for the custom training image 35 | # execution: (Required for trainable) 36 | # command: (Required) Entrypoint commands to execute model code 37 | # name: (Required) T-shirt size for training on Watson Machine Learning 38 | # nodes: (Required) Number of nodes needed for this training job. Default: 1 39 | # training_params: (Optional) list of hyperparameters for the training model 40 | # - (optional) list of key(param name):value(param value) 41 | train: 42 | trainable: true 43 | tested_platforms: 44 | - wml 45 | - ffdl 46 | model_source: 47 | initial_model: 48 | data_store: age_datastore 49 | bucket: facial-age-estimator 50 | path: 1.0/assets/ 51 | url: "" 52 | initial_model_local: 53 | path: /local/1.0/assets/ 54 | model_training_results: 55 | trained_model: 56 | data_store: age_datastore 57 | bucket: facial-age-estimator 58 | path: 1.0/assets/ 59 | url: "" 60 | trained_model_local: 61 | path: /local/1.0/assets/ 62 | data_source: 63 | training_data: 64 | data_store: age_datastore 65 | bucket: facial-age-estimator 66 | path: 1.0/assets/ 67 | training_data_url: 68 | training_data_local: 69 | path: /local/1.0/assets/ 70 | mount_type: mount_cos 71 | evaluation_metrics: 72 | type: tensorboard 73 | in: "$JOB_STATE_DIR/logs/tb/test" 74 | training_container_image: 75 | container_image_url: tensorflow/tensorflow:latest-gpu-py3 76 | container_store: container_store 77 | execution: 78 | command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz 79 | --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz 80 | --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 20000 81 | compute_configuration: 82 | name: k80 83 | nodes: 1 84 | training_params: 85 | - learning_rate: 86 | - loss: 87 | - batch_size: 88 | - epoch: 89 | - optimizer: 90 | - xxx 91 | - yyy 92 | - train_op: 93 | 94 | # process: (Optional) 95 | # - name: (Required) Script Process name. Can mix any kind of process here 96 | # params: (Optional) Free flowing list of key:value paisrs 97 | # staging_dir: (Optional) Staging directory within the local machine 98 | # trained_model_path: (Optional) trained model path within the object storage bucket 99 | 100 | process: 101 | - name: training_post_process 102 | params: 103 | key: value 104 | staging_dir: training_output/ 105 | trained_model_path: 106 | 107 | # data_stores: (Required for trainable) 108 | # - name: (Required) name of the data_stores 109 | # connection: 110 | # endpoing: (Required) Object Storage endpoint URL or public Object Storage key link. 111 | # access_key_id: (Required) Object Storage access_key_id 112 | # secret_access_key: (Required) Object secret_access_key 113 | 114 | data_stores: 115 | - name: age_datastore 116 | type: s3 117 | connection: 118 | endpoint: https://s3-api.us-geo.objectstorage.softlayer.net 119 | access_key_id: xxxxxxxxxx 120 | secret_access_key: xxxxxxxxxxxxx 121 | 122 | # data_stores_file_paths: (Optional. To be used if there are multiple files, else the path field can be used as above) - 123 | # - name: (Optional) name of any additional training file paths to assign. Should be referenced from a datastore if needed 124 | # key: value 125 | 126 | data_store_file_paths: 127 | - name: training_file_paths 128 | training_file_one: 2.0/assets/training_file_one 129 | training_file_two: 2.0/assets/training_file_two 130 | 131 | 132 | # container_stores: (Optional) 133 | # - name: (Required) name of the container_store 134 | # connection: 135 | # container_registry: (Required) container registry for this container_store 136 | # container_registry_token: (Required if container registry is private) container registry token 137 | 138 | container_stores: 139 | - name: container_store 140 | connection: 141 | container_registry: docker.io 142 | container_registry_token: "" 143 | 144 | -------------------------------------------------------------------------------- /monitoring_proto/README.md: -------------------------------------------------------------------------------- 1 | # Inference Requests & Predictions 2 | We will need to capture the following key attributes of an inference request to analyze drift & perform other upstream operations 3 | (such as labeling, feedback aggregation): 4 | 5 | ## Inference requests 6 | - applicationId 7 | - requestId # correlation ID 8 | - requestTimestamp 9 | - inputType 10 | - inputFeatures (dictionary, string:string) 11 | (for images, example features include dimensions, number of channels, dimension ordering, path to image file) 12 | - groundTruthLabel 13 | 14 | ## Inference predictions 15 | - applicationId 16 | - requestId 17 | - modelId (name/version) 18 | - inferenceServiceId 19 | - requestTimestamp 20 | - inferenceLatency 21 | - inputFeatures (dictionary, string:string) 22 | - prediction 23 | - predictionExplanation (dictionary, string:float) 24 | - predictionFeedback (float, 0:1, how useful was it) 25 | - feedbackActions (list, string, what did the user do after the prediction came back) 26 | 27 | Note that for ensemble cases (where requests are running through an inference pipeline) several of these may be generated. 28 | Do we log isFinal in ensemble case? 29 | -------------------------------------------------------------------------------- /monitoring_proto/inferenceRequest.yml: -------------------------------------------------------------------------------- 1 | applicationInfo: 2 | - applicationId 3 | - applicationEndpoint 4 | requestInfo: 5 | - requestId 6 | - requestTimestamp 7 | - requestLatencyMs 8 | resultInfo: 9 | - resultCode 10 | - resultSizeInBytes 11 | inputs: 12 | featureNames: 13 | type: array 14 | items: 15 | type: string 16 | featureValues: 17 | type: array 18 | items: 19 | type: string 20 | prediction: 21 | - predictionMimeType 22 | - predictionData 23 | -------------------------------------------------------------------------------- /pipelines/module.md: -------------------------------------------------------------------------------- 1 | # Modules 2 | 3 | Module represents a unit of computation, defining script which will run on compute target, and describing its interface. Module interface describes inputs, outputs, parameter definitions, but doesn't bind them to specific values or data. Module has snapshot associated with it, capturing script, binaries and other files necessary to execute on compute target. 4 | 5 | Module is container of ModuleVersions. Users can publish new versions, deprecate them, and otherwise manage. 6 | 7 | ## Publishing new module version 8 | 9 | There are multiple ways to publish new ModuleVersion: 10 | 11 | - Using UX, CLI, SDK, REST 12 | - Snapshot can originate from many sources: git commit, folder, vsts artifact, existing snapshot, docker image 13 | 14 | Publishing is in a scope of workspace. 15 | 16 | ## Consuming module 17 | 18 | In most cases users will reference/consume Module, not ModuleVersion. System will be responsible for resolving Module->ModuleVersion in meaningful time: 19 | 20 | - When publishing pipeline, if user used Module in their graph, we should preserve Module reference and resolve ModuleVersion only during submission 21 | 22 | - Challenge here is that interface is defined for ModuleVersion, not Module. So we can offer interface checks with authoring graph, but can't enforce them - actual module version might have different interface so interface enforcement can only happen when pipeline run is triggered 23 | 24 | - When submitting pipeline for execution, Module is resolved to ModuleVersion immediately when submitting graph 25 | 26 | - User can always use ModuleVersion in either of above scenarios, and we don't need to perform any binding. 27 | 28 | Generally if user doesn't care about versioning too much, they just publish and consume Module, and always work with latest. Underlying infra tracks actual executed instances as ModuleVersion and user can always fall back to binding specific version. 29 | 30 | ## Mutability 31 | 32 | Module has mutable and immutable metadata. Most of metadata is mutable. 33 | 34 | Mutable: name, description, versions, status 35 | 36 | Immutable: id, creation time 37 | 38 | 39 | 40 | ModuleVersion has mutable and immutable metadata. Most of metadata is immutable. 41 | 42 | Immutable: id, snapshot id, inputs, outputs, parameters, creation time, status etc. Generally any fields influencing execution are immutable. 43 | 44 | Mutable: description 45 | 46 | ## Naming and identifiers 47 | 48 | Module has name assigned by user, which must be unique within workspace. Module ID is system generated unique identifier. Module name can be changed by user (renaming modules), and is also "freed up" from user space when module is archived. Module Id is unique and immutable. 49 | 50 | 51 | ## Entities 52 | 53 | - Module 54 | 55 | ```json 56 | { 57 | "id": "266c582a-a92b-478f-93b3-66a89b755bbf", 58 | "name" : "My training module", 59 | "description" : "I dreamed epic algo and coded a new boosted decision trees algo, just need to make it work now", 60 | "status" : "active", 61 | "createdUtc" : "2019/01/01 12:12:32Z", 62 | "versions" : [ 63 | { 64 | "version" : "1", 65 | "id" : "b27ce40e-4ddd-4ef5-8f7b-095621f29a03" 66 | }, 67 | { 68 | "version" : "2", 69 | "id": "db47be6b-8e73-403c-9280-0e30a8dc80d3" 70 | } 71 | ] 72 | } 73 | ``` 74 | 75 | 76 | 77 | - ModuleVersion 78 | 79 | ```json 80 | { 81 | 82 | "id" : "b27ce40e-4ddd-4ef5-8f7b-095621f29a03", 83 | "module" : "266c582a-a92b-478f-93b3-66a89b755bbf", 84 | "description" : "final attempt3", 85 | "version" : "1", 86 | "status" : "deprecated", 87 | "snapshot_id" : "blah", 88 | "createdUtc" : "2019/01/01 12:12:32Z", 89 | "inputs": TBD, 90 | "outputs" : TBD, 91 | "params" : TBD 92 | 93 | } 94 | ``` 95 | 96 | -------------------------------------------------------------------------------- /pipelines/pipeline.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Pipeline 2 | Machine learning pipelines are optimized around a data scientist's workflow and focus on making the model training process as efficient as possible. 3 | 4 | ## Graph 5 | 6 | The core abstraction for Machine Leanring Pipelines is DAG, with every vertex being either 'data' or 'step' ('computation unit'), edges are data or control dependencies between verteces. 7 | 8 | #### Vertices 9 | 'Data' vertex can be : 10 | - [DataPath](../data/datapath.md) 11 | - [DataSet](../data/dataset.md) 12 | 13 | Data vertex has single output port. 14 | 15 | 'Step' vertex is always an instance of the [Module](module.md), with bound inputs, outputs and parameters. Module can be of two types: *primitive* (single script which can be executed on compute), and *complex* (subgraphs - DAGs themselves defined as above). This makes graph a nested recursive structure. 16 | 17 | Step vertex can have many inputs (or none), and many outputs (or none). 18 | 19 | #### Edges 20 | 21 | There are two types of edges: 22 | - Data dependency. Such edges connect data vertex or step output to input of another step. At execution time, input of that step will be set to specific instance of data 23 | 24 | - Control dependency. Such edges connect step output to input of another step. This edge is not applicable to data verteces, and there is no data flow invloved. It purely represents that one step must be executed after another 25 | 26 | Edges are always connecting inputs and outputs of vertices. Edges not associated with inputs and outputs are not allowed. 27 | 28 | ## Pipelines, Pipelines Drafts and Pipeline Runs 29 | 30 | Graphs can be of two major classes: 31 | - blueprint which defines what and how should be executed, and 32 | - particular executing instance 33 | While both classes can look similarly, they are not exactly the same. Executing instance has naturaly properties like status, start time, logs, etc. Blueprint would insted define topology of the graph, parameter types and acceptable values, etc. 34 | 35 | Mutable blueprint is called *Pipeline Draft*, and immutable preserved blueprint is called *Pipeline*. Executing instance is called *PipelineRun* 36 | 37 | User can interactively construct a *Pipeline Draft*. This client side object defines DAG, and can be parametrized. It is mutable during authoring, and then can be submitted for execution or published as *Pipeline* for future executions. When it's submitted for execution, *PipelineRun* is created and orchestrated by system. There are many ways to create *Pipeline Drafts* - *Pipeline Draft* can be defined by yaml file on disk, or in memory object in python, or created in UI interfaces, etc. 38 | 39 | *Pipeline* is preserved immutable blueprint, which defines a parametrized DAG, and can be "instantiated" multiple times. Every time it's instantiated, it produces a PipelineRun which is then orchestrated by system. 40 | 41 | *PipelineRun* is what is being executed by the system. While PipelineDraft or Pipeline graphs contain Steps, PipelineRun graph contains StepRuns - executing instances of Steps. 42 | 43 | Mapping summary: 44 | 45 | | Blueprint | Executing Instance | 46 | | ------------- | ------------------ | 47 | | Step | StepRun | 48 | | PipelineDraft | PipelineRun | 49 | | Pipeline | PipelineRun | 50 | 51 | Transformation summary: 52 | 53 | PipelineDraft.submit() -> PipelineRun 54 | 55 | Pipeline.submit() -> PipelineRun 56 | PipelineRun.clone() -> PipelineDraft 57 | Pipline.clone() -> PipelineDraft 58 | 59 | # Steps in a Pipeline 60 | Step is always an instance of some [Module](module.md). Modules can be thought as reusable packages, and can be used many times in the same pipeline or in another pipelines. For convenience, we allow users to construct steps directly from scripts and binaries, but system is responsible to publish module and instantiate it in such cases. 61 | 62 | #### Inputs and outputs 63 | 64 | All data inputs and outputs are data stored in [DataStores](../data/datastore.md) . Single step can have inputs and outputs stored in different DataStores: both multiple data stores of the same type, and data stores of different types are allowed. 65 | 66 | Step must have every non-optional input connected through edge to another vertex. Input can be connected only to single other vertex. 67 | 68 | Step might have outputs **not** connected to other steps. Output can be connected to multiple other vertexes. Single output can be connected to multiple inputs of single vertex as well. 69 | 70 | #### Parameters 71 | 72 | Steps have all parameters set to specific values, either explicitly or implicitly (through default values defined for corresponding module) 73 | 74 | ## Parametrization 75 | 76 | Pipeline Drafts and Pipelines can be parametrized with graph level parameters. Graph level parameters have names, types (string, float, etc.), default values and constrains defined. What PipelineRun is created from either Pipeline Draft or Pipeline, those parameters are set to actual values. Graph parameter names are unique within the graph. 77 | 78 | ## Pipeline Run 79 | 80 | When graph is orchestrated, its execution is captured by *Pipeline Run*. 81 | 82 | Nodes in executing graph are *Pipeline Step Runs*. Each step run's output is a uniquely identified [Artifact](../data/artifact.md). When some step run finishes, and downstream step can be executed, orchestrator will ensure that corresponding artifacts are passed as inputs to the next step run. 83 | 84 | Pipeline Run and each Step Run are runs - each having run id, logs, metrics and other metadata associated with them. See [Run](../experiment_tracking/run.md) for more details. 85 | 86 | # A YML specification for a pipeline draft 87 | See [pipeline.yml](pipeline.yml) for example 88 | -------------------------------------------------------------------------------- /pipelines/pipeline.yml: -------------------------------------------------------------------------------- 1 | name: 'myTrainingPipeline' 2 | data: 3 | pipelineData1 4 | steps: 5 | preprocess: 6 | moduleName: 'BatchStep' 7 | computeName: 'cpu' 8 | cmd: 'unzip data/Posts.xml.zip -d data/' 9 | deps: 10 | path: data/Posts.xml.zip 11 | outs: 12 | - cache: true 13 | path: data/Posts.xml 14 | train: 15 | moduleName: 'PythonScriptStep' 16 | computeName: 'cpu' 17 | cmd: 'python train.py' 18 | deps: 19 | path: data/Posts.xml.zip 20 | outs: 21 | path: data/Posts.xml 22 | validate: 23 | moduleName: 'PythonScriptStep' 24 | computeName: 'cpu' 25 | parameters: 26 | data_file: 'path' 27 | cmd: 'python validate.py {data_file}' 28 | 29 | pipeline: [preprocess,train,validate] 30 | --------------------------------------------------------------------------------