├── .gitignore
├── 05-experiment-tracking
    ├── 06-MLflow-and-DVC-project.md
    ├── 03-basic-MLflow-installation.md
    ├── 04-basic-MLflow-on-Kubernetes.md
    ├── 05-MLflow-prod-setup.md
    ├── 01-what-is-experiment-tracking.md
    └── 02-what-is-MLflow.md
├── 03-role-of-mlops
    ├── 01-introduction.md
    ├── 04-ml-engineers-without-mlops.md
    ├── 05-how-mlops-engineers-help-ml-engineers.md
    ├── 03-how-mlops-help-datascientists.md
    └── 02-data-scientists-without-mlops.md
├── 06-fundamentals-of-model-deployment.md
    ├── 03-project-for-deployment.md
    ├── 01-introduction-to-deployment-and-serving.md
    └── 02-popular-ways.md
├── 07-deploy-and-serving-using-vms
    ├── 02-implementing-wsgi.md
    ├── 00-IMPORTANT.md
    └── 01-architecture.md
├── 04-versioning-and-experiment-tracking
    ├── 03-DVC-hands-on.md
    ├── 02-introduction-to-dvc.md
    └── 01-what-is-data-versioning.md
├── README.md
├── 08-kserve
    ├── 02-architecture.md
    ├── 01-Introduction.md
    └── 03-end-to-end-demo.md
├── 09-SageMaker
    ├── 01-introduction.md
    └── 02-production-setup.md
└── 02-introduction-to-mlops
    ├── 03-what-is-mlops.md
    ├── 01-what-is-machine-learning-and-model.md
    ├── 05-ds-vs-ml-vs-mlops.md
    ├── 02-steps-to-create-a-model.md
    └── 04-machine-learning-lifecycle-overview.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .venv/
2 | .vscode/


--------------------------------------------------------------------------------
/05-experiment-tracking/06-MLflow-and-DVC-project.md:
--------------------------------------------------------------------------------
1 | Please refer to the below repository for this lecture.
2 | 
3 | https://github.com/iam-veeramalla/Wine-Prediction-Model


--------------------------------------------------------------------------------
/05-experiment-tracking/03-basic-MLflow-installation.md:
--------------------------------------------------------------------------------
1 | Please refer to the below documentation for this lecture.
2 | 
3 | https://mlflow.org/docs/2.4.2/quickstart.html#install-mlflow


--------------------------------------------------------------------------------
/03-role-of-mlops/01-introduction.md:
--------------------------------------------------------------------------------
1 | # Introduction 
2 | 
3 | Please refer the below repository for all the project files and notes.
4 | 
5 | https://github.com/iam-veeramalla/hello-world-mlops


--------------------------------------------------------------------------------
/05-experiment-tracking/04-basic-MLflow-on-Kubernetes.md:
--------------------------------------------------------------------------------
1 | Please refer to the below documentation for this lecture.
2 | 
3 | https://community-charts.github.io/docs/charts/mlflow/basic-installation


--------------------------------------------------------------------------------
/05-experiment-tracking/05-MLflow-prod-setup.md:
--------------------------------------------------------------------------------
1 | Please refer to the below document for the next lecture
2 | 
3 | https://community-charts.github.io/docs/charts/mlflow/postgresql-backend-installation


--------------------------------------------------------------------------------
/06-fundamentals-of-model-deployment.md/03-project-for-deployment.md:
--------------------------------------------------------------------------------
1 | Please refer to the below GitHub repository for this lecture
2 | 
3 | https://github.com/iam-veeramalla/Intent-classifier-model


--------------------------------------------------------------------------------
/07-deploy-and-serving-using-vms/02-implementing-wsgi.md:
--------------------------------------------------------------------------------
1 | Please refer for complete project files and notes.
2 | 
3 | https://github.com/iam-veeramalla/Intent-classifier-model/tree/virtual-machines


--------------------------------------------------------------------------------
/04-versioning-and-experiment-tracking/03-DVC-hands-on.md:
--------------------------------------------------------------------------------
1 | # Learn DVC using a project
2 | 
3 | Please refer to the below repository for this lecture.
4 | 
5 | https://github.com/iam-veeramalla/Wine-Prediction-Model
6 | 


--------------------------------------------------------------------------------
/07-deploy-and-serving-using-vms/00-IMPORTANT.md:
--------------------------------------------------------------------------------
1 | # Important Note
2 | 
3 | Please refer to the Virtual Machines branch of Intent Classifier repo for this section.
4 | 
5 | Link: 
6 | 
7 | https://github.com/iam-veeramalla/Intent-classifier-model/tree/virtual-machines


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MLOps Zero to Hero
2 | 
3 | Notes for my Udemy course - MLOps Zero to Hero
4 | 
5 | https://www.udemy.com/user/abhishek-veeramalla/?srsltid=AfmBOopEdZFhCNtWblrQcXgZa3LAzdW2Zg7b31Tu6ruW5TQ_GdD0qdOe
6 | 
7 | <img width="335" height="339" alt="Screenshot 2025-12-17 at 12 17 51 PM" src="https://github.com/user-attachments/assets/16d7cc92-9973-4b70-b401-9f924bcbfc10" />
8 | 


--------------------------------------------------------------------------------
/08-kserve/02-architecture.md:
--------------------------------------------------------------------------------
1 | # KServe Architecture
2 | 
3 | 
4 | <img width="1081" height="797" alt="Screenshot 2025-12-10 at 6 01 31 PM" src="https://github.com/user-attachments/assets/c09c1ea6-619f-4286-ac00-41938d15bfbb" />
5 | 
6 | <img width="1070" height="874" alt="Screenshot 2025-12-10 at 5 59 52 PM" src="https://github.com/user-attachments/assets/564158db-dbd4-476d-bd60-223637370c00" />
7 | 


--------------------------------------------------------------------------------
/07-deploy-and-serving-using-vms/01-architecture.md:
--------------------------------------------------------------------------------
 1 | # Architecture
 2 | 
 3 | Internet (client)
 4 |      │
 5 |      ▼
 6 | Internet Gateway (IGW) attached to VPC
 7 |      │
 8 |      ▼
 9 | Application Load Balancer (ALB) — internet-facing (ENIs in Public Subnets A & B)
10 |      │  (Listener: HTTP 80)
11 |      ▼
12 | Target Group (HTTP: 80, Health-check: /predict)
13 |      │
14 |      ▼
15 | Auto Scaling Group (ASG)
16 |      │
17 |      ▼
18 | EC2 Instance (in a Public Subnet)  ──>  Nginx (listen :80) ── proxy_pass──>  Gunicorn (127.0.0.1:6000) ──> WSGI app (/predict)
19 | 


--------------------------------------------------------------------------------
/03-role-of-mlops/04-ml-engineers-without-mlops.md:
--------------------------------------------------------------------------------
 1 | # Role of an ML Engineer in a Project
 2 | 
 3 | Once a Data Scientist builds a working model, the question becomes:
 4 | 
 5 | “How do we let real users or applications use this model?”
 6 | 
 7 | This is where the ML Engineer steps in.
 8 | 
 9 | ### Turn the Model into an API
10 | 
11 | A trained model by itself is just a file.
12 | An ML Engineer’s first responsibility is to wrap the model with an API.
13 | 
14 | They:
15 | 
16 | - Load the trained model
17 | - Accept input from users or applications (usually JSON)
18 | - Run predictions using the model
19 | - Return the result as a response
20 | 
21 | Now the model can be:
22 | 
23 | - Called by a frontend
24 | - Used by backend services
25 | - Integrated into real applications
26 | 
27 | The model becomes usable, not just theoretical.
28 | 
29 | ### Handle Input and Output Safely
30 | 
31 | In the real world, users can send bad data.
32 | 
33 | An ML Engineer ensures:
34 | 
35 | - Inputs are validated
36 | - Missing or incorrect fields are handled
37 | - Errors don’t crash the service
38 | 
39 | This prevents:
40 | 
41 | - Application failures
42 | - Incorrect predictions
43 | - Production incidents
44 | 
45 | ### Make the Model Fast and Efficient
46 | 
47 | A model that works in a notebook may be:
48 | 
49 | Slow, Memory-heavy and NOT optimized for repeated requests
50 | 
51 | ML Engineers:
52 | 
53 | - Optimize how the model is loaded
54 | - Avoid reloading the model for every request
55 | - Ensure predictions are fast enough for real users
56 | 
57 | 


--------------------------------------------------------------------------------
/08-kserve/01-Introduction.md:
--------------------------------------------------------------------------------
 1 | # Introduction to KServe
 2 | 
 3 | Imagine you’ve trained a machine learning model, maybe a classifier, a recommender, or anything else.
 4 | The next big question is: How do you deploy this model so real users or applications can send requests and get predictions?
 5 | 
 6 | KServe is a tool that solves exactly this problem.
 7 | 
 8 | ### What is KServe?
 9 | 
10 | KServe is a Kubernetes-native platform designed to deploy and serve ML models easily, reliably, and at scale.
11 | 
12 | - In even simpler words:
13 | KServe takes your ML model and turns it into a production-ready API running on Kubernetes without you writing a lot of server code.
14 | 
15 | ### Why KServe Exists?
16 | 
17 | Traditional model deployment is painful:
18 | - You need to write Flask or FastAPI code
19 | - You need to containerize the app
20 | - You need to expose endpoints
21 | - You need to manage scaling, logging, networking
22 | - You need to monitor and version your models
23 | 
24 | KServe removes most of this effort by providing standardized, ready-to-use model servers.
25 | 
26 | ### What KServe Actually Does?
27 | 
28 | KServe provides:
29 | 
30 | 1. Standard Model Servers
31 | 
32 | For popular frameworks like:
33 | - TensorFlow
34 | - PyTorch
35 | - Scikit-learn
36 | - XGBoost
37 | - ONNX
38 | 
39 | You simply point KServe to your model file (a storage URI), and it deploys everything automatically.
40 | 
41 | ### Automatic Scaling
42 | 
43 | Your model API can:
44 | - Scale up when traffic increases
45 | - Scale down to zero when idle (saving huge costs)
46 | - This is powered by Knative under the hood.
47 | 
48 | 


--------------------------------------------------------------------------------
/09-SageMaker/01-introduction.md:
--------------------------------------------------------------------------------
 1 | # What is SageMaker?
 2 | 
 3 | AWS SageMaker is Amazon’s fully managed platform for building, training, and deploying machine learning models at scale.
 4 | 
 5 | In real-world ML systems, the actual training code is only 5–10% of the work.
 6 | 
 7 | MLOps challenges include:
 8 | 
 9 | - Environment and dependency management
10 | - Scalable training workloads
11 | - Handling large datasets
12 | - Model versioning
13 | - Model registry
14 | - Automated deployments
15 | - Monitoring predictions & model drift
16 | - Cost control for GPUs/instances
17 | 
18 | SageMaker bundles these into managed services so that MLOps engineers can avoid building the entire ML control plane from scratch.
19 | 
20 | ### What MLOps Engineers Actually Do With SageMaker
21 | 
22 | A) Build ML Environments
23 | 
24 | - Prepare Docker images with Python/ML libraries
25 | - Manage dependency consistency
26 | - Use CDK/Terraform to provision infrastructure
27 | 
28 | B) Automate Training
29 | 
30 | - Use SageMaker Training Jobs with CI pipelines
31 | - Configure distributed training
32 | - Use Spot instances to control cost
33 | 
34 | C) Manage Model Registry
35 | 
36 | - Store versioned models
37 | - Integrate approval workflows (“manual gate” for prod)
38 | 
39 | D) Automate Deployments
40 | 
41 | - Blue/Green deployments
42 | - Canary deployments
43 | - Event-driven retraining
44 | - Update production endpoints with zero downtime
45 | 
46 | E) Observability & Monitoring
47 | 
48 | - CloudWatch for logs/metrics
49 | - SageMaker Model Monitor for:
50 | - Data drift
51 | - Feature drift
52 | - Prediction drift
53 | - Outlier detection
54 | 
55 | F) Cost Optimization
56 | 
57 | - Spot training
58 | - Multi-model endpoints
59 | - Serverless endpoints
60 | 
61 | Endpoint autoscaling
62 | 
63 | 


--------------------------------------------------------------------------------
/04-versioning-and-experiment-tracking/02-introduction-to-dvc.md:
--------------------------------------------------------------------------------
 1 | # What is DVC?
 2 | 
 3 | Think of DVC (Data Version Control) as Git for your data.
 4 | 
 5 | Git works great for: code and small text files.
 6 | 
 7 | But Git cannot handle:
 8 | 
 9 | - Large datasets
10 | - Model files
11 | - Data stored in cloud storage
12 | - This is where DVC helps.
13 | 
14 | DVC lets you:
15 | 
16 | - Track versions of datasets
17 | - Store large files outside Git (S3, GCS, Azure, local storage)
18 | - Keep your Git repo clean and lightweight
19 | - Reproduce your ML project anytime
20 | 
21 | ### Wine Prediction Example
22 | 
23 | Imagine you're building a simple Wine Quality Prediction ML model.
24 | 
25 | You have:
26 | 
27 | - A CSV file → wine_data_sample.csv
28 | - A training script → train.py
29 | - A Git repo
30 | 
31 | Your dataset may change over time:
32 | 
33 | - You add more rows
34 | - You clean the data
35 | - You update features
36 | 
37 | DVC allows you to version these dataset changes without storing the actual data inside Git.
38 | 
39 | Without DVC -> Your CSV sits in your repo → Git becomes slow & heavy.
40 | 
41 | With DVC -> Git stores only a small metadata file:
42 | 
43 | - wine_data_sample.csv.dvc
44 | - Actual data is stored in: `S3 bucket` or any external storage
45 | 
46 | You pull/push data similar to git pull / git push.
47 | 
48 | ### How DVC Works (Very Simple Flow)
49 | 
50 | Add your dataset to DVC
51 | 
52 | `dvc add wine_data_sample.csv`
53 | 
54 | Commit the .dvc file to Git
55 | 
56 | `git add wine_data_sample.csv.dvc`
57 | `git commit -m "Track dataset with DVC"`
58 | 
59 | Configure remote storage (e.g., S3)
60 | 
61 | `dvc remote add -d myremote s3://mybucket/dvcstore`
62 | 
63 | Push data to S3
64 | 
65 | `dvc push`
66 | 
67 | Anyone with your Git repo simply runs:
68 | 
69 | `dvc pull`
70 | 
71 | …and they get the exact same dataset version.


--------------------------------------------------------------------------------
/02-introduction-to-mlops/03-what-is-mlops.md:
--------------------------------------------------------------------------------
 1 | # What is MLOps?
 2 | 
 3 | Before understanding MLOps, it’s important to understand **where it comes from**.
 4 | 
 5 | MLOps is **directly inspired by DevOps**.
 6 | 
 7 | Just like DevOps transformed how we build and operate software, **MLOps brings those same principles into the Machine Learning world**.
 8 | 
 9 | ---
10 | 
11 | ## How DevOps Inspired MLOps
12 | 
13 | ### What DevOps Solved
14 | 
15 | Before DevOps:
16 | - Developers wrote code
17 | - Ops teams deployed and maintained it
18 | - Deployments were slow, manual, and risky
19 | - Failures were hard to debug
20 | 
21 | DevOps introduced:
22 | - Automation
23 | - CI/CD pipelines
24 | - Infrastructure as Code
25 | - Monitoring and feedback loops
26 | - Shared ownership between Dev and Ops
27 | 
28 | The result:
29 | - Faster releases
30 | - More reliable systems
31 | - Continuous improvement
32 | 
33 | ---
34 | 
35 | ## The Same Problem Happened in Machine Learning
36 | 
37 | In ML, a similar gap appeared:
38 | 
39 | - Data Scientists trained models in notebooks
40 | - Models worked locally
41 | - Production teams struggled to deploy them
42 | - No clear ownership after deployment
43 | - Models degraded silently over time
44 | 
45 | Just like Dev vs Ops, ML had a gap between:
46 | - **Model development**
47 | - **Model operations**
48 | 
49 | That gap is what **MLOps** was created to solve.
50 | 
51 | ---
52 | 
53 | ## MLOps = DevOps Practices for Machine Learning
54 | 
55 | MLOps takes proven DevOps ideas and applies them to ML systems.
56 | 
57 | | DevOps Concept | MLOps Equivalent |
58 | |----------------|------------------|
59 | | Source code versioning | Data + model versioning |
60 | | CI pipelines | Model training pipelines |
61 | | CD pipelines | Automated model deployment |
62 | | Monitoring services | Monitoring model performance |
63 | | Rollbacks | Model version rollback |
64 | | Automation | End-to-end ML lifecycle automation |
65 | 


--------------------------------------------------------------------------------
/08-kserve/03-end-to-end-demo.md:
--------------------------------------------------------------------------------
 1 | # Kserve Demonstration for Iris model
 2 | 
 3 | ### Install Cert Manager
 4 | 
 5 | ```
 6 | kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
 7 | ```
 8 | 
 9 | ### Install KServe CRDs
10 | 
11 | ```
12 | kubectl create namespace kserve
13 | 
14 | helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
15 |   --version v0.16.0 \
16 |   -n kserve \
17 |   --wait
18 | ```
19 | 
20 | ### Install KServe controller
21 | 
22 | ```
23 | helm install kserve oci://ghcr.io/kserve/charts/kserve \
24 |   --version v0.16.0 \
25 |   -n kserve \
26 |   --set kserve.controller.deploymentMode=RawDeployment \
27 |   --wait
28 | ```
29 | 
30 | ### Deploy the sklearn iris model
31 | 
32 | ```
33 | kubectl create namespace ml
34 | 
35 | cat <<EOF | kubectl apply -n ml -f -
36 | apiVersion: serving.kserve.io/v1beta1
37 | kind: InferenceService
38 | metadata:
39 |   name: sklearn-iris
40 | spec:
41 |   predictor:
42 |     model:
43 |       modelFormat:
44 |         name: sklearn
45 |       storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
46 |       resources:
47 |         requests:
48 |           cpu: "100m"
49 |           memory: "512Mi"
50 |         limits:
51 |           cpu: "1"
52 |           memory: "1Gi"
53 | EOF
54 | 
55 | kubectl get inferenceservice sklearn-iris -n ml
56 | ```
57 | 
58 | ### Port-forward to access the model
59 | 
60 | ```
61 | kubectl -n ml port-forward svc/sklearn-iris-predictor 8080:80
62 | ```
63 | 
64 | ### Inference the Model
65 | 
66 | ```
67 | curl -s -H "Content-Type: application/json" \
68 |   -d '{"instances":[[5.9,3.0,5.1,1.8]]}' \
69 |   http://localhost:8080/v1/models/sklearn-iris:predict
70 | ```
71 | 
72 | ### Cleanup
73 | 
74 | ```
75 | kubectl delete inferenceservice sklearn-iris -n ml
76 | helm uninstall kserve -n kserve
77 | helm uninstall kserve-crd -n kserve
78 | kubectl delete ns ml kserve
79 | ```
80 | 
81 | 
82 | 


--------------------------------------------------------------------------------
/05-experiment-tracking/01-what-is-experiment-tracking.md:
--------------------------------------------------------------------------------
 1 | # What is Experiment Tracking?
 2 | 
 3 | Experiment tracking is the process of **recording everything you do while training a machine learning model** so that you can:
 4 | 
 5 |     • Compare different models  
 6 |     • Understand what worked and what didn’t  
 7 |     • Reproduce a result in the future  
 8 |     • Share your results with others  
 9 | 
10 | In machine learning, you rarely train a model just once.  
11 | You try many combinations:
12 | 
13 |     • Different algorithms  
14 |     • Different hyperparameters (learning rate, batch size, epochs)  
15 |     • Different datasets  
16 |     • Different preprocessing steps  
17 | 
18 | After a few tries, you forget which run gave the best accuracy.  
19 | Experiment tracking solves this problem.
20 | 
21 | ## What Do We Track?
22 | 
23 | A good experiment tracking system stores:
24 | 
25 |     • Parameters → example: learning_rate=0.01, epochs=20  
26 |     • Code version → which Git commit you used  
27 |     • Dataset version → which data file you used  
28 |     • Metrics → accuracy, loss, RMSE  
29 |     • Artifacts → trained model files, plots  
30 |     • System info → OS, Python version, GPU/CPU  
31 | 
32 | ### Real-World Example (Wine Prediction)
33 | 
34 | Suppose you're building a wine quality prediction model.
35 | 
36 | Run 1:
37 |     • learning_rate=0.01  
38 |     • accuracy=78%
39 | 
40 | Run 2:
41 |     • learning_rate=0.05  
42 |     • accuracy=84%
43 | 
44 | Run 3:
45 |     • decision tree instead of linear regression  
46 |     • accuracy=81%
47 | 
48 | If you don’t track these:
49 |     You won't remember **why Run 2 was best** or **which settings produced it**.
50 | 
51 | Tools like MLflow, Weights & Biases, DVC Experiments, Neptune, etc., make this super easy.
52 | 
53 | ### TL;DR
54 | 
55 | Experiment tracking = **keeping a record of all your model training runs**  
56 | so you can compare, reproduce, and choose the best one.
57 | 


--------------------------------------------------------------------------------
/02-introduction-to-mlops/01-what-is-machine-learning-and-model.md:
--------------------------------------------------------------------------------
 1 | # Fundamentals
 2 | 
 3 | ### What Is Machine Learning?
 4 | 
 5 | Think of machine learning as teaching a computer to learn patterns from data instead of programming every rule manually.
 6 | 
 7 | With normal programming, you write rules yourself.
 8 | With machine learning, you give the computer examples, and it figures out the rules.
 9 | 
10 | Example: Flower Species Prediction
11 | 
12 | Suppose you want a system that predicts whether a flower is Setosa, Versicolor, or Virginica based on features like:
13 | 
14 | Petal length
15 | 
16 | Petal width
17 | 
18 | Sepal length
19 | 
20 | Sepal width
21 | 
22 | You could try writing if-else conditions — but it's nearly impossible to capture every pattern.
23 | 
24 | Instead, you give the computer:
25 | 
26 | 150 flower samples
27 | 
28 | Each sample has measurements
29 | 
30 | Each sample has the correct species label
31 | 
32 | The machine learning algorithm then learns the relationship between measurements and species.
33 | 
34 | That learned relationship is called a model.
35 | 
36 | ### What Is a Model?
37 | 
38 | A model is the final output of machine learning.
39 | 
40 | It is not the data.
41 | It is not the algorithm.
42 | It is the pattern the algorithm learned from the data.
43 | 
44 | Think of the model as:
45 | 
46 | A mathematical function the computer created
47 | 
48 | It takes inputs (flower measurements)
49 | 
50 | It outputs predictions (species)
51 | 
52 | Example: Flower Species Model
53 | 
54 | After training, the model may learn things like:
55 | 
56 | If petal length is very small, it's likely Setosa.
57 | 
58 | If petal length is large and petal width is medium, it’s probably Virginica.
59 | 
60 | You never coded these rules.
61 | The algorithm found them automatically based on the training data.
62 | 
63 | When you pass a new flower’s measurements into the model:
64 | 
65 | Input: Petal length = 5.1, Petal width = 1.8  
66 | Output: Species = Versicolor
67 | 
68 | 
69 | The model is simply applying what it learned earlier.


--------------------------------------------------------------------------------
/03-role-of-mlops/05-how-mlops-engineers-help-ml-engineers.md:
--------------------------------------------------------------------------------
 1 | ## How MLOps Engineers Help ML Engineers
 2 | 
 3 | MLOps Engineers focus on the **infrastructure and automation side** so that ML Engineers can concentrate on **model logic and APIs**. Their goal is to make ML services reliable, scalable, and production-ready.
 4 | 
 5 | ### Creating Dockerfile and Standardizing Runtime
 6 | MLOps Engineers create and maintain Dockerfiles that define:
 7 | - The exact Python version and OS
 8 | - Required ML and system dependencies
 9 | - How the model API starts
10 | 
11 | This ensures the ML Engineer’s API runs the same way on a laptop, VM, or Kubernetes cluster, eliminating environment-related issues.
12 | 
13 | ### Containerizing the Model and API
14 | Once the API is ready, MLOps Engineers:
15 | - Package the model and API into a container
16 | - Ensure the container is lightweight and secure
17 | - Define clear entry points for running the service
18 | 
19 | This makes the ML Engineer’s work portable and easy to deploy anywhere.
20 | 
21 | ### Deploying to VMs or Kubernetes
22 | MLOps Engineers handle deployment by:
23 | - Running containers on VMs for simple setups
24 | - Deploying containers to Kubernetes for scalability and resilience
25 | - Managing configuration and environment variables
26 | 
27 | ML Engineers don’t need to worry about servers, clusters, or runtime management.
28 | 
29 | ### Load Balancing and Networking
30 | To handle real users and traffic, MLOps Engineers:
31 | - Introduce load balancers in front of the model API
32 | - Configure networking so services can communicate securely
33 | - Ensure traffic is evenly distributed across multiple instances
34 | 
35 | This allows the ML Engineer’s API to serve many users reliably without performance issues.
36 | 
37 | ### Making the Service Production-Ready
38 | MLOps Engineers add the final production layers:
39 | - Health checks to ensure the service is alive
40 | - Basic logging and monitoring
41 | - Safe rollout and rollback mechanisms
42 | 
43 | This ensures the ML Engineer’s model is not just working, but **stable and trusted in production**.
44 | 
45 | ### One-Line-Summary
46 | MLOps Engineers take what ML Engineers build and turn it into a **scalable, reliable, and production-grade service** that real users can depend on.
47 | 


--------------------------------------------------------------------------------
/02-introduction-to-mlops/05-ds-vs-ml-vs-mlops.md:
--------------------------------------------------------------------------------
  1 | # Data Scientist vs ML Engineer vs MLOps Engineer
  2 | 
  3 | ### Data Scientist – The “Brain Behind the Model”
  4 | 
  5 | What they do:
  6 | 
  7 | Understand the business problem
  8 | 
  9 | Explore datasets
 10 | 
 11 | Clean and preprocess data
 12 | 
 13 | Try different algorithms
 14 | 
 15 | Build and evaluate ML models
 16 | 
 17 | Present insights to stakeholders
 18 | 
 19 | Think of them as:
 20 | Researchers + Statisticians + Storytellers
 21 | They turn raw data into a working ML model on a laptop or notebook environment.
 22 | 
 23 | Not their job:
 24 | 
 25 | Deployment
 26 | 
 27 | Scalability
 28 | 
 29 | Monitoring
 30 | 
 31 | CI/CD
 32 | 
 33 | Cloud infrastructure
 34 | 
 35 | ### ML Engineer – The “Builder Who Converts Model Into a Real Product”
 36 | 
 37 | What they do:
 38 | 
 39 | Take the model created by the Data Scientist
 40 | 
 41 | Convert it into production-ready code
 42 | 
 43 | Optimize it for performance (latency, throughput, memory)
 44 | 
 45 | Build APIs around the model
 46 | 
 47 | Integrate with backend systems
 48 | 
 49 | Think of them as:
 50 | Software engineers who specialize in ML models
 51 | They ensure the model works efficiently in an application or service.
 52 | 
 53 | Not their job:
 54 | 
 55 | Managing training pipelines
 56 | 
 57 | CI/CD for ML
 58 | 
 59 | ML monitoring at scale
 60 | 
 61 | Model governance
 62 | 
 63 | ### MLOps Engineer – The “DevOps for Machine Learning”
 64 | 
 65 | What they do:
 66 | 
 67 | Build reproducible training pipelines
 68 | 
 69 | Automate data ingestion and feature engineering
 70 | 
 71 | Manage experiment tracking
 72 | 
 73 | Set up model registry
 74 | 
 75 | Deploy models with CI/CD
 76 | 
 77 | Monitor models in production (drift, accuracy, latency)
 78 | 
 79 | Manage infra – Kubernetes, GPUs, cloud, scaling
 80 | 
 81 | Enable teams (Data Scientists + ML Engineers) to ship models faster and safely
 82 | 
 83 | Think of them as:
 84 | DevOps + Cloud + ML workflow automation
 85 | They ensure ML systems keep running reliably, just like DevOps ensures apps run reliably.
 86 | 
 87 | Not typically their job:
 88 | 
 89 | Doing heavy data analysis
 90 | 
 91 | Designing new ML algorithms
 92 | 
 93 | Creating the first version of the model
 94 | 
 95 | In One Simple Line
 96 | 
 97 | Data Scientist: Creates the model.
 98 | 
 99 | ML Engineer: Turns the model into production code.
100 | 
101 | MLOps Engineer: Builds the system that trains, deploys, scales, and monitors the model.


--------------------------------------------------------------------------------
/02-introduction-to-mlops/02-steps-to-create-a-model.md:
--------------------------------------------------------------------------------
  1 | # How Data Scientists Create a Simple Model
  2 | 
  3 | Now that you know what machine learning is and what a model is, the next step is understanding how data scientists actually build one.
  4 | Think of this as the “hello world” of ML.
  5 | 
  6 | ### Start with a Small Dataset
  7 | 
  8 | Every model begins with data.
  9 | 
 10 | Example:
 11 | You want to predict flower species based on features like:
 12 | 
 13 | Petal length
 14 | 
 15 | Petal width
 16 | 
 17 | Sepal length
 18 | 
 19 | Sepal width
 20 | 
 21 | This dataset will have:
 22 | 
 23 | Inputs (features): the measurements
 24 | 
 25 | Output (label): the species name
 26 | 
 27 | ### Split the Data: Train vs Test
 28 | 
 29 | Before training, data scientists split the dataset into:
 30 | 
 31 | Training data (usually ~80%)
 32 | The part the model learns from.
 33 | 
 34 | Testing data (usually ~20%)
 35 | The part used later to check if the model learned properly.
 36 | 
 37 | Why?
 38 | Because you should never test a student with questions they already memorized.
 39 | 
 40 | ### Choose a Simple Model
 41 | 
 42 | For beginners, data scientists usually start with a simple algorithm, like:
 43 | 
 44 | Logistic Regression (for classification)
 45 | 
 46 | Decision Tree
 47 | 
 48 | k-Nearest Neighbors (KNN)
 49 | 
 50 | For the flower species example, Logistic Regression or Decision Tree works great.
 51 | 
 52 | ### Train the Model
 53 | 
 54 | Training means:
 55 | 
 56 | Give the model training data
 57 | 
 58 | Let it “learn” patterns between measurements and flower species
 59 | 
 60 | Internally, the model adjusts itself to reduce mistakes
 61 | 
 62 | You don’t manually code patterns.
 63 | The model learns them automatically.
 64 | 
 65 | ### Test the Model
 66 | 
 67 | Now, evaluate it using the test data.
 68 | 
 69 | The goal is to see:
 70 | 
 71 | How many predictions are correct?
 72 | 
 73 | Where is the model making mistakes?
 74 | 
 75 | This step tells you if the model is ready or needs improvement.
 76 | 
 77 | ### Improve (If Needed)
 78 | 
 79 | Beginners usually follow simple improvement steps:
 80 | 
 81 | Remove noisy or incorrect data
 82 | 
 83 | Try a different model
 84 | 
 85 | Tune settings (called hyperparameters)
 86 | 
 87 | Add more training samples
 88 | 
 89 | Even small changes can boost accuracy.
 90 | 
 91 | ### Save the Model
 92 | 
 93 | Once the model performs well, you save it as a file.
 94 | Example formats:
 95 | 
 96 | .pkl
 97 | 
 98 | .joblib
 99 | 
100 | .onnx
101 | 
102 | This saved model can now be used by:
103 | 
104 | Apps
105 | 
106 | Websites
107 | 
108 | Backend microservices
109 | 
110 | MLOps pipelines
111 | 
112 | This is what gets deployed.


--------------------------------------------------------------------------------
/03-role-of-mlops/03-how-mlops-help-datascientists.md:
--------------------------------------------------------------------------------
  1 | # How MLOps Engineers Help Data Scientists
  2 | 
  3 | Before automation, Data Scientists usually train models by:
  4 | 
  5 | Running scripts manually
  6 | 
  7 | Clicking “Run” in a notebook
  8 | 
  9 | Training only on their local machine
 10 | 
 11 | This does not scale and is hard to repeat.
 12 | 
 13 | MLOps Engineers fix this using CI (Continuous Integration) with GitHub Actions.
 14 | 
 15 | ### Training Becomes an Automatic Event
 16 | 
 17 | Instead of a human triggering training, MLOps Engineers define rules such as:
 18 | 
 19 | Train the model when new code is pushed
 20 | 
 21 | Train the model when data changes
 22 | 
 23 | Train the model when a pull request is merged
 24 | 
 25 | Now training is:
 26 | 
 27 | Predictable
 28 | 
 29 | Consistent
 30 | 
 31 | Not dependent on someone’s laptop.
 32 | 
 33 | ### Standard Training Environment
 34 | 
 35 | Local machines differ:
 36 | 
 37 | Different Python versions
 38 | 
 39 | Different library versions
 40 | 
 41 | Different OS settings
 42 | 
 43 | MLOps Engineers ensure:
 44 | 
 45 | Every training run uses the same environment
 46 | 
 47 | Same dependencies every time
 48 | 
 49 | Same Python version
 50 | 
 51 | This removes the classic problem:
 52 | 
 53 | “The model worked on my machine but not in CI.”
 54 | 
 55 | ### Clean, Fresh Training Every Time
 56 | 
 57 | Each CI run starts from a clean environment:
 58 | 
 59 | No leftover files
 60 | 
 61 | No cached results
 62 | 
 63 | No manual tweaks
 64 | 
 65 | This ensures:
 66 | 
 67 | Training is reproducible
 68 | 
 69 | Results are trustworthy
 70 | 
 71 | Hidden dependencies are eliminated
 72 | 
 73 | ### Automatic Model Generation
 74 | 
 75 | When CI runs:
 76 | 
 77 | The training script executes automatically
 78 | 
 79 | A new model is produced
 80 | 
 81 | The model represents the latest code + data
 82 | 
 83 | Data Scientists don’t need to:
 84 | 
 85 | Run training manually
 86 | 
 87 | Share model files over chat or email
 88 | 
 89 | ### Consistent Experiment Tracking
 90 | 
 91 | Every CI training run is linked to:
 92 | 
 93 | A Git commit
 94 | 
 95 | A timestamp
 96 | 
 97 | A specific change
 98 | 
 99 | This makes it easy to answer:
100 | 
101 | Which code created this model?
102 | 
103 | What changed between two models?
104 | 
105 | Why did accuracy improve or drop?
106 | 
107 | ### Early Failure Detection
108 | 
109 | If something breaks:
110 | 
111 | Dependency issues
112 | 
113 | Data format problems
114 | 
115 | Training errors
116 | 
117 | CI fails immediately and visibly.
118 | 
119 | This helps Data Scientists:
120 | 
121 | Catch issues early
122 | 
123 | Fix problems before deployment
124 | 
125 | Avoid broken models reaching production
126 | 
127 | 


--------------------------------------------------------------------------------
/04-versioning-and-experiment-tracking/01-what-is-data-versioning.md:
--------------------------------------------------------------------------------
 1 | # Data Versioning – Why Git Isn’t Enough?
 2 | 
 3 | Git is amazing for versioning code — small, text-based files that change in predictable ways. But the moment you try to use Git to version data, everything starts breaking. Here’s why Git isn’t enough for real-world ML data versioning:
 4 | 
 5 | ### Git is not built for large files
 6 | - Git is optimized for kilobytes or a few megabytes.
 7 | - ML datasets are often gigabytes to terabytes.
 8 | - Git tries to store every version inside the repository → it becomes slow, bloated, and unusable.
 9 | 
10 | ### Git can’t handle binary data well
11 | - Datasets are usually CSVs, images, audio, video, Parquet, etc.
12 | - For Git, these are just binary blobs.
13 | - Git can’t diff them meaningfully → it stores the entire file every time something changes.
14 | 
15 | This leads to enormous repositories.
16 | 
17 | ### Git doesn’t support dataset lineage
18 | When working with ML data, you need to track:
19 | - Where did the data come from?
20 | - What transformations were applied?
21 | - Which version of data produced which model?
22 | 
23 | Git can’t track:
24 | - Data pipelines,
25 | - Transformation steps,
26 | - Data quality checks.
27 | 
28 | ### No built in dataset metadata
29 | ML workflows need metadata like:
30 | - Schema changes,
31 | - Statistics (min/max, missing values),
32 | - Train/validation splits,
33 | - Provenance.
34 | 
35 | Git provides none of this.
36 | 
37 | ### Git isn’t good for collaboration on large datasets
38 | - Multiple people pushing and pulling multi-GB files → slow and painful.
39 | - Merge conflicts on binary files → impossible.
40 | - CI/CD pipelines break when repos get too large.
41 | 
42 | ### Storage costs explode
43 | Git repository size grows exponentially.
44 | Cloud providers charge for storage + bandwidth.
45 | Teams end up spending more time and money managing Git than managing data.
46 | 
47 | ### ML workflows require experiment level tracking
48 | You typically need to answer:
49 | - “Which dataset version did I use for model v3?”
50 | - “What changed between dataset v1.4 → v1.5?”
51 | - “Why did accuracy drop yesterday?”
52 | 
53 | Git cannot correlate:
54 | - Data version
55 | - Code version
56 | - Experiment logs
57 | - Model artifacts
58 | 
59 | ML tools like DVC, LakeFS, MLflow, Delta Lake are built to do this.
60 | 
61 | # In simple words
62 | Git is great for text-based code.  
63 | ML needs versioning for huge, complex, constantly changing datasets, and Git simply wasn’t designed for that.
64 | 
65 | Data versioning tools provide:
66 | - Efficient storage
67 | - Deduplication
68 | - Lineage tracking
69 | - Metadata
70 | - Reproducible ML pipelines
71 | 
72 | That’s why Git alone is not enough.
73 | 


--------------------------------------------------------------------------------
/02-introduction-to-mlops/04-machine-learning-lifecycle-overview.md:
--------------------------------------------------------------------------------
  1 | # Machine Learning Lifecycle
  2 | 
  3 | The Machine Learning lifecycle is the complete journey of building, deploying, and maintaining an ML model.
  4 | It covers everything from collecting raw data to monitoring the model in production.
  5 | 
  6 | Below is a clear step-by-step breakdown:
  7 | 
  8 | ### Problem Definition
  9 | 
 10 | Understand what problem you are solving.
 11 | 
 12 | Define the goal: classification, regression, clustering, etc.
 13 | 
 14 | Identify what a good outcome looks like (accuracy, latency, cost).
 15 | 
 16 | ### Data Collection
 17 | 
 18 | Gather raw data from various sources: databases, APIs, logs, sensors, user inputs, etc.
 19 | 
 20 | Ensure the data represents real-world scenarios.
 21 | 
 22 | ### Data Cleaning & Preparation
 23 | 
 24 | Handle missing values.
 25 | 
 26 | Remove noise and duplicates.
 27 | 
 28 | Fix incorrect or inconsistent entries.
 29 | 
 30 | Split the data into train/validation/test sets.
 31 | 
 32 | This is usually the most time-consuming step.
 33 | 
 34 | ### Feature Engineering
 35 | 
 36 | Transform raw data into meaningful features.
 37 | 
 38 | Examples: converting timestamps, extracting text embeddings, scaling numbers, one-hot encoding categories.
 39 | 
 40 | Better features → better model performance.
 41 | 
 42 | ### Model Selection
 43 | 
 44 | Choose the right algorithm based on the problem and data:
 45 | 
 46 | Linear Regression
 47 | 
 48 | Decision Trees
 49 | 
 50 | Random Forest
 51 | 
 52 | Gradient Boosting
 53 | 
 54 | Neural Networks
 55 | 
 56 | etc.
 57 | 
 58 | ### Model Training
 59 | 
 60 | Feed training data into the algorithm.
 61 | 
 62 | The model learns patterns and relationships.
 63 | 
 64 | Adjust parameters to minimize the error.
 65 | 
 66 | ### Model Evaluation
 67 | 
 68 | Test model performance on validation/test datasets.
 69 | 
 70 | Common metrics: Accuracy, F1-score, RMSE, ROC-AUC.
 71 | 
 72 | Check if the model meets the defined success criteria.
 73 | 
 74 | ### Hyperparameter Tuning
 75 | 
 76 | Improve performance using techniques like Grid Search, Random Search, Bayesian Optimization.
 77 | 
 78 | Examples: learning rate, depth of trees, regularization values.
 79 | 
 80 | ### Model Deployment
 81 | 
 82 | Move the model from development to production.
 83 | 
 84 | Deployment options:
 85 | 
 86 | REST API
 87 | 
 88 | Batch jobs
 89 | 
 90 | Edge devices
 91 | 
 92 | Cloud ML services (SageMaker, Vertex AI, etc.)
 93 | 
 94 | ### Monitoring & Logging
 95 | 
 96 | Monitor model accuracy, drift, latency, errors.
 97 | 
 98 | Track data distribution and real-world performance.
 99 | 
100 | Set alerts for anomalies or performance drops.
101 | 
102 | ### Model Maintenance
103 | 
104 | Retrain with new data.
105 | 
106 | Update the pipeline when business logic or data changes.
107 | 
108 | Version control for data, code, and models.


--------------------------------------------------------------------------------
/03-role-of-mlops/02-data-scientists-without-mlops.md:
--------------------------------------------------------------------------------
  1 | # Role of a Data Scientist
  2 | 
  3 | Think of a Data Scientist as the person who turns raw data into insights and a working model.
  4 | Their focus is data, logic, and predictions, not deployment, automation, or production systems.
  5 | 
  6 | ### Understand the Business Problem
  7 | 
  8 | The first job is not coding.
  9 | 
 10 | A data scientist starts by answering questions like:
 11 | 
 12 | What problem are we trying to solve?
 13 | 
 14 | What decision will this model help make?
 15 | 
 16 | What does “success” look like?
 17 | 
 18 | Example:
 19 | 
 20 | “Can we predict whether a user will cancel their subscription?”
 21 | 
 22 | “Can we classify customer queries into intents?”
 23 | 
 24 | At this stage, everything is conceptual.
 25 | 
 26 | ### Collect and Understand Data
 27 | 
 28 | Once the problem is clear, the data scientist works with data:
 29 | 
 30 | CSV files
 31 | 
 32 | Databases
 33 | 
 34 | Logs
 35 | 
 36 | Excel sheets
 37 | 
 38 | API data
 39 | 
 40 | Key activities:
 41 | 
 42 | Look at columns and data types
 43 | 
 44 | Check missing values
 45 | 
 46 | Understand patterns
 47 | 
 48 | Identify noisy or irrelevant data
 49 | 
 50 | This step is about getting familiar with the data, not building models yet.
 51 | 
 52 | ### Clean and Prepare the Data
 53 | 
 54 | Real-world data is messy.
 55 | 
 56 | A data scientist:
 57 | 
 58 | Removes duplicates
 59 | 
 60 | Handles missing values
 61 | 
 62 | Fixes incorrect data
 63 | 
 64 | Converts text into numerical form (for ML models)
 65 | 
 66 | Normalizes or scales numbers if needed
 67 | 
 68 | This step often takes more time than model building.
 69 | 
 70 | ### Explore the Data (EDA – Exploratory Data Analysis)
 71 | 
 72 | Here, the data scientist tries to understand patterns and relationships.
 73 | 
 74 | They may:
 75 | 
 76 | Plot graphs
 77 | 
 78 | Check correlations
 79 | 
 80 | Compare distributions
 81 | 
 82 | Identify outliers
 83 | 
 84 | Goal:
 85 | 
 86 | Learn what features are useful
 87 | 
 88 | Decide what data helps prediction
 89 | 
 90 | This step helps decide which model might work well.
 91 | 
 92 | ### Build a Simple Model
 93 | 
 94 | Now comes the actual machine learning part.
 95 | 
 96 | The data scientist:
 97 | 
 98 | Chooses a basic algorithm (e.g., logistic regression, decision tree, Naive Bayes)
 99 | 
100 | Trains the model using historical data
101 | 
102 | Keeps it simple and understandable
103 | 
104 | At this stage:
105 | 
106 | Accuracy matters, but clarity matters more
107 | 
108 | Complex models are avoided for beginners
109 | 
110 | ### Evaluate the Model
111 | 
112 | After training, the data scientist checks:
113 | 
114 | Accuracy
115 | 
116 | Precision / Recall (if needed)
117 | 
118 | Confusion matrix
119 | 
120 | They answer questions like:
121 | 
122 | Is this model better than guessing?
123 | 
124 | Is it making obvious mistakes?
125 | 
126 | Is it good enough for a demo or proof of concept?
127 | 
128 | No production thinking yet — just “Does it work?”
129 | 
130 | ### Save the Model
131 | 
132 | Once satisfied, the data scientist:
133 | 
134 | Saves the trained model as a file (e.g., .pkl)
135 | 
136 | Shares it with the team
137 | 
138 | At this point:
139 | 
140 | The model exists as a file
141 | 
142 | It is NOT deployed
143 | 
144 | It is NOT automated
145 | 
146 | The job is almost done.


--------------------------------------------------------------------------------
/05-experiment-tracking/02-what-is-MLflow.md:
--------------------------------------------------------------------------------
  1 | # Introduction to MLflow
  2 | 
  3 | MLflow is an **open-source MLOps framework** used to manage the end-to-end machine learning lifecycle.  
  4 | It helps you **track experiments, manage models, and make ML work reproducible**.
  5 | 
  6 | In simple terms:
  7 | > MLflow helps you answer the question:  
  8 | > **“Which model did we train, with what parameters, and how good was it?”**
  9 | 
 10 | ---
 11 | 
 12 | ## Why MLflow Exists
 13 | 
 14 | Once you start building real ML projects, common problems appear:
 15 | 
 16 | - Multiple experiments with different parameters
 17 | - No clear record of which model performed best
 18 | - Difficult to reproduce results
 19 | - Models stored locally with no versioning
 20 | - Hard to move models from training to deployment
 21 | 
 22 | **MLflow solves these problems by acting as a central system of record for ML work.**
 23 | 
 24 | ---
 25 | 
 26 | ## Core Components of MLflow
 27 | 
 28 | MLflow has four main components. Beginners should focus mainly on the first two.
 29 | 
 30 | ---
 31 | 
 32 | ### MLflow Tracking
 33 | 
 34 | Used to **track experiments**.
 35 | 
 36 | You can log:
 37 | - Parameters (learning rate, epochs, etc.)
 38 | - Metrics (accuracy, loss, F1-score)
 39 | - Artifacts (model files, plots, datasets)
 40 | - Source code version
 41 | 
 42 | Each training run is stored as a **Run**.
 43 | 
 44 | Example:
 45 | 
 46 |     import mlflow
 47 | 
 48 |     with mlflow.start_run():
 49 |         mlflow.log_param("learning_rate", 0.01)
 50 |         mlflow.log_metric("accuracy", 0.92)
 51 | 
 52 | To view runs:
 53 | 
 54 |     mlflow ui
 55 | 
 56 | This opens a UI where you can compare experiments.
 57 | 
 58 | ---
 59 | 
 60 | ### MLflow Models
 61 | 
 62 | MLflow provides a **standard way to package models**.
 63 | 
 64 | This allows the same model to be:
 65 | - Loaded in Python
 66 | - Served via REST API
 67 | - Containerized using Docker
 68 | - Deployed to cloud or Kubernetes
 69 | 
 70 | This makes models **portable and production-ready**.
 71 | 
 72 | ---
 73 | 
 74 | ### MLflow Projects
 75 | 
 76 | A way to package ML code with:
 77 | - Environment details
 78 | - Entry points
 79 | - Reproducible execution
 80 | 
 81 | Mostly useful for larger teams and advanced workflows.
 82 | 
 83 | ---
 84 | 
 85 | ### MLflow Model Registry
 86 | 
 87 | A centralized place to manage models:
 88 | - Model versions
 89 | - Stages (Staging, Production, Archived)
 90 | - Metadata and approvals
 91 | 
 92 | Very useful in enterprise MLOps setups.
 93 | 
 94 | ---
 95 | 
 96 | ## Simple Real-World Example
 97 | 
 98 | Imagine a **Wine Quality Prediction** model.
 99 | 
100 | You try:
101 | - Run 1: learning_rate = 0.01 → accuracy = 0.86
102 | - Run 2: learning_rate = 0.1 → accuracy = 0.89
103 | - Run 3: learning_rate = 0.001 → accuracy = 0.82
104 | 
105 | Without MLflow:
106 | - You forget results
107 | - You overwrite models
108 | - You guess which model to deploy
109 | 
110 | With MLflow:
111 | - Every run is logged
112 | - Metrics are compared visually
113 | - Best model is clearly identified
114 | - Model file is stored and versioned
115 | 
116 | ---
117 | 
118 | ## How MLflow Fits into MLOps
119 | 
120 | MLflow helps MLOps Engineers with:
121 | 
122 | - Experiment tracking
123 | - Reproducibility
124 | - Model lineage
125 | - Model packaging
126 | - Deployment readiness
127 | 
128 | In real projects, MLflow is often combined with:
129 | - Git (code versioning)
130 | - DVC (data versioning)
131 | - GitHub Actions (CI/CD)
132 | - Docker (containerization)
133 | - Kubernetes (serving)
134 | - Cloud storage like S3 (remote tracking)
135 | 
136 | 
137 | 
138 | 


--------------------------------------------------------------------------------
/06-fundamentals-of-model-deployment.md/01-introduction-to-deployment-and-serving.md:
--------------------------------------------------------------------------------
  1 | # Introduction to Model Deployment and Model Serving
  2 | 
  3 | Machine learning models create value only when they can be used by real users or systems. Training a model is just one step. To make it useful, the model must be deployed and served so that applications can request predictions from it.
  4 | 
  5 | ---
  6 | 
  7 | ## What Is Model Deployment
  8 | 
  9 | Model deployment is the process of taking a trained machine learning model and making it available in a production environment.
 10 | 
 11 | In simple terms, deployment means:
 12 | - Moving the model out of a local machine
 13 | - Packaging it with required code and dependencies
 14 | - Making it accessible to other systems or users
 15 | 
 16 | A deployed model can be accessed by:
 17 | - Web applications
 18 | - Backend services
 19 | - Mobile apps
 20 | - Batch jobs or data pipelines
 21 | 
 22 | ---
 23 | 
 24 | ## What Is Model Serving
 25 | 
 26 | Model serving is how the deployed model **runs in production** and responds to prediction requests.
 27 | 
 28 | A model serving system typically:
 29 | - Loads the trained model into memory
 30 | - Accepts input data (JSON, text, images, numbers)
 31 | - Runs inference on the model
 32 | - Returns predictions to the caller
 33 | 
 34 | Model serving focuses on runtime behavior such as:
 35 | - Response time (latency)
 36 | - Number of requests handled (throughput)
 37 | - Reliability and availability
 38 | 
 39 | ---
 40 | 
 41 | ## Types of Model Serving
 42 | 
 43 | ### Real-time Serving
 44 | - Predictions are returned instantly
 45 | - Used when low latency is required
 46 | - Examples:
 47 |   - Fraud detection
 48 |   - Recommendation systems
 49 |   - Intent classification APIs
 50 | 
 51 | ### Batch Serving
 52 | - Predictions run on large datasets at scheduled intervals
 53 | - Used for offline analytics and reports
 54 | - Examples:
 55 |   - Nightly churn prediction
 56 |   - Weekly risk scoring
 57 |   - Bulk data enrichment
 58 | 
 59 | ---
 60 | 
 61 | ## Why Model Deployment and Serving Matter
 62 | 
 63 | A model that is not deployed is just an experiment.
 64 | 
 65 | Production systems require:
 66 | - Consistent availability
 67 | - Fast responses
 68 | - Ability to scale with traffic
 69 | - Logging and monitoring
 70 | - Safe updates and rollbacks
 71 | 
 72 | Deployment and serving help ensure:
 73 | - The model can handle real user traffic
 74 | - Predictions remain reliable over time
 75 | - Issues can be detected and fixed quickly
 76 | 
 77 | ---
 78 | 
 79 | ## Common Ways to Deploy and Serve Models
 80 | 
 81 | ### Python API-Based Serving
 82 | - Flask
 83 | - FastAPI
 84 | - Django
 85 | 
 86 | Simple and ideal for learning and small-scale use cases.
 87 | 
 88 | ---
 89 | 
 90 | ### Container-Based Deployment
 91 | - Docker for packaging
 92 | - Kubernetes for orchestration
 93 | - Load balancers for traffic distribution
 94 | 
 95 | Used in production environments for scalability and reliability.
 96 | 
 97 | ---
 98 | 
 99 | ### MLOps and Model Serving Platforms
100 | - MLflow Model Serving
101 | - KServe
102 | - Seldon Core
103 | - TensorFlow Serving
104 | - TorchServe
105 | 
106 | These tools provide built-in features like:
107 | - Model versioning
108 | - Auto-scaling
109 | - Canary deployments
110 | - Metrics and monitoring
111 | 
112 | ---
113 | 
114 | ### Serverless Deployment
115 | - AWS Lambda
116 | - GCP Cloud Run
117 | - Azure Functions
118 | 
119 | Best suited for lightweight models and variable traffic patterns.
120 | 
121 | ---
122 | 
123 | ## Simple Example: Model Serving Flow
124 | 
125 | For a basic prediction model:
126 | 
127 | - Train a model locally
128 | - Save the trained model as a file
129 | - Load the model inside an API service
130 | - Expose a `/predict` endpoint
131 | - Deploy the service to a server or container platform
132 | - Client applications send requests and receive predictions
133 | 
134 | ---
135 | 
136 | ## Model Deployment vs Model Serving
137 | 
138 | | Concept | Description |
139 | |-------|-------------|
140 | | Model Deployment | The process of releasing the model into a production environment |
141 | | Model Serving | The system that handles prediction requests at runtime |
142 | 
143 | Deployment is a one-time or versioned activity, while serving is continuous and always running.
144 | 


--------------------------------------------------------------------------------
/09-SageMaker/02-production-setup.md:
--------------------------------------------------------------------------------
  1 | # Install SageMaker using AWS CLI
  2 | 
  3 | ### Get the Default VPC ID
  4 | 
  5 | ```
  6 | aws ec2 describe-vpcs \
  7 |   --filters "Name=isDefault,Values=true" \
  8 |   --query "Vpcs[0].VpcId" \
  9 |   --output text \
 10 |   --region <REGION>
 11 | ```
 12 | 
 13 | ### List Subnets Under the Default VPC
 14 | 
 15 | ```
 16 | aws ec2 describe-subnets \
 17 |   --filters "Name=vpc-id,Values=<DEFAULT_VPC_ID>" \
 18 |   --query "Subnets[].SubnetId" \
 19 |   --output text \
 20 |   --region <REGION>
 21 | ```
 22 | 
 23 | ### Create an Execution Role for SageMaker Domain
 24 | 
 25 | Create a simple trust policy
 26 | 
 27 | Save as trust.json:
 28 | 
 29 | ```
 30 | {
 31 |   "Version": "2012-10-17",
 32 |   "Statement": [
 33 |     {
 34 |       "Effect": "Allow",
 35 |       "Principal": { "Service": "sagemaker.amazonaws.com" },
 36 |       "Action": "sts:AssumeRole"
 37 |     }
 38 |   ]
 39 | }
 40 | ```
 41 | 
 42 | Create the role
 43 | 
 44 | ```
 45 | aws iam create-role \
 46 |   --role-name SageMakerDomainExecutionRole \
 47 |   --assume-role-policy-document file://trust.json
 48 | ```
 49 | 
 50 | Attach a basic policy (beginner friendly)
 51 | 
 52 | ```
 53 | aws iam attach-role-policy \
 54 |   --role-name SageMakerDomainExecutionRole \
 55 |   --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
 56 | ```
 57 | 
 58 | Save the role ARN from:
 59 | 
 60 | `aws iam get-role --role-name SageMakerDomainExecutionRole --query "Role.Arn" --output text`
 61 | 
 62 | ### Create the SageMaker Domain (Using Default VPC)
 63 | 
 64 | This is the core step.
 65 | 
 66 | ```
 67 | aws sagemaker create-domain \
 68 |   --domain-name my-sagemaker-domain \
 69 |   --auth-mode IAM \
 70 |   --vpc-id <DEFAULT_VPC_ID> \
 71 |   --subnet-ids <SUBNET1> <SUBNET2> \
 72 |   --app-network-access-type VpcOnly \
 73 |   --default-user-settings "{
 74 |       \"ExecutionRole\": \"<ROLE_ARN>\"
 75 |    }" \
 76 |   --region <REGION>
 77 | ```
 78 | 
 79 | This returns a DomainId. If you lose it, list domains:
 80 | 
 81 | `aws sagemaker list-domains --region <REGION>`
 82 | 
 83 | ### Create a SageMaker UserProfile + Tag It
 84 | 
 85 | ABAC depends on tags.
 86 | 
 87 | ```
 88 | aws sagemaker create-user-profile \
 89 |   --domain-id <DOMAIN_ID> \
 90 |   --user-profile-name alice-profile \
 91 |   --tags Key=studiouserid,Value=alice123 \
 92 |   --region <REGION>
 93 | ```
 94 | 
 95 | ### Create the IAM User and Tag the User
 96 | 
 97 | The IAM user must have the same tag for ABAC matching.
 98 | 
 99 | ```
100 | aws iam create-user --user-name alice-iam-user
101 | ```
102 | 
103 | Add ABAC tag
104 | 
105 | ```
106 | aws iam tag-user \
107 |   --user-name alice-iam-user \
108 |   --tags Key=studiouserid,Value=alice123
109 | ```
110 | 
111 | ### Create the ABAC Policy
112 | 
113 | This policy enforces two things:
114 | 
115 | The IAM user can only generate a presigned URL for a user profile whose tag matches their own (studiouserid).
116 | 
117 | The IAM user can view the domain and user profile in the SageMaker console.
118 | 
119 | Save this as sagemaker-abac.json:
120 | 
121 | ```
122 | {
123 |   "Version": "2012-10-17",
124 |   "Statement": [
125 |     {
126 |       "Sid": "AllowConsoleListAndDescribe",
127 |       "Effect": "Allow",
128 |       "Action": [
129 |         "sagemaker:ListDomains",
130 |         "sagemaker:ListUserProfiles",
131 |         "sagemaker:ListApps",
132 |         "sagemaker:DescribeDomain",
133 |         "sagemaker:DescribeUserProfile",
134 |         "sagemaker:ListTags"
135 |       ],
136 |       "Resource": "*"
137 |     },
138 |     {
139 |       "Sid": "AllowPresignedUrlWhenTagMatches",
140 |       "Effect": "Allow",
141 |       "Action": [
142 |         "sagemaker:CreatePresignedDomainUrl"
143 |       ],
144 |       "Resource": "*",
145 |       "Condition": {
146 |         "StringEquals": {
147 |           "sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/studiouserid}"
148 |         }
149 |       }
150 |     }
151 |   ]
152 | }
153 | ```
154 | 
155 | ### Create the IAM policy
156 | 
157 | ```
158 | aws iam create-policy \
159 |   --policy-name SageMaker-Studio-ABAC \
160 |   --policy-document file://sagemaker-abac.json
161 | ```
162 | 
163 | ### Attach the Policy to the IAM User
164 | 
165 | ```
166 | aws iam attach-user-policy \
167 |   --user-name alice-iam-user \
168 |   --policy-arn arn:aws:iam::<ACCOUNT_ID>:policy/SageMaker-Studio-ABAC
169 | ```
170 | 
171 | ### How the IAM User Opens SageMaker Studio
172 | 
173 | There are two ways now:
174 | 
175 | Using the SageMaker Console (now works due to list permissions)
176 | 
177 | - IAM user signs in → goes to:
178 | - Amazon SageMaker → Studio → Domains
179 | - They can now see: The domain -> The user profile
180 | 
181 | Using a Presigned URL (ABAC-restricted)
182 | 
183 | The user (or admin) runs:
184 | 
185 | ```
186 | aws sagemaker create-presigned-domain-url \
187 |   --domain-id <DOMAIN_ID> \
188 |   --user-profile-name alice-profile \
189 |   --session-expiration-duration-in-seconds 3600 \
190 |   --region <REGION>
191 | ```
192 | 
193 | This returns a URL that opens SageMaker Studio only for this UserProfile.
194 | 
195 | If an IAM user tries to open another user’s profile → access denied because the tags won't match.
196 | 


--------------------------------------------------------------------------------
/06-fundamentals-of-model-deployment.md/02-popular-ways.md:
--------------------------------------------------------------------------------
  1 | # High-level overview of popular model serving implementations 
  2 | 
  3 | Below are concise, high-level descriptions, architectures, trade-offs, and best-practices for four common model serving approaches: Flask on VM (WSGI + autoscaling), Containerized on Kubernetes (Ingress), Amazon SageMaker, and KServe.
  4 | 
  5 | ---
  6 | 
  7 | ## 1) Flask app deployment on VM with WSGI and autoscaling
  8 | 
  9 | **What it is (short):**
 10 | Run a Python Flask app that loads a model and exposes prediction endpoints (REST). Serve it via a production WSGI server (e.g., Gunicorn, uWSGI) on virtual machines. Autoscale by adding/removing VM instances (manual, cloud ASG, or autoscaler).
 11 | 
 12 | **Architecture (high level):**
 13 | - Model artifact stored on disk or fetched at startup (S3, artifact store).
 14 | - Flask app exposes `/predict` (REST).
 15 | - WSGI server (Gunicorn/uWSGI) runs multiple worker processes/threads.
 16 | - Fronted by a load-balancer (cloud LB or HAProxy/Nginx).
 17 | - Autoscaling group / VM scale set increases instances based on metrics (CPU, latency, queue length).
 18 | 
 19 | **When to use:**
 20 | - Small teams or POCs.
 21 | - Low to moderate traffic; simple deployment requirements.
 22 | - When you need direct control over the host environment or have non-containerized infra constraints.
 23 | 
 24 | **Pros:**
 25 | - Simple and easy to debug.
 26 | - Minimal infra complexity; direct control of system packages and drivers (GPU drivers on VM).
 27 | - Quick to prototype.
 28 | 
 29 | **Cons / Risks:**
 30 | - Operational overhead: patching, OS maintenance, scaling logic.
 31 | - Harder to achieve fast, fine-grained autoscaling (startup time of VM can be high).
 32 | - Less portable and reproducible than container-based deployments.
 33 | - Concurrency limited by WSGI worker model; can be CPU-bound with Python GIL for single-process workers.
 34 | 
 35 | **Best practices:**
 36 | - Use a process manager and WSGI server with multiple workers.
 37 | - Load model once per process; use batching if needed.
 38 | - Use health checks and graceful shutdown to avoid dropping in-flight requests.
 39 | - Autoscale on application-level metrics (latency, queue length) and keep warm instances or fast startup containers/VM images.
 40 | - Add logging, metrics (Prometheus, StatsD), and tracing (OpenTelemetry).
 41 | 
 42 | ---
 43 | 
 44 | ## 2) Containerize and deploy to Kubernetes with Ingress
 45 | 
 46 | **What it is (short):**
 47 | Package model server (Flask/FastAPI/TorchServe/Triton or custom) into a container image, run it as pods on Kubernetes. Expose via Ingress (Ingress Controller / LB). Use Horizontal Pod Autoscaler (HPA) and potentially custom autoscalers (KEDA) for scaling.
 48 | 
 49 | **Architecture (high level):**
 50 | - Container image contains model server and dependencies.
 51 | - Kubernetes Deployment or StatefulSet runs pods; ConfigMaps/Secrets for config.
 52 | - Service exposes pod set; Ingress routes external traffic (nginx-ingress/ingress-nginx/traefik/ALB).
 53 | - Autoscaling: HPA (CPU/RPS/Custom metrics), KEDA for event-driven scaling.
 54 | - Optional: GPU nodes via nodePools; use device plugin for GPU scheduling.
 55 | - Observability: Prometheus, Grafana, Loki, OpenTelemetry.
 56 | 
 57 | **When to use:**
 58 | - Production-grade systems requiring elasticity, multi-tenancy, and resilience.
 59 | - Teams already using Kubernetes for infra.
 60 | - Need for Canary/Blue-Green deployments, rollout strategies.
 61 | 
 62 | **Pros:**
 63 | - Portability: same image across environments.
 64 | - Rich ecosystem: autoscaling, service discovery, network policies, RBAC.
 65 | - Easy to integrate CI/CD, rollout strategies, canary testing.
 66 | - Fast horizontal scaling of pods compared to VMs (container startup faster).
 67 | 
 68 | **Cons / Risks:**
 69 | - Operational complexity (K8s cluster management).
 70 | - Resource fragmentation and noisy-neighbor issues without careful resource requests/limits.
 71 | - Requires solid observability and cost control.
 72 | 
 73 | **Best practices:**
 74 | - Use readiness/liveness probes; graceful termination.
 75 | - Keep container images small and immutable; load model from external artifact store or use initContainers to fetch large models.
 76 | - Use resource requests/limits; tune HPA using meaningful metrics (latency, queue length rather than just CPU).
 77 | - Secure with network policies, PodSecurity, and RBAC.
 78 | - Use multi-stage builds and CI to test image, run model-smoke tests in CI.
 79 | - Consider model warm-up or preloading to avoid cold-start latency.
 80 | - For high-throughput or low-latency needs, use specialized servers (Triton, TorchServe) rather than general web frameworks.
 81 | 
 82 | ---
 83 | 
 84 | ## 3) Amazon SageMaker
 85 | 
 86 | **What it is (short):**
 87 | A fully managed AWS service for training and serving ML models. SageMaker provides hosted endpoints (real-time), multi-model endpoints, batch transform, and serverless inference options.
 88 | 
 89 | **Architecture (high level):**
 90 | - Models are registered in SageMaker Model registry or stored in S3.
 91 | - Create an Endpoint (single-model or multi-model) backed by endpoint instances (EC2) or serverless compute.
 92 | - Autoscaling via SageMaker Endpoint Auto Scaling (target tracking policies).
 93 | - Integration with CI/CD (SageMaker Pipelines), Model Monitor for drift detection, and Experiments for lineage.
 94 | 
 95 | **When to use:**
 96 | - Teams on AWS wanting managed end-to-end MLOps: training, deployment, monitoring.
 97 | - Need for simplified operational burden and built-in features like model monitoring, A/B testing, and built-in containers for popular frameworks.
 98 | 
 99 | **Pros:**
100 | - Managed: reduces infra ops (patching, scaling, provisioning).
101 | - Feature-rich: model registry, batch inference, built-in monitoring and explainability features.
102 | - Tight integration with other AWS services (IAM, CloudWatch, S3, ECR).
103 | - Serverless inference option reduces need to manage instance fleets for sporadic traffic.
104 | 
105 | **Cons / Risks:**
106 | - Cost can increase for always-on heavy workloads unless optimized (multi-model endpoints, serverless).
107 | - Vendor lock-in to AWS APIs and patterns.
108 | - Less flexibility for very custom runtime environments (though custom containers are supported).
109 | 
110 | **Best practices:**
111 | - Use multi-model endpoints for many small models to save instances, or serverless endpoints for intermittent traffic.
112 | - Enable Model Monitor for data/label drift and alarms.
113 | - Use SageMaker Model Registry for versioning and lineage.
114 | - Automate deployment with SageMaker Pipelines or Terraform/CDK and integrate CI/CD for model promotion.
115 | - Control costs by right-sizing instance types and using autoscaling policies.
116 | 
117 | ---
118 | 
119 | ## 4) KServe (previously KFServing)
120 | 
121 | **What it is (short):**
122 | An open-source Kubernetes-native model serving framework built for cloud-native MLOps. KServe provides standardized CRDs to deploy model servers with autoscaling, multi-framework support, canary rollouts, inference graphing, and explainability.
123 | 
124 | **Architecture (high level):**
125 | - KServe runs on Kubernetes and exposes a `InferenceService` CRD per model.
126 | - Behind the scenes it orchestrates a server (predictor) with autoscaling (Knative/KEDA), transformer, and explainer components.
127 | - Supports many frameworks out-of-the-box: TensorFlow, PyTorch, ONNX, Triton; can use custom containers.
128 | - Integrates with Knative Serving for request autoscaling (including scale-to-zero) and with istio/other service meshes for networking.
129 | 
130 | **When to use:**
131 | - Teams standardizing on Kubernetes and wanting a declarative, extensible model-serving platform.
132 | - When you need multi-framework support, canary deployments for models, and scale-to-zero support to save costs.
133 | 
134 | **Pros:**
135 | - Declarative model lifecycle using Kubernetes CRDs.
136 | - Rich features: canary rollouts, autoscale-to-zero, explainers, transformers, batch/streaming integration.
137 | - Framework abstraction: swap underlying predictor implementation without changing higher-level config.
138 | - Extensible and vendor-neutral; integrates with Kubeflow and other tools.
139 | 
140 | **Cons / Risks:**
141 | - Requires Kubernetes and some maturity in K8s operations.
142 | - Complexity in advanced features (Knative setup, autoscaling tuning).
143 | - Ecosystem maturity varies; sometimes upgrades or custom integrations needed.
144 | 
145 | **Best practices:**
146 | - Use `InferenceService` to standardize deployments; use canary rollouts for model updates.
147 | - Leverage Knative or KEDA for event-driven or scale-to-zero behavior to reduce cost.
148 | - Use model explainers and monitors (prometheus exporters) offered by KServe.
149 | - Manage model artifacts outside the cluster (object storage) and use init containers or model loaders.
150 | - Integrate CI/CD to create and update InferenceService resources programmatically.
151 | 
152 | 


--------------------------------------------------------------------------------