├── .vscode └── settings.json ├── AWS MLE Study Guide.md ├── AWS Power Hour.md ├── AWS Ramp Up Guide.md ├── Exam Readiness Course.md ├── Full Exams.md ├── One Minute AWS MLE Playlist.md ├── Practical Data Science on AWS.md ├── README.md ├── Related Whitepapers.md ├── Udemy - AWS Certified Machine Learning Specialty 2022 - Hands On!.md ├── images ├── 1.png ├── AWS ML Flywheel.png ├── Amazon Flywheel.png ├── Anomaly Detection in AWS.png ├── Architectures and Frameworks.png ├── AutoML Workflow.png ├── Automated Quality Gates.png ├── Choosing Recommender Models.png ├── Configure Training.png ├── Debugger.png ├── ML Workflow.png ├── Metrics to use for Recommender.png ├── Model Deployment for Drift Monitoring.png ├── Recommender Development and Deployment.png ├── Recommender Logging.png ├── S3.png ├── SPARK.png ├── Sagemaker Autopilot.png ├── Sagemaker Services.png ├── Spark and Sagemaker.png ├── Summary.png ├── ml_map.png └── precisionvsrecall.png └── pdf └── AWS-AI-Services-2023.pdf /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "gitdoc.enabled": true 3 | } -------------------------------------------------------------------------------- /AWS MLE Study Guide.md: -------------------------------------------------------------------------------- 1 | 2 | # **[Machine Learning Engineering Guide](https://training.resources.awscloud.com/get-certified-machine-learning-specialty/aws-certified-machine-learning-specialty-exam-guide )** 3 | 4 | This is a summarized version of the most important topics derived from the official guide. Here we follow step by step each key word from the document and bring a high level description and definition for each of them. We have the following main domains to study from: 5 | 6 | ## Domain 1: Data Engineering 7 | 8 | Identify and compare storage mediums, classes and data sources. Know the difference between data engineering concepts such as data lakes, batch load vs streaming data. 9 | 10 | ## Domain 2: Exploratory Data Analysis 11 | 12 | Learn how to deal with missing/corrupt data and various preprocessing techniques for NLP/CV. Learn interpretive descriptive statistics and appropriate graphing 13 | 14 | ## Domain 3: Modeling 15 | 16 | Learn when and how to use each algorithm for specific use cases. Main focus on main terms and important concepts such as optimizers, hyperparameter tuning and model evaluation. 17 | 18 | ## Domain 4: Machine Learning Implementation 19 | 20 | Focus on productionizing maintainable and secure ML models through AWS services. Decide when to use out of the box ML models instead of building your own. Learn MLOPs architectures. 21 | 22 | (*Tip: Use CTRL+F to navigate to specific term of interest*) 23 | 24 | ------ 25 | 26 | ## **Domain 1: Data Engineering** 27 | 28 | ### **Main database storage mediums (and use case)** 29 | 30 | - Aurora, RDS, Redshift (Relational) → For ERP/CRM 31 | 32 | - DynamoDB (NoSQL-Key Value) → E-Commerce / Gaming / High Traffic 33 | 34 | - ElastiCache, MemoryDB (In-Memory) → Cashing / Geo-Spatial 35 | 36 | - DocumentDB (Document) → Content Management / User Profiles 37 | 38 | - KeySpaces (Wide Column) → Equipments / Fleet / Routes 39 | 40 | - Neptune (Fraud/Graph) → Recommenders / Social Networks 41 | 42 | - LedgerDB (Ledger) → Banking / Supply Chain 43 | 44 | (Check the [table here](https://aws.amazon.com/products/databases/) for a more comprehensive overview) 45 | 46 | (Review [2022 ReInvent](https://www.youtube.com/playlist?list=PL2yQDdvlhXf_22xqaqPb13gDRDOq2Sjg4) for appropriate use of storage solutions) 47 | 48 |
49 | 50 | ### **Data Lakes (definition and services)** 51 | 52 | ![Data Lakes in AWS](https://d1.awsstatic.com/Data%20Lake/320x320-what-is-a-data-lake.b32634fa96e91bb5670b885be9428a2c0c40c76d.png) 53 | 54 | Definition: Centralized repository that allows you to store all your structured and unstructured data at any scale. 55 | 56 | In a data lake we can perform big data processing and real time analytics for machine learning use cases. 57 | 58 | In AWS we use the following services to create a data lake: 59 | 60 | - S3 (Storage) → (Standard, IT, IA, One-Zone IA, Glacier, Archive). 61 | 62 | - Athena (Analytics) → Use Python/SQL to query data. 63 | 64 | - EMR (Elastic MapReduce) → Big Data Pipelines for streaming and analytics. 65 | 66 | - Lake Formation (Governance) → CRUD lakes with SQL for security and governance. 67 | 68 | - Glue (ETL) → Event driven data integration tool. 69 | 70 | (Review the [following chart here](https://aws.amazon.com/big-data/datalakes-and-analytics/) for more insights on analytics) 71 | 72 |
73 | 74 | ### **Simple Storage Solution - S3 (definition and main features)** 75 | 76 | Storage solution used for: 77 | 78 | - Data Lakes 79 | - Back-Ups 80 | - Apps 81 | - Archive 82 | 83 | It has the following features: 84 | 85 | - Unique Access Points → for large, shared datasets (with different restriction policies). 86 | 87 | - Batch Operations → for modifying object metadata, copy objects, IAM policy management and Lambda functions. 88 | 89 | - Block Public Access → based on bucket or account level. 90 | 91 | - Multi-Region Access points → route requests based on proximity. 92 | 93 | - Object Lambda → invoking lambda functions to transform objects. 94 | 95 | - Object Lock → Prevent object from being overwritten, deleted (WORM - write once, read many). 96 | 97 | - Replication → Of objects in buckets (fully managed and elastic). 98 | 99 | - Analytics → Using Storage Lens for insights across S3 Objects. 100 | 101 |
102 | 103 | ### **S3 Storage Classes** 104 | 105 | S3 is composed of different tiers, based on the required use case, as follows: 106 | 107 | 1. Standard → For big data analytics, game applications, dynamic sites etc. 108 | 109 | 2. Intelligent-Tiering → Automatically moves to most effective tier. 110 | 111 | 3. Infrequent-Access → Less frequent, but rapid access (think backups). 112 | 113 | 4. Infrequent-Access One Zone → Single AZ, costs 30% less than IA (less accessed backups). 114 | 115 | 5. Glacier (Instant Retrieval) → For rare access, but retrieval in ms. (online file-sharing, disaster recovery). 116 | 117 | 6. Glacier (Flexible Retrieval) → For rare access, retrieval 1-5 min (digital media, legacy docs). 118 | 119 | 7. Glacier (Deep Archive) → Rare Access, retrieval within 48 hours (legacy docs, low importance - think auditing). 120 | 121 | 8. Outposts → pool of AWS compute and storage capacity deployed at a customer site (online multiplayer gaming). 122 | 123 |
124 | 125 | ### **EFS - Elastic File System** 126 | 127 | File storage solution which: 128 | 129 | - Enables sharing data without provisioning & can scale automatically. 130 | 131 | - Supports up thousands of EC2 instances connecting to a file system concurrently. 132 | 133 | - Can store petabytes (2^50 bytes). 134 | 135 | - Option for EFS Infrequent-Access One Zone → Price performance optimized on accessed data (saves up to 92% of file storage). 136 | 137 |
138 | 139 | ### **EBS - Elastic Block Store** 140 | 141 | Block storage solution which: 142 | 143 | - Think of hard drive for single EC2. 144 | 145 | - Block storage (equally sized blocks). 146 | 147 | - Performance advantage (EBS Encryption with low latency). 148 | 149 | - Use case for long-term logs & distribution of mass content. 150 | 151 | - Have snapshots of backups for compliance. 152 | 153 | - Has automated lifecycle manager based on given policies. 154 | 155 |
156 | 157 | **Summary**: We have many options which overlap and are context dependent. For longest term storage follow this pattern: 158 | 159 | S3 → EBS → EFS (Different tiers based on given context). 160 | 161 | For higher performance on very small object size we use the listed DBs for each scenario. 162 | 163 | The main focus should be on the use case, tier and pricing optimization. 164 | 165 |
166 | 167 | ### **Batch Load vs Streaming** 168 | 169 | The main difference is that: 170 | 171 | - Batch load requires processing data over all (or most data) in a dataset with large batches of data. Latency is usually in minutes or hours. (Use BATCH or Glue). AWS Glue is an event driven ETL tool which has many features such as no code, monitoring data quality, data prep etc. 172 | 173 | - Streaming processes in real time in micro-batches in seconds or milliseconds for simple functions (Kinesis or MSK). 174 | 175 | **For streaming we have the following services:** 176 | 177 | - Kinesis Video Streams → Streaming media from connected devices to AWS for storage/ML/Analytics. Fully managed ingestion, storage and processing. Good integration with MXNet, TensorFlow, OpenCV. Best use cases: Smart Home/City & Industrial Automation. 178 | 179 | - Kinesis Data Streams → Real time read & process data streams. Use cases: Log Intake, Real Time Metrics/Analytics, Power Events. 180 | 181 | - Kinesis Data Firehose → Real time analytics (BI & Dashboards). Can also batch, compress, encrypt & scale. 182 | 183 | - Kinesis Data Analytics → For real time metrics, analytics and interactive data streams. 184 | 185 | - Managed Service Kafka (MSK) → Managed Kafka solution if you have existing Kafka solution integrated for the data stream. 186 | 187 | **Summary**: For Batch operations we have GLUE or massive BATCH operations which we can schedule. We can also deploy open source solutions (f.ex Airflow). For streaming we can use firehose mainly for loading streaming data transfer while data streams for real time ingestion. 188 | 189 | *Refer [here](https://www.whizlabs.com/blog/aws-kinesis-data-streams-vs-aws-kinesis-data-firehose/) for more details.* 190 | 191 |
192 | 193 | ### **EMR (Elastic MapReduce)** 194 | 195 | Big data analytics platform hosted in AWS used for petabyte scale analytics. EMR has the following features: 196 | 197 | - Half-cost (-50%) less cost than on premises solutions. 198 | 199 | - 1.7x times faster than Apache Spark. 200 | 201 | - Instant scaling and fully managed resources. 202 | 203 | - Used mainly for big data ML, Clickstream analysis, ETL, etc. 204 | 205 | *To learn more about Apache Spark try [this course](https://www.coursera.org/learn/scala-spark-big-data)* 206 | 207 |
208 | 209 | ------ 210 | 211 |
212 | 213 | ## **Domain 2: Exploratory Data Analysis** 214 | 215 | ### **How to deal with missing data** 216 | 217 | To deal with missing data we can impute, delete rows/columns or not do anything. 218 | 219 | ### **Data Imputation Methods (Numeric)** 220 | 221 | - KNN Imputer (Using mean value from n-neighbors). **Best One** 222 | 223 | - MICE (Using predictive mean matching through chain equation). 224 | 225 | - Bayesian PCA (Maximizing marginal likehood through BPCA). 226 | 227 | - Bayesian Linear Regression (Mean imputation as a linear combination of features). 228 | 229 | - Others (Mean, Mode, Median, Forward/Backward, Interpolation). 230 | 231 | ### **Data Imputation Methods (Non-Numeric)** 232 | 233 | - Most frequent value. 234 | 235 | - Add 'Unknown' or NA. 236 | 237 | - Category specific imputation (f.ex animal for dogs & cats). 238 | 239 | ### **How to deal with corrupt data** 240 | 241 | - Delete row/columns. 242 | 243 | - Create new category. 244 | 245 | - Imputation (See Above). 246 | 247 | - Predictive model for missing values. 248 | 249 | ### **Types of missing/corrupt data** 250 | 251 | - M.C.A.R → Missing completely at random (No Pattern). 252 | 253 | - M.A.R → Missing completely at random (Usually machine error which can be inferred). 254 | 255 | - M.N.A.R → Missing not at random (Deliberately hiding info like salary). 256 | 257 | - S.M.D → Structurally missing data (We know pattern and can infer). 258 | 259 | ### **Types of preprocessing** 260 | 261 | - NLP (lowercase, punctuation, numbers, special char, emojis etc.). 262 | 263 | - CV (greyscale, standardize image, augment data etc.). 264 | 265 | - Most use cases are context and case dependent on what the scope or objective is. 266 | 267 | ### **Formatting data** 268 | 269 | - Choose the right data format (parquet is usually faster). 270 | 271 | - Put similar items under same columns. 272 | 273 | - Avoid blank rows. 274 | 275 | - Avoid trailing whitespaces. 276 | 277 | - Keep format consistent (f.ex address, currency, date etc.). 278 | 279 | - Quality > Quantity of data. 280 | 281 |
282 | 283 | ### **Normalizing data** 284 | 285 | - Decimal place (f.ex in excel accounting nr.). 286 | 287 | - Data type normalization (f.ex in excel general to currency). 288 | 289 | - Z Score normalization (Normalize relative to Standard Deviation). (Value - Mean) / SD 290 | 291 | - Linear normalization (Base on Max-Min). (X - min(x))/Max(X) - Min(X). 292 | 293 | - Clipping normalization (Re-assign Outliers) 294 | 295 | - Normalized Standard Deviation (SD/Mean) 296 | 297 |
298 | 299 | ### **Normalization vs Standardization** 300 | 301 | - Normalization - End result on all dataset after performing standardization. 302 | 303 | - Standardization - Individual row based transformation. 304 | 305 | ### **Data Augmentation Techniques** 306 | 307 | Chiefly for text & image based data. For numerical we can use synthetic simulated data. 308 | 309 | **For text data:** 310 | 311 | - Synonym replacement (ELmo, BERT). 312 | 313 | - Lexical based replacement (NLTK, Spacy). 314 | 315 | - Random Insertion/Deletion/Swapping. 316 | 317 | - Backtranslation (F.ex ENG->FR->ENG). 318 | 319 | - Generative Models (Text Attack, NLPAug, TextAugment). 320 | 321 | **For image data:** 322 | 323 | - Data warping (preserves label, geometric/color transformation). 324 | 325 | - Oversampling (GAN, Feature Space, Mixing Images). 326 | 327 | - Combination of both. 328 | 329 |
330 | 331 | ### **Labeled data:** 332 | 333 | - Manual through domain knowledge (customizable against bias). 334 | 335 | - Mechanical turk (access to workforce in marketplace). 336 | 337 | - Sagemaker Groundtruth Plus (Share data and required labeling & and you get them labeled). 338 | 339 |
340 | 341 | ### **Feature Engineering Concepts:** 342 | 343 | - Binning (Group breakdown). 344 | 345 | - Tokenization (Word / Character breakdown). 346 | 347 | - Adding synthetic features. 348 | 349 | - One hot Encoding (for categorical). 350 | 351 | - PCA. 352 | 353 |
354 | 355 | ### **Graphing Data:** 356 | 357 | Use the [following guide](https://www.sqlbi.com/ref/power-bi-visuals-reference/) for which chart to use for each scenario. A summarized version is as below: 358 | 359 | - Comparison (Clustered Bar Chart). 360 | 361 | - Change over time (Line Chart). 362 | 363 | - Ranking (Clustered Bar Chart). 364 | 365 | - Flow (Waterfall Chart). 366 | 367 | - Part-to-Whole (Clustered Bar Chart). 368 | 369 | - Distribution (Clustered Column). 370 | 371 | - Correlation (Scatterplot). 372 | 373 |
374 | 375 | ### **Interpreting Descriptive Statistics:** 376 | 377 | For a quick refresher you can complete [the following course](https://www.khanacademy.org/math/statistics-probability). 378 | 379 | - Variance (Variability / Spread of distribution). 380 | 381 | - Effect Size (Difference between groups). 382 | 383 | - PMF (Used to calculate the mean and variance of the discrete distribution). 384 | 385 | - CDF (Obtained by summing up the probability density function and getting the cumulative probability for a random variable). 386 | 387 | - Different Distributions (Gaussian, Log Normal, Pareto). 388 | 389 | - PDF (Statistical expression that defines a probability distribution (the likelihood of an outcome) for a discrete random variable (e.g., a stock or ETF). 390 | 391 | - KDE (Application of kernel smoothing for probability density estimation). 392 | 393 | - Raw Moment vs Standardized Moment (moment that is not normalized vs normalized). 394 | 395 | - Correlation Methods (Spearmans vs Pearsons). 396 | 397 | - Other terms to know (Standard Error, Chi Squared Tests). 398 | 399 | *To learn more and apply basic stats required for the exam try [this repository book](https://github.com/AllenDowney/ThinkStats2)*. 400 | 401 |
402 | 403 | ### **Interpreting OLS-Stats Summary:** 404 | 405 | When running an ordinary least squared 406 | 407 | - Dep Variable (What we are trying to predict). 408 | 409 | - Model (Usually OLS). 410 | 411 | - Number of Observations. 412 | 413 | - DF Residuals (Degrees of Freedom -> nr of observations - predicted variable - 1). 414 | 415 | - DF Model (Predicted Variables). 416 | 417 | - Covariance (Type -> Positive / Negative) 418 | 419 | - R Squared → How much the independent variable is explained by changes in dependent variables. 420 | 421 | - Adjusted R Squared → Penalizes for non-contributing variables. 422 | 423 | - F-Stat → Finds out if variables are statistically significant. 424 | 425 | - Coef → Positive or inverse relationship of the dependent to the independent variable. 426 | 427 | - Std Error → Std of Coef → Variation of Coefficient. 428 | 429 | - p|t| → Uses t-stat to produce p-value (showing statistical significance). 430 | 431 | - Omnibus → Normalcy of distribution using skwe and curtosis as measurements (0 = perfectly normal). 432 | 433 | - Prob (Omnibus) → Probability of residuals being normally distributed. 434 | 435 | - Skew → Symmetry (0 = Perfect symmetry). 436 | 437 | - Kurtosis → (High = Few Outliers). 438 | 439 | - Durbin Watson → [1-2] is the best range for homoscedasticity. 440 | 441 | - JB → Alternative method of measuring normalcy of distribution. 442 | 443 | - Condition Number → Measures multicollinearity 444 | 445 | *To learn more about OLS and stats models library interpretation for the exam read [this article](https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a)*. 446 | 447 | - P-value → determines how likely is it that data could have occured under null hypothesis by random chance (Lower than 0.05 we reject null and count it as significant). 448 | 449 |
450 | 451 | ------ 452 | 453 |
454 | 455 | ## **Domain 3: Modeling** 456 | 457 |
458 | 459 | ### **When to use ML/AI:** 460 | 461 | - When we have the right amount of data (and the right data). 462 | 463 | - Properly have assessed the ROI and similar use cases. 464 | 465 | - Ease of support and deployment. 466 | 467 | - Team support (Devs, Product, DevOps). 468 | 469 | ### **Supervised vs Unsupervised Learning** 470 | 471 | - Supervised → has labeled inputs and outputs. 472 | 473 | - Classification (Linear Classifiers, SVM, Decision Trees, Random Forests). 474 | 475 | - Regression (Linear Regression, Logistic, Polynomial). 476 | 477 | - Unsupervised (Clustering Unlabeled Data). 478 | 479 | - Clustering (K-Means → Grouping on Similarities). 480 | 481 | - Association (F.ex Recommenders). 482 | 483 | - Dimensionality Reduction (Autoencoders / PCA). 484 | 485 | - Other differences: 486 | 487 | - Goals (Supervised → Prediction, Unsuppervised → Need to make sense). 488 | 489 | - Application (Supervised → Sentiment/Forecasting, Unsupervised → Recommenders). 490 | 491 | - Complexity (May vary). 492 | 493 | - Drawbacks (Time, accuracy, domain expertise). 494 | 495 | - Semi-Supervised difference: 496 | 497 | - A mix of both methods 498 | 499 | - Uses labeled data for ground predictions & unlabeled data to learn shape of distribution. 500 | 501 | - Self-training (Modifying supervised training to work in semi-supervised way). 502 | 503 | - Use case: Ranking web pages (Google). 504 | 505 | ### **Selecting Appropriate Model** 506 | 507 | Note: Here we are only including some of the main ML/DL models. There are many more which we can find on sites such as hugging face, connected papers etc. 508 | 509 | We need to decide based on: 510 | 511 | - Regression vs Classification 512 | 513 | - Type of data 514 | 515 | - Scope / Objective 516 | 517 | ### **[XGBoost](https://www.youtube.com/watch?v=OtD8wVaFm6E)** 518 | 519 | - Tree based model in adjusting weights of weak vs strong learners. 520 | 521 | - We use it when: 522 | 523 | - We have large number of observations in training data. 524 | 525 | - Nr. of observations > Nr. of features. 526 | 527 | - Mixture of numeric & categorical (or just numeric). 528 | 529 | - Used for supervised problems. 530 | 531 | ### **[Logistic Regression](https://youtu.be/yIYKR4sgzI8)** 532 | 533 | - Simple algorithm for classifying categorical data. 534 | 535 | - Examples include Spam detection etc. 536 | 537 | ### **[K-Means](https://www.youtube.com/watch?v=4b5d3muPQmA)** 538 | 539 | - Mean of clustered data points is at minimum / per cluster. 540 | 541 | - We use it when: 542 | 543 | - We dont have specific outcome to predict. 544 | 545 | - Unsupervised problem (lacking labels). 546 | 547 | ### **[Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU)** 548 | 549 | - Adjusting and assessing relationship of variables 550 | 551 | - Use when we want to predict outcome of target variable based on the relationship predictor variables. 552 | 553 | - A simple checklist to check whether the model is appropriate is the following: 554 | 555 | - Check for a linear relationship of variables. 556 | 557 | - Multivariate normality (close to normal distribution). 558 | 559 | - No multicollinearity (correlation of independent variables). 560 | 561 | - Autocorrelation kept to a minimum (similarity of lagged version with itself). 562 | 563 | - Homoscedasticity (constant variance of error term). 564 | 565 | ### **[Decision Trees](https://youtu.be/_L39rN6gz7Y)** 566 | 567 | - Simple algorithm based on decision for each leaf. 568 | 569 | - We can use them when we want easier interpretation (instead of Random Forests). 570 | 571 | - Can be used both for regression & classification. 572 | 573 | ### **[Recurrent Neural Networks (RNNs)](https://youtu.be/AsNTP8Kwu80)** 574 | 575 | - Neural networks where output from previous step is fed to current input. 576 | 577 | - Mainly used for text, speech and generative AI. 578 | 579 | ### **[Convolutional Neural Networks (CNNs)](https://youtu.be/HGwBXDKFk9I)** 580 | 581 | - Use convolution to extract features. 582 | 583 | - Has connected layers to make final prediciton. 584 | 585 | - Main use: facial recognition, self-driving cars etc. 586 | 587 | ### **[Transfer Learning](https://youtu.be/yofjFQddwHE)** 588 | 589 | - Storing knowledge from previous task to a new taks (f.ex vehicle to bus recognition). 590 | 591 | ### **Train/Test/Validation and Cross Validation** 592 | 593 | - Rule of thumb: 594 | 595 | - Training data > 1000k 596 | 597 | - Best train/test/split usually around 75/10/15 (Adjusted on variance, data size etc.). 598 | 599 | - Cross Validation Steps: 600 | 601 | - Shuffle dataset randomly. 602 | 603 | - Split in k groups. 604 | 605 | - Take holdout (test/training). 606 | 607 | - Fit Model. 608 | 609 | - Summarize skill of model. 610 | 611 | - Cross Validation types: 612 | 613 | - Leave-p-out (Using p for validation). 614 | 615 | - Leave-one-out (p=1). 616 | 617 | - Stratified k-fold (good for imbalanced sets, but not time series). 618 | 619 | - Time Series. 620 | 621 | - Nested (Double CV). 622 | 623 | ### **[Other Important Concepts](https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/)** 624 | 625 | - Loss Functions (Display whether models show improvement). 626 | 627 | - Squared Error (Square of difference between actual and predicted values). 628 | 629 | - Gradient Descent (Optimization algorithm where multiple epochs adjust with changing weights to find local/global minima). 630 | 631 | - Batch/Vanilla (Takes all available training data to find minima) - but its too slow. 632 | 633 | - Stochastic (Shuffles and constantly aims for new minima) - good, but still slow. 634 | 635 | - Mini-Batch (Less than total data sets) - here optimizers come in. 636 | 637 | - Types of optimizers: 638 | 639 | - Momentum → weight parameters updated through previous gradient based on momentum. 640 | 641 | - AdaGrad → suited for sparse data / adjust properly weights manually. 642 | 643 | - RMSProp → Better than AdaGrad since adjusts weights automatically faster. 644 | 645 | - Adam → exploits momentum speed while exponentially decaying average of past grad 646 | 647 | ### **CPU vs GPU** 648 | 649 | - CPU better choice for general, non-DL, small data cases. 650 | 651 | - GPU for data with huge volume, Deep Learning. 652 | 653 | ### **Distributed vs Non-Distributed ML** 654 | 655 | - We should not use SPARK or equivalent solution if: 656 | 657 | - processing pandas time is reasonable. 658 | 659 | - we dont need all data to be processed. 660 | 661 | - we have equivalent solution in cloud (EMR). 662 | 663 | - solution can be implemented with simple SQL. 664 | 665 | ### **Hyper Parameter Optimization** 666 | 667 | - [Regularization](https://www.einfochips.com/blog/regularization-make-your-machine-learning-algorithms-learn-not-memorize/) (Prevents a model from overfitting through regularizing learning). 668 | 669 | - 3 main regularization methods: 670 | 671 | - Ridge Regression (L2 Regularization) 672 | 673 | - Add the sum of weight’s square to a loss function and thus create a new loss function which is denoted thus 674 | 675 | ![L2](https://www.einfochips.com/blog/wp-content/uploads/2019/01/L2-Regularization.png) 676 | 677 | - Lasso Regression (L1 Regularization) 678 | 679 | - Uses absolute weight values for normalization 680 | 681 | - Eliminates less important features and sets respective weight values to zero performing feature selection along with regularization. 682 | 683 | - ![Lasso](https://www.einfochips.com/blog/wp-content/uploads/2019/01/L1-Regularization.png) 684 | 685 | - Dropout (Mainly for Neural Nets) 686 | 687 | - Drop connections with 1-p probability for each of the specified layers. Where p is called **keep probability parameter** and which needs to be tuned. 688 | 689 | - ![Dropout](https://www.einfochips.com/blog/wp-content/uploads/2019/01/Dropout.png) 690 | 691 | ### **[Hyperparameter Tuning](https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide)** 692 | 693 | - Hyperparameter tuning determines the right combination of hyperparameters that maximizes the model performance. 694 | 695 | - Main Hyperparameter tuning methods: 696 | 697 | - GridSearchCV (a grid of possible values tried in order). Very slow, but the best performing. 698 | 699 | - RandomizedSearchCV (grid of possible values for hyperparameters but each iteration tries a random combination of hyperparameters from this grid). Faster, but lower accuracy. 700 | 701 | - BayesSearchCV (uses bayesian algorithm to find the minimal point in the minimum number of steps.). Faster than GridSearch & produces better results than RandomizedSearch. 702 | 703 | ### **[Neural Networks in High Level](https://youtu.be/aircAruvnKk)** 704 | 705 | Basic idea is to think in term of brain synapses and how they transmit information. 706 | 707 | For neural nets we use various formulas to adjust the layers in which we transmit the information. 708 | 709 | ![Brain](https://miro.medium.com/max/640/1*Zx0FP-qA_mlAWqxiQh7ZRw.webp) 710 | 711 | ![Neuron](https://miro.medium.com/max/640/1*39ZfHWfdv1UNhFFS9dcb1Q.webp) 712 | 713 | *Note: Watch and keep notes of the 3B1B Neural Network [playlist here](https://youtu.be/aircAruvnKk)*. 714 | 715 | ### **Evaluating Machine Learning Models** 716 | 717 | Overfitting & Underfitting in a nutshell! 718 | 719 | ![Over&Under](https://miro.medium.com/max/720/1*lARssDbZVTvk4S-Dk1g-eA.webp) 720 | 721 | - [Avoiding overfitting](https://elitedatascience.com/overfitting-in-machine-learning): 722 | 723 | - Use Cross-Validation. 724 | 725 | - ![CV](https://elitedatascience.com/wp-content/uploads/2017/06/Cross-Validation-Diagram-768x295.jpg) 726 | 727 | - Train more data. 728 | 729 | - Early stopping algorithm. 730 | 731 | - ![Stop](https://elitedatascience.com/wp-content/uploads/2017/09/early-stopping-graphic.jpg) 732 | 733 | - Adjust Regularization. 734 | 735 | - Ensemble models (F.ex Bagging & Boosting) 736 | 737 | - Avoid Underfitting (Adjust the abovementioned, but on the opposite end). 738 | 739 | - Decrease regularization 740 | 741 | - Increase the duration of training 742 | 743 | - Increase the model complexity (Parameters f.ex) 744 | 745 | - Shuffling data after each epoch (Neural Nets). 746 | 747 | ### **[Main Metrics](https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide)** 748 | 749 | ### **Regression Metrics** 750 | 751 | - **Mean Squared Error (MSE)** 752 | 753 | - Average of the squared difference between the target value and the value predicted. 754 | 755 | - Penalizes small errors. Due to the squaring, it’s more prone to outliers than other metrics. Overstimates errors. 756 | 757 | - ![MSE](https://lh5.googleusercontent.com/UU0UymvLgNfq6va2--cOndvalbdcZQX20FuzalU2RR0qxwesRa2pjZesapeFvMnRu39KlVbGIhVk6W6w1C2o_WbwEOYoU9UZtnZCw2eS2hBQbR-4RSShqkMGGCfg9Lr3eVM_1e-8) 758 | 759 | - **Mean Absolute Error (MAE)** 760 | 761 | - Average of the difference between the ground truth and the predicted values. 762 | 763 | - MAE uses absolute value of the residual (doesnt overestimate errors. However, it doesn’t give us an idea of the direction of the error). 764 | 765 | - ![MAE](https://lh6.googleusercontent.com/gbSihak-qx9VaNa-ibUqxaIM6mD9nmfwI7wwxK_tRyOfUUGJ_XnH6jU_vcDqD9IgI1disL-cTIELJx5skJZ2uIX6oSC9rG2M8hKoabc4fIoBzJdNg3NkT91GBqH9yabg5sKSf4-J) 766 | 767 | - **Root Mean Squared Error (RMSE)** 768 | 769 | - Square root of the average of the squared difference between the target value and the value predicted - SQRT(MSE) 770 | 771 | - Best of both worlds since its differentiable (like MSE) and less prone to outliers (like MAE). 772 | 773 | - ![RMSE](https://lh5.googleusercontent.com/pxb5gFdX2WYgW5dAvofM3bGUpJumpr_ATYdTScT3oXB-fXr-wAZ4QTOEjNaWpDtVPyU_Iyv62uJ3HlzAcT6dVj9x5ZgZ246oCgD5zVVOW65EQ8XUnESmVVHRLt7sc5szK4pIXxC_) 774 | 775 | - **R-Squared (R²)** 776 | 777 | - Post-metric: How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line). 778 | 779 | - If the sum of Squared Error of the regression line is small => R² will be close to 1 (Ideal), meaning the regression was able to capture 100% of the variance in the target variable. 780 | 781 | - **Adjusted R-Squared (Adj-R²)** 782 | 783 | - R² can mislead (showing improvement when really we are overfitting). 784 | 785 | - Adj-R² is adjusted with the number of independent variables. 786 | 787 | - Adjusts for the increasing predictors and only shows improvement if there is a real improvement. 788 | 789 | - ![RMSE](https://lh3.googleusercontent.com/D3TyYb1YNZKngK6sE01edMDL0u0uyJDlaaiwXu0g9haVWxJ9hqC3T01RUGTqPqelUIgIfBKI_KBAd8NyRVORsRKXmikIOJjv-MMgeka8fRHYNBW5vrVU15wrMTVgsdUrScthMACf) 790 | 791 | ### **Classification Metrics** 792 | 793 | - **Accuracy** 794 | 795 | - Number of correct predictions divided by the total number of predictions, multiplied by 100. 796 | 797 | - Simplest model (not many insights on distribution or breakdown). 798 | 799 | - **Confusion Matrix** 800 | 801 | - Tabular representation of ground-truth labels versus model predictions. 802 | 803 | - First we raise a hypothesis → then decide on True / False labels as below. 804 | 805 | - ![Confusion Matrix](https://lh5.googleusercontent.com/Zc7lXIYu0XBJG-P_VWLhiyAmWTvUL-CPLcRxiNnJy03JPoIPkJWmdGn4kxFdm6I0MBDBr7tlZW6Wlko5aO--eleGQQoy3yKQJcGSapDRqCf-W3xxiFjnGzNQXBzaHDK-y32lVUvr) 806 | 807 | - The components on a confusion matrix are the following: 808 | 809 | - True Positive(TP) number of positive class samples model predicted correctly. 810 | 811 | - True Negative(TN) number of negative class samples model predicted correctly. 812 | 813 | - False Positive(FP) number of positive class samples your model predicted incorrectly (Type-I error). 814 | 815 | - False Negative(FN) number of negative class samples your model predicted incorrectly (Type-II error). 816 | 817 | - Core formulas we need to know are the following: 818 | 819 | - Precision → ratio of true positives to total positives predicted. 820 | 821 | - The close to 1, the better (>0.5 as starting point). 822 | 823 | - ![Precision](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/Performance-metrics-precision.png?ssl=1) 824 | 825 | - Recall → ratio of true positives to all the positives. 826 | 827 | - The close to 1, the better (>0.5 as starting point). 828 | 829 | - Summary of confusion matrix breakdown: 830 | 831 | - ![Summary](https://www.researchgate.net/publication/336402347/figure/fig3/AS:812472659349505@1570719985505/Calculation-of-Precision-Recall-and-Accuracy-in-the-confusion-matrix.ppm) 832 | 833 | - **F1-score** 834 | 835 | - Harmonic mean of precision & recall. 836 | 837 | - High score → shows good balance between precision and recall and gives good results on imbalanced classification problems. 838 | 839 | - ![F1](https://lh6.googleusercontent.com/GJd7fTEzlzcUW1z3q7KJrk5FIUwZYnSqIQ-iqejGCR2ANqdUDcvj0bhhvaSI4yfBE0bFDqdNimzRetBO5NL9fljZ0kO-BniYC5U3M_l3nJw2OORUzW6i8w6_iCjXDpyDHenIpi7Z) 840 | 841 | - **AUC-ROC (Area under Receiver operating characteristics curve)** 842 | 843 | - Area under the curve between combination of Recall (TPR) & Fallout (FPR) - also knows as ROC. 844 | 845 | - The bigger the area (extending on upper-left corner) the better results. 846 | 847 | - ![ROC](https://developers.google.com/static/machine-learning/crash-course/images/ROCCurve.svg) 848 | 849 | - ![ROC](https://developers.google.com/static/machine-learning/crash-course/images/AUC.svg) 850 | 851 | ### **Clustering Evaluation** 852 | 853 | - **Dunn Index** 854 | 855 | - Identifies clusters that have low variance and are compact. The mean values of the different clusters also need to be far apart. 856 | 857 | - High computational cost with high number of clusters 858 | 859 | - ![Dunn](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/08/Screenshot-from-2019-08-08-15-37-22.png) 860 | 861 | - Silhouette Coefficient 862 | 863 | - Tracks how every point in one cluster is close to every point in the other clusters in the range of -1 to +1 864 | 865 | - A score of 1 denotes the best (meaning that the data point i is very compact within the cluster to which it belongs and far away from the other clusters). 866 | 867 | - ![Silhouette](https://editor.analyticsvidhya.com/uploads/59928Untitled.png) 868 | 869 | - Elbow Method 870 | 871 | - Determine the number of clusters in a dataset by plotting the number of clusters on the x-axis against the percentage of variance explained on the y-axis. 872 | 873 | - Basically find sweet spot between nr. of clusters & variance (as below): 874 | 875 | - ![Elbow](https://i0.wp.com/neptune.ai/wp-content/uploads/2022/10/clusters.png?ssl=1) 876 | 877 | - **Simple summary of bias / variance tradeoff** 878 | - ![tradeoff](https://miro.medium.com/max/720/1*wPUGn4buYw4LYISGL-TUuA.webp) 879 | 880 | **For A/B Testing read the [following article](https://towardsdatascience.com/25-a-b-testing-concepts-interview-cheat-sheet-c998a501f911) for better understanding** 881 | 882 | *For deeper understanding on evaluating machine learning models for business use cases, I highly recommend **[this book](https://www.amazon.sg/Data-Science-Business-Data-Analytic-Thinking/dp/1449361323/ref=asc_df_1449361323/?tag=googleshoppin-22&linkCode=df0&hvadid=389114203157&hvpos=&hvnetw=g&hvrand=9338476164786084356&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9062524&hvtargid=pla-448095044074&psc=1)** as a reading complementary resource. **Warning**: Dont use [this source](https://libgen.is/) for it ;)! 883 | 884 |
885 | 886 | ------ 887 | 888 |
889 | 890 | ## **Domain 4: Machine Learning Implementation and Operations** 891 | 892 | ### **Key Services Associated With ML in AWS** 893 | 894 |
895 | 896 | **Analytics:** 897 | 898 | - [Amazon Athena](https://aws.amazon.com/athena/faqs/?nc=sn&loc=6) 899 | 900 | - [Amazon EMR](https://aws.amazon.com/emr/faqs/?nc=sn&loc=5) 901 | 902 | - [Amazon Kinesis Data Analytics](https://aws.amazon.com/kinesis/data-analytics/faqs/?nc=sn&loc=6) 903 | 904 | - [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5) 905 | 906 | - [Amazon Kinesis Data Streams](https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=6) 907 | 908 | - [Amazon QuickSight](https://aws.amazon.com/quicksight/resources/faqs/) 909 | 910 | **Compute:** 911 | 912 | - [AWS Batch](https://aws.amazon.com/batch/faqs/?nc=sn&loc=5) 913 | 914 | - [Amazon EC2](https://aws.amazon.com/ec2/faqs/) 915 | 916 | **Containers:** 917 | 918 | - [Amazon Elastic Container Registry (Amazon ECR)](https://aws.amazon.com/ecr/faqs/) 919 | 920 | - [Amazon Elastic Container Service (Amazon ECS)](https://aws.amazon.com/ecs/faqs/) 921 | 922 | - [Amazon Elastic Kubernetes Service (Amazon EKS)](https://aws.amazon.com/eks/) 923 | 924 | - Database: 925 | 926 | - [AWS Glue](https://aws.amazon.com/glue/faqs/) 927 | 928 | - [Amazon Redshift](https://aws.amazon.com/redshift/) 929 | 930 | **Internet of Things (IoT):** 931 | 932 | - [AWS IoT Greengrass](https://aws.amazon.com/greengrass/faqs/) 933 | 934 | **Machine Learning:** 935 | 936 | - [Amazon Comprehend](https://aws.amazon.com/comprehend/) 937 | 938 | - [AWS Deep Learning AMIs (DLAMI)](https://aws.amazon.com/machine-learning/amis/resources/) 939 | 940 | - [AWS DeepLens](https://aws.amazon.com/deeplens/faqs/) 941 | 942 | - [Amazon Forecast](https://aws.amazon.com/forecast/resources/#Documentation) 943 | 944 | - [Amazon Fraud Detector](https://aws.amazon.com/fraud-detector/faqs/) 945 | 946 | - [Amazon Lex](https://aws.amazon.com/lex/faqs/?nc=sn&loc=6) 947 | 948 | - [Amazon Polly](https://aws.amazon.com/polly/faqs/?nc=sn&loc=8) 949 | 950 | - [Amazon Rekognition](https://aws.amazon.com/rekognition/faqs/?nc=sn&loc=7) 951 | 952 | - [Amazon SageMaker](https://aws.amazon.com/sagemaker/faqs/?nc=sn&loc=4) 953 | 954 | - [Amazon Textract](https://aws.amazon.com/textract/faqs/) 955 | 956 | - [AmazonTranscribe](https://aws.amazon.com/transcribe/faqs/?nc=sn&loc=5) 957 | 958 | - [Amazon Translate](https://aws.amazon.com/translate/faqs/) 959 | 960 | **Management and Governance:** 961 | 962 | - [AWS CloudTrail](https://aws.amazon.com/cloudtrail/faqs/) 963 | 964 | - [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/faqs/) 965 | 966 | **Networking and Content Delivery:** 967 | 968 | - [Amazon VPC](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 969 | 970 | **Security, Identity, and Compliance:** 971 | 972 | - [AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam/faqs/?nc=sn&loc=5) 973 | 974 | **Serverless:** 975 | 976 | - [AWS Fargate](https://aws.amazon.com/fargate/faqs/?nc=sn&loc=4) 977 | 978 | - [AWS Lambda](https://aws.amazon.com/lambda/faqs/) 979 | 980 | **Storage:** 981 | 982 | - [Amazon Elastic File System (Amazon EFS)](https://aws.amazon.com/efs/faq/) 983 | 984 | - [Amazon FSx1](https://aws.amazon.com/fsx/) 985 | 986 | - [Amazon S3](https://aws.amazon.com/s3/faqs/?nc=sn&loc=7) 987 | 988 | ### The following seminars offer an overview on the best patterns to follow for productionizing machine learning 989 | 990 |
991 | 992 | **[End to End - MLOps Architecture Patterns](https://youtu.be/UnAN35gu3Rw)** 993 | 994 | - The typical data scientist setup **(which is bad)** is the following: 995 | 996 | - Data Sources (S3, EFS, RDS, DynamoDB, Redshift, EMR etc). 997 | 998 | - Sagemaker notebooks (getting data from the sources). 999 | 1000 | - Storing in S3 model artifacts (output, files etc). 1001 | 1002 | - Sagemaker Endpoint (make real-time inferences via a REST API). 1003 | 1004 | - Create Lambda function to connect through API Gateway. 1005 | 1006 | **This setup is bad since:** 1007 | 1008 | a. We have to manually re-run cells. 1009 | 1010 | b. Code is stuck in notebooks (difficult to version & automate). 1011 | 1012 | c. No autoscaling or feedback. 1013 | 1014 |
1015 | 1016 | - **INSTEAD WE CREATE THE FOLLOWING**: 1017 | 1018 | ![Small-Mid Architecture](1.png) 1019 | 1020 | - (Same initial 3 steps) 1021 | 1022 | - Configure Data Sources (S3, EFS, RDS, DynamoDB, Redshift, EMR etc). 1023 | 1024 | - Sagemaker notebooks (getting data from the sources). 1025 | 1026 | - Storing in S3 model artifacts (output, files etc). 1027 | 1028 | (To improve versioning we:) 1029 | 1030 | - Use **Code Commit** to store code. 1031 | 1032 | - Add **ECR** (Elastic-Container-Registry) to store docker container and versioning environments. 1033 | 1034 | (To improve automation, we: ) 1035 | 1036 | - Add **(Pipelines, Step-Functions, Airflow)**. 1037 | 1038 | - Incorporate a scheduled trigger through **EventBridge**. 1039 | 1040 | - Create a model registry through Sagemaker (which keeps track of model metadata such as hyper-tuning parameters). 1041 | 1042 | (Finally, for deployment stage, we: ) 1043 | 1044 | - Add **autoscaling** for the Sagemaker endpoints. 1045 | 1046 | - Add a **Lambda Function** to trigger approval of a model and fetch to endpoint. 1047 | 1048 | - Finally, we add a Cloud Watch to inform through an alarm (if we get error). 1049 | 1050 |
1051 | 1052 | **[How to productionize ML workloads at scale](https://youtu.be/fJer8dO3iFU)** 1053 | 1054 | **[AWS re:Invent 2021 - Implementing MLOps practices with Amazon SageMaker, featuring Vanguard](https://youtu.be/fuXUi_hoK78)** 1055 | 1056 | **[Automate MLOps with SageMaker Projects | Amazon Web Services](https://youtu.be/3_cHnk9VSfQ)** 1057 | 1058 | Add more videos here. 1059 | -------------------------------------------------------------------------------- /AWS Power Hour.md: -------------------------------------------------------------------------------- 1 | # **AWS Power Hour: Machine Learning EP 1 Introduction to Machine Learning on AWS** 2 | 3 | - **S3 Storage Classes** (in mind with optimizing costs) 4 | 5 | - S3 Standard 6 | 7 | - Store frequently accessed data. Readibly accessible 8 | 9 | - Durability 99.(11x9)% - chance that data will be stored. 10 | 11 | - Avalaibility 99.(4x9)% - chance that data will be available. 12 | 13 | - S3 IA Standard 14 | 15 | - Less frequently accessed data. (less than once/month) 16 | 17 | - S3 One Zone IA 18 | 19 | - Data which we dont need equally available at all times. 20 | 21 | - Instead of replicating data across 3 Availability Zones, replicate in only one. 22 | 23 | - For example, previous versions of data lake instead 24 | 25 | - S3 Glacier - Instant Retreival 26 | 27 | - Archival type of data. F.ex variety of logs, ETL jobs. 28 | 29 | - S3 Intelligent Tiering 30 | 31 | - ML Based storage. Shifts datasets across different storage types depending on access patterns it observes. 32 | 33 |
34 | 35 | - **Moving Data: Kinesis Services** 36 | 37 | - Kinesis Video Streams: 38 | 39 | - Secure stream video from connected services. 40 | 41 | - Kinesis Data Streams: 42 | 43 | - High throughput data streams integration. 44 | 45 | - Kinesis Firehose: 46 | 47 | - Can be configured, to automatically stream data into the data lake. 48 | 49 | - Kinesis Data Analytics: 50 | 51 | - Doing real time analytics. 52 | 53 | - Amazon Kinesis Data Streams producer. 54 | 55 | - **Producer** is an application that puts user data records into a Kinesis data stream (also called data ingestion). 56 | 57 | - Performance Benefits. 58 | 59 | - Consumer-Side Ease of Use. 60 | 61 | - Asynchronous Architecture. 62 | 63 | - Producer Monitoring. 64 | 65 | - Connecting to Zeppeling Notebook 66 | 67 | - Setting up a schema, and execute queries. 68 | 69 | - Delivering streams (Firehose): 70 | 71 | - Transforming source records with Lambda 72 | 73 | - AWS Glue 74 | 75 | - ETL Tool. 76 | 77 | - Spark Analytics Kernel Image Integration with ETL Glue. (Can use PySpark) (Configure Spark Session). 78 | 79 | --- 80 | 81 | ## **AWS Power Hour: EDA on AWS** 82 | 83 | - Sagemaker Data Wrangler: 84 | 85 | - Tool from within studio. 86 | 87 | - Map various data stores (f.ex redshift, lake formation). Can also write queries before pulling in data). 88 | 89 | - Generate insights to better understand data that we have. (Data quality checks). 90 | 91 | - Can do the following: 92 | 93 | - Balancing Data: 94 | 95 | - Random Oversampling 96 | 97 | - Random Undersampling 98 | 99 | - SMOTE (Synthetic Minority Oversampling TEchnique) 100 | 101 | - Encoding (Mainly for Categorical): 102 | 103 | - One-hot/dummy encoding 104 | 105 | - Label / Ordinal encoding 106 | 107 | - Target encoding 108 | 109 | - Frequency / count encoding 110 | 111 | - Binary encoding 112 | 113 | - Feature Hashing 114 | 115 | - Nominal variables → NO order 116 | 117 | - Ordinal variables → HAVE order 118 | 119 | - Process Numeric: 120 | 121 | - Standard Scaler (subtracting mean from each value and scaling to unit variance). 122 | 123 | - Robust Scaler (scaling in a way robust to outliers). 124 | 125 | - MinMax Scaler (scaling each feature to a given range). 126 | 127 | - Max Absolute Scaler (maximal absolute value of each feature in the training set will be 1.0) 128 | 129 | - Data Flow → similar to Alteryx or Knime 130 | 131 | - Sagemaker Autopilot helps automating end to end projects. 132 | 133 | - Feature Store: Can be stored and shared interoperatibly. Linear tracking for each feature. 134 | 135 | - Spark connector enabling to ingest data in bulk. 136 | 137 | - **Amazon SageMaker Clarify** 138 | 139 | - Detect bias in ML data, models and explain model predictions. 140 | 141 | - Identifies imbalance. 142 | 143 | - Two Important AWS Papers for ML: 144 | 145 | - [**Augmented AI: The Power of Human and Machine**](https://aws.amazon.com/certification/certified-machine-learning-specialty/#:~:text=Augmented%20AI%3A%20The%20Power%20of%20Human%20and%20Machine) 146 | 147 | - [**Machine Learning Lens - AWS Well-Architected Framework**](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/machine-learning-lens.html) 148 | 149 | --- 150 | 151 | ## **AWS Power Hour: Modeling on AWS** 152 | 153 | - Starting with Built-in Algorithms (we have docker images in AWS). 154 | 155 | - Mapping use cases against each [**built in algorithm**](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) 156 | 157 | - Remember and visualize each of the use cases mentioned against the models. 158 | 159 | - [Sagemaker debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) (part of studio). Helps visualize and understand better the errors. 160 | 161 | - Sagemaker HyperParameter Tuning: 162 | 163 | - 164 | 165 | - To enable profiling, we need to pass: 166 | 167 | - profiler_config (profiling configuration) 168 | 169 | - rules (debugger rules) 170 | 171 | - image_url (Uniform Resource Identifier) 172 | 173 | - Can download report (training time, breakdown etc). 174 | 175 | --- 176 | 177 | ## **AWS Power Hour: Security, Operations and the Exam** 178 | 179 | - Typical ML Vulnerabilities: 180 | 181 | - Data Poisonining → ML models are trained on tampered data, leading to inaccurate model predictions. 182 | 183 | - Membership inference → ability to tell whether a data record was included in the dataset used to train the ML model. This could lead to additional privacy concerns for personnel data. 184 | 185 | - Model inversion → reverse-engineering of model features and parameters. 186 | 187 | - Protecting against them: 188 | 189 | - Launch ML instances in a VPC. 190 | 191 | - Control inbound and outbound fo traffic flow isolated compute and network environment 192 | 193 | - Use least privilege to control access to ML artifacts. 194 | 195 | - Apply IAM through Identity, Resource or Service based policies. 196 | 197 | - Use Data Encryption. 198 | 199 | - Choose among three types of custom master keys (CMK) provided by AWS Key Management Service (KMS) 200 | 201 | - Use Secrets Manager to protect credentials. 202 | 203 | - Use AWS Secrets Manager to store your credentials, and then grant permissions to your SageMaker IAM role to access Secrets Manager from your notebook. 204 | 205 | - Monitor model input and output. 206 | 207 | - Use Amazon SageMaker Model Monitor to detect and alert you to drifts in your data and model performance. 208 | 209 | - Enable logging for model access. 210 | 211 | - Grant API Gateway permission to read and write logs to CloudWatch, and then enable CloudWatch Logs with API Gateway. 212 | 213 | - Use version control on model artifacts. 214 | 215 | - Version control to track your code or other model artifacts (can rollback to previous state). 216 | 217 | - Use Sagemaker provided templates as starting point for: 218 | 219 | - Processing data, extracting features, train and test models, registering the models in the SageMaker model registry, and deploying the models for inference. 220 | 221 | - We can modify the templates through config_files. 222 | 223 | - Other pre-configurations include: image building, model monitoring etc. 224 | 225 | - Can see DAGS for flow of ML. 226 | 227 | - Service Catalog helps with automating the setup and implementation of MLOps (for Sagemaker Projects). 228 | 229 | - Data Quality and Monitoring: 230 | 231 | ![Alt text](Model%20Deployment%20for%20Drift%20Monitoring.png) 232 | 233 | - Executed via Sagemaker process job. Enable data capture. 234 | 235 | - Check schema, quality, baseline statistics, recommended constraints. 236 | 237 | - Uses SNS when boundaries are breached. -------------------------------------------------------------------------------- /AWS Ramp Up Guide.md: -------------------------------------------------------------------------------- 1 | # **[Machine Learning Ramp Up Guide: Summary](https://training.resources.awscloud.com/get-certified-machine-learning-specialty/aws-ramp-up-guide-machine-learning-2)** 2 | 3 | ## **Step 1: Learn AWS Machine Learning (ML) fundamentals** 4 | 5 | ### a) Machine Learning Essentials for Business and Technical Decision Makers 6 | 7 | ### **Introduction to Machine Learning: Art of the Possible** 8 | 9 |
10 | 11 | - ML → using math to find patterns in data & update model and training data to improve accuracy. 12 | 13 | ![Process](https://dapperdatadig.files.wordpress.com/2020/05/ml-steps.png?w=1024) 14 | 15 | - Key Terms (M.T.T.D): 16 | 17 | **Model** → The output of an ML algorithm trained on a data set; used for data prediction. 18 | 19 | **Training** → The act of creating a model from past data. 20 | 21 | **Testing** → Measuring the performance of a model on test data. 22 | 23 | **Deploying** → Integrating a model into a production pipeline. 24 | 25 |
26 | 27 | - History of Amazon and ML: 28 | 29 | ![History](https://www.algotive.ai/hs-fs/hubfs/00%20Blog/02%20Machine%20Learning/timeline.jpg?width=800&name=timeline.jpg) 30 | 31 | **Amazon Flywheel** - How investing in specific key business operations can reinforce other processes and create a positive feedback loop. 32 | 33 | ![Amazon Flywheel](Amazon%20Flywheel.png) 34 | 35 | **AWS ML Flywheel** - Uses data collected from parts of a business operation, a model to predict future outcomes, and provides ways to continuously improve efficiency. 36 | 37 | ![ML Flywheel](AWS%20ML%20Flywheel.png) 38 | 39 | **Amazon uses ML in the following ways:** 40 | 41 | - Product recommendations and promotions. 42 | 43 | - Alexa and voice interactions through NLP. 44 | 45 | - Ship 1.6M packages per day. 46 | 47 | **Amazon AI/ML Services:** 48 | 49 | - Amazon Forecast: Time series forecasting (just upload data within requirements). 50 | 51 | - Amazon Fraud Detector: Fully managed service that spots online payment fraud and creation of fake accounts. 52 | 53 | - Amazon Personalize: Recommender where you select a training algorithm to use on the data, solution model training, and solution deployment. 54 | 55 | - Amazon Polly: Text to speech. 56 | 57 | - Amazon Transcribe: Speech to text (Polly's opposite). 58 | 59 | - Amazon SageMaker: Think of Jupyter lab on steroids. 60 | 61 |
62 | 63 | ### **How does machine learning work?** 64 | 65 | - **AI**: Automate and accelerate tasks performable by humans through natural intelligence. Two types: 66 | 67 | - **Narrow** → AI imitates human intelligence in a single context. 68 | 69 | - **General** → AI learns and behaves with intelligence across multiple contexts. 70 | 71 | - **ML vs AI: Difference**: AI ingests data and imitates human knowledge vs ML improves model on training data (subset) 72 | 73 | ![Difference](https://images.squarespace-cdn.com/content/v1/5bce4071ab1a620db382773e/756e0a76-8544-483b-8757-ad53d2afa7af/Euler+Diag.png) 74 | 75 | - **ML vs Traditional Programming**: ML is teaching a computer to recognize patterns by example, rather than programming it with specific rules. 76 | 77 | ![MLvsProgramming](https://miro.medium.com/max/799/1*t6Myx_4eEwaWP9Vms_kYfg.png) 78 | 79 | - **3 Main ML Categories:** 80 | 81 | - **Supervised learning** → model learns from a data set containing input values and paired output values that you would like to predict. Could be *classification* (spam detection) or *regression* (forecasting demand). 82 | 83 |
84 | 85 | ![Supervised](https://lh4.googleusercontent.com/K17BRCQTR5hHU-qOthrs9KIQa4DLAWJh5jeXkyn6NZRQfimHnCAadWbw3EaZPZl1bit2IBQPeBv1CZURiyFYkIDPH1Z3Pb0O_qkeS9av7vrEtQLpMLWdtDJ7YNlRki8CoAsY8bmn) 86 | 87 | - **Unsupervised learning** → training model learns from data without any guidance. The objective is pattern and structure recognition. Could be *clustering* (customer segmentation) or *association* (finding regularities among products). 88 | 89 | ![Unsupervised](https://miro.medium.com/max/1400/1*4yFCbNwp0gGdGR5KbquFHA.png) 90 | 91 | - **Reinforcement learning** → training model learns from its environment by being rewarded for correct moves and punished for incorrect moves. F.ex *autonomous driving*. 92 | 93 | ![Reinforcement](https://miro.medium.com/max/702/1*4u2GtNnMa9xso1WkLh7hVA.png) 94 | 95 |
96 | 97 | ### **What are some potential problems with machine learning?** 98 | 99 | - Basically a combination of: 100 | 101 | - Poor Data / Lack of data. 102 | 103 | - Unexplainability (Too complex). Or too simplistic (low accuracy). 104 | 105 | - Take uncertainty (black swan events into account). 106 | 107 |
108 | 109 | ### **Planning a Machine Learning Project** 110 | 111 | - Is a machine learning solution appropriate for my problem? 112 | 113 | - Requires 4 main components: 114 | 115 | - **Complex logic** (f.ex recommender). 116 | 117 | - **Requires scalability** (f.ex personalized recommendations to million users). 118 | 119 | - **Requires personalization** (tailored specific to user). 120 | 121 | - **Requires responsiveness** (response in milliseconds to personalization). 122 | 123 | - We dont need to use ML if: 124 | 125 | - Can be solved with traditional algorithms. 126 | 127 | - Does not require adapting to new data. 128 | 129 | - Requires 100% accuracy. 130 | 131 | - Requires full interpretability. 132 | 133 | - Is my data ready for machine learning? 134 | 135 | - Types of data: 136 | 137 | - Document 138 | 139 | - Audio 140 | 141 | - Images 142 | 143 | - Video 144 | 145 | - Checklist requirement for using data: 146 | 147 | - **Availability** (not requiring significant preprocessing). ✓ 148 | 149 | - **Accessibility** (on demand with CRUD capabilities). ✓ 150 | 151 | - **Respect Privacy** (f.ex ethnicity, salary). ✓ 152 | 153 | - **Security** (f.ex respect regulations). ✓ 154 | 155 | - **Relevant** to the scope an project. ✓ 156 | 157 | - **Fresh** and recent data. ✓ 158 | 159 | - **Representative** and encompassing features. ✓ 160 | 161 | - **Unbiased** without agenda. ✓ 162 | 163 |
164 | 165 | ### **How will machine learning impact a project timeline?** 166 | 167 | - Machine learning project expectations (weeks up to months). Keep track of model drift (changes in data distribution)! 168 | 169 | ![Lifecycle](https://www.ntconcepts.com/wp-content/uploads/ntc_ml_lifecycle-1.png) 170 | 171 | - A typical timeline for ML projects (rough benchmark). 172 | 173 | ![timeline](https://global-uploads.webflow.com/5d3ec351b1eba4332d213004/5efeef85594ffa20604a9b76_image2_s.jpg) 174 | 175 |
176 | 177 | ### **What early questions should I ask in deployment?** 178 | 179 | - What is the likely computational cost of generating predictions with your model? 180 | 181 | - How quickly does your data change? 182 | 183 | - How significant are the changes needed to deploy? 184 | 185 | - Does the model’s performance meet the business need? 186 | 187 |
188 | 189 | ### **Building a Machine Learning Ready Organization** 190 | 191 | - How can I prepare my organization for using ML? 192 | 193 | - Have a robust ML strategy. 194 | 195 | - Data strategy. 196 | 197 | - Culture of learning and collaboration. 198 | 199 | - Find the right problem (data, complexity, ). 200 | 201 | - Fail forward (deliberate failure - keep experimenting). 202 | 203 | - Scale beyond proofs of concept (POC). 204 | 205 |
206 | 207 | ### **Machine Learning for Business Challenges** 208 | 209 | - Key Takeaways: 210 | 211 | - Defining **scope** of ML problem: 212 | 213 | - Specific business problem we are trying to solve. 214 | 215 | - Current state. 216 | 217 | - What are / what is causing pain points. 218 | 219 | - Impact of the problem. 220 | 221 | - How do we define success. 222 | 223 | - **Input** gathering: 224 | 225 | - Do we have sufficient data? 226 | 227 | - Is there labeled data? 228 | 229 | - How difficult to obtain labeled data? 230 | 231 | - What are the main features? 232 | 233 | - Where is data located? 234 | 235 | - Data quality check? 236 | 237 | - **Output** definitions: 238 | 239 | - What business metrics define success? 240 | 241 | - Trade-offs? 242 | 243 | - Existing baselines (if not, simplest solution)? 244 | 245 | - How important is runtime and performance? 246 | 247 | - Image **Classification** Problem: 248 | 249 | - We need training data, and groundtruth label. 250 | 251 | - Feature engineering: deciding set of measurements for each instance. 252 | 253 | - Choose classifier model (select the one with higher accuracy on *validation* set and ***good business values***): 254 | 255 | - **SVM** (creates a line which separates the data into classes). 256 | 257 | - **Naive Bayes** (uses Bayes rule together with a strong assumption that the attributes are conditionally independent). 258 | 259 | - **Logistic Regression** (predicts the probability of a binary (yes/no) event). 260 | 261 | - **Deep Neural Networks** 262 | 263 |
264 | 265 | - **Reinforcement** Problem: Training a Robot 266 | 267 | - Reward / Punish agent based on choice and reiterate. 268 | 269 | - No presentation of input or output pairs. 270 | 271 | - Agent needs to gather useful experiences. 272 | 273 | - Evaluation is often concurrent with learning. 274 | 275 |
276 | 277 | - **Automating Speech Tasks** Problem: Pollexy 278 | 279 | - Speech to task and automation. 280 | 281 |
282 | 283 | ### **Machine Learning Terminology and Process** 284 | 285 | - **Step 1: Business Problem** 286 | 287 | - What are we trying to solve (see scope,input and output above). 288 | 289 |
290 | 291 | - **Step 2: Machine Learning Problem** 292 | 293 | - What model could solve most of our issues. 294 | 295 | - Key Elements: 296 | 297 | - Attributes from dataset → **Observations** 298 | 299 | - Future Outputs → **Labels** 300 | 301 | - Used attributes to predict labels → **features** 302 | 303 | - Framing ML problem: 304 | 305 | - ![sklearnguide](https://scikit-learn.org/stable/_static/ml_map.png) 306 | 307 | - **Step 3: Develop Datasets** 308 | 309 | - Data collection and integration 310 | 311 | - 3 different types: 312 | 313 | - **Structured** (Organized in tabular/databases) 314 | 315 | - **Semi-Structured** (semi - csv, json) 316 | 317 | - **Unstructured** (video, image etc.) 318 | 319 | - **Step 4: Data Preparation** 320 | 321 | - Data Cleaning: 322 | 323 | - Could introduce new variable 324 | 325 | - Remove 326 | 327 | - Or Imputation (best guess) 328 | 329 | - Imputation: 330 | 331 | - You can refer here for a comprehensive review of the [**best imputation methods**](https://www.kaggle.com/discussions/general/375794). 332 | 333 |
334 | 335 | - Data Shuffling: 336 | 337 | - We dont want to make predictions just based on order (could lead to bias) - so we shuffle. 338 | 339 | - Test/Val/Train Split: 340 | 341 | - Predict new examples (by holding from our dataset). 342 | 343 | - Train ~ 70% of data. 344 | 345 | - Test ~ 20% of data. 346 | 347 | - Validation ~ 10% of data. 348 | 349 |
350 | 351 | - Cross Validation Techniques: 352 | 353 | - Leave-One-Out (Only use one data point as our test sample, run training with other examples) - Expensive 354 | 355 | - K-Fold (For each fold we train the model and keep track of error). 356 | 357 |
358 | 359 | - **Step 5: Data Visualization and Analysis** 360 | 361 | - **Feature** = Attribute in training dataset. 362 | 363 | - **Label** = NOT in training; what we are trying to predict. 364 | 365 | - Types of Visualization for EDA: 366 | 367 | - Statistics 368 | 369 | - Scatterplots 370 | 371 | - Histograms 372 | 373 | - Features and Target Summary: 374 | 375 | - Numerical & Categorical. 376 | 377 | - Usually check distributions / detect outliers through histograms. 378 | 379 | - Check correlation through scatterplots. 380 | 381 |
382 | 383 | - **Step 6: Feature Engineering** 384 | 385 | - Converts raw data into a higher representation of the data. 386 | 387 | - F.ex moving from non-linearity to linear models. 388 | 389 | - Numeric value binning in to groups (age, salaries etc.) 390 | 391 | - Quadratic features (combining them together). 392 | 393 | - Non/Linear Feature Transformations (log, product/ratio, tree-path). 394 | 395 | - Domain specific transformation (text/image/web based). 396 | 397 |
398 | 399 | - **Step 7: Model Training** 400 | 401 | - Parameters → tune model to improve performance. 402 | 403 | - **Loss Function** (Calculates how far predictions are from ground truth): 404 | 405 | - **Square** (regression, classification). 406 | 407 | - **Hinge** (classification, best for outliers). 408 | 409 | - **Logistic** (classification, best for skewed data). 410 | 411 | - **Regularization** (Can increase generalization of the model to better fit the data): 412 | 413 | - Prevents overfitting by contraining weights to be too small. 414 | 415 | - **Learning Parameters/Decay Rate** (controls how fast our model learns): 416 | 417 | - Decaying too aggressively (algorithm never reaches the optimum). 418 | 419 | - Decaying too slowly (algorithm bounces around / never converging to an optimum). 420 | 421 |
422 | 423 | - **Step 8: Model Evaluation** 424 | 425 | - Dont fit training data to obtain maximum accuracy. 426 | 427 | - Overfitting vs Underfitting. 428 | 429 | ![OvervsUnder](https://docs.aws.amazon.com/images/machine-learning/latest/dg/images/mlconcepts_image5.png) 430 | 431 | - Bias vs Variance: 432 | 433 | - **Bias**: Difference between average model prediction to target values. 434 | 435 | - **Variance**: Variation in prediction of different models. 436 | 437 | ![BiasvsVariance](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/images/bias_variance/bullseye.png) 438 | 439 | - **Evaluation Metrics:** 440 | 441 | - For **regression**: 442 | 443 | - Root Mean Square Error **(RMSE)** - lower is better (use only in test) 444 | 445 | - Mean Absolute Percetage Error **(MAPE)** - lower is better (use only in test) 446 | 447 | - R Squared $(R^2)$ - How much better the is the model compared to picking best constant? **(1 - Model MSE / Variance)** 448 | 449 | - For **classification**: 450 | 451 | - Confusion Matrix: 452 | 453 | - ![CF](https://www.researchgate.net/profile/Sebastian-Bittrich/publication/330174519/figure/fig1/AS:711883078258689@1546737560677/Confusion-matrix-Exemplified-CM-with-the-formulas-of-precision-PR-recall-RE.png) 454 | 455 | - How many data points were classified correctly and incorrectly. 456 | 457 | - ROC Curve - For binary classification prediction. 458 | 459 | - **Precision** (How correct under the positive predictions). 460 | 461 | - F.ex - In search engine - (Precision is quality of viewed / relevance) 462 | 463 | - **Recall** (Fraction of the negatives that were wrongly predicted) 464 | 465 | - F.ex - In search engine - (Recall is completeness of results, and what fraction of relevancy was found). 466 | 467 | ![PrecisionvsRecall](precisionvsrecall.png) 468 | 469 |
470 | 471 | - **Step 9: Business Goal Evaluation** 472 | 473 | - Do we need data or feature **augmentation** ? 474 | 475 | - Evaluate how model is performing related to business goals: 476 | 477 | - Accuracy of the model. 478 | 479 | - Interpretability. 480 | 481 | - Model generalization on unseen/unknown data. 482 | 483 | - Business success criteria (KPIs). 484 | 485 |
486 | 487 | ![Summary](Summary.png) 488 | 489 |
490 | 491 | --- 492 | 493 | ## **Step 2.1: Learn Data Platform Engineering on AWS** 494 | 495 | ### **Machine Learning Security** 496 | 497 | ### a) AWS Security Fundamentals 498 | 499 | - Cloud security principles: 500 | 501 | - **Implement a strong identity** foundation (Least privilege / enforce separation of duties). 502 | 503 | - **Enable traceability** (Monitor alerts / audit actions / integrate logs). 504 | 505 | - **Apply security at all layers** (defense-in-depth approach). 506 | 507 | - **Automate security best practices** (code version-controlled templates). 508 | 509 | - **Protect data at rest and in transit** (use encryption / access control). 510 | 511 | - **Enforce the principle of least privilege** (Deny everything, and give access only to needed). 512 | 513 | - **Prepare for security events** (Incident management / Simulations) 514 | 515 | - Shared Responsibility model 516 | 517 | - **AWS** is responsible for security **OUT** of the cloud. 518 | 519 | - The *customer* is responsible for security **IN** the cloud. 520 | 521 | -![Shared Responsibility](https://explore.skillbuilder.aws/files/a/w/aws_prod1_docebosaas_com/1673427600/uHxSu_RsNM8QdLbhxHxLkA/tincan/842b44ce18500c7e75f3c09fb3a74da4d121de2f/assets/e4_6y4d7cz51Bo0Y_M4k8jdbphutKhqfN.png) 522 | 523 | - Skim through: 524 | 525 | - [**AWS Cloud Adoption Framework**](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/ 526 | 527 | - [**Shared Responsibility Model**](https://aws.amazon.com/compliance/shared-responsibility-model/) 528 | 529 | - AWS Global Infrastructure 530 | 531 | - $Data Center \subset Availability Zone \subset Regions$ 532 | 533 | - Different costs for regions (due to regulations, legislation etc.) 534 | 535 | - Usually choose region closest to your area (for less latency) 536 | 537 | - Currently we have the following statistics: 538 | 539 | - **30** Launched Regions 540 | 541 | - **96** Availability Zones 542 | 543 | - **410+** Points of Presence 544 | 545 | - Data center security layers: 546 | 547 | - Perimeter Layer (Buildings, Surveilance) 548 | 549 | - Environmental Layer (Safe from flooding, natural disasters etc.) 550 | 551 | - Infrastructure Layer (HVAC system, and fire suppression equipment). 552 | 553 | - Data Layer (Shared responsibility / NIST800-88 techniques / Auditing). 554 | 555 | - Compliance and Governance 556 | 557 | - [**AWS Artifact**](https://aws.amazon.com/artifact/) - no-cost, self-service portal for access to the AWS security and compliance reports and select online agreements (**SOC/PCI**) 558 | 559 |
560 | 561 | - Entry Points on AWS (Concepts): 562 | 563 | - Endpoint → URL of the entry point for an AWS web service. 564 | 565 | - Regional Endpoint (f.ex ). 566 | 567 | - General endpoints (f.ex ec2.amazonaws.com). 568 | 569 | - Global services (Do not support regions) 570 | 571 | - IAM (Identity Access Management): 572 | 573 | - Centralized mechanism for creating and managing permissions. 574 | 575 | - Types of IAM: 576 | 577 | - password policy (rules f.ex special char) 578 | 579 | - Multi-factor authentication (MFA) 580 | 581 | - Access Keys (access key ID and a secret key) 582 | 583 | - EC2 Key Pair 584 | 585 | - Services you need to know related to IAM: 586 | 587 | - **AWS Secrets Manager** → manages credentials, passwords, third-party API keys, and even arbitrary text. 588 | 589 | - **AWS Single Sign-On** → manages SSO access to multiple AWS accounts. 590 | 591 | - **AWS Security Token Service (STS)** → temporary, limited-privilege credentials for special IAM users. 592 | 593 | - **Temporary, limited-privilege credentials for IAM users** → domain resource management built on actual Microsoft Active Directory. 594 | 595 | - **AWS Organizations** → manage and enforce policies for multiple AWS accounts. 596 | 597 | - **Amazon Cognito** → add user sign-up, sign-in, and access controls to your web and mobile apps. 598 | 599 |
600 | 601 | - Detective Controls 602 | 603 | - Part of governance frameworks and can be used to identify a potential security threat or incident. 604 | 605 | - **Cloudtrail**: Use log files to track changes to AWS resources, including creation, modification, and deletion of AWS resources. 606 | 607 | - Check IAM user / When / Where / What happened (response element). 608 | 609 | - TIP - Track changes to AWS resources, including creation, modification, and deletion of AWS resources. (Make sure no one can cover tracks). 610 | 611 | - Use **CloudWatch** to monitor resources and logs, send notifications, and initiate automated actions. 612 | 613 | Example of CloudWatch remediation: 614 | 615 | ![Cloudwatch](https://explore.skillbuilder.aws/files/a/w/aws_prod1_docebosaas_com/1673445600/-e58pODQVrD0EPW11nw7PA/tincan/842b44ce18500c7e75f3c09fb3a74da4d121de2f/assets/w3YSCbvi0z905BU__Roqy_g34KZ-lDIXq.png) 616 | 617 | - Main services to audit in AWS: 618 | 619 | - **S3** → through access logs 620 | 621 | - **Elastic Load Balancer** → log captures IP addresses, latency & server responses. 622 | 623 | - **Cloudwatch Logs & Events** 624 | 625 | - Logs monitors operating systems and applications (as well as phrases, patterns, values). 626 | 627 | - Events tracks activity level of Cloudwatch rule collections. 628 | 629 | - **VPC Flow Logs** → ensures network access rules are configured properly. 630 | 631 | - **Cloudtrail** → keeps track of API calls. 632 | 633 | - **Amazon GuardDuty** → uses ML to report unusual API calls or unauthorized deployments. 634 | 635 | - **AWS Trusted Advisor** → gives feedback on optimizing resources and services. 636 | 637 | - **AWS Security Hub** → aggregates, organizes, and prioritizes security alerts from multiple services. 638 | 639 | - **AWS Config** → detects non-compliance configurations (current or historical). 640 | 641 | - Services for Infrastructure Protection 642 | 643 | - **AWS Systems Manager** → secure end-to-end management solutions (applications, operations, change, Node). 644 | 645 | - **AWS Firewall Manager** → configure and manage AWS WAF rules. 646 | 647 | - **AWS Direct Connect** → securely connect AWS with on premises. 648 | 649 | - **AWS CloudFormation** → automates and simplifies the task of repeatedly creating and deploying AWS resources. 650 | 651 | - **Amazon Inspector** → assesses applications for vulnerabilities or deviations from best practices (gives list based on severity). 652 | 653 | - AWS Services for Data Protection 654 | 655 | - **AWS CloudHSM** → generate, store, import, export, and manage cryptographic keys. 656 | 657 | - **Amazon S3 Glacier** → enforce compliance controls for individual Amazon S3 Glacier vaults with a *vault lock policy*. 658 | 659 | - **AWS Certificate Manager** → creates *SSL/TLS* certificates for your AWS based websites and applications. 660 | 661 | - **Amazon Macie** → uses machine learning to automatically discover, classify, and protect sensitive data in AWS. (Provides dashboards & alerts). 662 | 663 | - **AWS KMS** → create and control the keys used in data encryption (avoid infinite loop of creating encrypted keys). 664 | 665 | - Services for protection *against DDoS*: 666 | 667 | - **Amazon Route 53** → scalable traffic flow, latency-based routing, weighted round-robin, Geo DNS, health checks, and monitoring. 668 | 669 | - **Amazon CloudFront** → content delivery network (CDN) service that can deliver data, including entire website, to end users. 670 | 671 | - **AWS Shield** → DDoS protection service that safeguards web applications that run on AWS. 672 | 673 | - **AWS Web Application Firewall (WAF)** → helps protect web applications control over which traffic to allow or block by defining customizable web security rules. 674 | 675 | --- 676 | 677 | ### **Developing Machine Learning Applications** 678 | 679 | - **Sagemaker** 680 | 681 | - Hosted Jupyter Notebook that doesnt requires setup. (Key preloaded libraries) 682 | 683 | - Details about creating instance: 684 | 685 | - Instance name (unique) 686 | 687 | - Instance type (smallest vs largest) 688 | 689 | - Granting permissions (**IAM roles** - created automatically if you dot have one) 690 | 691 | - **VPC** connected for providing access to additional resources. 692 | 693 | - Secure data through **KMS** 694 | 695 | - **Demo summary** 696 | 697 | - You can attach IAM roles through notebook 698 | 699 | - Standard Preprocessing, EDA, TrainTestVal Split, Training and tuning models. 700 | 701 | - Storing model artifacts in the back-end 702 | 703 | - Deploy in production by setting up the inference image and specifying model artifact location. 704 | 705 | - Set up endpoint (how many models we are putting, and compute resources for each model). 706 | 707 | - **Amazon Sagemaker Neo** 708 | 709 | - Many challenges for Machine Learning in Organizations: 710 | 711 | - Choosing the right framework, models, integrating and deploying. 712 | 713 | - Many to many problem: Numerous frameworks only running separately on numerous platforms. 714 | 715 | - Solution: **[Amazon Sagemaker Neo](https://aws.amazon.com/sagemaker/neo/)** 716 | 717 | - Freezes the model from the framework, and optimized for running on the hardware. 718 | 719 | - Process: 720 | 721 | - Compiler reads models in various formats. 722 | 723 | - Turns them to a generalized, framework agnostic represantations. 724 | 725 | - Optimization for various operating systems and processors it will be deployed on. 726 | 727 | - Use cases: 728 | 729 | - Accelerates in the cloud and on the edge. 730 | 731 | - Better and faster, optimized development for IoT (Image Classification, Anomaly Detection etc.). 732 | 733 | - Integration of ML with databases (f.ex Neo API) 734 | 735 | - **Machine Learning Algorithms** 736 | 737 | - **Supervised** - We have labels, we train the labeled data (think of a teacher → supervisor). F.ex Customer churn prediction, failure of a system prediction. Types: 738 | 739 | - **Linear Supervised Algorithms** - f.ex SVM, Logistic, Perceptrons. 740 | 741 | - [**AWS Linear Learner Service**](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) 742 | 743 | - **Non-Linear Supervised Algorithms** - f.ex Treebased models (XGB, DT, RF). 744 | 745 | - **UnSupervised** - No labels, or *teacher* → we just have unlabeled data trying to make sense of it. 746 | 747 | - **Clustering** - Grouping data based on similarity. K-Means, PCA (reducing dimensionality). 748 | 749 | - **Anomaly detection** - Labeling normal and outliers. 750 | 751 | - **[New - Random Cut Forests](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html)** - Constructs a model of randomly cut trees for detecting anomalous data points. 752 | 753 | - **Topic Modeling for NLP** (LDA) 754 | 755 | - **Recommender** - We have penalties or rewards for each step. 756 | 757 | - **Deep Learning** - Composed of numerous *neurons*, for which we apply a weighted sum (activation functions) to connect to output. 758 | 759 | - Use backpropagation for improving accuracy. 760 | 761 | - Usually, thousands of layers and billions of parameters. 762 | 763 | - Many types, for different cases: 764 | 765 | - **Convolutional neural network (CNN)** - Mainly for images → allows convolution (merging of two sets of information) to recognize and improve patterns. 766 | 767 | - **Recurrent Neural Networks (RNN)** - Feeding output again to input (hence recurrent). F.ex LSTM. Good for speech recognition and translation. 768 | 769 | - **[AWS Sockeye](https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/)** - Very useful for building, training, and running state-of-the-art sequence-to-sequence models. 770 | 771 |
772 | 773 | - **[Automated Sagemaker Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)** 774 | 775 | - Types of hyperparameter tunings: 776 | 777 | - For **Neural Networks**: 778 | 779 | - **Learning Rate** (how much to change the model in response to the estimated error for each update). 780 | 781 | - **Layers** (structure as a neuron which takes and gives information). 782 | 783 | - **Regularization** (prevents overfitting by penalizing techniques). 784 | 785 | - **Drop-Out** (Randomly deactivating layers - prevents overfitting). 786 | 787 | - For **Trees**: 788 | 789 | - **Numbers** (Numbers of starting nodes) 790 | 791 | - **Depth** (How many leafs each tree segments into) 792 | 793 | - **Boosting Step**(Converting weak learners to strong learners by preventing overfitting) 794 | 795 | - For **Clustering**: 796 | 797 | - **Initialization** (initial centroids to start with). 798 | 799 | - **Number** (number of clusters set). 800 | 801 | - **Pre-Processing Steps** 802 | 803 | - Then basically we go through using sagemaker hands on! 804 | 805 | - Hyperparameter tuning can be very costly if not operating efficiently. Use **Sagemaker Model Tuning** to save time, effort and resources. 806 | 807 |
808 | 809 | - **Advanced Analytics with Amazon SageMaker** 810 | 811 | - Using Spark together with Sagemaker (through SDK). [**Check hands on here**](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb) 812 | 813 | - Spark hybrid connection with Sagemaker: 814 | 815 | ![Spark](Spark%20and%20Sagemaker.png) 816 | 817 | - Connecting services for anomaly detection on AWS: 818 | 819 | ![Anomaly Detection AWS](Anomaly%20Detection%20in%20AWS.png) 820 | 821 | - Building a Recommender model with MXNet and GluOn. (Summary slides) 822 | 823 | ![Choosing Model](Choosing%20Recommender%20Models.png) 824 | 825 | ![Metrics](Metrics%20to%20use%20for%20Recommender.png) 826 | 827 | --- 828 | 829 | ### **Math Required for MLE (Topics to Study)** 830 | 831 | - **Vectors** 832 | 833 | - Row vs Column vectors difference 834 | 835 | - Dimensions 836 | 837 | - Matrices 838 | 839 | - Operations 840 | 841 | - Scalar Multiplications 842 | 843 | - Addition 844 | 845 | - Zero Vector 846 | 847 | - Scalar Multiplication 848 | 849 | - **Geometry of Column Vectors** 850 | 851 | - Addition as Displacement 852 | 853 | - Scalar Multiplication 854 | 855 | - Subtraction as Mapping 856 | 857 | - **Measures of Magnitude** 858 | 859 | - Norm Properties 860 | 861 | - Euclidean 862 | 863 | - $L_p$ Norm 864 | 865 | - $L_\infty$ Norm 866 | 867 | - $L_0$ Norm 868 | 869 | - **Matrices** 870 | 871 | - Dot Products and how to extract angles 872 | 873 | - Orthogonality 874 | 875 | - Hyperplane 876 | 877 | - Decision Plane 878 | 879 | - Matrix multiplication and examples 880 | 881 | - Hadamard product 882 | 883 | - Matrix product properties 884 | 885 | - Distributability 886 | 887 | - Associativity 888 | 889 | - The Identity Matrix 890 | 891 | - Not commutativity 892 | 893 | - Geometry of matrix operations 894 | 895 | - Determinant 896 | 897 | - Intuition from 2-Dimension 898 | 899 | - Determinant computation 900 | 901 | - The Two-by-Two 902 | 903 | - Matrix invertibility 904 | 905 | - Linear dependency 906 | 907 |
908 | 909 | - **Probability** 910 | 911 | - Axioms of probability 912 | 913 | - Probability represented with Venn diagrams 914 | 915 | - Conditional probability 916 | 917 | - Bayes’ rule 918 | 919 | - Independent events and notation 920 | 921 | - Random variables 922 | 923 | - Chebyshev’s inequality 924 | 925 | - Entropy 926 | 927 | - Continuous random variables and probability density function 928 | 929 | - The Gaussian curve 930 | 931 | - Building machine learning models 932 | 933 | - **Univariate Derivates** 934 | 935 | - **Multivariate Derivates** -------------------------------------------------------------------------------- /Exam Readiness Course.md: -------------------------------------------------------------------------------- 1 | # Exam Readiness for MLE 2 | 3 | A brief refresher into what will be tested. We have covered below a concise version of will be included and we need to know in depth: 4 | 5 | --- 6 | 7 | ## 1. Course Introduction 8 | 9 |
10 | 11 | - Understand basic algorithms & hyper-parameter tuning. 12 | 13 | - Understanding ML Pipeline. 14 | 15 | - Experience with ML and Deep Learning frameworks. 16 | 17 | - Understanding of and experience in model training, deployment & operational best practices. 18 | 19 | - Use this guide as a starting point (or before exam) - to identify weaknesses and dig deeper. 20 | 21 |
22 | 23 | ## 2. Exam Overview and Test-Taking Strategies 24 | 25 |
26 | 27 | - Focus on key phrases, and qualifier to easily discard distracting questions. 28 | 29 | - Focus on AWS Services, and see if there is a correlation between the assumption and the service. 30 | 31 | - Prepare for some minor calculations at hand. 32 | 33 | - Read and understand the question before reading answer options (pretend the answer options aren't even there at first). 34 | 35 | - Identify the key phrases and qualifiers in the question. 36 | 37 | - Try to answer the question before even looking at the answer choices, then see if any of those answer choices match your original answer. 38 | 39 | - Eliminate answer options based on what you know about the question, including the key phrases and qualifiers you highlighted earlier. 40 | 41 | - If you still don't know the answer, consider flagging the question and moving on to easier questions. But remember to answer all questions before the time is up on the exam, as there are no penalties for guessing. 42 | 43 |
44 | 45 | ## 3. Domains Which Will Be Covered 46 | 47 |
48 | 49 | ### **Data Engineering** 50 | 51 |
52 | 53 | - **Create data repositories for Machine Learning**. 54 | 55 | - Think of Data Lake as an all encompassing solution for ML tasks. 56 | 57 | - *Lake Formation* as a single place to manage access controls for data in your data lake 58 | 59 | - Recall storage solutions and use cases. 60 | 61 | - ![S3](images/S3.png) 62 | 63 | - *Amazon FSx* for Lustre → for running training jobs several times using different algorithms and parameter (when data is already on S3). 64 | 65 | - If data is already in EFS → use that as training data source (faster training start times). 66 | 67 | - Images per second that each file system can load (Amazon FSx fastest → S3 slowest). 68 | 69 | - Topics to study in depth: 70 | 71 | - AWS Lake Formation ☐ 72 | 73 | - Amazon S3 (as storage for a data lake) ☐ 74 | 75 | - Amazon FSx for Lustre ☐ 76 | 77 | - Amazon EFS ☐ 78 | 79 | - Amazon EBS volumes ☐ 80 | 81 | - Amazon S3 lifecycle configuration ☐ 82 | 83 |
84 | 85 | - **Identify and implement a data-ingestion solution**. 86 | 87 | - Batch vs Streaming Ingestion 88 | 89 | - Batch has the following attributes: 90 | 91 | - Cheaper, simpler method for periodically ingesting data. 92 | 93 | - Services which can enable batch ingestion: 94 | 95 | - AWS Glue (ETL Service). 96 | 97 | - AWS Database Migration Service (Reads historical data from source systems, such as RDS, Warehouses, and NoSQL databases, at any desired interval). 98 | 99 | - AWS Step Functions (to automate the abovementioned). 100 | 101 | - Streaming has the following attributes: 102 | 103 | - Data is sourced, manipulated, and loaded as soon as it is created or recognized. 104 | 105 | - More expensive, harder to maintain. 106 | 107 | - The following services are associated with streaming data: 108 | 109 | - Kinesis Video Stream 110 | 111 | - Kinesis Data Analytics 112 | 113 | - Kinesis Firehose 114 | 115 | - Kinesis Data Stream 116 | 117 | - Topics to study more in depth: 118 | 119 | - Amazon Kinesis Data Streams ☐ 120 | 121 | - Amazon Kinesis Data Firehose ☐ 122 | 123 | - Amazon Kinesis Data Analytics ☐ 124 | 125 | - Amazon Kinesis Video Streams ☐ 126 | 127 | - AWS Glue ☐ 128 | 129 | - Apache Kafka ☐ 130 | 131 | - Identify and implement a data-transformation solution. 132 | 133 |
134 | 135 | --- 136 | 137 |
138 | 139 | ### **Domain 2: Exploratory Data Analysis** 140 | 141 |
142 | 143 | - Sanitize and prepare data for modeling. 144 | 145 | - Multivariate statistics (correlation & relationships). 146 | 147 | - Attribute statistics (f.ex mean, SD). 148 | 149 | - Individual statistics (f.ex rows, columns). 150 | 151 | - Clean data 152 | 153 | - Same scale (f.ex miles vs km). 154 | 155 | - Columns dont have multiple features (f.ex date, text). 156 | 157 | - Clear Outliers. 158 | 159 | - Imputation & dealing with missing data. 160 | 161 | - Topics to study more: 162 | 163 | - Dataset generation 164 | 165 | - Amazon SageMaker Ground Truth 166 | 167 | - Amazon Mechanical Turk 168 | 169 | - Amazon Kinesis Data Analytics 170 | 171 | - Amazon Kinesis Video Streams 172 | 173 | - Data augmentation 174 | 175 | - Descriptive statistics 176 | 177 | - Informative statistics 178 | 179 | - Handling missing values and outliers 180 | 181 | - **Perform feature engineering.** 182 | 183 | - Reduce features (PCA, t-distributed stochastic neighbor embedding). 184 | 185 | - For numerical we can transform (multiply, square, cube). 186 | 187 | - Categorical Feature Engineering: 188 | 189 | - Ordinal → ORDER MATTERS. 190 | 191 | - Nominal → ORDER DOESN'T MATTER. 192 | 193 | - Common techniques for scaling (clues in names): 194 | 195 | - Mean/variance standardization. 196 | 197 | - MinMax scaling. 198 | 199 | - Maxabs scaling. 200 | 201 | - Robust scaling. 202 | 203 | - Normalizer. 204 | 205 | - Topics to study: 206 | 207 | - Scaling. 208 | 209 | - Normalizing. 210 | 211 | - Dimensionality Reduction. 212 | 213 | - Date Formatting. 214 | 215 | - One-Hot Encoding. 216 | 217 |
218 | 219 | ### **Domain 3: Modeling** 220 | 221 |
222 | 223 | - **Frame business problems as ML problems.** 224 | 225 | - Use ML, when? 226 | 227 | - If tons of data, and we can make prediction. 228 | 229 | - If we cannot code rules. 230 | 231 | - If we cannot scale current solution. 232 | 233 | - ML Algorithms: 234 | 235 | - Supervised: 236 | 237 | - Binary Classification. 238 | 239 | - Multiclass Classification. 240 | 241 | - Regression problems. 242 | 243 | - Unsupervised 244 | 245 | - Reinforcement 246 | 247 | - Topics to study: 248 | 249 | - Supervised learning 250 | 251 | - Regression and classification 252 | 253 | - Unsupervised learning 254 | 255 | - Clustering 256 | 257 | - Anomaly detection 258 | 259 | - Deep learning 260 | 261 | - Perceptron 262 | 263 | - Components of an artificial neuron 264 | 265 | - **Select appropriate model(s) for given problem.** 266 | 267 | ![ML Map](images/ml_map.png) 268 | 269 | - **Train ML models.** 270 | 271 | - Splitting data (train/test/val): 272 | 273 | - 80%:10%:10%. or 70%:15%:15% 274 | 275 | - Cross Validation (compare the performance of multiple models) 276 | 277 | - K-Fold (split the input data into k subsets of data). 278 | 279 | - Topics to study: 280 | 281 | - Amazon SageMaker workflow for training jobs. 282 | 283 | - Running a training job using containers. 284 | 285 | - Build your own containers. 286 | 287 | - P3 instances. 288 | 289 | - Components of an ML training job for deep learning. 290 | 291 |
292 | 293 | - **Perform hyperparameter optimization.** 294 | 295 | - Different types of Hyperparameters: 296 | 297 | - Model Hyperparameters (filter size, pooling, architecture). 298 | 299 | - Optimizers (how model learns - Adagrad, Xavier Init etc.). 300 | 301 | - Data Hyperparameters (augementation - cropping, resizing). 302 | 303 | - Topics to study: 304 | 305 | - Amazon SageMaker hyperparameter tuning jobs 306 | 307 | - Common hyperparameters to tune 308 | 309 | - Momentum 310 | 311 | - Optimizers 312 | 313 | - Activation functions 314 | 315 | - Dropout 316 | 317 | - Learning rate 318 | 319 | - Regularization 320 | 321 | - Dropout 322 | 323 | - L1/L2 324 | 325 | - **Evaluate ML Models.** 326 | 327 | - Confusion Matrix. 328 | 329 | - Accuracy (All trues / All) → dont use when we have many true negatives. 330 | 331 | - Precision (TP/ ALL Positives) → when cost of False positives is high (hiring for FAANG). 332 | 333 | - Recall (TP / TP + FN) → when cost of False negatives is high (customers for fraud, execution). 334 | 335 | - F1-Score 336 | 337 | - Topics to Study 338 | 339 | - Metrics for regression: sum of squared errors, RMSE 340 | 341 | - Sensitivity 342 | 343 | - Specificity 344 | 345 | - Neural network functions like Softmax for the last layer 346 | 347 |
348 | 349 | ### **Domain 4: ML Implementation and Operations** 350 | 351 |
352 | 353 | - **Build ML solutions for performance, availability, scalability, resiliency and fault tolerance.** 354 | 355 | - Design for high-availability & fault tolerance: 356 | 357 | - High Availability → system will keep working even when some component in architecture stops working. 358 | 359 | - Fault Tolerance → ensures no degradation (despite failure in architecture). 360 | 361 | - Decoupling resources in a distributed fashion (f.ex storage & training jobs). 362 | 363 | - Use queues like Amazon SQS or Step Functions. 364 | 365 | - Monitor with CloudWatch (logs, alarms, events). 366 | 367 | - Use AWS CloudTrain to capture API calls, related events on behalf of AWS account (store in S3). 368 | 369 | - Common practices for designing for failure: 370 | 371 | - Decouple ETL process from ML pipeline (AWS GLUE & Amazon EMR) → use Apache Spark to handle large amounts. 372 | 373 | - Deploy Amazon Sagemakes Endpoints backed by multiple instances across availability zones. 374 | 375 | - Containerize ML models for both inference & training in Sagemaker. 376 | 377 | - Use AWS Auto-Scaling 378 | 379 | - Topics to study in depth: 380 | 381 | - Amazon Deep Learning containers 382 | 383 | - AWS Deep Learning AMI (Amazon Machine Image) 384 | 385 | - AWS Auto Scaling 386 | 387 | - AWS GPU (P2 and P3) and CPU instances 388 | 389 | - Amazon CloudWatch 390 | 391 | - AWS CloudTrail 392 | 393 | - **Recommend and implement the appropriate ML services and features for a given problem**. 394 | 395 | ![Alt text](images/Architectures%20and%20Frameworks.png) 396 | 397 | ![Alt text](images/Sagemaker%20Services.png) 398 | 399 | - Topics to study more: 400 | 401 | - Amazon SageMaker Spark containers 402 | 403 | - Amazon SageMaker build your own containers 404 | 405 | - Amazon AI services 406 | 407 | - Amazon Translate 408 | 409 | - Amazon Lex 410 | 411 | - Amazon Polly 412 | 413 | - Amazon Transcribe 414 | 415 | - Amazon Rekognition 416 | 417 | - Amazon Comprehend 418 | 419 | - **Apply basic AWS security practices for ML solutions.** 420 | 421 | - IAM Role-Based Access (Least Privilige Access). 422 | 423 | - Launch Instances in customer managed VPC. 424 | 425 | - Specify subnets & security groups (creates elastic network interfaces associated with them). 426 | 427 | - Encrypt data at rest with SageMaker with AWS KMS (create, import, rotate, disable, delete, define usage policies for, and audit the use of encryption keys). 428 | 429 | - Ways to manage AWS KMS with Amazon S3: 430 | 431 | - Custom KMS 432 | 433 | - SSE-S3 requires that Amazon S3 manage the data and master encryption keys. 434 | 435 | - AWS KMS 436 | 437 | - SSE-C requires that you manage the encryption key. 438 | 439 | - S3 built-in 440 | 441 | - SSE-KMS requires that AWS manage the data key, but you manage the customer master key in AWS KMS. 442 | 443 | - Security features integrated with SageMaker: 444 | 445 | - Authentication (IAM federation). 446 | 447 | - Gaining Insight (Restrict access by IAM policy and condition keys). 448 | 449 | - Audit (API logs to AWS CloudTrail - exception of InvokeEndpoint). 450 | 451 | - Data protection at rest 452 | 453 | - AWS KMS-based encryption for: 454 | 455 | - Notebooks 456 | 457 | - Training jobs 458 | 459 | - Amazon S3 location to store modelsEndpoint 460 | 461 | - Data protection at motion 462 | 463 | - HTTPS for: 464 | 465 | - API/Console 466 | 467 | - Notebooks 468 | 469 | - VPC-enabled 470 | 471 | - Interface endpoint 472 | 473 | - Limit by IPTraining jobs/endpoints 474 | 475 | - Compliance programs 476 | 477 | - PCI DSS 478 | 479 | - HIPAA-eligible with BAA 480 | 481 | - ISO 482 | 483 | - Topics related to this subdomain: 484 | 485 | - Security on Amazon SageMaker 486 | 487 | - Infrastructure security on Amazon SageMaker 488 | 489 | - What is a: 490 | 491 | - VPC 492 | 493 | - Security Group 494 | 495 | - NAT gateway 496 | 497 | - Internet Gateway 498 | 499 | - AWS Key Management Service (AWS KMS) 500 | 501 | - AWS Identity and Access Management (IAM) 502 | 503 | - **Deploy and operationalize ML solutions.** 504 | 505 | - Apply all software engineering practices(f.ex security, logging and monitoring, task management, API versioning). 506 | 507 | - Add error recovery code and make sure that tests for unexpected data inputs exist (Unit Testing, Quality Assurance, UAT). 508 | 509 | - Automate system (AWS CodeBuild and AWS CodeCommit). 510 | 511 | - Track, identify, and account for changes in data sources. 512 | 513 | - Perform ongoing monitoring and evaluation of results. 514 | 515 | - Create methods to collect data from production inferences that can be used to improve future models. 516 | 517 | - Manage following practices: 518 | 519 | - End-to-end and A/B testing 520 | 521 | - API versioning, if multiple versions of the model are used 522 | 523 | - Reliability and failover 524 | 525 | - Ongoing maintenance 526 | 527 | - Cloud infrastructure best practices, such as continuous integration/continuous deployment (CI/CD). 528 | 529 | - Deploy a model using Sagemaker hosting services: 530 | 531 | - S3 Path to store artifacts. 532 | 533 | - Docker registry path for image with inference code. 534 | 535 | - Name for deployment steps. 536 | 537 | - Create an endpoint configuration for an HTTPS endpoint 538 | 539 | - Define and apply a scaling policy that uses Amazon CloudWatch metrics: 540 | 541 | - Load-test your automatic scaling configuration. 542 | 543 | - Automatic scaling uses the policy to adjust the number of instances up or down in response to actual workloads. 544 | 545 | - Topics related to this subdomain: 546 | 547 | - A/B testing with Amazon SageMaker 548 | 549 | - Amazon SageMaker endpoints 550 | 551 | - Production variants 552 | 553 | - Endpoint configuration 554 | 555 | - Using Lambda with Amazon SageMaker 556 | 557 | --- 558 | -------------------------------------------------------------------------------- /Full Exams.md: -------------------------------------------------------------------------------- 1 | # List of Full Exams 2 | 3 | List of exams gathered from various sources (books, builders, youtube and official questions). 4 | Put your score on each and check if you make more than the 70% mark. 5 | 6 | ## Exam #1 7 | 8 | https://testmoz.com/12500822 9 | 10 | Enter your score: 11 | 12 | ## Exam #2 13 | 14 | https://testmoz.com/12501020 15 | 16 | Enter your score: 17 | 18 | ## Exam #3 19 | 20 | https://testmoz.com/12501284 21 | 22 | --- 23 | 24 | (*Take note of weaker areas and go for 1 day or 2 deep in them → repeat again after a week*) -------------------------------------------------------------------------------- /One Minute AWS MLE Playlist.md: -------------------------------------------------------------------------------- 1 | # AWS Services for MLE Specialty 2 | 3 | Playlist of ~1-Minute Videos On Most Important AWS Services for the MLE Specialty 4 | 5 | --- 6 | 7 | - **[Amazon Rekognition](https://www.youtube.com/watch?v=Jw2zF_oj-I8)** 8 | 9 | - **[Amazon Textract](https://www.youtube.com/watch?v=Qz2Rdho0VIM)** 10 | 11 | - **[Amazon Transcribe](https://www.youtube.com/watch?v=oHNRrXq5ZD0)** 12 | 13 | - **[Amazon Translate](https://www.youtube.com/watch?v=e4R7UUcTVs4)** 14 | 15 | - **[Amazon Polly](https://www.youtube.com/watch?v=ba0fzNEu76I)** 16 | 17 | - **[Amazon Lex](https://www.youtube.com/watch?v=ePn-1hHXC3s)** 18 | 19 | - **[Amazon Kendra](https://www.youtube.com/watch?v=zmccRoe82FE)** 20 | 21 | - **[Amazon CodeGuru](https://www.youtube.com/watch?v=LqCoZlnZMGA)** 22 | 23 | - **[AWS Augmented AI](https://youtu.be/2stgxmvQ7Og)** 24 | 25 | - **[AWS DeepLens](https://www.youtube.com/watch?v=T6xtgiByC_o)** 26 | 27 | - **AWS DeepRacer** 28 | 29 | - **AWS DeepComposer** 30 | 31 | - **AWS Panorama Device and SDK** 32 | 33 | - **Amazon SageMaker** 34 | 35 | --- 36 | -------------------------------------------------------------------------------- /Practical Data Science on AWS.md: -------------------------------------------------------------------------------- 1 | # **[Practical Data Science on the AWS Cloud Specialization](https://www.coursera.org/specializations/practical-data-science)** 2 | 3 | Summary of the Coursera Series. Composed of three parts: 4 | 5 | - Analyze Datasets and Train ML Models using AutoML. 6 | 7 | - Build, Train, and Deploy ML Pipelines using BERT. 8 | 9 | - Optimize ML Models and Deploy Human-in-the-Loop Pipelines. 10 | 11 | ***Highly recommended to do the hands on lab!*** 12 | 13 |
14 | 15 | ## Analyze Datasets and Train ML Models using AutoML 16 | 17 |
18 | 19 | - Focus on massive data which cannot be run locally. (Elastic **pay-as-you** go infrastructure). 20 | 21 | - Ingest and analyze data: 22 | 23 | - Data Exploration & Bias Detection: 24 | 25 | - Amazon S3 & Amazon Athena. 26 | 27 | - **Athena** → serverless running SQL queries (petabytes). No data movement required. 28 | 29 | - Amazon Glue. 30 | 31 | - **Glue Catalog** → creates reference s3 to data mapping. (Metadata about schema etc.) 32 | 33 | - **Glue Crawler** → automatically infers data schema and updates data catalog. 34 | 35 | - Sagemaker (Data Wrangler, Clarify). 36 | 37 | - **Data Wrangler** → can ingest data from data lakes, warehouses, databases. 38 | 39 | - Prepare & transform: 40 | 41 | - Feature Engineering and Feature Store: 42 | 43 | - Sagemaker **Data Wrangler**. 44 | 45 | - Sagemaker **Feature Store**. 46 | 47 | - Sagemaker **Processing Jobs**. 48 | 49 | - Train & Tune: 50 | 51 | - Automated ML & Model Train/Tune 52 | 53 | - Sagemaker **Autopilot**. 54 | 55 | - Sagemaker **Training & Debugger**. 56 | 57 | - Sagemaker **Hyperparameter Tuning**. 58 | 59 | - Deployment & Production: 60 | 61 | - Model Deployment & Automated Pipelines 62 | 63 | - Sagemaker **Endpoints**. 64 | 65 | - Sagemaker **Batch Transform**. 66 | 67 | - Sagemaker **Pipelines**. 68 | 69 | - Typical ML workflow and tools: 70 | 71 | ![Alt text](images/ML%20Workflow.png) 72 | 73 | - Statistical bias and feature importance: 74 | 75 | - **Statistical Bias**: Tendency to *overestimate/underestimate* a parameter. 76 | 77 | - Biased datasets → biased models. F.ex vastly more product reviews for A then B. 78 | 79 | - Different Types of Biases: 80 | 81 | - **Activity** Bias (f.ex popularity of product B than A). 82 | 83 | - **Social** Bias (f.ex preconceived notions about background). 84 | 85 | - **Selection** Bias (f.ex streaming movie recommendation wolves vs favorite actors). 86 | 87 | - Data Drifts types (data distribution significantly varies from ): 88 | 89 | - *Covariant Drift* → distribution of features changes. 90 | 91 | - *Prior probability Drift* → distribution of target variable changes. 92 | 93 | - *Concept Drift* → relationship between both changes (f.ex age, geography location). 94 | 95 | - *Class Imbalance* → (Disproportionate reviews)? 96 | 97 | - *DPL* - *Difference in Proportions of Labels* (imbalance between positive outcomes). F.ex way higher ratings? 98 | 99 | - Main service to use: **Sagemaker Clarify** (import as library). bias config → run_pre_training_bias 100 | 101 | - **Bias** Detection 102 | 103 | - ML **explainability** 104 | 105 | - **Report** generation 106 | 107 | - Detecting statistical biases: 108 | 109 | - Sagemaker Data Wrangler → UI Based flow; launch bias detection. 110 | 111 | - Sagemaker Clarify → API based approach, ability to scale (processing jobs distributed). 112 | 113 | - Feature Importance: 114 | 115 | - **SHAP** → game theory approach; multiplayer game where *outcome of the play is ML prediction*. (Local & Global). 116 | 117 | - Analysis → Data Wrangler can create *feature importance reports*. 118 | 119 | - **Auto-ML** 120 | 121 | - Reduce Time-to-Market. 122 | 123 | - Iterate quickly using ML and automation. 124 | 125 | - Lower ML barrier to entry for non Data Scientists. 126 | 127 | - Save scarce resources for more vital use cases. 128 | 129 | - Fully transparent. Walks you through data processing, modeling etc. 130 | 131 | ![Alt text](images/AutoML%20Workflow.png) 132 | 133 | - Can be fully managed, or until feature engineering. 134 | 135 | - Visibility in optimized in hyper-parameter tuning: 136 | 137 | ![Autopilot](images/Sagemaker%20Autopilot.png) 138 | 139 | - Model Hosting: 140 | 141 | - Batch & Real-Time Deployment: 142 | 143 | - Multiple Containers (Pipeline Model): 144 | 145 | - *Data transformation* container. 146 | 147 | - *Algorithm* Container. 148 | 149 | - *Inverse Label* Transformer Container. 150 | 151 | - *Inference* Pipeline. 152 | 153 | - **Built-in algorithms** 154 | 155 | - Choose built-in algorithms when: 156 | 157 | - When we need highly optimized and scalable solutions. 158 | 159 | - Generalized solutions without much customization. 160 | 161 | - Built-in vs Bring Code vs Bring Containers. 162 | 163 |
164 | 165 | --- 166 | 167 |
168 | 169 | ## Build, Train, and Deploy ML Pipelines using BERT 170 | 171 |
172 | 173 | - Feature Engineering (Main Steps): 174 | 175 | - **Feature Selection** → reduce dimensionality for faster training. 176 | 177 | - Feature selection *score* through **Data Wrangler** 178 | 179 |
180 | 181 | - **Feature Creation** → combine new features or infer new attributes to increase accuracy of predictions. 182 | 183 | - **Feature Transformation** → imputing, scaling or transforming. 184 | 185 | - Feature Engineering Pipeline: 186 | 187 | - Select Features & Labels (Input of Raw Data) 188 | 189 | - Balance Dataset by Label 190 | 191 | - Split Dataset 192 | 193 | - Transform (Output of Features to be used for Training) 194 | 195 |
196 | 197 | - BERT 198 | 199 | - Based on Transformer Architecture. 200 | 201 | - Operates on Sentence Level. 202 | 203 | - Using Bi-Directional form, it can capture context. 204 | 205 | - RoBERTa - A Robustly Optimized BERT Pretraining Approach 206 | 207 | - Potential Challenge: Using feature engineering at scale! 208 | 209 | - Sagemaker Processing 210 | 211 | - Performs preprocessing, postprocessing & data evaluation at scale. 212 | 213 | - Can scale through distributed clusters. 214 | 215 | - Built-in Sklearn container. 216 | 217 | - Feature Store 218 | 219 | - Repository to store engineered features. 220 | 221 | - Centralized (many people can contribute). 222 | 223 | - Reusable (can be used in multiple projects). 224 | 225 | - Discoverable (so that people can easily access it). 226 | 227 | - Can be retrained, and deleted after training. 228 | 229 | - Sagemaker Feature Store 230 | 231 | - Centralized repository for depositing features. 232 | 233 | - Easily scalable. 234 | 235 | - Real-time & batch ability to lookup up features. 236 | 237 |
238 | 239 | - Model Training & Tuning 240 | 241 | - Using pre-trained models, helps reduce training time & costs (adapting to our use case). 242 | 243 | - Built-in vs Pre-Trained Difference (Code vs Model) 244 | 245 | - Sagemaker Jumpstart 246 | 247 | - Has pre-trained models from TensorFlow Hub & PyTorch Hub. 248 | 249 | - Lets us deploy and/or fine-tune (NLP, CV etc.) in one click. 250 | 251 | - Storing training images 252 | 253 | - Configure dataset & evaluation metrics. 254 | 255 | - Train/Test/Val Split 256 | 257 | - Use Sagemaker Training Input to configure data input flow, for the training. 258 | 259 | - Use CloudWatch to define RegEx expressions to capture the evaluation metrics. 260 | 261 | - Evaluation Metrics (Validation Loss VS Validation Accuracy) 262 | 263 | - Configure hyper-parameters. 264 | 265 | - Number of Epochs, learning rate etc. 266 | 267 |
268 | 269 | - Provide training script. 270 | 271 | - Importing transformers, model configurations, model name, train model. 272 | 273 | - Fit the model. 274 | 275 | - Import Sagemaker PytorchEstimator. 276 | 277 | - Add requirements & instance types. 278 | 279 | - Pass defined hyperparameters. 280 | 281 | - Add estimator.fit. 282 | 283 | ![Alt text](images/Configure%20Training.png) 284 | 285 | 286 | - Debugging & Profiling 287 | 288 | - Common training errors: 289 | 290 | - Vanishing Gradients → When there are more layers in the network, the value of the product of derivative decreases until at some point the partial derivative of the loss function approaches a value close to zero, and the partial derivative vanishes 291 | 292 | - Exploding Gradients → the inverse of the vanishing gradient and occurs when large error gradients accumulate, resulting in extremely large updates to neural network model weights during training. As a result, the model is unstable and incapable of learning from your training data. 293 | 294 | - Bad Initialization → issues when we have same initialization values, too small, or too big init values lead to vanishing/exploding. 295 | 296 | - Overfitting/Underfitting. 297 | 298 | - System resources issues (bottlenecks): 299 | 300 | - I/O usage (for loading data) 301 | 302 | - CPU & Memory Usage (when processing data) 303 | 304 | - GPU Usage (training data) 305 | 306 | - Sagemaker Debugger 307 | 308 | - Monitors & profiles system resources (CPU, GPU, network, memory) in real time. 309 | 310 | - Given recommendations on reallocation of resources. 311 | 312 | - Captures debugging metrics (data, framework, output tensors etc.) 313 | 314 |
315 | 316 | - Debugger Building Rules (Import rules & rules_config). 317 | 318 | ![Debugger](images/Debugger.png) 319 | 320 |
321 | 322 | - MLOps 323 | 324 | - Different from Software Development Lifecycle 325 | 326 | - Additional Pipeline tasks (new data generated). 327 | 328 | - Considerations (People, Tech, Process). 329 | 330 |
331 | 332 | ![Automated Quality Gate](images/Automated%20Quality%20Gates.png) 333 | 334 | - Best practice for MLOps: 335 | 336 | - Use Data Lakes. 337 | 338 | - Enable Traceability (Code Versioning & Data Versioning). 339 | 340 | - Pipeline Checks (Bias, Schema, Quality). 341 | 342 | - Log (Model, System Data). 343 | 344 | - Monitoring (Collect Metrics, Setup Alerts, Trigger Events). 345 | 346 | - Model Lineage & Artifacts. 347 | 348 | - Sagemaker Pipelines: 349 | 350 | - Create and visualize workflows. 351 | 352 | - Choose best performing model to deploy. 353 | 354 | - Automatic tracking of models. 355 | 356 | - Bring CI/CD to Machine Learning. 357 | 358 | - Step by Step Integration: 359 | 360 | - Data Processing (integrates with sagemaker processing job). 361 | 362 | - Data Training (integrates with sagemaker training job). 363 | 364 | - Evaluation (using sagemaker processing job to evaluate holdout test). 365 | 366 | - Add Accuracy Condition (sagemaker workflow condition). 367 | 368 | - Register Model → Create Model. 369 | 370 | - Best practice → make use of already build in MLOps templates. 371 | 372 |
373 | 374 | --- 375 | 376 |
377 | 378 | ## Optimize ML Models and Deployment 379 | 380 |
381 | 382 | - Automatic Model Tuning 383 | 384 | - GridSearchCV → tests every combination (time consuming, but most accurate). 385 | 386 | - RandomSearchCV → tests random combinations within search space. (faster, but less accurate). 387 | 388 | - BayesianOptimizationCV → solver HPT as regression problem. (continuous improvement, but might get stuck in local minima). 389 | 390 | - HyperBand → multiarmed bandits approach (explore / exploit). (Probably best for time, but might leave good candidates out early). 391 | 392 | - Sagemaker Hyper Parameter Tuner. 393 | 394 | - Best Practices: 395 | 396 | - Start with small range & number of HP. 397 | 398 | - Warm start. (Uses result from previous jobs). 399 | 400 | - Enable early stop. 401 | 402 | - Use small number of concurrent training jobs. 403 | 404 | - Right size of compute resources. 405 | 406 | - Checkpointing: 407 | 408 | - Not re-running the whole thing (just from a given checkpoint). 409 | 410 | - Snapshots of: 411 | 412 | - Model Architecture 413 | 414 | - Model Weigths 415 | 416 | - Training Configurations 417 | 418 | - Optimizer 419 | 420 | - Beware of high frequency and number of checkpoints (storage). 421 | 422 | - Training at Scale: 423 | 424 | - Data Parallelism & Model Parallelism 425 | 426 | - Deployment Strategies: 427 | 428 | - Blue / Green → shifting from original deployment to a new one through load balancer. (Rollback / Swap). However if there other deployment sucks we're bad. 429 | 430 | - Shadow / Challenger → serving both models requests, but only one gets response. We analyze, first - if ok then we shift. 431 | 432 | - Canary → split traffic between smaller specific groups, while majority is still in one. Good for validating a bit first, before going all in. 433 | 434 | - A/B Testing → we split traffic between larger groups, to cross compare (even more than 2). Gathering live performance data. 435 | 436 | - Multi-Armed Bandits → the most dynamic and quickly adaptable. Reinforcement learning. 437 | 438 | - Amazon Sagemaker Hosting: 439 | 440 | - Autoscaling workload demands. 441 | 442 | - Add cloudwatch metrics to configure the endpoints. 443 | 444 | - Sagemaker batch transform job: 445 | 446 | - Has inference pipeline (chaining together multiple models). 447 | 448 | - Data transformation before sending to model for prediction. 449 | 450 | - Monitoring ML Workloads: 451 | 452 | - Business 453 | 454 | - System 455 | 456 | - Models: 457 | 458 | - Decay or changed informations (degradation). 459 | 460 | - Concept drift (change in environment - label). 461 | 462 | - Continue to gather ground truth and keep checking performance. 463 | 464 | - Data Drift (change in input / features). 465 | 466 | - Deequ (open source library) - Data profiling on whats normal. 467 | 468 |
469 | 470 | - Sagemaker Model Monitor: 471 | 472 | - Data Quality Monitor: 473 | 474 | - Deploy endpoint. 475 | 476 | - Enable data capture. 477 | 478 | - Set baselines. 479 | 480 | - Set up monitor schedule. 481 | 482 | - Interpreting Results (could be with cloud watch). 483 | 484 | - Model Quality Monitor: 485 | 486 | - Same steps as above. 487 | 488 | - We need to configure collecting ground truth data to evaluate perfomance. 489 | 490 | - Statistical Bias Monitoring: 491 | 492 | - Integrate SageMaker Clarify with SageMaker Model monitor - and repeat same steps. 493 | 494 | - Feature Attribution Monitor: 495 | 496 | - SageMaker Clarify + Model Monitor & Using SHAP values for baselining. 497 | 498 | - Humans in the Loop 499 | 500 | - Data Labeling (identifying raw data adding informative labels, f.ex X-Ray, Cat Images etc.) 501 | 502 | - Automated + Human Labeling: 503 | 504 | - Amazon SageMaker GroundTruth → Pointer to S3 → define labeling tasks. 505 | 506 | - Human Workforce → 3rd party, or own workers (still S3). Manifest file (list of instructions & source ref). 507 | 508 | - Type of labeling: 509 | 510 | - Image 511 | 512 | - Single Label 513 | 514 | - Multi Label 515 | 516 | - Bounding Box 517 | 518 | - Semantic Segmentation 519 | 520 | - Label Verification 521 | 522 | - Video 523 | 524 | - Clip Classification. 525 | 526 | - Object Detection. 527 | 528 | - Text 529 | 530 | - NER 531 | 532 | - Single/Multi Label 533 | 534 | - Custom → GroundTruth Templates 535 | 536 | - Lambda → GroundTruth → Lambda 537 | 538 |
539 | 540 | - Best Labeling practices: 541 | 542 | - Provide clear instructions. 543 | 544 | - Consolidate annotations to improve label quality. 545 | 546 | - Verify and adjust labels. 547 | 548 | - Use active learning / automated data labeling on large datasets. 549 | 550 | - Re-use prior labeling jobs. 551 | 552 | - Amazon Augmented AI (A2I) 553 | 554 | - Provides built-in human review workflow. 555 | 556 | - Allows human reviewer to step in and audit. 557 | 558 | - Define workforce (turk, private, vendor). -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Comprehensive Guide to AWS Certified Machine Learning –Specialty (MLS-C01) 2 | 3 | Here is a summary of the main resources used on each separate file for accomplishing the AWS MLE Specialty Exam preparation. Follow separately each of them to better prepare for the official exam. On each folder you can find my summarized version for each topic. Feel free to contribute and add other resources which you are using or have used to pass the exam. 4 | 5 | ## **Start the Studying by Clicking the Items Below** 6 | 7 | - ### [Official Exam Guide](https://github.com/Xns140/AWS-MLE-Docs/blob/master/AWS%20MLE%20Study%20Guide.md) ☑ 8 | 9 | - ### [AWS Ramp Up Guide](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/AWS%20Ramp%20Up%20Guide.md) ☑ 10 | 11 | - ### [One Minute AWS MLE Playlist](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/One%20Minute%20AWS%20MLE%20Playlist.md) 12 | 13 | - ### [AWS Power Hour: 4 Episodes](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/AWS%20Power%20Hour.md) ☑ 14 | 15 | - ### [Practical Data Science on the AWS Cloud Specialization](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/Practical%20Data%20Science%20on%20AWS.md) ☑ 16 | 17 | - ### [Exam Readiness Course](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/Exam%20Readiness%20Course.md) ☑ 18 | 19 | - ### [AWS Certified Machine Learning Specialty 2022 - Hands On!](https://github.com/JShollaj/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/Udemy%20-%20AWS%20Certified%20Machine%20Learning%20Specialty%202022%20-%20Hands%20On!.md) ☑ 20 | 21 | - ### [Related Whitepapers](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/Related%20Whitepapers.md) ☑ 22 | 23 | - ### [Amazon Machine Learning - Developer Guide](https://docs.aws.amazon.com/machine-learning/latest/dg/what-is-amazon-machine-learning.html) ☑ 24 | 25 | - ### [AWS AI Services Guide](https://github.com/koreva-liubov/AWS-Certified-Machine-Learning-Specialty-Guide/blob/aca0e1ebd08a693d9d83a24606d65483eb7b9e6c/pdf/AWS-AI-Services-2023.pdf) ☑ 26 | 27 | - ### [Full Exams](https://github.com/Xns140/AWS-Certified-Machine-Learning-Specialty-Guide/blob/master/Full%20Exams.md) ☑ 28 | 29 | -------------------------------------------------------------------------------- /Related Whitepapers.md: -------------------------------------------------------------------------------- 1 | ## [**Augmented AI: The Power of Human and Machine**](https://d1.awsstatic.com/whitepapers/augmented-ai-the-power-of-human-and-machine.pdf) 2 | 3 | 4 | --- 5 | 6 | ## [**Machine Learning Lens - AWS Well-Architected Framework**](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/machine-learning-lens.html) -------------------------------------------------------------------------------- /Udemy - AWS Certified Machine Learning Specialty 2022 - Hands On!.md: -------------------------------------------------------------------------------- 1 | # AWS Certified Machine Learning Specialty Course MLS C01 2 | 3 | ## Data Engineering: Moving, Storing and Processing data in AWS 4 | 5 | ### Amazon S3 6 | 7 | #### **Overview** 8 | 9 | - Simple Storage Service (S3) 10 | - Store objects (files) in buckets (directories). 11 | - Buckets must have a globally unique name. 12 | - Max object size is 5TB. 13 | 14 | - Durability and Availability: 15 | 16 | - Durability of: **11 x 9s** → If we have 10 million objects, lose 1 every 10,000 years. 17 | 18 | - Availability of: **4 x 9s** → Not available ~53 minutes in a year. 19 | 20 | #### **S3 For ML** 21 | 22 | - We can create **Data Lakes** with S3 as the storage. 23 | 24 | - Fully managed & 11 x 9s Durability 25 | - Storage is decoupled from Computing resources. 26 | 27 | - Object storage supports any file format. 28 | 29 | - Best files formats for performance are Avro & Parquet. 30 | 31 | - We can perform data partitioning to optimize performance. 32 | 33 | - By Date (Hourly, Daily, Monthly). 34 | - By Product (ID, Family). 35 | - Query Patterns 36 | 37 | #### **S3 Storage Classes** 38 | 39 | - Amazon S3 **Standard - General Purpose** 40 | 41 | - Used for frequently accessed data 42 | - Sustain 2 concurrent facility failures. 43 | - Low latency & High throughput. 44 | - Used for Big Data Analytics, Mobile & Gaming etc. 45 | 46 | - Amazon S3 **Standard - Infrequent Access (IA)** 47 | 48 | - Less frequent, but required rapid access. 49 | - Lower cost than S3. 50 | - Lower Availability (99.9%). 51 | - Used for Disaster Recovery, Backups etc. 52 | 53 | - Amazon S3 One Zone - Infrequent Access 54 | 55 | - High durability (11x9s), but only in single AZ. 56 | - Lower cost than S3 Standard IA. 57 | - Storing secondary data. 58 | 59 | - Amazon S3 Glacier - Instant Retrieval 60 | 61 | - Lowest cost (meant for deep archives). 62 | - Millisecond retrieval, objects but needs to be stored for at least 90 days. 63 | 64 | - Amazon S3 Glacier - Flexible Retrieval 65 | 66 | - Minimum storage duration (90 days). 67 | - Different types of retrievals: 68 | - Expedited (1-5 minutes) 69 | - Standard (3-5 hours) 70 | - Bulk (5-12 hours) - Free tier 71 | 72 | - Amazon S3 Glacier - Deep Archive 73 | 74 | - Cheapest ($1 per TB) 75 | - Longest term storage. 76 | - Minimum storage duration (180 days). 77 | - For regulatory and compliance cases. 78 | - Data retrieval options: 79 | - Standard (12 hours) 80 | - Bulk (48 hours) 81 | 82 | - Amazon S3 Intelligent Tiering 83 | 84 | - Shifts data through different storages. 85 | - Has 5 different access tiers, based on frequency: 86 | 87 | - **Frequent Access** tier (within 30 days) 88 | - **Infrequent Access** tier (> 30 days, but still low latency & high throughput) 89 | - **Archive Instant Access** tier (> 90 days, but still low latency & high throughput) 90 | - **Archive Access** tier (> 90 days; we choose async access, to save costs; 3-5 hours to access) 91 | - **Deep Archive Access** tier (> 180 days, 12 hours to access) 92 | 93 | - We can also configure lifecycle rules based on: 94 | 95 | - Start with S3 Storage Analysis to create rules. 96 | - Transition to storage class. 97 | - Costs. 98 | - Object based. 99 | - Expiration/Deletion based on inactivity. 100 | 101 | #### **S3 Security** 102 | 103 | - S3 Encryption for Objects (4 Types) 104 | 105 | - **SSE-S3**: encrypts S3 objects using keys handled & managed by AWS. 106 | 107 | - **SSE-KMS**: use AWS Key Management Service to manage encryption keys (audit trail & increased security). 108 | 109 | - **SSE-C**: client manages encryption keys. 110 | 111 | - **Client-Side** 112 | 113 | - S3 Security Layers 114 | 115 | - User Based (IAM Policies) 116 | - Resource Based (Object & Bucket policies) 117 | - JSON Based (Objects/Buckets, Actions, Effect, Principal) 118 | - Networking (Allowing traffic within configured VPC) 119 | 120 | --- 121 | 122 | ### Amazon Kinesis 123 | 124 | #### **Kinesis Overview** 125 | 126 | - Real-time processing service for big data. 127 | - Composed of 4 main sub-groups: 128 | 129 | - Kinesis Streams (Ingesting in Real Time) 130 | 131 | - Ingests data in divided shards. 132 | - Provisioned Mode: We choose between 1MBs/s or 2MBs/s. 133 | - On-Demand Mode: 4MBs/s - we pay for stream/hr & data instead of shards. 134 | - Limits: 135 | - 1MB/s or 1000 messages/s at write per shard (Producer) 136 | - 2MB/s at read PER SHARD across all consumers 137 | - 5 API calls per second PER SHARD across all consumers 138 | - Data retention (24hrs - 365days) 139 | 140 | - Kinesis Analytics (Analytics in Real-Time) 141 | - Serverless responsive analytics in real time. 142 | - Continuous metric generation (f.ex live leaderboard). 143 | - Can be configured through IAM for permissions. 144 | - Lambda can be used for preprocessing. 145 | - Kinesis Firehose (Storing in Real-Time 146 | - Automatic scaling & fully managed data storing for *near* real time. 147 | - Supports data conversion, transformantion & compression. 148 | - Kinesis Video Streams (Video Streaming) 149 | - Keeps data for 1hr - 10years. 150 | - Used for security cameras, RADAR etc. 151 | 152 | --- 153 | 154 | ### Amazon Glue (Batch) 155 | 156 | - **Glue Data Catalog** 157 | 158 | - Automated and versioned schema inference (useful for DB). 159 | - Glue Crawlers to built the catalog on schedule or demand. 160 | - They need IAM role to access data stores. 161 | - Works with JSON, Parquet, CSV & S3, Redshift, RDS. 162 | - Extracts partitions on how data is organized in S3. 163 | 164 | - **Glue ETL** 165 | 166 | - Serverless ETL Tool, used for batch transformations. 167 | - Bundled Transformations (Drop/Filter/Join/Map). 168 | - ML Transformations (Matching Records). 169 | - Apache Spark Transformations. 170 | - Format transformations (CSV, JSON, Avro, Parquet, ORC, XML). 171 | 172 | --- 173 | 174 | ### Stores for ML 175 | 176 | - **Redshift** 177 | 178 | - Data Warehousing; OLAP (S3 → Redshift). 179 | - Can query data in S3 through Redshift Spectrum. 180 | 181 | - **RDS, Aurora** 182 | 183 | - Relational Store, SQL (OLTP Online Transaction Processing) 184 | - Must provision servers in advance. 185 | 186 | - **DynamoDB** 187 | 188 | - NoSQL data store, serverless, provision, read/write capacity. 189 | 190 | - **S3** 191 | 192 | - Object storage, Serverless, infinite storage 193 | 194 | - **OpenSearch** 195 | 196 | - Indexing of data 197 | - Search amongst data points 198 | - Clickstream Analytics 199 | 200 | - **ElastiCache** 201 | 202 | - Caching mechanism 203 | - Not really used for Machine Learning 204 | 205 | --- 206 | 207 | ### AWS Data Pipelines Features 208 | 209 | - **AWS Data Pipeline** 210 | 211 | - Orchestration service (controls environment, EC2, Code). 212 | - Destinations such as S3, RDS, DynamoDB, EMR, Redshift. 213 | - Notifies on failure, and is highly available. 214 | 215 | - **AWS Batch** 216 | 217 | - Run batch jobs as Docker images & Dynamic Provisioning. 218 | - Can be integrated to schedule jobs with CloudWatch/StepFunctions 219 | 220 | 221 | -------------------------------------------------------------------------------- /images/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/1.png -------------------------------------------------------------------------------- /images/AWS ML Flywheel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/AWS ML Flywheel.png -------------------------------------------------------------------------------- /images/Amazon Flywheel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Amazon Flywheel.png -------------------------------------------------------------------------------- /images/Anomaly Detection in AWS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Anomaly Detection in AWS.png -------------------------------------------------------------------------------- /images/Architectures and Frameworks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Architectures and Frameworks.png -------------------------------------------------------------------------------- /images/AutoML Workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/AutoML Workflow.png -------------------------------------------------------------------------------- /images/Automated Quality Gates.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Automated Quality Gates.png -------------------------------------------------------------------------------- /images/Choosing Recommender Models.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Choosing Recommender Models.png -------------------------------------------------------------------------------- /images/Configure Training.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Configure Training.png -------------------------------------------------------------------------------- /images/Debugger.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Debugger.png -------------------------------------------------------------------------------- /images/ML Workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/ML Workflow.png -------------------------------------------------------------------------------- /images/Metrics to use for Recommender.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Metrics to use for Recommender.png -------------------------------------------------------------------------------- /images/Model Deployment for Drift Monitoring.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Model Deployment for Drift Monitoring.png -------------------------------------------------------------------------------- /images/Recommender Development and Deployment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Recommender Development and Deployment.png -------------------------------------------------------------------------------- /images/Recommender Logging.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Recommender Logging.png -------------------------------------------------------------------------------- /images/S3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/S3.png -------------------------------------------------------------------------------- /images/SPARK.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/SPARK.png -------------------------------------------------------------------------------- /images/Sagemaker Autopilot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Sagemaker Autopilot.png -------------------------------------------------------------------------------- /images/Sagemaker Services.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Sagemaker Services.png -------------------------------------------------------------------------------- /images/Spark and Sagemaker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Spark and Sagemaker.png -------------------------------------------------------------------------------- /images/Summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/Summary.png -------------------------------------------------------------------------------- /images/ml_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/ml_map.png -------------------------------------------------------------------------------- /images/precisionvsrecall.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/images/precisionvsrecall.png -------------------------------------------------------------------------------- /pdf/AWS-AI-Services-2023.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JShollaj/AWS-Machine-Learning/HEAD/pdf/AWS-AI-Services-2023.pdf --------------------------------------------------------------------------------