├── 01-kaggle-dataset-download.html
├── 02-extract-transform-load.html
├── 03-data-visualization.html
├── 04-train-model-pipelines.html
├── 05-model-serving.html
├── Architecture-flow.png
├── README.md
└── workflow.html
/Architecture-flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdrakiburrahman/azure-databricks-malware-prediction/b82994c5b74ef93b959b7423be575e90a7dd24ee/Architecture-flow.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # azure-databricks-malware-prediction
2 | End-to-end Machine Learning Pipeline demo using Delta Lake, MLflow and AzureML in Azure Databricks
3 |
4 | # The Problem
5 | The problem I set out to solve is this public [Kaggle competition](https://www.kaggle.com/c/microsoft-malware-prediction) hosted my Microsoft earlier this year. Essentially, Microsoft has provided datasets containing Windows telemetry data for a variety of machines; in order words - a dump of various windows features (Os Build, Patch version etc.) for machines like our laptops. The idea is to use the test.csv and train.csv dataset to develop a Machine Learning model that would predict a Windows machine's probability of getting infected with various families of malware.
6 |
7 | ## Architecture
8 | .
9 |
10 | # 01-kaggle-dataset-download: Connecting to Kaggle via API and copying competition files to Azure Blob Storage
11 |
12 | The Kaggle API allows us to connect to various competitions and datasets hosted on the platform: [API documentation](https://github.com/Kaggle/kaggle-api).
13 |
14 | **Pre-requisite**: You should have downloaded the _kaggle.json_ containing the API *username* and *key* and localized the notebook below.
15 | In this notebook, we will -
16 | 1. Mount a container called `bronze` in Azure Blob Storage
17 | 2. Import the competition data set in .zip format from Kaggle to the mounted container
18 | 3. Unzip the downloaded data set and remove the zip file
19 |
20 | # 02-extract-transform-load: EXTRACT, TRANSFORM, LOAD from BRONZE to SILVER Zone
21 |
22 | Here is the Data Lake Architecture we are emulating:
23 |
24 | .
25 |
26 | **Pre-requisite**: You should have run `01-kaggle-dataset-download` to download the Kaggle dataset to BRONZE Zone.
27 | In this notebook, we will -
28 | 1. **EXTRACT** the downloaded Kaggle `train.csv` dataset from BRONZE Zone into a dataframe
29 | 2. Perform various **TRANSFORMATIONS** on the dataframe to enhance/clean the data
30 | 3. **LOAD** the data into SILVER Zone in Delta Lake format
31 | 4. Repeat the above three steps for `test.csv`
32 | 5. Take the Curated `test.csv` data and enhance it further for ML scoring later on.
33 |
34 | # 03-data-visualization: Data Visualization
35 |
36 | I'm leveraging a lot of the visualization/data exploration done by the brilliant folks over at [Kaggle](https://www.kaggle.com/c/microsoft-malware-prediction/notebooks) that have already spent a lot of time exploring this Dataset.
37 |
38 | **Pre-requisite**: You should have run `02-extract-transform-load` and have the curated data ready to go in SILVER Zone.
39 | In this notebook, we will -
40 | 1. Import the `malware_train_delta` training dataset from SILVER Zone into a dataframe
41 | 2. Explore live visualization capabilities built into Databricks GUI
42 | 3. Explore the top 10 features most correlated with the `HasDetections` column - the column of interest
43 | 4. Generate a correlation heatmap for the top 10 features
44 | 5. Explore top 3 features via various plots to visualize the data
45 |
46 | # 04-train-model-pipelines: Use MLflow to create Machine Learning Pipelines and Track Experiments
47 |
48 | ## Tracking Experiments with MLflow
49 |
50 | MLflow Tracking is one of the three main components of MLflow. It is a logging API specific for machine learning and agnostic to libraries and environments that do the training. It is organized around the concept of **runs**, which are executions of data science code. Runs are aggregated into **experiments** where many runs can be a part of a given experiment and an MLflow server can host many experiments.
51 |
52 | MLflow tracking also serves as a **model registry** so tracked models can easily be stored and, as necessary, deployed into production.
53 |
54 |