├── .github └── FUNDING.yml ├── .gitignore ├── BinaryClassification ├── DiabetesDetection │ ├── README.md │ ├── assets │ │ └── data.png │ └── diabetes.csv ├── FraudDetection │ ├── README.md │ └── assets │ │ └── data.png ├── HeartDiseasePrediction │ ├── Heart.fsproj │ ├── Program.fs │ ├── README.md │ ├── assets │ │ └── data.png │ └── processed.cleveland.data.csv ├── SpamDetection │ ├── Program.fs │ ├── README.md │ ├── SpamDetection.fsproj │ ├── assets │ │ └── data.png │ └── spam.tsv └── TitanicPrediction │ ├── Program.fs │ ├── README.md │ ├── TitanicPrediction.fsproj │ ├── assets │ ├── data.jpg │ └── titanic.jpeg │ ├── test_data.csv │ └── train_data.csv ├── Clustering └── IrisFlower │ ├── IrisFlower.fsproj │ ├── Program.fs │ ├── README.md │ ├── assets │ ├── data.png │ └── flowers.png │ └── iris-data.csv ├── LoadingData └── CaliforniaHousing │ ├── CaliforniaHousing.fsproj │ ├── Program.fs │ ├── README.md │ ├── assets │ ├── data.png │ └── plot.png │ └── california_housing.csv ├── MulticlassClassification ├── DigitRecognition │ ├── Mnist.fsproj │ ├── Program.fs │ ├── README.md │ └── assets │ │ ├── datafile.png │ │ ├── mnist.png │ │ └── mnist_hard.png └── FlagToxicComments │ ├── README.md │ └── assets │ └── data.png ├── README.md ├── Recommendation └── MovieRecommender │ ├── MovieRecommender.fsproj │ ├── Program.fs │ ├── README.md │ ├── assets │ ├── data.png │ └── movies.png │ ├── recommendation-movies.csv │ ├── recommendation-ratings-test.csv │ └── recommendation-ratings-train.csv ├── Regression ├── BikeDemandPrediction │ ├── BikeDemand.fsproj │ ├── Program.fs │ ├── README.md │ ├── assets │ │ ├── bikesharing.jpeg │ │ └── data.png │ └── bikedemand.csv ├── HousePricePrediction │ ├── README.md │ ├── assets │ │ └── data.png │ └── data.csv └── TaxiFarePrediction │ ├── Program.fs │ ├── README.md │ ├── TaxiFarePrediction.fsproj │ ├── assets │ └── data.png │ └── yellow_tripdata_2018-12_small.csv └── assets └── DSC-FS.jpg /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: [mdfarragher] 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | BinaryClassification/HeartDiseasePrediction/bin/ 2 | BinaryClassification/HeartDiseasePrediction/obj/ 3 | BinaryClassification/SpamDetection/bin/ 4 | BinaryClassification/SpamDetection/obj/ 5 | BinaryClassification/TitanicPrediction/bin/ 6 | BinaryClassification/TitanicPrediction/obj/ 7 | Clustering/IrisFlower/obj/ 8 | MulticlassClassification/DigitRecognition/bin/ 9 | MulticlassClassification/DigitRecognition/obj/ 10 | Regression/BikeDemandPrediction/bin/ 11 | Regression/BikeDemandPrediction/obj/ 12 | Regression/TaxiFarePrediction/bin/ 13 | Regression/TaxiFarePrediction/obj/ 14 | MulticlassClassification/DigitRecognition/mnist_test.csv 15 | MulticlassClassification/DigitRecognition/mnist_train.csv 16 | Clustering/IrisFlower/bin/ 17 | Recommendation/MovieRecommender/bin/ 18 | Recommendation/MovieRecommender/obj/ 19 | LoadingData/CaliforniaHousing/bin/ 20 | LoadingData/CaliforniaHousing/obj/ 21 | Regression/TaxiFarePrediction/yellow_tripdata_2018-12.csv 22 | -------------------------------------------------------------------------------- /BinaryClassification/DiabetesDetection/README.md: -------------------------------------------------------------------------------- 1 | # The case 2 | 3 | The Pima are a tribe of North American Indians who traditionally lived along the Gila and Salt rivers in Arizona, U.S., in what was the core area of the prehistoric Hohokam culture. They speak a Uto-Aztecan language and call themselves the River People and are usually considered to be the descendants of the Hohokam. 4 | 5 | But there's a weird thing about the Pima: they have the highest reported prevalence of diabetes of any population in the world. Their diabetes is exclusively type 2 diabetes, with no evidence of type 1 diabetes, even in very young children with an early onset of the disease. 6 | 7 | This suggests that the Pima carry a specific gene mutation that makes them extremely susceptive to diabetes. The tribe has been the focus of many medical studies over the years. 8 | 9 | In this case study, you're going to participate in one of these medical studies. You will build an app that loads a dataset of Pima medical records and tries to predict from the data who has diabetes and who has not. 10 | 11 | How accurate will your app be? Do you think you will be able to correctly predict every single diabetes case? 12 | 13 | That's for you to find out! 14 | 15 | # The dataset 16 | 17 | ![The dataset](./assets/data.png) 18 | 19 | In this case study you'll be working with a dataset containing the medical records of 768 Pima women. 20 | 21 | There is a single file in the dataset: 22 | * [diabetes.csv](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) which contains 768 records, 8 input features, and 1 output label. You will use this file to train and test your model. 23 | 24 | You'll need to [download the dataset](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) and save it in your project folder to get started. 25 | 26 | Here's a description of all columns in the file: 27 | * **Pregnancies**: the number of times the woman got pregnant 28 | * **Glucose**: the plasma glucose concentration at 2 hours in an oral glucose tolerance test 29 | * **BloodPressure**: the diastolic blood pressure (mm Hg) 30 | * **SkinThickness**: the triceps skin fold thickness (mm) 31 | * **Insulin**: the 2-hour serum insulin concentration (mu U/ml) 32 | * **BMI**: the body mass index (weight in kg/(height in m)^2) 33 | * **DiabetesPedigreeFunction**: the diabetes pedigree function 34 | * **Age**: the age (years) 35 | * **Outcome**: the label you need to predict - 1 if the woman has diabetes, 0 if she has not 36 | 37 | 38 | # Getting started 39 | Go to the console and set up a new console application: 40 | 41 | ```bash 42 | $ dotnet new console --language F# --output DiabetesDetection 43 | $ cd DiabetesDetection 44 | ``` 45 | 46 | Then install the ML.NET NuGet package: 47 | 48 | ```bash 49 | $ dotnet add package Microsoft.ML 50 | $ dotnet add package Microsoft.ML.FastTree 51 | ``` 52 | 53 | And launch the Visual Studio Code editor: 54 | 55 | ```bash 56 | $ code . 57 | ``` 58 | 59 | The rest is up to you! 60 | 61 | # Your assignment 62 | I want you to build an app that reads the data file and splits it for training and testing. Reserve 80% of all records for training and 20% for testing. 63 | 64 | Process the data and train a binary classifier on the training partition. Then use the fully-trained model to generate predictions for the records in the testing partition. 65 | 66 | Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group. 67 | 68 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 69 | 70 | Good luck! -------------------------------------------------------------------------------- /BinaryClassification/DiabetesDetection/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/DiabetesDetection/assets/data.png -------------------------------------------------------------------------------- /BinaryClassification/FraudDetection/README.md: -------------------------------------------------------------------------------- 1 | # The case 2 | 3 | It is very important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. 4 | 5 | Credit card fraud happens a lot. During two days in September 2013 in Europe, credit card networks recorded at least 492 fraud cases out of a total of 284,807 transactions. That's 246 fraud cases per day! 6 | 7 | In this case study, you're going to help credit card companies detect fraud in real time. You will build an app and train it on detected fraud cases, and then test your predictions on a new set of transactions. 8 | 9 | How accurate will your app be? Do you think you will be able to detect financial fraud in real time? 10 | 11 | That's for you to find out! 12 | 13 | # The dataset 14 | 15 | ![The dataset](./assets/data.png) 16 | 17 | In this case study you'll be working with a dataset containing transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. 18 | 19 | Note that the dataset is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions. 20 | 21 | The data set contains 285k records, 30 feature columns, and a single label indicating if the transaction is fraudulent or not. You can use any combination of features you like to generate your fraud predictions. 22 | 23 | There is a single file in the dataset: 24 | * [creditcard.csv](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) which contains 285k records, 30 input features, and one output label. You will use this file to train and test your model. 25 | 26 | The file is about 150 MB in size. You'll need to [download it from Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet. 27 | 28 | Here's a description of all 31 columns in the data file: 29 | * Time: Number of seconds elapsed between this transaction and the first transaction in the dataset 30 | * V1-V28: A feature of the transaction, processed to a number to protect user identities and sensitive information 31 | * Amount: Transaction amount 32 | * Class: 1 for fraudulent transactions, 0 otherwise 33 | 34 | # Getting started 35 | Go to the console and set up a new console application: 36 | 37 | ```bash 38 | $ dotnet new console --language F# --output FraudDetection 39 | $ cd FraudDetection 40 | ``` 41 | 42 | Then install the ML.NET NuGet package: 43 | 44 | ```bash 45 | $ dotnet add package Microsoft.ML 46 | $ dotnet add package Microsoft.ML.FastTree 47 | ``` 48 | 49 | And launch the Visual Studio Code editor: 50 | 51 | ```bash 52 | $ code . 53 | ``` 54 | 55 | The rest is up to you! 56 | 57 | # Your assignment 58 | I want you to build an app that reads the data file in memory and splits it. Use 80% for training and 20% for testing. 59 | 60 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns. 61 | 62 | Processes the selected input features, then train a binary classifier on the data, and generate predictions for the transactions in the testing partition. 63 | 64 | Use the trained model to make fraud predictions on the test data. Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group. 65 | 66 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 67 | 68 | Good luck! -------------------------------------------------------------------------------- /BinaryClassification/FraudDetection/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/FraudDetection/assets/data.png -------------------------------------------------------------------------------- /BinaryClassification/HeartDiseasePrediction/Heart.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /BinaryClassification/HeartDiseasePrediction/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open System.IO 3 | open Microsoft.ML 4 | open Microsoft.ML.Data 5 | 6 | /// The HeartData record holds one single heart data record. 7 | [] 8 | type HeartData = { 9 | [] Age : float32 10 | [] Sex : float32 11 | [] Cp : float32 12 | [] TrestBps : float32 13 | [] Chol : float32 14 | [] Fbs : float32 15 | [] RestEcg : float32 16 | [] Thalac : float32 17 | [] Exang : float32 18 | [] OldPeak : float32 19 | [] Slope : float32 20 | [] Ca : float32 21 | [] Thal : float32 22 | [] Diagnosis : float32 23 | } 24 | 25 | /// The HeartPrediction class contains a single heart data prediction. 26 | [] 27 | type HeartPrediction = { 28 | [] Prediction : bool 29 | Probability : float32 30 | Score : float32 31 | } 32 | 33 | /// The ToLabel class is a helper class for a column transformation. 34 | [] 35 | type ToLabel = { 36 | mutable Label : bool 37 | } 38 | 39 | /// file paths to data files (assumes os = windows!) 40 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory 41 | 42 | /// The main application entry point. 43 | [] 44 | let main argv = 45 | 46 | // set up a machine learning context 47 | let context = new MLContext() 48 | 49 | // load training and test data 50 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',') 51 | 52 | // split the data into a training and test partition 53 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 54 | 55 | // set up a training pipeline 56 | let pipeline = 57 | EstimatorChain() 58 | 59 | // step 1: convert the label value to a boolean 60 | .Append( 61 | context.Transforms.CustomMapping( 62 | Action(fun input output -> output.Label <- input.Diagnosis > 0.0f), 63 | "LabelMapping")) 64 | 65 | // step 2: concatenate all feature columns 66 | .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal")) 67 | 68 | // step 3: set up a fast tree learner 69 | .Append(context.BinaryClassification.Trainers.FastTree()) 70 | 71 | // train the model 72 | let model = partitions.TrainSet |> pipeline.Fit 73 | 74 | // make predictions and compare with the ground truth 75 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate 76 | 77 | // report the results 78 | printfn "Model metrics:" 79 | printfn " Accuracy: %f" metrics.Accuracy 80 | printfn " Auc: %f" metrics.AreaUnderRocCurve 81 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 82 | printfn " F1Score: %f" metrics.F1Score 83 | printfn " LogLoss: %f" metrics.LogLoss 84 | printfn " LogLossReduction: %f" metrics.LogLossReduction 85 | printfn " PositivePrecision: %f" metrics.PositivePrecision 86 | printfn " PositiveRecall: %f" metrics.PositiveRecall 87 | printfn " NegativePrecision: %f" metrics.NegativePrecision 88 | printfn " NegativeRecall: %f" metrics.NegativeRecall 89 | 90 | // set up a prediction engine 91 | let predictionEngine = context.Model.CreatePredictionEngine model 92 | 93 | // create a sample patient 94 | let sample = { 95 | Age = 36.0f 96 | Sex = 1.0f 97 | Cp = 4.0f 98 | TrestBps = 145.0f 99 | Chol = 210.0f 100 | Fbs = 0.0f 101 | RestEcg = 2.0f 102 | Thalac = 148.0f 103 | Exang = 1.0f 104 | OldPeak = 1.9f 105 | Slope = 2.0f 106 | Ca = 1.0f 107 | Thal = 7.0f 108 | Diagnosis = 0.0f // unused 109 | } 110 | 111 | // make the prediction 112 | let prediction = sample |> predictionEngine.Predict 113 | 114 | // report the results 115 | printfn "\r" 116 | printfn "Single prediction:" 117 | printfn " Prediction: %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk") 118 | printfn " Probability: %f" prediction.Probability 119 | 120 | 0 // return value -------------------------------------------------------------------------------- /BinaryClassification/HeartDiseasePrediction/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Predict heart disease risk 2 | 3 | In this assignment you're going to build an app that can predict the heart disease risk in a group of patients. 4 | 5 | The first thing you will need for your app is a data file with patients, their medical info, and their heart disease risk assessment. We're going to use the famous [UCI Heart Disease Dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) which has real-life data from 303 patients. 6 | 7 | Download the [Processed Cleveland Data](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data) file and save it as **processed.cleveland.data.csv**. 8 | 9 | The data file looks like this: 10 | 11 | ![Processed Cleveland Data](./assets/data.png) 12 | 13 | It’s a CSV file with 14 columns of information: 14 | 15 | * Age 16 | * Sex: 1 = male, 0 = female 17 | * Chest Pain Type: 1 = typical angina, 2 = atypical angina , 3 = non-anginal pain, 4 = asymptomatic 18 | * Resting blood pressure in mm Hg on admission to the hospital 19 | * Serum cholesterol in mg/dl 20 | * Fasting blood sugar > 120 mg/dl: 1 = true; 0 = false 21 | * Resting EKG results: 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria 22 | * Maximum heart rate achieved 23 | * Exercise induced angina: 1 = yes; 0 = no 24 | * ST depression induced by exercise relative to rest 25 | * Slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping 26 | * Number of major vessels (0–3) colored by fluoroscopy 27 | * Thallium heart scan results: 3 = normal, 6 = fixed defect, 7 = reversible defect 28 | * Diagnosis of heart disease: 0 = normal risk, 1-4 = elevated risk 29 | 30 | The first 13 columns are patient diagnostic information, and the last column is the diagnosis: 0 means a healthy patient, and values 1-4 mean an elevated risk of heart disease. 31 | 32 | You are going to build a binary classification machine learning model that reads in all 13 columns of patient information, and then makes a prediction for the heart disease risk. 33 | 34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 35 | 36 | ```bash 37 | $ dotnet new console --language F# --output Heart 38 | $ cd Heart 39 | ``` 40 | 41 | Now install the following ML.NET packages: 42 | 43 | ```bash 44 | $ dotnet add package Microsoft.ML 45 | $ dotnet add package Microsoft.ML.FastTree 46 | ``` 47 | 48 | Now you are ready to add some types. You’ll need one to hold patient info, and one to hold your model predictions. 49 | 50 | Replace the contents of the Program.fs file with this: 51 | 52 | ```fsharp 53 | open System 54 | open System.IO 55 | open Microsoft.ML 56 | open Microsoft.ML.Data 57 | 58 | /// The HeartData record holds one single heart data record. 59 | [] 60 | type HeartData = { 61 | [] Age : float32 62 | [] Sex : float32 63 | [] Cp : float32 64 | [] TrestBps : float32 65 | [] Chol : float32 66 | [] Fbs : float32 67 | [] RestEcg : float32 68 | [] Thalac : float32 69 | [] Exang : float32 70 | [] OldPeak : float32 71 | [] Slope : float32 72 | [] Ca : float32 73 | [] Thal : float32 74 | [] Diagnosis : float32 75 | } 76 | 77 | /// The HeartPrediction class contains a single heart data prediction. 78 | [] 79 | type HeartPrediction = { 80 | [] Prediction : bool 81 | Probability : float32 82 | Score : float32 83 | } 84 | 85 | // the rest of the code goes here.... 86 | ``` 87 | 88 | The **HeartData** class holds one single patient record. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from. 89 | 90 | There's also a **HeartPrediction** class which will hold a single heart disease prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction. 91 | 92 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 93 | 94 | Now look at the final **Diagnosis** column in the data file. Our label is an integer value between 0-4, with 0 meaning 'no risk' and 1-4 meaning 'elevated risk'. 95 | 96 | But you're building a Binary Classifier which means your model needs to be trained on boolean labels. 97 | 98 | So you'll have to somehow convert the 'raw' numeric label (stored in the **Diagnosis** field) to a boolean value. 99 | 100 | To set that up, you'll need a helper type: 101 | 102 | ```fsharp 103 | /// The ToLabel class is a helper class for a column transformation. 104 | [] 105 | type ToLabel = { 106 | mutable Label : bool 107 | } 108 | 109 | // the rest of the code goes here.... 110 | ``` 111 | 112 | The **ToLabel** type contains the label converted to a boolean value. We'll set up that conversion in a minute. 113 | 114 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 115 | 116 | Now you're going to load the training data in memory: 117 | 118 | ```fsharp 119 | /// file paths to data files (assumes os = windows!) 120 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory 121 | 122 | /// The main application entry point. 123 | [] 124 | let main argv = 125 | 126 | // set up a machine learning context 127 | let context = new MLContext() 128 | 129 | // load training and test data 130 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',') 131 | 132 | // split the data into a training and test partition 133 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 134 | 135 | // the rest of the code goes here.... 136 | 137 | 0 // return value 138 | ``` 139 | 140 | This code uses the method **LoadFromTextFile** to load the CSV data directly into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **HeartData** class. 141 | 142 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data. 143 | 144 | Now you’re ready to start building the machine learning model: 145 | 146 | ```fsharp 147 | // set up a training pipeline 148 | let pipeline = 149 | EstimatorChain() 150 | 151 | // step 1: convert the label value to a boolean 152 | .Append( 153 | context.Transforms.CustomMapping( 154 | Action(fun input output -> output.Label <- input.Diagnosis > 0.0f), 155 | "LabelMapping")) 156 | 157 | // step 2: concatenate all feature columns 158 | .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal")) 159 | 160 | // step 3: set up a fast tree learner 161 | .Append(context.BinaryClassification.Trainers.FastTree()) 162 | 163 | // train the model 164 | let model = partitions.TrainSet |> pipeline.Fit 165 | 166 | // the rest of the code goes here.... 167 | ``` 168 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 169 | 170 | This pipeline has the following components: 171 | 172 | * A **CustomMapping** that transforms the numeric label to a boolean value. We define 0 values as healthy, and anything above 0 as an elevated risk. 173 | * **Concatenate** which combines all input data columns into a single column called 'Features'. This is a required step because ML.NET can only train on a single input column. 174 | * A **FastTree** classification learner which will train the model to make accurate predictions. 175 | 176 | The **FastTreeBinaryClassificationTrainer** is a very nice training algorithm that uses gradient boosting, a machine learning technique for classification problems. 177 | 178 | With the pipeline fully assembled, we can train the model by piping the **TrainSet** into the **Fit** function. 179 | 180 | You now have a fully- trained model. So now it's time to take the test partition, predict the diagnosis for each patient, and calculate the accuracy metrics of the model: 181 | 182 | ```fsharp 183 | // make predictions and compare with the ground truth 184 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate 185 | 186 | // report the results 187 | printfn "Model metrics:" 188 | printfn " Accuracy: %f" metrics.Accuracy 189 | printfn " Auc: %f" metrics.AreaUnderRocCurve 190 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 191 | printfn " F1Score: %f" metrics.F1Score 192 | printfn " LogLoss: %f" metrics.LogLoss 193 | printfn " LogLossReduction: %f" metrics.LogLossReduction 194 | printfn " PositivePrecision: %f" metrics.PositivePrecision 195 | printfn " PositiveRecall: %f" metrics.PositiveRecall 196 | printfn " NegativePrecision: %f" metrics.NegativePrecision 197 | printfn " NegativeRecall: %f" metrics.NegativeRecall 198 | 199 | // the rest of the code goes here.... 200 | ``` 201 | 202 | This code pipes the **TestSet** into **model.Transform** to set up a prediction for every patient in the set, and then pipes the predictions into **Evaluate** to compare these predictions to the ground truth and automatically calculate all evaluation metrics: 203 | 204 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions. 205 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good. 206 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive. 207 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive. 208 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes. 209 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance. 210 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high. 211 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high. 212 | * **NegativePrecision**: this is the fraction of negative predictions that are correct. 213 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases. 214 | 215 | When monitoring heart disease, you definitely want to avoid false negatives because you don’t want to be sending high-risk patients home and telling them everything is okay. 216 | 217 | You also want to avoid false positives, but they are a lot better than a false negative because later tests would probably discover that the patient is healthy after all. 218 | 219 | To wrap up, You’re going to create a new patient record and ask the model to make a prediction: 220 | 221 | ```fsharp 222 | // set up a prediction engine 223 | let predictionEngine = context.Model.CreatePredictionEngine model 224 | 225 | // create a sample patient 226 | let sample = { 227 | Age = 36.0f 228 | Sex = 1.0f 229 | Cp = 4.0f 230 | TrestBps = 145.0f 231 | Chol = 210.0f 232 | Fbs = 0.0f 233 | RestEcg = 2.0f 234 | Thalac = 148.0f 235 | Exang = 1.0f 236 | OldPeak = 1.9f 237 | Slope = 2.0f 238 | Ca = 1.0f 239 | Thal = 7.0f 240 | Diagnosis = 0.0f // unused 241 | } 242 | 243 | // make the prediction 244 | let prediction = sample |> predictionEngine.Predict 245 | 246 | // report the results 247 | printfn "\r" 248 | printfn "Single prediction:" 249 | printfn " Prediction: %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk") 250 | printfn " Probability: %f" prediction.Probability 251 | ``` 252 | 253 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then creates a new patient record for a 36-year old male with asymptomatic chest pain and a bunch of other medical info. 254 | 255 | We then pipe the patient record into the **Predict** function and display the diagnosis. 256 | 257 | What’s the model going to predict? 258 | 259 | Time to find out. Go to your terminal and run your code: 260 | 261 | ```bash 262 | $ dotnet run 263 | ``` 264 | 265 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value? 266 | 267 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model? 268 | 269 | And what about our patient? What did your model predict? 270 | 271 | Think about the code in this assignment. How could you improve the accuracy of the model? What are your best AUC and AUCPRC values? 272 | 273 | Share your results in our group! 274 | -------------------------------------------------------------------------------- /BinaryClassification/HeartDiseasePrediction/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/HeartDiseasePrediction/assets/data.png -------------------------------------------------------------------------------- /BinaryClassification/HeartDiseasePrediction/processed.cleveland.data.csv: -------------------------------------------------------------------------------- 1 | 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0 2 | 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2 3 | 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1 4 | 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0 5 | 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0 6 | 56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0 7 | 62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3 8 | 57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0 9 | 63.0,1.0,4.0,130.0,254.0,0.0,2.0,147.0,0.0,1.4,2.0,1.0,7.0,2 10 | 53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1 11 | 57.0,1.0,4.0,140.0,192.0,0.0,0.0,148.0,0.0,0.4,2.0,0.0,6.0,0 12 | 56.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,0 13 | 56.0,1.0,3.0,130.0,256.0,1.0,2.0,142.0,1.0,0.6,2.0,1.0,6.0,2 14 | 44.0,1.0,2.0,120.0,263.0,0.0,0.0,173.0,0.0,0.0,1.0,0.0,7.0,0 15 | 52.0,1.0,3.0,172.0,199.0,1.0,0.0,162.0,0.0,0.5,1.0,0.0,7.0,0 16 | 57.0,1.0,3.0,150.0,168.0,0.0,0.0,174.0,0.0,1.6,1.0,0.0,3.0,0 17 | 48.0,1.0,2.0,110.0,229.0,0.0,0.0,168.0,0.0,1.0,3.0,0.0,7.0,1 18 | 54.0,1.0,4.0,140.0,239.0,0.0,0.0,160.0,0.0,1.2,1.0,0.0,3.0,0 19 | 48.0,0.0,3.0,130.0,275.0,0.0,0.0,139.0,0.0,0.2,1.0,0.0,3.0,0 20 | 49.0,1.0,2.0,130.0,266.0,0.0,0.0,171.0,0.0,0.6,1.0,0.0,3.0,0 21 | 64.0,1.0,1.0,110.0,211.0,0.0,2.0,144.0,1.0,1.8,2.0,0.0,3.0,0 22 | 58.0,0.0,1.0,150.0,283.0,1.0,2.0,162.0,0.0,1.0,1.0,0.0,3.0,0 23 | 58.0,1.0,2.0,120.0,284.0,0.0,2.0,160.0,0.0,1.8,2.0,0.0,3.0,1 24 | 58.0,1.0,3.0,132.0,224.0,0.0,2.0,173.0,0.0,3.2,1.0,2.0,7.0,3 25 | 60.0,1.0,4.0,130.0,206.0,0.0,2.0,132.0,1.0,2.4,2.0,2.0,7.0,4 26 | 50.0,0.0,3.0,120.0,219.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0 27 | 58.0,0.0,3.0,120.0,340.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0 28 | 66.0,0.0,1.0,150.0,226.0,0.0,0.0,114.0,0.0,2.6,3.0,0.0,3.0,0 29 | 43.0,1.0,4.0,150.0,247.0,0.0,0.0,171.0,0.0,1.5,1.0,0.0,3.0,0 30 | 40.0,1.0,4.0,110.0,167.0,0.0,2.0,114.0,1.0,2.0,2.0,0.0,7.0,3 31 | 69.0,0.0,1.0,140.0,239.0,0.0,0.0,151.0,0.0,1.8,1.0,2.0,3.0,0 32 | 60.0,1.0,4.0,117.0,230.0,1.0,0.0,160.0,1.0,1.4,1.0,2.0,7.0,2 33 | 64.0,1.0,3.0,140.0,335.0,0.0,0.0,158.0,0.0,0.0,1.0,0.0,3.0,1 34 | 59.0,1.0,4.0,135.0,234.0,0.0,0.0,161.0,0.0,0.5,2.0,0.0,7.0,0 35 | 44.0,1.0,3.0,130.0,233.0,0.0,0.0,179.0,1.0,0.4,1.0,0.0,3.0,0 36 | 42.0,1.0,4.0,140.0,226.0,0.0,0.0,178.0,0.0,0.0,1.0,0.0,3.0,0 37 | 43.0,1.0,4.0,120.0,177.0,0.0,2.0,120.0,1.0,2.5,2.0,0.0,7.0,3 38 | 57.0,1.0,4.0,150.0,276.0,0.0,2.0,112.0,1.0,0.6,2.0,1.0,6.0,1 39 | 55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0,3 40 | 61.0,1.0,3.0,150.0,243.0,1.0,0.0,137.0,1.0,1.0,2.0,0.0,3.0,0 41 | 65.0,0.0,4.0,150.0,225.0,0.0,2.0,114.0,0.0,1.0,2.0,3.0,7.0,4 42 | 40.0,1.0,1.0,140.0,199.0,0.0,0.0,178.0,1.0,1.4,1.0,0.0,7.0,0 43 | 71.0,0.0,2.0,160.0,302.0,0.0,0.0,162.0,0.0,0.4,1.0,2.0,3.0,0 44 | 59.0,1.0,3.0,150.0,212.0,1.0,0.0,157.0,0.0,1.6,1.0,0.0,3.0,0 45 | 61.0,0.0,4.0,130.0,330.0,0.0,2.0,169.0,0.0,0.0,1.0,0.0,3.0,1 46 | 58.0,1.0,3.0,112.0,230.0,0.0,2.0,165.0,0.0,2.5,2.0,1.0,7.0,4 47 | 51.0,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0,0.0,3.0,0 48 | 50.0,1.0,4.0,150.0,243.0,0.0,2.0,128.0,0.0,2.6,2.0,0.0,7.0,4 49 | 65.0,0.0,3.0,140.0,417.0,1.0,2.0,157.0,0.0,0.8,1.0,1.0,3.0,0 50 | 53.0,1.0,3.0,130.0,197.0,1.0,2.0,152.0,0.0,1.2,3.0,0.0,3.0,0 51 | 41.0,0.0,2.0,105.0,198.0,0.0,0.0,168.0,0.0,0.0,1.0,1.0,3.0,0 52 | 65.0,1.0,4.0,120.0,177.0,0.0,0.0,140.0,0.0,0.4,1.0,0.0,7.0,0 53 | 44.0,1.0,4.0,112.0,290.0,0.0,2.0,153.0,0.0,0.0,1.0,1.0,3.0,2 54 | 44.0,1.0,2.0,130.0,219.0,0.0,2.0,188.0,0.0,0.0,1.0,0.0,3.0,0 55 | 60.0,1.0,4.0,130.0,253.0,0.0,0.0,144.0,1.0,1.4,1.0,1.0,7.0,1 56 | 54.0,1.0,4.0,124.0,266.0,0.0,2.0,109.0,1.0,2.2,2.0,1.0,7.0,1 57 | 50.0,1.0,3.0,140.0,233.0,0.0,0.0,163.0,0.0,0.6,2.0,1.0,7.0,1 58 | 41.0,1.0,4.0,110.0,172.0,0.0,2.0,158.0,0.0,0.0,1.0,0.0,7.0,1 59 | 54.0,1.0,3.0,125.0,273.0,0.0,2.0,152.0,0.0,0.5,3.0,1.0,3.0,0 60 | 51.0,1.0,1.0,125.0,213.0,0.0,2.0,125.0,1.0,1.4,1.0,1.0,3.0,0 61 | 51.0,0.0,4.0,130.0,305.0,0.0,0.0,142.0,1.0,1.2,2.0,0.0,7.0,2 62 | 46.0,0.0,3.0,142.0,177.0,0.0,2.0,160.0,1.0,1.4,3.0,0.0,3.0,0 63 | 58.0,1.0,4.0,128.0,216.0,0.0,2.0,131.0,1.0,2.2,2.0,3.0,7.0,1 64 | 54.0,0.0,3.0,135.0,304.0,1.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0 65 | 54.0,1.0,4.0,120.0,188.0,0.0,0.0,113.0,0.0,1.4,2.0,1.0,7.0,2 66 | 60.0,1.0,4.0,145.0,282.0,0.0,2.0,142.0,1.0,2.8,2.0,2.0,7.0,2 67 | 60.0,1.0,3.0,140.0,185.0,0.0,2.0,155.0,0.0,3.0,2.0,0.0,3.0,1 68 | 54.0,1.0,3.0,150.0,232.0,0.0,2.0,165.0,0.0,1.6,1.0,0.0,7.0,0 69 | 59.0,1.0,4.0,170.0,326.0,0.0,2.0,140.0,1.0,3.4,3.0,0.0,7.0,2 70 | 46.0,1.0,3.0,150.0,231.0,0.0,0.0,147.0,0.0,3.6,2.0,0.0,3.0,1 71 | 65.0,0.0,3.0,155.0,269.0,0.0,0.0,148.0,0.0,0.8,1.0,0.0,3.0,0 72 | 67.0,1.0,4.0,125.0,254.0,1.0,0.0,163.0,0.0,0.2,2.0,2.0,7.0,3 73 | 62.0,1.0,4.0,120.0,267.0,0.0,0.0,99.0,1.0,1.8,2.0,2.0,7.0,1 74 | 65.0,1.0,4.0,110.0,248.0,0.0,2.0,158.0,0.0,0.6,1.0,2.0,6.0,1 75 | 44.0,1.0,4.0,110.0,197.0,0.0,2.0,177.0,0.0,0.0,1.0,1.0,3.0,1 76 | 65.0,0.0,3.0,160.0,360.0,0.0,2.0,151.0,0.0,0.8,1.0,0.0,3.0,0 77 | 60.0,1.0,4.0,125.0,258.0,0.0,2.0,141.0,1.0,2.8,2.0,1.0,7.0,1 78 | 51.0,0.0,3.0,140.0,308.0,0.0,2.0,142.0,0.0,1.5,1.0,1.0,3.0,0 79 | 48.0,1.0,2.0,130.0,245.0,0.0,2.0,180.0,0.0,0.2,2.0,0.0,3.0,0 80 | 58.0,1.0,4.0,150.0,270.0,0.0,2.0,111.0,1.0,0.8,1.0,0.0,7.0,3 81 | 45.0,1.0,4.0,104.0,208.0,0.0,2.0,148.0,1.0,3.0,2.0,0.0,3.0,0 82 | 53.0,0.0,4.0,130.0,264.0,0.0,2.0,143.0,0.0,0.4,2.0,0.0,3.0,0 83 | 39.0,1.0,3.0,140.0,321.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0 84 | 68.0,1.0,3.0,180.0,274.0,1.0,2.0,150.0,1.0,1.6,2.0,0.0,7.0,3 85 | 52.0,1.0,2.0,120.0,325.0,0.0,0.0,172.0,0.0,0.2,1.0,0.0,3.0,0 86 | 44.0,1.0,3.0,140.0,235.0,0.0,2.0,180.0,0.0,0.0,1.0,0.0,3.0,0 87 | 47.0,1.0,3.0,138.0,257.0,0.0,2.0,156.0,0.0,0.0,1.0,0.0,3.0,0 88 | 53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0 89 | 53.0,0.0,4.0,138.0,234.0,0.0,2.0,160.0,0.0,0.0,1.0,0.0,3.0,0 90 | 51.0,0.0,3.0,130.0,256.0,0.0,2.0,149.0,0.0,0.5,1.0,0.0,3.0,0 91 | 66.0,1.0,4.0,120.0,302.0,0.0,2.0,151.0,0.0,0.4,2.0,0.0,3.0,0 92 | 62.0,0.0,4.0,160.0,164.0,0.0,2.0,145.0,0.0,6.2,3.0,3.0,7.0,3 93 | 62.0,1.0,3.0,130.0,231.0,0.0,0.0,146.0,0.0,1.8,2.0,3.0,7.0,0 94 | 44.0,0.0,3.0,108.0,141.0,0.0,0.0,175.0,0.0,0.6,2.0,0.0,3.0,0 95 | 63.0,0.0,3.0,135.0,252.0,0.0,2.0,172.0,0.0,0.0,1.0,0.0,3.0,0 96 | 52.0,1.0,4.0,128.0,255.0,0.0,0.0,161.0,1.0,0.0,1.0,1.0,7.0,1 97 | 59.0,1.0,4.0,110.0,239.0,0.0,2.0,142.0,1.0,1.2,2.0,1.0,7.0,2 98 | 60.0,0.0,4.0,150.0,258.0,0.0,2.0,157.0,0.0,2.6,2.0,2.0,7.0,3 99 | 52.0,1.0,2.0,134.0,201.0,0.0,0.0,158.0,0.0,0.8,1.0,1.0,3.0,0 100 | 48.0,1.0,4.0,122.0,222.0,0.0,2.0,186.0,0.0,0.0,1.0,0.0,3.0,0 101 | 45.0,1.0,4.0,115.0,260.0,0.0,2.0,185.0,0.0,0.0,1.0,0.0,3.0,0 102 | 34.0,1.0,1.0,118.0,182.0,0.0,2.0,174.0,0.0,0.0,1.0,0.0,3.0,0 103 | 57.0,0.0,4.0,128.0,303.0,0.0,2.0,159.0,0.0,0.0,1.0,1.0,3.0,0 104 | 71.0,0.0,3.0,110.0,265.0,1.0,2.0,130.0,0.0,0.0,1.0,1.0,3.0,0 105 | 49.0,1.0,3.0,120.0,188.0,0.0,0.0,139.0,0.0,2.0,2.0,3.0,7.0,3 106 | 54.0,1.0,2.0,108.0,309.0,0.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0 107 | 59.0,1.0,4.0,140.0,177.0,0.0,0.0,162.0,1.0,0.0,1.0,1.0,7.0,2 108 | 57.0,1.0,3.0,128.0,229.0,0.0,2.0,150.0,0.0,0.4,2.0,1.0,7.0,1 109 | 61.0,1.0,4.0,120.0,260.0,0.0,0.0,140.0,1.0,3.6,2.0,1.0,7.0,2 110 | 39.0,1.0,4.0,118.0,219.0,0.0,0.0,140.0,0.0,1.2,2.0,0.0,7.0,3 111 | 61.0,0.0,4.0,145.0,307.0,0.0,2.0,146.0,1.0,1.0,2.0,0.0,7.0,1 112 | 56.0,1.0,4.0,125.0,249.0,1.0,2.0,144.0,1.0,1.2,2.0,1.0,3.0,1 113 | 52.0,1.0,1.0,118.0,186.0,0.0,2.0,190.0,0.0,0.0,2.0,0.0,6.0,0 114 | 43.0,0.0,4.0,132.0,341.0,1.0,2.0,136.0,1.0,3.0,2.0,0.0,7.0,2 115 | 62.0,0.0,3.0,130.0,263.0,0.0,0.0,97.0,0.0,1.2,2.0,1.0,7.0,2 116 | 41.0,1.0,2.0,135.0,203.0,0.0,0.0,132.0,0.0,0.0,2.0,0.0,6.0,0 117 | 58.0,1.0,3.0,140.0,211.0,1.0,2.0,165.0,0.0,0.0,1.0,0.0,3.0,0 118 | 35.0,0.0,4.0,138.0,183.0,0.0,0.0,182.0,0.0,1.4,1.0,0.0,3.0,0 119 | 63.0,1.0,4.0,130.0,330.0,1.0,2.0,132.0,1.0,1.8,1.0,3.0,7.0,3 120 | 65.0,1.0,4.0,135.0,254.0,0.0,2.0,127.0,0.0,2.8,2.0,1.0,7.0,2 121 | 48.0,1.0,4.0,130.0,256.0,1.0,2.0,150.0,1.0,0.0,1.0,2.0,7.0,3 122 | 63.0,0.0,4.0,150.0,407.0,0.0,2.0,154.0,0.0,4.0,2.0,3.0,7.0,4 123 | 51.0,1.0,3.0,100.0,222.0,0.0,0.0,143.0,1.0,1.2,2.0,0.0,3.0,0 124 | 55.0,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,0.0,7.0,3 125 | 65.0,1.0,1.0,138.0,282.0,1.0,2.0,174.0,0.0,1.4,2.0,1.0,3.0,1 126 | 45.0,0.0,2.0,130.0,234.0,0.0,2.0,175.0,0.0,0.6,2.0,0.0,3.0,0 127 | 56.0,0.0,4.0,200.0,288.0,1.0,2.0,133.0,1.0,4.0,3.0,2.0,7.0,3 128 | 54.0,1.0,4.0,110.0,239.0,0.0,0.0,126.0,1.0,2.8,2.0,1.0,7.0,3 129 | 44.0,1.0,2.0,120.0,220.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0 130 | 62.0,0.0,4.0,124.0,209.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0 131 | 54.0,1.0,3.0,120.0,258.0,0.0,2.0,147.0,0.0,0.4,2.0,0.0,7.0,0 132 | 51.0,1.0,3.0,94.0,227.0,0.0,0.0,154.0,1.0,0.0,1.0,1.0,7.0,0 133 | 29.0,1.0,2.0,130.0,204.0,0.0,2.0,202.0,0.0,0.0,1.0,0.0,3.0,0 134 | 51.0,1.0,4.0,140.0,261.0,0.0,2.0,186.0,1.0,0.0,1.0,0.0,3.0,0 135 | 43.0,0.0,3.0,122.0,213.0,0.0,0.0,165.0,0.0,0.2,2.0,0.0,3.0,0 136 | 55.0,0.0,2.0,135.0,250.0,0.0,2.0,161.0,0.0,1.4,2.0,0.0,3.0,0 137 | 70.0,1.0,4.0,145.0,174.0,0.0,0.0,125.0,1.0,2.6,3.0,0.0,7.0,4 138 | 62.0,1.0,2.0,120.0,281.0,0.0,2.0,103.0,0.0,1.4,2.0,1.0,7.0,3 139 | 35.0,1.0,4.0,120.0,198.0,0.0,0.0,130.0,1.0,1.6,2.0,0.0,7.0,1 140 | 51.0,1.0,3.0,125.0,245.0,1.0,2.0,166.0,0.0,2.4,2.0,0.0,3.0,0 141 | 59.0,1.0,2.0,140.0,221.0,0.0,0.0,164.0,1.0,0.0,1.0,0.0,3.0,0 142 | 59.0,1.0,1.0,170.0,288.0,0.0,2.0,159.0,0.0,0.2,2.0,0.0,7.0,1 143 | 52.0,1.0,2.0,128.0,205.0,1.0,0.0,184.0,0.0,0.0,1.0,0.0,3.0,0 144 | 64.0,1.0,3.0,125.0,309.0,0.0,0.0,131.0,1.0,1.8,2.0,0.0,7.0,1 145 | 58.0,1.0,3.0,105.0,240.0,0.0,2.0,154.0,1.0,0.6,2.0,0.0,7.0,0 146 | 47.0,1.0,3.0,108.0,243.0,0.0,0.0,152.0,0.0,0.0,1.0,0.0,3.0,1 147 | 57.0,1.0,4.0,165.0,289.0,1.0,2.0,124.0,0.0,1.0,2.0,3.0,7.0,4 148 | 41.0,1.0,3.0,112.0,250.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0 149 | 45.0,1.0,2.0,128.0,308.0,0.0,2.0,170.0,0.0,0.0,1.0,0.0,3.0,0 150 | 60.0,0.0,3.0,102.0,318.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,0 151 | 52.0,1.0,1.0,152.0,298.0,1.0,0.0,178.0,0.0,1.2,2.0,0.0,7.0,0 152 | 42.0,0.0,4.0,102.0,265.0,0.0,2.0,122.0,0.0,0.6,2.0,0.0,3.0,0 153 | 67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,0 154 | 55.0,1.0,4.0,160.0,289.0,0.0,2.0,145.0,1.0,0.8,2.0,1.0,7.0,4 155 | 64.0,1.0,4.0,120.0,246.0,0.0,2.0,96.0,1.0,2.2,3.0,1.0,3.0,3 156 | 70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,1 157 | 51.0,1.0,4.0,140.0,299.0,0.0,0.0,173.0,1.0,1.6,1.0,0.0,7.0,1 158 | 58.0,1.0,4.0,125.0,300.0,0.0,2.0,171.0,0.0,0.0,1.0,2.0,7.0,1 159 | 60.0,1.0,4.0,140.0,293.0,0.0,2.0,170.0,0.0,1.2,2.0,2.0,7.0,2 160 | 68.0,1.0,3.0,118.0,277.0,0.0,0.0,151.0,0.0,1.0,1.0,1.0,7.0,0 161 | 46.0,1.0,2.0,101.0,197.0,1.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0 162 | 77.0,1.0,4.0,125.0,304.0,0.0,2.0,162.0,1.0,0.0,1.0,3.0,3.0,4 163 | 54.0,0.0,3.0,110.0,214.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0 164 | 58.0,0.0,4.0,100.0,248.0,0.0,2.0,122.0,0.0,1.0,2.0,0.0,3.0,0 165 | 48.0,1.0,3.0,124.0,255.0,1.0,0.0,175.0,0.0,0.0,1.0,2.0,3.0,0 166 | 57.0,1.0,4.0,132.0,207.0,0.0,0.0,168.0,1.0,0.0,1.0,0.0,7.0,0 167 | 52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0 168 | 54.0,0.0,2.0,132.0,288.0,1.0,2.0,159.0,1.0,0.0,1.0,1.0,3.0,0 169 | 35.0,1.0,4.0,126.0,282.0,0.0,2.0,156.0,1.0,0.0,1.0,0.0,7.0,1 170 | 45.0,0.0,2.0,112.0,160.0,0.0,0.0,138.0,0.0,0.0,2.0,0.0,3.0,0 171 | 70.0,1.0,3.0,160.0,269.0,0.0,0.0,112.0,1.0,2.9,2.0,1.0,7.0,3 172 | 53.0,1.0,4.0,142.0,226.0,0.0,2.0,111.0,1.0,0.0,1.0,0.0,7.0,0 173 | 59.0,0.0,4.0,174.0,249.0,0.0,0.0,143.0,1.0,0.0,2.0,0.0,3.0,1 174 | 62.0,0.0,4.0,140.0,394.0,0.0,2.0,157.0,0.0,1.2,2.0,0.0,3.0,0 175 | 64.0,1.0,4.0,145.0,212.0,0.0,2.0,132.0,0.0,2.0,2.0,2.0,6.0,4 176 | 57.0,1.0,4.0,152.0,274.0,0.0,0.0,88.0,1.0,1.2,2.0,1.0,7.0,1 177 | 52.0,1.0,4.0,108.0,233.0,1.0,0.0,147.0,0.0,0.1,1.0,3.0,7.0,0 178 | 56.0,1.0,4.0,132.0,184.0,0.0,2.0,105.0,1.0,2.1,2.0,1.0,6.0,1 179 | 43.0,1.0,3.0,130.0,315.0,0.0,0.0,162.0,0.0,1.9,1.0,1.0,3.0,0 180 | 53.0,1.0,3.0,130.0,246.0,1.0,2.0,173.0,0.0,0.0,1.0,3.0,3.0,0 181 | 48.0,1.0,4.0,124.0,274.0,0.0,2.0,166.0,0.0,0.5,2.0,0.0,7.0,3 182 | 56.0,0.0,4.0,134.0,409.0,0.0,2.0,150.0,1.0,1.9,2.0,2.0,7.0,2 183 | 42.0,1.0,1.0,148.0,244.0,0.0,2.0,178.0,0.0,0.8,1.0,2.0,3.0,0 184 | 59.0,1.0,1.0,178.0,270.0,0.0,2.0,145.0,0.0,4.2,3.0,0.0,7.0,0 185 | 60.0,0.0,4.0,158.0,305.0,0.0,2.0,161.0,0.0,0.0,1.0,0.0,3.0,1 186 | 63.0,0.0,2.0,140.0,195.0,0.0,0.0,179.0,0.0,0.0,1.0,2.0,3.0,0 187 | 42.0,1.0,3.0,120.0,240.0,1.0,0.0,194.0,0.0,0.8,3.0,0.0,7.0,0 188 | 66.0,1.0,2.0,160.0,246.0,0.0,0.0,120.0,1.0,0.0,2.0,3.0,6.0,2 189 | 54.0,1.0,2.0,192.0,283.0,0.0,2.0,195.0,0.0,0.0,1.0,1.0,7.0,1 190 | 69.0,1.0,3.0,140.0,254.0,0.0,2.0,146.0,0.0,2.0,2.0,3.0,7.0,2 191 | 50.0,1.0,3.0,129.0,196.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0 192 | 51.0,1.0,4.0,140.0,298.0,0.0,0.0,122.0,1.0,4.2,2.0,3.0,7.0,3 193 | 43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1 194 | 62.0,0.0,4.0,138.0,294.0,1.0,0.0,106.0,0.0,1.9,2.0,3.0,3.0,2 195 | 68.0,0.0,3.0,120.0,211.0,0.0,2.0,115.0,0.0,1.5,2.0,0.0,3.0,0 196 | 67.0,1.0,4.0,100.0,299.0,0.0,2.0,125.0,1.0,0.9,2.0,2.0,3.0,3 197 | 69.0,1.0,1.0,160.0,234.0,1.0,2.0,131.0,0.0,0.1,2.0,1.0,3.0,0 198 | 45.0,0.0,4.0,138.0,236.0,0.0,2.0,152.0,1.0,0.2,2.0,0.0,3.0,0 199 | 50.0,0.0,2.0,120.0,244.0,0.0,0.0,162.0,0.0,1.1,1.0,0.0,3.0,0 200 | 59.0,1.0,1.0,160.0,273.0,0.0,2.0,125.0,0.0,0.0,1.0,0.0,3.0,1 201 | 50.0,0.0,4.0,110.0,254.0,0.0,2.0,159.0,0.0,0.0,1.0,0.0,3.0,0 202 | 64.0,0.0,4.0,180.0,325.0,0.0,0.0,154.0,1.0,0.0,1.0,0.0,3.0,0 203 | 57.0,1.0,3.0,150.0,126.0,1.0,0.0,173.0,0.0,0.2,1.0,1.0,7.0,0 204 | 64.0,0.0,3.0,140.0,313.0,0.0,0.0,133.0,0.0,0.2,1.0,0.0,7.0,0 205 | 43.0,1.0,4.0,110.0,211.0,0.0,0.0,161.0,0.0,0.0,1.0,0.0,7.0,0 206 | 45.0,1.0,4.0,142.0,309.0,0.0,2.0,147.0,1.0,0.0,2.0,3.0,7.0,3 207 | 58.0,1.0,4.0,128.0,259.0,0.0,2.0,130.0,1.0,3.0,2.0,2.0,7.0,3 208 | 50.0,1.0,4.0,144.0,200.0,0.0,2.0,126.0,1.0,0.9,2.0,0.0,7.0,3 209 | 55.0,1.0,2.0,130.0,262.0,0.0,0.0,155.0,0.0,0.0,1.0,0.0,3.0,0 210 | 62.0,0.0,4.0,150.0,244.0,0.0,0.0,154.0,1.0,1.4,2.0,0.0,3.0,1 211 | 37.0,0.0,3.0,120.0,215.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0 212 | 38.0,1.0,1.0,120.0,231.0,0.0,0.0,182.0,1.0,3.8,2.0,0.0,7.0,4 213 | 41.0,1.0,3.0,130.0,214.0,0.0,2.0,168.0,0.0,2.0,2.0,0.0,3.0,0 214 | 66.0,0.0,4.0,178.0,228.0,1.0,0.0,165.0,1.0,1.0,2.0,2.0,7.0,3 215 | 52.0,1.0,4.0,112.0,230.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,1 216 | 56.0,1.0,1.0,120.0,193.0,0.0,2.0,162.0,0.0,1.9,2.0,0.0,7.0,0 217 | 46.0,0.0,2.0,105.0,204.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0 218 | 46.0,0.0,4.0,138.0,243.0,0.0,2.0,152.0,1.0,0.0,2.0,0.0,3.0,0 219 | 64.0,0.0,4.0,130.0,303.0,0.0,0.0,122.0,0.0,2.0,2.0,2.0,3.0,0 220 | 59.0,1.0,4.0,138.0,271.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0 221 | 41.0,0.0,3.0,112.0,268.0,0.0,2.0,172.0,1.0,0.0,1.0,0.0,3.0,0 222 | 54.0,0.0,3.0,108.0,267.0,0.0,2.0,167.0,0.0,0.0,1.0,0.0,3.0,0 223 | 39.0,0.0,3.0,94.0,199.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0 224 | 53.0,1.0,4.0,123.0,282.0,0.0,0.0,95.0,1.0,2.0,2.0,2.0,7.0,3 225 | 63.0,0.0,4.0,108.0,269.0,0.0,0.0,169.0,1.0,1.8,2.0,2.0,3.0,1 226 | 34.0,0.0,2.0,118.0,210.0,0.0,0.0,192.0,0.0,0.7,1.0,0.0,3.0,0 227 | 47.0,1.0,4.0,112.0,204.0,0.0,0.0,143.0,0.0,0.1,1.0,0.0,3.0,0 228 | 67.0,0.0,3.0,152.0,277.0,0.0,0.0,172.0,0.0,0.0,1.0,1.0,3.0,0 229 | 54.0,1.0,4.0,110.0,206.0,0.0,2.0,108.0,1.0,0.0,2.0,1.0,3.0,3 230 | 66.0,1.0,4.0,112.0,212.0,0.0,2.0,132.0,1.0,0.1,1.0,1.0,3.0,2 231 | 52.0,0.0,3.0,136.0,196.0,0.0,2.0,169.0,0.0,0.1,2.0,0.0,3.0,0 232 | 55.0,0.0,4.0,180.0,327.0,0.0,1.0,117.0,1.0,3.4,2.0,0.0,3.0,2 233 | 49.0,1.0,3.0,118.0,149.0,0.0,2.0,126.0,0.0,0.8,1.0,3.0,3.0,1 234 | 74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,0 235 | 54.0,0.0,3.0,160.0,201.0,0.0,0.0,163.0,0.0,0.0,1.0,1.0,3.0,0 236 | 54.0,1.0,4.0,122.0,286.0,0.0,2.0,116.0,1.0,3.2,2.0,2.0,3.0,3 237 | 56.0,1.0,4.0,130.0,283.0,1.0,2.0,103.0,1.0,1.6,3.0,0.0,7.0,2 238 | 46.0,1.0,4.0,120.0,249.0,0.0,2.0,144.0,0.0,0.8,1.0,0.0,7.0,1 239 | 49.0,0.0,2.0,134.0,271.0,0.0,0.0,162.0,0.0,0.0,2.0,0.0,3.0,0 240 | 42.0,1.0,2.0,120.0,295.0,0.0,0.0,162.0,0.0,0.0,1.0,0.0,3.0,0 241 | 41.0,1.0,2.0,110.0,235.0,0.0,0.0,153.0,0.0,0.0,1.0,0.0,3.0,0 242 | 41.0,0.0,2.0,126.0,306.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0 243 | 49.0,0.0,4.0,130.0,269.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0 244 | 61.0,1.0,1.0,134.0,234.0,0.0,0.0,145.0,0.0,2.6,2.0,2.0,3.0,2 245 | 60.0,0.0,3.0,120.0,178.0,1.0,0.0,96.0,0.0,0.0,1.0,0.0,3.0,0 246 | 67.0,1.0,4.0,120.0,237.0,0.0,0.0,71.0,0.0,1.0,2.0,0.0,3.0,2 247 | 58.0,1.0,4.0,100.0,234.0,0.0,0.0,156.0,0.0,0.1,1.0,1.0,7.0,2 248 | 47.0,1.0,4.0,110.0,275.0,0.0,2.0,118.0,1.0,1.0,2.0,1.0,3.0,1 249 | 52.0,1.0,4.0,125.0,212.0,0.0,0.0,168.0,0.0,1.0,1.0,2.0,7.0,3 250 | 62.0,1.0,2.0,128.0,208.0,1.0,2.0,140.0,0.0,0.0,1.0,0.0,3.0,0 251 | 57.0,1.0,4.0,110.0,201.0,0.0,0.0,126.0,1.0,1.5,2.0,0.0,6.0,0 252 | 58.0,1.0,4.0,146.0,218.0,0.0,0.0,105.0,0.0,2.0,2.0,1.0,7.0,1 253 | 64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,0 254 | 51.0,0.0,3.0,120.0,295.0,0.0,2.0,157.0,0.0,0.6,1.0,0.0,3.0,0 255 | 43.0,1.0,4.0,115.0,303.0,0.0,0.0,181.0,0.0,1.2,2.0,0.0,3.0,0 256 | 42.0,0.0,3.0,120.0,209.0,0.0,0.0,173.0,0.0,0.0,2.0,0.0,3.0,0 257 | 67.0,0.0,4.0,106.0,223.0,0.0,0.0,142.0,0.0,0.3,1.0,2.0,3.0,0 258 | 76.0,0.0,3.0,140.0,197.0,0.0,1.0,116.0,0.0,1.1,2.0,0.0,3.0,0 259 | 70.0,1.0,2.0,156.0,245.0,0.0,2.0,143.0,0.0,0.0,1.0,0.0,3.0,0 260 | 57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,1 261 | 44.0,0.0,3.0,118.0,242.0,0.0,0.0,149.0,0.0,0.3,2.0,1.0,3.0,0 262 | 58.0,0.0,2.0,136.0,319.0,1.0,2.0,152.0,0.0,0.0,1.0,2.0,3.0,3 263 | 60.0,0.0,1.0,150.0,240.0,0.0,0.0,171.0,0.0,0.9,1.0,0.0,3.0,0 264 | 44.0,1.0,3.0,120.0,226.0,0.0,0.0,169.0,0.0,0.0,1.0,0.0,3.0,0 265 | 61.0,1.0,4.0,138.0,166.0,0.0,2.0,125.0,1.0,3.6,2.0,1.0,3.0,4 266 | 42.0,1.0,4.0,136.0,315.0,0.0,0.0,125.0,1.0,1.8,2.0,0.0,6.0,2 267 | 52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2 268 | 59.0,1.0,3.0,126.0,218.0,1.0,0.0,134.0,0.0,2.2,2.0,1.0,6.0,2 269 | 40.0,1.0,4.0,152.0,223.0,0.0,0.0,181.0,0.0,0.0,1.0,0.0,7.0,1 270 | 42.0,1.0,3.0,130.0,180.0,0.0,0.0,150.0,0.0,0.0,1.0,0.0,3.0,0 271 | 61.0,1.0,4.0,140.0,207.0,0.0,2.0,138.0,1.0,1.9,1.0,1.0,7.0,1 272 | 66.0,1.0,4.0,160.0,228.0,0.0,2.0,138.0,0.0,2.3,1.0,0.0,6.0,0 273 | 46.0,1.0,4.0,140.0,311.0,0.0,0.0,120.0,1.0,1.8,2.0,2.0,7.0,2 274 | 71.0,0.0,4.0,112.0,149.0,0.0,0.0,125.0,0.0,1.6,2.0,0.0,3.0,0 275 | 59.0,1.0,1.0,134.0,204.0,0.0,0.0,162.0,0.0,0.8,1.0,2.0,3.0,1 276 | 64.0,1.0,1.0,170.0,227.0,0.0,2.0,155.0,0.0,0.6,2.0,0.0,7.0,0 277 | 66.0,0.0,3.0,146.0,278.0,0.0,2.0,152.0,0.0,0.0,2.0,1.0,3.0,0 278 | 39.0,0.0,3.0,138.0,220.0,0.0,0.0,152.0,0.0,0.0,2.0,0.0,3.0,0 279 | 57.0,1.0,2.0,154.0,232.0,0.0,2.0,164.0,0.0,0.0,1.0,1.0,3.0,1 280 | 58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0 281 | 57.0,1.0,4.0,110.0,335.0,0.0,0.0,143.0,1.0,3.0,2.0,1.0,7.0,2 282 | 47.0,1.0,3.0,130.0,253.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0 283 | 55.0,0.0,4.0,128.0,205.0,0.0,1.0,130.0,1.0,2.0,2.0,1.0,7.0,3 284 | 35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0 285 | 61.0,1.0,4.0,148.0,203.0,0.0,0.0,161.0,0.0,0.0,1.0,1.0,7.0,2 286 | 58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4 287 | 58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2 288 | 58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0 289 | 56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0 290 | 56.0,1.0,2.0,120.0,240.0,0.0,0.0,169.0,0.0,0.0,3.0,0.0,3.0,0 291 | 67.0,1.0,3.0,152.0,212.0,0.0,2.0,150.0,0.0,0.8,2.0,0.0,7.0,1 292 | 55.0,0.0,2.0,132.0,342.0,0.0,0.0,166.0,0.0,1.2,1.0,0.0,3.0,0 293 | 44.0,1.0,4.0,120.0,169.0,0.0,0.0,144.0,1.0,2.8,3.0,0.0,6.0,2 294 | 63.0,1.0,4.0,140.0,187.0,0.0,2.0,144.0,1.0,4.0,1.0,2.0,7.0,2 295 | 63.0,0.0,4.0,124.0,197.0,0.0,0.0,136.0,1.0,0.0,2.0,0.0,3.0,1 296 | 41.0,1.0,2.0,120.0,157.0,0.0,0.0,182.0,0.0,0.0,1.0,0.0,3.0,0 297 | 59.0,1.0,4.0,164.0,176.0,1.0,2.0,90.0,0.0,1.0,2.0,2.0,6.0,3 298 | 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1 299 | 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1 300 | 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2 301 | 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3 302 | 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1 303 | 38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0 304 | -------------------------------------------------------------------------------- /BinaryClassification/SpamDetection/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open System.IO 3 | open Microsoft.ML 4 | open Microsoft.ML.Data 5 | 6 | /// The SpamInput class contains one single message which may be spam or ham. 7 | [] 8 | type SpamInput = { 9 | [] Verdict : string 10 | [] Message : string 11 | } 12 | 13 | /// The SpamPrediction class contains one single spam prediction. 14 | [] 15 | type SpamPrediction = { 16 | [] IsSpam : bool 17 | Score : float32 18 | Probability : float32 19 | } 20 | 21 | /// This class describes what output columns we want to produce. 22 | [] 23 | type ToLabel ={ 24 | mutable Label : bool 25 | } 26 | 27 | /// Helper function to cast the ML pipeline to an estimator 28 | let castToEstimator (x : IEstimator<_>) = 29 | match x with 30 | | :? IEstimator as y -> y 31 | | _ -> failwith "Cannot cast pipeline to IEstimator" 32 | 33 | /// file paths to data files (assumes os = windows!) 34 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory 35 | 36 | [] 37 | let main arv = 38 | 39 | // set up a machine learning context 40 | let context = new MLContext() 41 | 42 | // load the spam dataset in memory 43 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = '\t') 44 | 45 | // use 80% for training and 20% for testing 46 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 47 | 48 | // set up a training pipeline 49 | let pipeline = 50 | EstimatorChain() 51 | 52 | // step 1: transform the 'spam' and 'ham' values to true and false 53 | .Append( 54 | context.Transforms.CustomMapping( 55 | Action(fun input output -> output.Label <- input.Verdict = "spam"), 56 | "MyLambda")) 57 | 58 | // step 2: featureize the input text 59 | .Append(context.Transforms.Text.FeaturizeText("Features", "Message")) 60 | 61 | // step 3: use a stochastic dual coordinate ascent learner 62 | .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression()) 63 | 64 | // test the full data set by performing k-fold cross validation 65 | printfn "Performing cross validation:" 66 | let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5) 67 | 68 | // report the results 69 | cvResults |> Seq.iter(fun f -> printfn " Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve) 70 | 71 | // train the model on the training set 72 | let model = partitions.TrainSet |> pipeline.Fit 73 | 74 | // evaluate the model on the test set 75 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate 76 | 77 | // report the results 78 | printfn "Model metrics:" 79 | printfn " Accuracy: %f" metrics.Accuracy 80 | printfn " Auc: %f" metrics.AreaUnderRocCurve 81 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 82 | printfn " F1Score: %f" metrics.F1Score 83 | printfn " LogLoss: %f" metrics.LogLoss 84 | printfn " LogLossReduction: %f" metrics.LogLossReduction 85 | printfn " PositivePrecision: %f" metrics.PositivePrecision 86 | printfn " PositiveRecall: %f" metrics.PositiveRecall 87 | printfn " NegativePrecision: %f" metrics.NegativePrecision 88 | printfn " NegativeRecall: %f" metrics.NegativeRecall 89 | 90 | // set up a prediction engine 91 | let engine = context.Model.CreatePredictionEngine model 92 | 93 | // create sample messages 94 | let messages = [ 95 | { Message = "Hi, wanna grab lunch together today?"; Verdict = "" } 96 | { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" } 97 | { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" } 98 | { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" } 99 | ] 100 | 101 | // make the predictions 102 | printfn "Model predictions:" 103 | let predictions = messages |> List.iter(fun m -> 104 | let p = engine.Predict m 105 | printfn " %f %s" p.Probability m.Message) 106 | 107 | 0 // return value -------------------------------------------------------------------------------- /BinaryClassification/SpamDetection/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Detect spam SMS messages 2 | 3 | In this assignment you're going to build an app that can automatically detect spam SMS messages. 4 | 5 | The first thing you'll need is a file with lots of SMS messages, correctly labelled as being spam or not spam. You will use a dataset compiled by Caroline Tagg in her [2009 PhD thesis](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf). This dataset has 5574 messages. 6 | 7 | Download the [list of messages](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/SpamDetection/spam.tsv) and save it as **spam.tsv**. 8 | 9 | The data file looks like this: 10 | 11 | ![Spam message list](./assets/data.png) 12 | 13 | It’s a TSV file with only 2 columns of information: 14 | 15 | * Label: ‘spam’ for a spam message and ‘ham’ for a normal message. 16 | * Message: the full text of the SMS message. 17 | 18 | You will build a binary classification model that reads in all messages and then makes a prediction for each message if it is spam or ham. 19 | 20 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 21 | 22 | ```bash 23 | $ dotnet new console --language F# --output SpamDetection 24 | $ cd SpamDetection 25 | ``` 26 | 27 | Now install the following ML.NET packages: 28 | 29 | ```bash 30 | $ dotnet add package Microsoft.ML 31 | ``` 32 | 33 | Now you are ready to add some classes. You’ll need need one to hold a labelled message, and one to hold the model predictions. 34 | 35 | Replace the contents of the Program.fs file with this: 36 | 37 | ```fsharp 38 | open System 39 | open System.IO 40 | open Microsoft.ML 41 | open Microsoft.ML.Data 42 | 43 | /// The SpamInput class contains one single message which may be spam or ham. 44 | [] 45 | type SpamInput = { 46 | [] Verdict : string 47 | [] Message : string 48 | } 49 | 50 | /// The SpamPrediction class contains one single spam prediction. 51 | [] 52 | type SpamPrediction = { 53 | [] IsSpam : bool 54 | Score : float32 55 | Probability : float32 56 | } 57 | 58 | // the rest of the code goes here.... 59 | ``` 60 | 61 | The **SpamInput** class holds one single message. Note how each field is tagged with a **LoadColumn** attribute that tells the data loading code which column to import data from. 62 | 63 | There's also a **SpamPrediction** class which will hold a single spam prediction. There's a boolean **IsSpam**, a **Probability** value, and the **Score** the model will assign to the prediction. 64 | 65 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 66 | 67 | Now look at the first column in the data file. Our label is a string with the value 'spam' meaning it's a spam message, and 'ham' meaning it's a normal message. 68 | 69 | But you're building a Binary Classifier which needs to be trained on boolean labels. 70 | 71 | So you'll have to somehow convert the 'raw' text labels (stored in the **Verdict** field) to a boolean value. 72 | 73 | To set that up, you'll need a helper type: 74 | 75 | ```fsharp 76 | /// This class describes what output columns we want to produce. 77 | [] 78 | type ToLabel ={ 79 | mutable Label : bool 80 | } 81 | 82 | // the rest of the code goes here.... 83 | ``` 84 | 85 | Note how the **ToLabel** type contains a **Label** field with the converted boolean label value. We will set up this conversion in a minute. 86 | 87 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 88 | 89 | We need one more helper function before we can load the dataset. Add the following code: 90 | 91 | ```fsharp 92 | /// Helper function to cast the ML pipeline to an estimator 93 | let castToEstimator (x : IEstimator<_>) = 94 | match x with 95 | | :? IEstimator as y -> y 96 | | _ -> failwith "Cannot cast pipeline to IEstimator" 97 | 98 | // the rest of the code goes here 99 | ``` 100 | 101 | The **castToEstimator** function takes an **IEstimator<>** argument and uses pattern matching to cast the value to an **IEstimator\** type. You'll see in a minute why we need this helper function. 102 | 103 | Now you're ready to load the training data in memory: 104 | 105 | ```fsharp 106 | /// file paths to data files (assumes os = windows!) 107 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory 108 | 109 | [] 110 | let main arv = 111 | 112 | // set up a machine learning context 113 | let context = new MLContext() 114 | 115 | // load the spam dataset in memory 116 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = '\t') 117 | 118 | // use 80% for training and 20% for testing 119 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 120 | 121 | 122 | // the rest of the code goes here.... 123 | ``` 124 | 125 | This code uses the **LoadFromTextFile** function to load the TSV data directly into memory. The field annotations in the **SpamInput** type tell the function how to store the loaded data. 126 | 127 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data. 128 | 129 | Now you’re ready to start building the machine learning model: 130 | 131 | ```fsharp 132 | // set up a training pipeline 133 | let pipeline = 134 | EstimatorChain() 135 | 136 | // step 1: transform the 'spam' and 'ham' values to true and false 137 | .Append( 138 | context.Transforms.CustomMapping( 139 | Action(fun input output -> output.Label <- input.Verdict = "spam"), 140 | "MyLambda")) 141 | 142 | // step 2: featureize the input text 143 | .Append(context.Transforms.Text.FeaturizeText("Features", "Message")) 144 | 145 | // step 3: use a stochastic dual coordinate ascent learner 146 | .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression()) 147 | 148 | // the rest of the code goes here.... 149 | ``` 150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 151 | 152 | This pipeline has the following components: 153 | 154 | * A **CustomMapping** that transforms the text label to a boolean value. We define 'spam' values as spam and anything else as normal messages. 155 | * **FeaturizeText** which calculates a numerical value for each message. This is a required step because machine learning models cannot handle text data directly. 156 | * A **SdcaLogisticRegression** classification learner which will train the model to make accurate predictions. 157 | 158 | The FeaturizeText component is a very nice solution for handling text input data. The component performs a number of transformations on the text to prepare it for model training: 159 | 160 | * Normalize the text (=remove punctuation, diacritics, switching to lowercase etc.) 161 | * Tokenize each word. 162 | * Remove all stopwords 163 | * Extract Ngrams and skip-grams 164 | * TF-IDF rescaling 165 | * Bag of words conversion 166 | 167 | The result is that each message is converted to a vector of numeric values that can easily be processed by the model. 168 | 169 | Before you start training, you're going to perform a quick check to see if the dataset has enough data to reliably train a binary classification model. 170 | 171 | We have 5574 messages which makes this a very small dataset. We'd prefer to have between 10k-100k records for reliable training. For small datasets like this one, we'll have to perform **K-Fold Cross Validation** to make sure we have enough data to work with. 172 | 173 | Let's set that up right now: 174 | 175 | ```fsharp 176 | // test the full data set by performing k-fold cross validation 177 | printfn "Performing cross validation:" 178 | let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5) 179 | 180 | // report the results 181 | cvResults |> Seq.iter(fun f -> printfn " Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve) 182 | 183 | // the rest of the code goes here.... 184 | ``` 185 | 186 | This code calls the **CrossValidate** method to perform K-Fold Cross Validation on the training partition using 5 folds. Note how we call **castToEstimator** to cast the pipeline to an **IEstimator\** type. 187 | 188 | We need to do this because the **EstimatorChain** function we use every time to build the machine learning pipeline produces a type that cannot be read directly by **CrossValidate**. And the F# compiler is unable to perform the type cast for us automatically, so we need the helper function to perform the cast explicitly. 189 | 190 | Next, the code reports the individual AUC for each fold. For a well-balanced dataset we expect to see roughly identical AUC values for each fold. Any outliers are hints that the dataset may be unbalanced and too small to train on. 191 | 192 | Now let's train the model and get some validation metrics: 193 | 194 | ```fsharp 195 | // train the model on the training set 196 | let model = partitions.TrainSet |> pipeline.Fit 197 | 198 | // evaluate the model on the test set 199 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate 200 | 201 | // report the results 202 | printfn "Model metrics:" 203 | printfn " Accuracy: %f" metrics.Accuracy 204 | printfn " Auc: %f" metrics.AreaUnderRocCurve 205 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 206 | printfn " F1Score: %f" metrics.F1Score 207 | printfn " LogLoss: %f" metrics.LogLoss 208 | printfn " LogLossReduction: %f" metrics.LogLossReduction 209 | printfn " PositivePrecision: %f" metrics.PositivePrecision 210 | printfn " PositiveRecall: %f" metrics.PositiveRecall 211 | printfn " NegativePrecision: %f" metrics.NegativePrecision 212 | printfn " NegativeRecall: %f" metrics.NegativeRecall 213 | 214 | // the rest of the code goes here 215 | ``` 216 | 217 | This code trains the model by piping the training data into the **Fit** function. Then it pipes the test data into the **Transform** function to make a prediction for every message in the validation partition. 218 | 219 | The code pipes these predictions into the **Evaluate** function to compare these predictions to the ground truth and calculate the following metrics: 220 | 221 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions. 222 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good. 223 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive. 224 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive. 225 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes. 226 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance. 227 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high. 228 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high. 229 | * **NegativePrecision**: this is the fraction of negative predictions that are correct. 230 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases. 231 | 232 | When filtering spam, you definitely want to avoid false positives because you don’t want to be sending important emails into the junk folder. 233 | 234 | You also want to avoid false negatives but they are not as bad as a false positive. Having some spam slipping through the filter is not the end of the world. 235 | 236 | To wrap up, You’re going to create a couple of messages and ask the model to make a prediction: 237 | 238 | ```fsharp 239 | // set up a prediction engine 240 | let engine = context.Model.CreatePredictionEngine model 241 | 242 | // create sample messages 243 | let messages = [ 244 | { Message = "Hi, wanna grab lunch together today?"; Verdict = "" } 245 | { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" } 246 | { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" } 247 | { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" } 248 | ] 249 | 250 | // make the predictions 251 | printfn "Model predictions:" 252 | let predictions = messages |> List.iter(fun m -> 253 | let p = engine.Predict m 254 | printfn " %f %s" p.Probability m.Message) 255 | ``` 256 | 257 | This code calls the **CreatePredictionEngine** function to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction. 258 | 259 | The code creates four new test messages and calls **List.iter** to make spam predictions for each message. What’s the result going to be? 260 | 261 | Time to find out. Go to your terminal and run your code: 262 | 263 | ```bash 264 | $ dotnet run 265 | ``` 266 | 267 | What results do you get? What are your five AUC values from K-Fold Cross Validation and the average AUC over all folds? Are there any outliers? Are the five values grouped close together? 268 | 269 | What can you conclude from your cross-validation results? Do we have enough data to make reliable spam predictions? 270 | 271 | Based on the results of cross-validation, would you say this dataset is well-balanced? And what does this say about the metrics you should use to evaluate your model? 272 | 273 | Which metrics did you pick to evaluate the model? And what do the values say about the accuracy of your model? 274 | 275 | And what about the four test messages? Dit the model accurately predict which ones are spam? 276 | 277 | Think about the code in this assignment. How could you improve the accuracy of the model even more? What are your best AUC values after optimization? 278 | 279 | Share your results in our group! 280 | -------------------------------------------------------------------------------- /BinaryClassification/SpamDetection/SpamDetection.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /BinaryClassification/SpamDetection/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/SpamDetection/assets/data.png -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open System.IO 3 | open Microsoft.ML 4 | open Microsoft.ML.Data 5 | open Microsoft.ML.Transforms 6 | 7 | /// The Passenger class represents one passenger on the Titanic. 8 | [] 9 | type Passenger = { 10 | [] Label : bool 11 | [] Pclass : float32 12 | [] Sex : string 13 | [] RawAge : string // not a float! 14 | [] SibSp : float32 15 | [] Parch : float32 16 | [] Ticket : string 17 | [] Fare : float32 18 | [] Cabin : string 19 | [] Embarked : string 20 | } 21 | 22 | /// The PassengerPrediction class represents one model prediction. 23 | [] 24 | type PassengerPrediction = { 25 | [] Prediction : bool 26 | Probability : float32 27 | Score : float32 28 | } 29 | 30 | /// The ToAge class is a helper class for a column transformation. 31 | [] 32 | type ToAge = { 33 | mutable Age : string 34 | } 35 | 36 | /// file path to the train data file (assumes os = windows!) 37 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory 38 | 39 | /// file path to the test data file (assumes os = windows!) 40 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory 41 | 42 | [] 43 | let main argv = 44 | 45 | // set up a machine learning context 46 | let context = new MLContext() 47 | 48 | // load the training and testing data in memory 49 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 50 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 51 | 52 | // set up a training pipeline 53 | let pipeline = 54 | EstimatorChain() 55 | 56 | // step 1: replace missing ages with '?' 57 | .Append( 58 | context.Transforms.CustomMapping( 59 | Action(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge), 60 | "AgeMapping")) 61 | 62 | // step 2: convert string ages to floats 63 | .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single)) 64 | 65 | // step 3: replace missing age values with the mean age 66 | .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean)) 67 | 68 | // step 4: replace string columns with one-hot encoded vectors 69 | .Append(context.Transforms.Categorical.OneHotEncoding("Sex")) 70 | .Append(context.Transforms.Categorical.OneHotEncoding("Ticket")) 71 | .Append(context.Transforms.Categorical.OneHotEncoding("Cabin")) 72 | .Append(context.Transforms.Categorical.OneHotEncoding("Embarked")) 73 | 74 | // step 5: concatenate everything into a single feature column 75 | .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked")) 76 | 77 | // step 6: use a fasttree trainer 78 | .Append(context.BinaryClassification.Trainers.FastTree()) 79 | 80 | // train the model 81 | let model = trainData |> pipeline.Fit 82 | 83 | // make predictions and compare with ground truth 84 | let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate 85 | 86 | // report the results 87 | printfn "Model metrics:" 88 | printfn " Accuracy: %f" metrics.Accuracy 89 | printfn " Auc: %f" metrics.AreaUnderRocCurve 90 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 91 | printfn " F1Score: %f" metrics.F1Score 92 | printfn " LogLoss: %f" metrics.LogLoss 93 | printfn " LogLossReduction: %f" metrics.LogLossReduction 94 | printfn " PositivePrecision: %f" metrics.PositivePrecision 95 | printfn " PositiveRecall: %f" metrics.PositiveRecall 96 | printfn " NegativePrecision: %f" metrics.NegativePrecision 97 | printfn " NegativeRecall: %f" metrics.NegativeRecall 98 | 99 | // set up a prediction engine 100 | let engine = context.Model.CreatePredictionEngine model 101 | 102 | // create a sample record 103 | let passenger = { 104 | Pclass = 1.0f 105 | Sex = "male" 106 | RawAge = "48" 107 | SibSp = 0.0f 108 | Parch = 0.0f 109 | Ticket = "B" 110 | Fare = 70.0f 111 | Cabin = "123" 112 | Embarked = "S" 113 | Label = false // unused! 114 | } 115 | 116 | // make the prediction 117 | let prediction = engine.Predict passenger 118 | 119 | // report the results 120 | printfn "Model prediction:" 121 | printfn " Prediction: %s" (if prediction.Prediction then "survived" else "perished") 122 | printfn " Probability: %f" prediction.Probability 123 | 124 | 0 // return value -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Predict who survived the Titanic disaster 2 | 3 | The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. 4 | 5 | ![Sinking Titanic](./assets/titanic.jpeg) 6 | 7 | In this assignment you're going to build an app that can predict which Titanic passengers survived the disaster. You will use a decision tree classifier to make your predictions. 8 | 9 | The first thing you will need for your app is the passenger manifest of the Titanic's last voyage. You will use the famous [Kaggle Titanic Dataset](https://github.com/sbaidachni/MLNETTitanic/tree/master/MLNetTitanic) which has data for a subset of 891 passengers. 10 | 11 | Download the [test_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/test_data.csv) and [train_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/train_data.csv) files and save them to your project folder. 12 | 13 | The training data file looks like this: 14 | 15 | ![Training data](./assets/data.jpg) 16 | 17 | It’s a CSV file with 12 columns of information: 18 | 19 | * The passenger identifier 20 | * The label column containing ‘1’ if the passenger survived and ‘0’ if the passenger perished 21 | * The class of travel (1–3) 22 | * The name of the passenger 23 | * The gender of the passenger (‘male’ or ‘female’) 24 | * The age of the passenger, or ‘0’ if the age is unknown 25 | * The number of siblings and/or spouses aboard 26 | * The number of parents and/or children aboard 27 | * The ticket number 28 | * The fare paid 29 | * The cabin number 30 | * The port in which the passenger embarked 31 | 32 | The second column is the label: 0 means the passenger perished, and 1 means the passenger survived. All other columns are input features from the passenger manifest. 33 | 34 | You're gooing to build a binary classification model that reads in all columns and then predicts for each passenger if he or she survived. 35 | 36 | Let’s get started. Here’s how to set up a new console project in NET Core: 37 | 38 | ```bash 39 | $ dotnet new console --language F# --output TitanicPrediction 40 | $ cd TitanicPrediction 41 | ``` 42 | 43 | Next, you need to install the correct NuGet packages: 44 | 45 | ``` 46 | $ dotnet add package Microsoft.ML 47 | $ dotnet add package Microsoft.ML.FastTree 48 | ``` 49 | 50 | Now you are ready to add some classes. You’ll need one to hold passenger data, and one to hold your model predictions. 51 | 52 | Replace the contents of the Program.fs file with this: 53 | 54 | ```fsharp 55 | open System 56 | open System.IO 57 | open Microsoft.ML 58 | open Microsoft.ML.Data 59 | open Microsoft.ML.Transforms 60 | 61 | /// The Passenger class represents one passenger on the Titanic. 62 | [] 63 | type Passenger = { 64 | [] Label : bool 65 | [] Pclass : float32 66 | [] Sex : string 67 | [] RawAge : string // not a float! 68 | [] SibSp : float32 69 | [] Parch : float32 70 | [] Ticket : string 71 | [] Fare : float32 72 | [] Cabin : string 73 | [] Embarked : string 74 | } 75 | 76 | /// The PassengerPrediction class represents one model prediction. 77 | [] 78 | type PassengerPrediction = { 79 | [] Prediction : bool 80 | Probability : float32 81 | Score : float32 82 | } 83 | 84 | // the rest of the code goes here... 85 | ``` 86 | 87 | The **Passenger** type holds one single passenger record. There's also a **PassengerPrediction** type which will hold a single passenger prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction. 88 | 89 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 90 | 91 | Now look at the age column in the data file. It's a number, but for some passengers in the manifest the age is not known and the column is empty. 92 | 93 | ML.NET can automatically load and process missing numeric values, but only if they are present in the CSV file as a '?'. 94 | 95 | The Titanic datafile uses an empty string to denote missing values, so we'll have to perform a feature conversion 96 | 97 | Notice how the age is loaded as s string into a Passenger class field called **RawAge**. 98 | 99 | We will process the missing values later in our app. To prepare for this, we'll need an additional helper type: 100 | 101 | ```fsharp 102 | /// The ToAge class is a helper class for a column transformation. 103 | [] 104 | type ToAge = { 105 | mutable Age : string 106 | } 107 | 108 | // the rest of the code goes here... 109 | ``` 110 | 111 | The **ToAge** type will contain the converted age values. We will set up this conversion in a minute. 112 | 113 | Note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 114 | 115 | Now you're going to load the training data in memory: 116 | 117 | ```fsharp 118 | /// file path to the train data file (assumes os = windows!) 119 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory 120 | 121 | /// file path to the test data file (assumes os = windows!) 122 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory 123 | 124 | [] 125 | let main argv = 126 | 127 | // set up a machine learning context 128 | let context = new MLContext() 129 | 130 | // load the training and testing data in memory 131 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 132 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 133 | 134 | // the rest of the code goes here... 135 | 136 | 0 // return value 137 | ``` 138 | 139 | This code calls the **LoadFromTextFile** function twice to load the training and testing datasets in memory. 140 | 141 | ML.NET expects missing data in CSV files to appear as a ‘?’, but unfortunately the Titanic file uses an empty string to indicate an unknown age. So the first thing you need to do is replace all empty age strings occurrences with ‘?’. 142 | 143 | Add the following code: 144 | 145 | ```fsharp 146 | // set up a training pipeline 147 | let pipeline = 148 | EstimatorChain() 149 | 150 | // step 1: replace missing ages with '?' 151 | .Append( 152 | context.Transforms.CustomMapping( 153 | Action(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge), 154 | "AgeMapping")) 155 | 156 | // the rest of the code goes here... 157 | ``` 158 | 159 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 160 | 161 | The **CustomMapping** component converts empty age strings to ‘?’ values. 162 | 163 | Now ML.NET is happy with the age values. You will now convert the string ages to numeric values and instruct ML.NET to replace any missing values with the mean age over the entire dataset. 164 | 165 | Add the following code, and make sure you match the indentation level of the previous **Append** function exactly. Indentation is significant in F# and the wrong indentation level will lead to compiler errors: 166 | 167 | ```fsharp 168 | // step 2: convert string ages to floats 169 | .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single)) 170 | 171 | // step 3: replace missing age values with the mean age 172 | .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean)) 173 | 174 | // the rest of the code goes here... 175 | ``` 176 | 177 | The **ConvertType** component converts the Age column to a single-precision floating point value. And the **ReplaceMissingValues** component replaces any missing values with the mean value of all ages in the entire dataset. 178 | 179 | Now let's process the rest of the data columns. The Sex, Ticket, Cabin, and Embarked columns are enumerations of string values. As you've already learned, you'll need to one-hot encode them: 180 | 181 | ```fsharp 182 | // step 4: replace string columns with one-hot encoded vectors 183 | .Append(context.Transforms.Categorical.OneHotEncoding("Sex")) 184 | .Append(context.Transforms.Categorical.OneHotEncoding("Ticket")) 185 | .Append(context.Transforms.Categorical.OneHotEncoding("Cabin")) 186 | .Append(context.Transforms.Categorical.OneHotEncoding("Embarked")) 187 | 188 | // the rest of the code goes here... 189 | ``` 190 | 191 | The **OneHotEncoding** components take an input column, one-hot encode all values, and produce a new column with the same name holding the one-hot vectors. 192 | 193 | Now let's wrap up the pipeline: 194 | 195 | ```fsharp 196 | // step 5: concatenate everything into a single feature column 197 | .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked")) 198 | 199 | // step 6: use a fasttree trainer 200 | .Append(context.BinaryClassification.Trainers.FastTree()) 201 | 202 | // the rest of the code goes here (indented back 2 levels!)... 203 | ``` 204 | 205 | The **Concatenate** component concatenates all remaining feature columns into a single column for training. This is required because ML.NET can only train on a single input column. 206 | 207 | And the **FastTreeBinaryClassificationTrainer** is the algorithm that's going to train the model. You're going to build a decision tree classifier that uses the Fast Tree algorithm to train on the data and configure the tree. 208 | 209 | Note the indentation level of the 'the rest of the code...' comment. Make sure that when you add the remaining code you indent this code back by two levels to match the indentation level of the **main** function. 210 | 211 | Now all you need to do now is train the model on the entire dataset, compare the predictions with the labels, and compute a bunch of metrics that describe how accurate the model is: 212 | 213 | ```fsharp 214 | // train the model 215 | let model = trainData |> pipeline.Fit 216 | 217 | // make predictions and compare with ground truth 218 | let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate 219 | 220 | // report the results 221 | printfn "Model metrics:" 222 | printfn " Accuracy: %f" metrics.Accuracy 223 | printfn " Auc: %f" metrics.AreaUnderRocCurve 224 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve 225 | printfn " F1Score: %f" metrics.F1Score 226 | printfn " LogLoss: %f" metrics.LogLoss 227 | printfn " LogLossReduction: %f" metrics.LogLossReduction 228 | printfn " PositivePrecision: %f" metrics.PositivePrecision 229 | printfn " PositiveRecall: %f" metrics.PositiveRecall 230 | printfn " NegativePrecision: %f" metrics.NegativePrecision 231 | printfn " NegativeRecall: %f" metrics.NegativeRecall 232 | 233 | // the rest of the code goes here... 234 | ``` 235 | 236 | This code pipes the training data into the **Fit** function to train the model on the entire dataset. 237 | 238 | We then pipe the test data into the **Transform** function to set up a prediction for each passenger, and pipe these predictions into the **Evaluate** function to compare them to the label and automatically calculate evaluation metrics. 239 | 240 | We then display the following metrics: 241 | 242 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions. 243 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good. 244 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive. 245 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive. 246 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes. 247 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance. 248 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high. 249 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high. 250 | * **NegativePrecision**: this is the fraction of negative predictions that are correct. 251 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases. 252 | 253 | To wrap up, let's have some fun and pretend that I’m going to take a trip on the Titanic too. I will embark in Southampton and pay $70 for a first-class cabin. I travel on my own without parents, children, or my spouse. 254 | 255 | What are my odds of surviving? 256 | 257 | Add the following code: 258 | 259 | ```fsharp 260 | // set up a prediction engine 261 | let engine = context.Model.CreatePredictionEngine model 262 | 263 | // create a sample record 264 | let passenger = { 265 | Pclass = 1.0f 266 | Sex = "male" 267 | RawAge = "48" 268 | SibSp = 0.0f 269 | Parch = 0.0f 270 | Ticket = "B" 271 | Fare = 70.0f 272 | Cabin = "123" 273 | Embarked = "S" 274 | Label = false // unused! 275 | } 276 | 277 | // make the prediction 278 | let prediction = engine.Predict passenger 279 | 280 | // report the results 281 | printfn "Model prediction:" 282 | printfn " Prediction: %s" (if prediction.Prediction then "survived" else "perished") 283 | printfn " Probability: %f" prediction.Probability 284 | ``` 285 | 286 | This code uses the **CreatePredictionEngine** method to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction. 287 | 288 | The code sets up a new passenger record with my information and then calls **Predict** to make a prediction about my survival chances. 289 | 290 | So would I have survived the Titanic disaster? 291 | 292 | Time to find out. Go to your terminal and run your code: 293 | 294 | ```bash 295 | $ dotnet run 296 | ``` 297 | 298 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value? 299 | 300 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model? 301 | 302 | And what about me? Did I survive the disaster? 303 | 304 | Do you think a decision tree is a good choice to predict Titanic survivors? Which other algorithms could you use instead? Do they give a better result? 305 | 306 | Share your results in our group! 307 | -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/TitanicPrediction.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/assets/data.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/data.jpg -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/assets/titanic.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/titanic.jpeg -------------------------------------------------------------------------------- /BinaryClassification/TitanicPrediction/test_data.csv: -------------------------------------------------------------------------------- 1 | "PassengerId","Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked" 2 | 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)","female","38",1,0,"PC 17599","71.2833","C85","C" 3 | 3,1,3,"Heikkinen, Miss. Laina","female","26",0,0,"STON/O2. 3101282","7.925","","S" 4 | 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","female","27",0,2,"347742","11.1333","","S" 5 | 11,1,3,"Sandstrom, Miss. Marguerite Rut","female","4",1,1,"PP 9549","16.7","G6","S" 6 | 18,1,2,"Williams, Mr. Charles Eugene","male","",0,0,"244373","13","","S" 7 | 25,0,3,"Palsson, Miss. Torborg Danira","female","8",3,1,"349909","21.075","","S" 8 | 30,0,3,"Todoroff, Mr. Lalio","male","",0,0,"349216","7.8958","","S" 9 | 31,0,1,"Uruchurtu, Don. Manuel E","male","40",0,0,"PC 17601","27.7208","","C" 10 | 34,0,2,"Wheadon, Mr. Edward H","male","66",0,0,"C.A. 24579","10.5","","S" 11 | 38,0,3,"Cann, Mr. Ernest Charles","male","21",0,0,"A./5. 2152","8.05","","S" 12 | 43,0,3,"Kraeff, Mr. Theodor","male","",0,0,"349253","7.8958","","C" 13 | 49,0,3,"Samaan, Mr. Youssef","male","",2,0,"2662","21.6792","","C" 14 | 51,0,3,"Panula, Master. Juha Niilo","male","7",4,1,"3101295","39.6875","","S" 15 | 55,0,1,"Ostby, Mr. Engelhart Cornelius","male","65",0,1,"113509","61.9792","B30","C" 16 | 60,0,3,"Goodwin, Master. William Frederick","male","11",5,2,"CA 2144","46.9","","S" 17 | 64,0,3,"Skoog, Master. Harald","male","4",3,2,"347088","27.9","","S" 18 | 67,1,2,"Nye, Mrs. (Elizabeth Ramell)","female","29",0,0,"C.A. 29395","10.5","F33","S" 19 | 72,0,3,"Goodwin, Miss. Lillian Amy","female","16",5,2,"CA 2144","46.9","","S" 20 | 76,0,3,"Moen, Mr. Sigurd Hansen","male","25",0,0,"348123","7.65","F G73","S" 21 | 78,0,3,"Moutal, Mr. Rahamin Haim","male","",0,0,"374746","8.05","","S" 22 | 81,0,3,"Waelens, Mr. Achille","male","22",0,0,"345767","9","","S" 23 | 85,1,2,"Ilett, Miss. Bertha","female","17",0,0,"SO/C 14885","10.5","","S" 24 | 87,0,3,"Ford, Mr. William Neal","male","16",1,3,"W./C. 6608","34.375","","S" 25 | 93,0,1,"Chaffee, Mr. Herbert Fuller","male","46",1,0,"W.E.P. 5734","61.175","E31","S" 26 | 95,0,3,"Coxon, Mr. Daniel","male","59",0,0,"364500","7.25","","S" 27 | 99,1,2,"Doling, Mrs. John T (Ada Julia Bone)","female","34",0,1,"231919","23","","S" 28 | 113,0,3,"Barton, Mr. David John","male","22",0,0,"324669","8.05","","S" 29 | 121,0,2,"Hickman, Mr. Stanley George","male","21",2,0,"S.O.C. 14879","73.5","","S" 30 | 123,0,2,"Nasser, Mr. Nicholas","male","32.5",1,0,"237736","30.0708","","C" 31 | 136,0,2,"Richard, Mr. Emile","male","23",0,0,"SC/PARIS 2133","15.0458","","C" 32 | 140,0,1,"Giglio, Mr. Victor","male","24",0,0,"PC 17593","79.2","B86","C" 33 | 144,0,3,"Burke, Mr. Jeremiah","male","19",0,0,"365222","6.75","","Q" 34 | 146,0,2,"Nicholls, Mr. Joseph Charles","male","19",1,1,"C.A. 33112","36.75","","S" 35 | 148,0,3,"Ford, Miss. Robina Maggie ""Ruby""","female","9",2,2,"W./C. 6608","34.375","","S" 36 | 156,0,1,"Williams, Mr. Charles Duane","male","51",0,1,"PC 17597","61.3792","","C" 37 | 157,1,3,"Gilnagh, Miss. Katherine ""Katie""","female","16",0,0,"35851","7.7333","","Q" 38 | 158,0,3,"Corn, Mr. Harry","male","30",0,0,"SOTON/OQ 392090","8.05","","S" 39 | 166,1,3,"Goldsmith, Master. Frank John William ""Frankie""","male","9",0,2,"363291","20.525","","S" 40 | 167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)","female","",0,1,"113505","55","E33","S" 41 | 168,0,3,"Skoog, Mrs. William (Anna Bernhardina Karlsson)","female","45",1,4,"347088","27.9","","S" 42 | 195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)","female","44",0,0,"PC 17610","27.7208","B4","C" 43 | 201,0,3,"Vande Walle, Mr. Nestor Cyriel","male","28",0,0,"345770","9.5","","S" 44 | 206,0,3,"Strom, Miss. Telma Matilda","female","2",0,1,"347054","10.4625","G6","S" 45 | 210,1,1,"Blank, Mr. Henry","male","40",0,0,"112277","31","A31","C" 46 | 218,0,2,"Jacobsohn, Mr. Sidney Samuel","male","42",1,0,"243847","27","","S" 47 | 223,0,3,"Green, Mr. George Henry","male","51",0,0,"21440","8.05","","S" 48 | 241,0,3,"Zabour, Miss. Thamine","female","",1,0,"2665","14.4542","","C" 49 | 243,0,2,"Coleridge, Mr. Reginald Charles","male","29",0,0,"W./C. 14263","10.5","","S" 50 | 251,0,3,"Reed, Mr. James George","male","",0,0,"362316","7.25","","S" 51 | 255,0,3,"Rosblom, Mrs. Viktor (Helena Wilhelmina)","female","41",0,2,"370129","20.2125","","S" 52 | 265,0,3,"Henry, Miss. Delia","female","",0,0,"382649","7.75","","Q" 53 | 266,0,2,"Reeves, Mr. David","male","36",0,0,"C.A. 17248","10.5","","S" 54 | 271,0,1,"Cairns, Mr. Alexander","male","",0,0,"113798","31","","S" 55 | 279,0,3,"Rice, Master. Eric","male","7",4,1,"382652","29.125","","Q" 56 | 285,0,1,"Smith, Mr. Richard William","male","",0,0,"113056","26","A19","S" 57 | 296,0,1,"Lewy, Mr. Ervin G","male","",0,0,"PC 17612","27.7208","","C" 58 | 305,0,3,"Williams, Mr. Howard Hugh ""Harry""","male","",0,0,"A/5 2466","8.05","","S" 59 | 306,1,1,"Allison, Master. Hudson Trevor","male","0.92",1,2,"113781","151.55","C22 C26","S" 60 | 311,1,1,"Hays, Miss. Margaret Bechstein","female","24",0,0,"11767","83.1583","C54","C" 61 | 314,0,3,"Hendekovic, Mr. Ignjac","male","28",0,0,"349243","7.8958","","S" 62 | 315,0,2,"Hart, Mr. Benjamin","male","43",1,1,"F.C.C. 13529","26.25","","S" 63 | 333,0,1,"Graham, Mr. George Edward","male","38",0,1,"PC 17582","153.4625","C91","S" 64 | 335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinsheimer)","female","",1,0,"PC 17611","133.65","","S" 65 | 337,0,1,"Pears, Mr. Thomas Clinton","male","29",1,0,"113776","66.6","C2","S" 66 | 341,1,2,"Navratil, Master. Edmond Roger","male","2",1,1,"230080","26","F2","S" 67 | 344,0,2,"Sedgwick, Mr. Charles Frederick Waddington","male","25",0,0,"244361","13","","S" 68 | 345,0,2,"Fox, Mr. Stanley Hubert","male","36",0,0,"229236","13","","S" 69 | 359,1,3,"McGovern, Miss. Mary","female","",0,0,"330931","7.8792","","Q" 70 | 365,0,3,"O'Brien, Mr. Thomas","male","",1,0,"370365","15.5","","Q" 71 | 366,0,3,"Adahl, Mr. Mauritz Nils Martin","male","30",0,0,"C 7076","7.25","","S" 72 | 367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)","female","60",1,0,"110813","75.25","D37","C" 73 | 374,0,1,"Ringhini, Mr. Sante","male","22",0,0,"PC 17760","135.6333","","C" 74 | 375,0,3,"Palsson, Miss. Stina Viola","female","3",3,1,"349909","21.075","","S" 75 | 376,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)","female","",1,0,"PC 17604","82.1708","","C" 76 | 383,0,3,"Tikkanen, Mr. Juho","male","32",0,0,"STON/O 2. 3101293","7.925","","S" 77 | 387,0,3,"Goodwin, Master. Sidney Leonard","male","1",5,2,"CA 2144","46.9","","S" 78 | 393,0,3,"Gustafsson, Mr. Johan Birger","male","28",2,0,"3101277","7.925","","S" 79 | 396,0,3,"Johansson, Mr. Erik","male","22",0,0,"350052","7.7958","","S" 80 | 401,1,3,"Niskanen, Mr. Juha","male","39",0,0,"STON/O 2. 3101289","7.925","","S" 81 | 407,0,3,"Widegren, Mr. Carl/Charles Peter","male","51",0,0,"347064","7.75","","S" 82 | 408,1,2,"Richards, Master. William Rowe","male","3",1,1,"29106","18.75","","S" 83 | 414,0,2,"Cunningham, Mr. Alfred Fleming","male","",0,0,"239853","0","","S" 84 | 419,0,2,"Matthews, Mr. William John","male","30",0,0,"28228","13","","S" 85 | 422,0,3,"Charters, Mr. David","male","21",0,0,"A/5. 13032","7.7333","","Q" 86 | 423,0,3,"Zimmerman, Mr. Leo","male","29",0,0,"315082","7.875","","S" 87 | 427,1,2,"Clarke, Mrs. Charles V (Ada Maria Winfield)","female","28",1,0,"2003","26","","S" 88 | 428,1,2,"Phillips, Miss. Kate Florence (""Mrs Kate Louise Phillips Marshall"")","female","19",0,0,"250655","26","","S" 89 | 434,0,3,"Kallio, Mr. Nikolai Erland","male","17",0,0,"STON/O 2. 3101274","7.125","","S" 90 | 437,0,3,"Ford, Miss. Doolina Margaret ""Daisy""","female","21",2,2,"W./C. 6608","34.375","","S" 91 | 438,1,2,"Richards, Mrs. Sidney (Emily Hocking)","female","24",2,3,"29106","18.75","","S" 92 | 441,1,2,"Hart, Mrs. Benjamin (Esther Ada Bloomfield)","female","45",1,1,"F.C.C. 13529","26.25","","S" 93 | 446,1,1,"Dodge, Master. Washington","male","4",0,2,"33638","81.8583","A34","S" 94 | 448,1,1,"Seward, Mr. Frederic Kimber","male","34",0,0,"113794","26.55","","S" 95 | 449,1,3,"Baclini, Miss. Marie Catherine","female","5",2,1,"2666","19.2583","","C" 96 | 462,0,3,"Morley, Mr. William","male","34",0,0,"364506","8.05","","S" 97 | 465,0,3,"Maisner, Mr. Simon","male","",0,0,"A/S 2816","8.05","","S" 98 | 483,0,3,"Rouse, Mr. Richard Henry","male","50",0,0,"A/5 3594","8.05","","S" 99 | 493,0,1,"Molson, Mr. Harry Markland","male","55",0,0,"113787","30.5","C30","S" 100 | 495,0,3,"Stanley, Mr. Edward Roland","male","21",0,0,"A/4 45380","8.05","","S" 101 | 497,1,1,"Eustis, Miss. Elizabeth Mussey","female","54",1,0,"36947","78.2667","D20","C" 102 | 507,1,2,"Quick, Mrs. Frederick Charles (Jane Richards)","female","33",0,2,"26360","26","","S" 103 | 508,1,1,"Bradley, Mr. George (""George Arthur Brayton"")","male","",0,0,"111427","26.55","","S" 104 | 512,0,3,"Webber, Mr. James","male","",0,0,"SOTON/OQ 3101316","8.05","","S" 105 | 518,0,3,"Ryan, Mr. Patrick","male","",0,0,"371110","24.15","","Q" 106 | 522,0,3,"Vovk, Mr. Janko","male","22",0,0,"349252","7.8958","","S" 107 | 530,0,2,"Hocking, Mr. Richard George","male","23",2,1,"29104","11.5","","S" 108 | 531,1,2,"Quick, Miss. Phyllis May","female","2",1,1,"26360","26","","S" 109 | 532,0,3,"Toufik, Mr. Nakli","male","",0,0,"2641","7.2292","","C" 110 | 538,1,1,"LeRoy, Miss. Bertha","female","30",0,0,"PC 17761","106.425","","C" 111 | 543,0,3,"Andersson, Miss. Sigrid Elisabeth","female","11",4,2,"347082","31.275","","S" 112 | 547,1,2,"Beane, Mrs. Edward (Ethel Clarke)","female","19",1,0,"2908","26","","S" 113 | 551,1,1,"Thayer, Mr. John Borland Jr","male","17",0,2,"17421","110.8833","C70","C" 114 | 558,0,1,"Robbins, Mr. Victor","male","",0,0,"PC 17757","227.525","","C" 115 | 561,0,3,"Morrow, Mr. Thomas Rowan","male","",0,0,"372622","7.75","","Q" 116 | 570,1,3,"Jonsson, Mr. Carl","male","32",0,0,"350417","7.8542","","S" 117 | 574,1,3,"Kelly, Miss. Mary","female","",0,0,"14312","7.75","","Q" 118 | 589,0,3,"Gilinski, Mr. Eliezer","male","22",0,0,"14973","8.05","","S" 119 | 591,0,3,"Rintamaki, Mr. Matti","male","35",0,0,"STON/O 2. 3101273","7.125","","S" 120 | 592,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)","female","52",1,0,"36947","78.2667","D20","C" 121 | 600,1,1,"Duff Gordon, Sir. Cosmo Edmund (""Mr Morgan"")","male","49",1,0,"PC 17485","56.9292","A20","C" 122 | 602,0,3,"Slabenoff, Mr. Petco","male","",0,0,"349214","7.8958","","S" 123 | 609,1,2,"Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)","female","22",1,2,"SC/Paris 2123","41.5792","","C" 124 | 616,1,2,"Herman, Miss. Alice","female","24",1,2,"220845","65","","S" 125 | 619,1,2,"Becker, Miss. Marion Louise","female","4",2,1,"230136","39","F4","S" 126 | 635,0,3,"Skoog, Miss. Mabel","female","9",3,2,"347088","27.9","","S" 127 | 641,0,3,"Jensen, Mr. Hans Peder","male","20",0,0,"350050","7.8542","","S" 128 | 647,0,3,"Cor, Mr. Liudevit","male","19",0,0,"349231","7.8958","","S" 129 | 648,1,1,"Simonius-Blumer, Col. Oberst Alfons","male","56",0,0,"13213","35.5","A26","C" 130 | 650,1,3,"Stanley, Miss. Amy Zillah Elsie","female","23",0,0,"CA. 2314","7.55","","S" 131 | 655,0,3,"Hegarty, Miss. Hanora ""Nora""","female","18",0,0,"365226","6.75","","Q" 132 | 657,0,3,"Radeff, Mr. Alexander","male","",0,0,"349223","7.8958","","S" 133 | 661,1,1,"Frauenthal, Dr. Henry William","male","50",2,0,"PC 17611","133.65","","S" 134 | 664,0,3,"Coleff, Mr. Peju","male","36",0,0,"349210","7.4958","","S" 135 | 673,0,2,"Mitchell, Mr. Henry Michael","male","70",0,0,"C.A. 24580","10.5","","S" 136 | 675,0,2,"Watson, Mr. Ennis Hastings","male","",0,0,"239856","0","","S" 137 | 679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)","female","43",1,6,"CA 2144","46.9","","S" 138 | 688,0,3,"Dakic, Mr. Branko","male","19",0,0,"349228","10.1708","","S" 139 | 698,1,3,"Mullens, Miss. Katherine ""Katie""","female","",0,0,"35852","7.7333","","Q" 140 | 705,0,3,"Hansen, Mr. Henrik Juul","male","26",1,0,"350025","7.8542","","S" 141 | 713,1,1,"Taylor, Mr. Elmer Zebley","male","48",1,0,"19996","52","C126","S" 142 | 720,0,3,"Johnson, Mr. Malkolm Joackim","male","33",0,0,"347062","7.775","","S" 143 | 727,1,2,"Renouf, Mrs. Peter Henry (Lillian Jefferys)","female","30",3,0,"31027","21","","S" 144 | 732,0,3,"Hassan, Mr. Houssein G N","male","11",0,0,"2699","18.7875","","C" 145 | 740,0,3,"Nankoff, Mr. Minko","male","",0,0,"349218","7.8958","","S" 146 | 741,1,1,"Hawksford, Mr. Walter James","male","",0,0,"16988","30","D45","S" 147 | 742,0,1,"Cavendish, Mr. Tyrell William","male","36",1,0,"19877","78.85","C46","S" 148 | 744,0,3,"McNamee, Mr. Neal","male","24",1,0,"376566","16.1","","S" 149 | 748,1,2,"Sinkkonen, Miss. Anna","female","30",0,0,"250648","13","","S" 150 | 751,1,2,"Wells, Miss. Joan","female","4",1,1,"29103","23","","S" 151 | 752,1,3,"Moor, Master. Meier","male","6",0,1,"392096","12.475","E121","S" 152 | 762,0,3,"Nirva, Mr. Iisakki Antino Aijo","male","41",0,0,"SOTON/O2 3101272","7.125","","S" 153 | 763,1,3,"Barah, Mr. Hanna Assi","male","20",0,0,"2663","7.2292","","C" 154 | 769,0,3,"Moran, Mr. Daniel J","male","",1,0,"371110","24.15","","Q" 155 | 770,0,3,"Gronnestad, Mr. Daniel Danielsen","male","32",0,0,"8471","8.3625","","S" 156 | 783,0,1,"Long, Mr. Milton Clyde","male","29",0,0,"113501","30","D6","S" 157 | 786,0,3,"Harmer, Mr. Abraham (David Lishin)","male","25",0,0,"374887","7.25","","S" 158 | 792,0,2,"Gaskell, Mr. Alfred","male","16",0,0,"239865","26","","S" 159 | 795,0,3,"Dantcheff, Mr. Ristiu","male","25",0,0,"349203","7.8958","","S" 160 | 797,1,1,"Leader, Dr. Alice (Farnham)","female","49",0,0,"17465","25.9292","D17","S" 161 | 801,0,2,"Ponesell, Mr. Martin","male","34",0,0,"250647","13","","S" 162 | 810,1,1,"Chambers, Mrs. Norman Campbell (Bertha Griggs)","female","33",1,0,"113806","53.1","E8","S" 163 | 812,0,3,"Lester, Mr. James","male","39",0,0,"A/4 48871","24.15","","S" 164 | 815,0,3,"Tomlin, Mr. Ernest Portage","male","30.5",0,0,"364499","8.05","","S" 165 | 821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gregg)","female","52",1,1,"12749","93.5","B69","S" 166 | 829,1,3,"McCormack, Mr. Thomas Joseph","male","",0,0,"367228","7.75","","Q" 167 | 832,1,2,"Richards, Master. George Sibley","male","0.83",1,1,"29106","18.75","","S" 168 | 845,0,3,"Culumovic, Mr. Jeso","male","17",0,0,"315090","8.6625","","S" 169 | 850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)","female","",1,0,"17453","89.1042","C92","C" 170 | 851,0,3,"Andersson, Master. Sigvard Harald Elias","male","4",4,2,"347082","31.275","","S" 171 | 853,0,3,"Boulos, Miss. Nourelain","female","9",1,1,"2678","15.2458","","C" 172 | 857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)","female","45",1,1,"36928","164.8667","","S" 173 | 858,1,1,"Daly, Mr. Peter Denis ","male","51",0,0,"113055","26.55","E17","S" 174 | 860,0,3,"Razi, Mr. Raihed","male","",0,0,"2629","7.2292","","C" 175 | 865,0,2,"Gill, Mr. John William","male","24",0,0,"233866","13","","S" 176 | 867,1,2,"Duran y More, Miss. Asuncion","female","27",1,0,"SC/PARIS 2149","13.8583","","C" 177 | 874,0,3,"Vander Cruyssen, Mr. Victor","male","47",0,0,"345765","9","","S" 178 | 879,0,3,"Laleff, Mr. Kristo","male","",0,0,"349217","7.8958","","S" 179 | 882,0,3,"Markun, Mr. Johann","male","33",0,0,"349257","7.8958","","S" 180 | 886,0,3,"Rice, Mrs. William (Margaret Norton)","female","39",0,5,"382652","29.125","","Q" 181 | -------------------------------------------------------------------------------- /Clustering/IrisFlower/IrisFlower.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /Clustering/IrisFlower/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open Microsoft.ML 3 | open Microsoft.ML.Data 4 | 5 | /// A type that holds a single iris flower. 6 | [] 7 | type IrisData = { 8 | [] SepalLength : float32 9 | [] SepalWidth : float32 10 | [] PetalLength : float32 11 | [] PetalWidth : float32 12 | [] Label : string 13 | } 14 | 15 | /// A type that holds a single model prediction. 16 | [] 17 | type IrisPrediction = { 18 | PredictedLabel : uint32 19 | Score : float32[] 20 | } 21 | 22 | /// file paths to data files (assumes os = windows!) 23 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory 24 | 25 | [] 26 | let main argv = 27 | 28 | // get the machine learning context 29 | let context = new MLContext(); 30 | 31 | // read the iris flower data from a text file 32 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',') 33 | 34 | // split the data into a training and testing partition 35 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 36 | 37 | // set up a learning pipeline 38 | let pipeline = 39 | EstimatorChain() 40 | 41 | // step 1: concatenate features into a single column 42 | .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")) 43 | 44 | // step 2: use k-means clustering to find the iris types 45 | .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3)) 46 | 47 | // train the model on the training data 48 | let model = partitions.TrainSet |> pipeline.Fit 49 | 50 | // get predictions and compare to ground truth 51 | let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate 52 | 53 | // show results 54 | printfn "Nodel results" 55 | printfn " Average distance: %f" metrics.AverageDistance 56 | printfn " Davies Bouldin index: %f" metrics.DaviesBouldinIndex 57 | 58 | // set up a prediction engine 59 | let engine = context.Model.CreatePredictionEngine model 60 | 61 | // grab 3 flowers from the dataset 62 | let flowers = context.Data.CreateEnumerable(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq 63 | let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ] 64 | 65 | // show predictions for the three flowers 66 | printfn "Predictions for the 3 test flowers:" 67 | printfn " Label\t\t\tPredicted\tScores" 68 | testFlowers |> Seq.iter(fun f -> 69 | let p = engine.Predict f 70 | printf " %-15s\t%i\t\t" f.Label p.PredictedLabel 71 | p.Score |> Seq.iter(fun s -> printf "%f\t" s) 72 | printfn "") 73 | 74 | 0 // return value -------------------------------------------------------------------------------- /Clustering/IrisFlower/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Cluster Iris flowers 2 | 3 | In this assignment you are going to build an unsupervised learning app that clusters Iris flowers into discrete groups. 4 | 5 | There are three types of Iris flowers: Versicolor, Setosa, and Virginica. Each flower has two sets of leaves: the inner Petals and the outer Sepals. 6 | 7 | Your goal is to build an app that can identify an Iris flower by its sepal and petal size. 8 | 9 | ![MNIST digits](./assets/flowers.png) 10 | 11 | Your challenge is that you're not going to use the dataset labels. Your app has to recognize patterns in the dataset and cluster the flowers into three groups without any help. 12 | 13 | Clustering is an example of **unsupervised learning** where the data science model has to figure out the labels on its own. 14 | 15 | The first thing you will need for your app is a data file with Iris flower petal and sepal sizes. You can use this [CSV file](https://github.com/mdfarragher/DSC/blob/master/Clustering/IrisFlower/iris-data.csv). Save it as **iris-data.csv** in your project folder. 16 | 17 | The file looks like this: 18 | 19 | ![Data file](./assets/data.png) 20 | 21 | It’s a CSV file with 5 columns: 22 | 23 | * The length of the Sepal in centimeters 24 | * The width of the Sepal in centimeters 25 | * The length of the Petal in centimeters 26 | * The width of the Petal in centimeters 27 | * The type of Iris flower 28 | 29 | You are going to build a clustering data science model that reads the data and then guesses the label for each flower in the dataset. 30 | 31 | Of course the app won't know the real names of the flowers, so it's just going to number them: 1, 2, and 3. 32 | 33 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 34 | 35 | ```bash 36 | $ dotnet new console --language F# --output IrisFlowers 37 | $ cd IrisFlowers 38 | ``` 39 | 40 | Now install the ML.NET package: 41 | 42 | ```bash 43 | $ dotnet add package Microsoft.ML 44 | ``` 45 | 46 | Now you are ready to add some types. You’ll need one to hold a flower and one to hold your model prediction. 47 | 48 | Edit the Program.fs file and replace its contents with this: 49 | 50 | ```fsharp 51 | open System 52 | open Microsoft.ML 53 | open Microsoft.ML.Data 54 | 55 | /// A type that holds a single iris flower. 56 | [] 57 | type IrisData = { 58 | [] SepalLength : float32 59 | [] SepalWidth : float32 60 | [] PetalLength : float32 61 | [] PetalWidth : float32 62 | [] Label : string 63 | } 64 | 65 | /// A type that holds a single model prediction. 66 | [] 67 | type IrisPrediction = { 68 | PredictedLabel : uint32 69 | Score : float32[] 70 | } 71 | 72 | // the rest of the code goes here.... 73 | ``` 74 | 75 | The **IrisData** type holds one single flower. Note how the fields are tagged with the **LoadColumn** attribute that tells ML.NET how to load the data from the data file. 76 | 77 | We are loading the label in the 5th column, but we won't be using the label during training because we want the model to figure out the iris flower types on its own. 78 | 79 | There's also an **IrisPrediction** type which will hold a prediction for a single flower. The prediction consists of the ID of the cluster that the flower belongs to. Clusters are numbered from 1 upwards. And notice how the score field is an array? Each individual score value represents the distance of the flower to one specific cluster. 80 | 81 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 82 | 83 | Next you'll need to load the data in memory: 84 | 85 | ```fsharp 86 | /// file paths to data files (assumes os = windows!) 87 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory 88 | 89 | [] 90 | let main argv = 91 | 92 | // get the machine learning context 93 | let context = new MLContext(); 94 | 95 | // read the iris flower data from a text file 96 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',') 97 | 98 | // split the data into a training and testing partition 99 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 100 | 101 | // the rest of the code goes here.... 102 | 103 | 0 // return value 104 | ``` 105 | 106 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory, and then calls **TrainTestSplit** to split the dataset into an 80% training partition and a 20% test partition. 107 | 108 | Now let’s build the data science pipeline: 109 | 110 | ```fsharp 111 | // set up a learning pipeline 112 | let pipeline = 113 | EstimatorChain() 114 | 115 | // step 1: concatenate features into a single column 116 | .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")) 117 | 118 | // step 2: use k-means clustering to find the iris types 119 | .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3)) 120 | 121 | // train the model on the training data 122 | let model = partitions.TrainSet |> pipeline.Fit 123 | 124 | // the rest of the code goes here... 125 | ``` 126 | 127 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 128 | 129 | This pipeline has two components: 130 | 131 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column. 132 | * A **KMeans** component which performs K-Means Clustering on the data and tries to find all Iris flower types. 133 | 134 | With the pipeline fully assembled, the code trains the model by piping the training set into the **Fit** function. 135 | 136 | You now have a fully- trained model. So now it's time to take the test set, predict the type of each flower, and calculate the accuracy metrics of the model: 137 | 138 | ```fsharp 139 | // get predictions and compare to ground truth 140 | let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate 141 | 142 | // show results 143 | printfn "Nodel results" 144 | printfn " Average distance: %f" metrics.AverageDistance 145 | printfn " Davies Bouldin index: %f" metrics.DaviesBouldinIndex 146 | 147 | // the rest of the code goes here.... 148 | ``` 149 | 150 | This code pipes the test set into the **Transform** function to set up predictions for every flower in the test set. Then it pipes these predictions into the **Evaluate** function to compare each predictions with the label and automatically calculates two metrics: 151 | 152 | * **AverageDistance**: this is the average distance of a flower to the center point of its cluster, averaged over all clusters in the dataset. It is a measure for the 'tightness' of the clusters. Lower values are better and mean more concentrated clusters. 153 | * **DaviesBouldinIndex**: this metric is the average 'similarity' of each cluster with its most similar cluster. Similarity is defined as the ratio of within-cluster distances to between-cluster distances. So in other words, clusters which are farther apart and more concentrated will result in a better score. Low values indicate better clustering. 154 | 155 | So Average Distance measures how concentrated the clusters are in the dataset, and the Davies Bouldin Index measures both concentration and how far apart the clusters are spaced. Both metrics are negative-based with zero being the perfect score. 156 | 157 | To wrap up, let’s use the model to make predictions. 158 | 159 | You will pick three arbitrary flowers from the test set, run them through the model, and compare the predictions with the labels provided in the data file. 160 | 161 | Here’s how to do it: 162 | 163 | ```fsharp 164 | // set up a prediction engine 165 | let engine = context.Model.CreatePredictionEngine model 166 | 167 | // grab 3 flowers from the dataset 168 | let flowers = context.Data.CreateEnumerable(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq 169 | let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ] 170 | 171 | // show predictions for the three flowers 172 | printfn "Predictions for the 3 test flowers:" 173 | printfn " Label\t\t\tPredicted\tScores" 174 | testFlowers |> Seq.iter(fun f -> 175 | let p = engine.Predict f 176 | printf " %-15s\t%i\t\t" f.Label p.PredictedLabel 177 | p.Score |> Seq.iter(fun s -> printf "%f\t" s) 178 | printfn "") 179 | ``` 180 | 181 | This code calls **CreatePredictionEngine** to set up a prediction engine. This is a type that can generate individual predictions from sample data. 182 | 183 | Then we call the **CreateEnumerable** function to convert the test partition into an array of **IrisData** instances. Note the **Array.ofSeq** function at the end which converts the enumeration to an array. 184 | 185 | Next, we pick three test flowers and pipe them into **Seq.iter**. For each flower, we generate a prediction, print the predicted label (a cluster ID between 1 and 3) and then use a second **Seq.iter** to write the three scores to the console. 186 | 187 | That's it, you're done! 188 | 189 | Go to your terminal and run your code: 190 | 191 | ```bash 192 | $ dotnet run 193 | ``` 194 | 195 | What results do you get? What is your average distance and your davies bouldin index? 196 | 197 | What do you think this says about the quality of the clusters? 198 | 199 | What did the 3 flower predictions look like? Does the cluster prediction match the label every time? 200 | 201 | Now change the code and check the predictions for every flower. How often does the model get it wrong? Which Iris types are the most confusing to the model? 202 | 203 | Share your results in our group. -------------------------------------------------------------------------------- /Clustering/IrisFlower/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/data.png -------------------------------------------------------------------------------- /Clustering/IrisFlower/assets/flowers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/flowers.png -------------------------------------------------------------------------------- /Clustering/IrisFlower/iris-data.csv: -------------------------------------------------------------------------------- 1 | 5.1,3.5,1.4,0.2,Iris-setosa 2 | 4.9,3.0,1.4,0.2,Iris-setosa 3 | 4.7,3.2,1.3,0.2,Iris-setosa 4 | 4.6,3.1,1.5,0.2,Iris-setosa 5 | 5.0,3.6,1.4,0.2,Iris-setosa 6 | 5.4,3.9,1.7,0.4,Iris-setosa 7 | 4.6,3.4,1.4,0.3,Iris-setosa 8 | 5.0,3.4,1.5,0.2,Iris-setosa 9 | 4.4,2.9,1.4,0.2,Iris-setosa 10 | 4.9,3.1,1.5,0.1,Iris-setosa 11 | 5.4,3.7,1.5,0.2,Iris-setosa 12 | 4.8,3.4,1.6,0.2,Iris-setosa 13 | 4.8,3.0,1.4,0.1,Iris-setosa 14 | 4.3,3.0,1.1,0.1,Iris-setosa 15 | 5.8,4.0,1.2,0.2,Iris-setosa 16 | 5.7,4.4,1.5,0.4,Iris-setosa 17 | 5.4,3.9,1.3,0.4,Iris-setosa 18 | 5.1,3.5,1.4,0.3,Iris-setosa 19 | 5.7,3.8,1.7,0.3,Iris-setosa 20 | 5.1,3.8,1.5,0.3,Iris-setosa 21 | 5.4,3.4,1.7,0.2,Iris-setosa 22 | 5.1,3.7,1.5,0.4,Iris-setosa 23 | 4.6,3.6,1.0,0.2,Iris-setosa 24 | 5.1,3.3,1.7,0.5,Iris-setosa 25 | 4.8,3.4,1.9,0.2,Iris-setosa 26 | 5.0,3.0,1.6,0.2,Iris-setosa 27 | 5.0,3.4,1.6,0.4,Iris-setosa 28 | 5.2,3.5,1.5,0.2,Iris-setosa 29 | 5.2,3.4,1.4,0.2,Iris-setosa 30 | 4.7,3.2,1.6,0.2,Iris-setosa 31 | 4.8,3.1,1.6,0.2,Iris-setosa 32 | 5.4,3.4,1.5,0.4,Iris-setosa 33 | 5.2,4.1,1.5,0.1,Iris-setosa 34 | 5.5,4.2,1.4,0.2,Iris-setosa 35 | 4.9,3.1,1.5,0.1,Iris-setosa 36 | 5.0,3.2,1.2,0.2,Iris-setosa 37 | 5.5,3.5,1.3,0.2,Iris-setosa 38 | 4.9,3.1,1.5,0.1,Iris-setosa 39 | 4.4,3.0,1.3,0.2,Iris-setosa 40 | 5.1,3.4,1.5,0.2,Iris-setosa 41 | 5.0,3.5,1.3,0.3,Iris-setosa 42 | 4.5,2.3,1.3,0.3,Iris-setosa 43 | 4.4,3.2,1.3,0.2,Iris-setosa 44 | 5.0,3.5,1.6,0.6,Iris-setosa 45 | 5.1,3.8,1.9,0.4,Iris-setosa 46 | 4.8,3.0,1.4,0.3,Iris-setosa 47 | 5.1,3.8,1.6,0.2,Iris-setosa 48 | 4.6,3.2,1.4,0.2,Iris-setosa 49 | 5.3,3.7,1.5,0.2,Iris-setosa 50 | 5.0,3.3,1.4,0.2,Iris-setosa 51 | 7.0,3.2,4.7,1.4,Iris-versicolor 52 | 6.4,3.2,4.5,1.5,Iris-versicolor 53 | 6.9,3.1,4.9,1.5,Iris-versicolor 54 | 5.5,2.3,4.0,1.3,Iris-versicolor 55 | 6.5,2.8,4.6,1.5,Iris-versicolor 56 | 5.7,2.8,4.5,1.3,Iris-versicolor 57 | 6.3,3.3,4.7,1.6,Iris-versicolor 58 | 4.9,2.4,3.3,1.0,Iris-versicolor 59 | 6.6,2.9,4.6,1.3,Iris-versicolor 60 | 5.2,2.7,3.9,1.4,Iris-versicolor 61 | 5.0,2.0,3.5,1.0,Iris-versicolor 62 | 5.9,3.0,4.2,1.5,Iris-versicolor 63 | 6.0,2.2,4.0,1.0,Iris-versicolor 64 | 6.1,2.9,4.7,1.4,Iris-versicolor 65 | 5.6,2.9,3.6,1.3,Iris-versicolor 66 | 6.7,3.1,4.4,1.4,Iris-versicolor 67 | 5.6,3.0,4.5,1.5,Iris-versicolor 68 | 5.8,2.7,4.1,1.0,Iris-versicolor 69 | 6.2,2.2,4.5,1.5,Iris-versicolor 70 | 5.6,2.5,3.9,1.1,Iris-versicolor 71 | 5.9,3.2,4.8,1.8,Iris-versicolor 72 | 6.1,2.8,4.0,1.3,Iris-versicolor 73 | 6.3,2.5,4.9,1.5,Iris-versicolor 74 | 6.1,2.8,4.7,1.2,Iris-versicolor 75 | 6.4,2.9,4.3,1.3,Iris-versicolor 76 | 6.6,3.0,4.4,1.4,Iris-versicolor 77 | 6.8,2.8,4.8,1.4,Iris-versicolor 78 | 6.7,3.0,5.0,1.7,Iris-versicolor 79 | 6.0,2.9,4.5,1.5,Iris-versicolor 80 | 5.7,2.6,3.5,1.0,Iris-versicolor 81 | 5.5,2.4,3.8,1.1,Iris-versicolor 82 | 5.5,2.4,3.7,1.0,Iris-versicolor 83 | 5.8,2.7,3.9,1.2,Iris-versicolor 84 | 6.0,2.7,5.1,1.6,Iris-versicolor 85 | 5.4,3.0,4.5,1.5,Iris-versicolor 86 | 6.0,3.4,4.5,1.6,Iris-versicolor 87 | 6.7,3.1,4.7,1.5,Iris-versicolor 88 | 6.3,2.3,4.4,1.3,Iris-versicolor 89 | 5.6,3.0,4.1,1.3,Iris-versicolor 90 | 5.5,2.5,4.0,1.3,Iris-versicolor 91 | 5.5,2.6,4.4,1.2,Iris-versicolor 92 | 6.1,3.0,4.6,1.4,Iris-versicolor 93 | 5.8,2.6,4.0,1.2,Iris-versicolor 94 | 5.0,2.3,3.3,1.0,Iris-versicolor 95 | 5.6,2.7,4.2,1.3,Iris-versicolor 96 | 5.7,3.0,4.2,1.2,Iris-versicolor 97 | 5.7,2.9,4.2,1.3,Iris-versicolor 98 | 6.2,2.9,4.3,1.3,Iris-versicolor 99 | 5.1,2.5,3.0,1.1,Iris-versicolor 100 | 5.7,2.8,4.1,1.3,Iris-versicolor 101 | 6.3,3.3,6.0,2.5,Iris-virginica 102 | 5.8,2.7,5.1,1.9,Iris-virginica 103 | 7.1,3.0,5.9,2.1,Iris-virginica 104 | 6.3,2.9,5.6,1.8,Iris-virginica 105 | 6.5,3.0,5.8,2.2,Iris-virginica 106 | 7.6,3.0,6.6,2.1,Iris-virginica 107 | 4.9,2.5,4.5,1.7,Iris-virginica 108 | 7.3,2.9,6.3,1.8,Iris-virginica 109 | 6.7,2.5,5.8,1.8,Iris-virginica 110 | 7.2,3.6,6.1,2.5,Iris-virginica 111 | 6.5,3.2,5.1,2.0,Iris-virginica 112 | 6.4,2.7,5.3,1.9,Iris-virginica 113 | 6.8,3.0,5.5,2.1,Iris-virginica 114 | 5.7,2.5,5.0,2.0,Iris-virginica 115 | 5.8,2.8,5.1,2.4,Iris-virginica 116 | 6.4,3.2,5.3,2.3,Iris-virginica 117 | 6.5,3.0,5.5,1.8,Iris-virginica 118 | 7.7,3.8,6.7,2.2,Iris-virginica 119 | 7.7,2.6,6.9,2.3,Iris-virginica 120 | 6.0,2.2,5.0,1.5,Iris-virginica 121 | 6.9,3.2,5.7,2.3,Iris-virginica 122 | 5.6,2.8,4.9,2.0,Iris-virginica 123 | 7.7,2.8,6.7,2.0,Iris-virginica 124 | 6.3,2.7,4.9,1.8,Iris-virginica 125 | 6.7,3.3,5.7,2.1,Iris-virginica 126 | 7.2,3.2,6.0,1.8,Iris-virginica 127 | 6.2,2.8,4.8,1.8,Iris-virginica 128 | 6.1,3.0,4.9,1.8,Iris-virginica 129 | 6.4,2.8,5.6,2.1,Iris-virginica 130 | 7.2,3.0,5.8,1.6,Iris-virginica 131 | 7.4,2.8,6.1,1.9,Iris-virginica 132 | 7.9,3.8,6.4,2.0,Iris-virginica 133 | 6.4,2.8,5.6,2.2,Iris-virginica 134 | 6.3,2.8,5.1,1.5,Iris-virginica 135 | 6.1,2.6,5.6,1.4,Iris-virginica 136 | 7.7,3.0,6.1,2.3,Iris-virginica 137 | 6.3,3.4,5.6,2.4,Iris-virginica 138 | 6.4,3.1,5.5,1.8,Iris-virginica 139 | 6.0,3.0,4.8,1.8,Iris-virginica 140 | 6.9,3.1,5.4,2.1,Iris-virginica 141 | 6.7,3.1,5.6,2.4,Iris-virginica 142 | 6.9,3.1,5.1,2.3,Iris-virginica 143 | 5.8,2.7,5.1,1.9,Iris-virginica 144 | 6.8,3.2,5.9,2.3,Iris-virginica 145 | 6.7,3.3,5.7,2.5,Iris-virginica 146 | 6.7,3.0,5.2,2.3,Iris-virginica 147 | 6.3,2.5,5.0,1.9,Iris-virginica 148 | 6.5,3.0,5.2,2.0,Iris-virginica 149 | 6.2,3.4,5.4,2.3,Iris-virginica 150 | 5.9,3.0,5.1,1.8,Iris-virginica 151 | -------------------------------------------------------------------------------- /LoadingData/CaliforniaHousing/CaliforniaHousing.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /LoadingData/CaliforniaHousing/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open Microsoft.ML 3 | open Microsoft.ML.Data 4 | open FSharp.Plotly 5 | 6 | /// The HouseBlockData class holds one single housing block data record. 7 | [] 8 | type HouseBlockData = { 9 | [] Longitude : float32 10 | [] Latitude : float32 11 | [] HousingMedianAge : float32 12 | [] TotalRooms : float32 13 | [] TotalBedrooms : float32 14 | [] Population : float32 15 | [] Households : float32 16 | [] MedianIncome : float32 17 | [] MedianHouseValue : float32 18 | } 19 | 20 | /// The ToMedianHouseValue class is used in a column data conversion. 21 | [] 22 | type ToMedianHouseValue = { 23 | mutable NormalizedMedianHouseValue : float32 24 | } 25 | 26 | /// The ToRoomsPerPerson class is used in a column data conversion. 27 | [] 28 | type ToRoomsPerPerson = { 29 | mutable RoomsPerPerson : float32 30 | } 31 | 32 | /// The ToLocation class is used in a column data conversion. 33 | [] 34 | type FromLocation = { 35 | EncodedLongitude : float32[] 36 | EncodedLatitude : float32[] 37 | } 38 | 39 | /// The ToLocation class is used in a column data conversion. 40 | [] 41 | type ToLocation = { 42 | mutable Location : float32[] 43 | } 44 | 45 | /// file paths to data files (assumes os = windows!) 46 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory 47 | 48 | [] 49 | let main argv = 50 | 51 | // create the machine learning context 52 | let context = new MLContext() 53 | 54 | // load the dataset 55 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 56 | 57 | // keep only records with a median house value < 500,000 58 | let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0) 59 | 60 | // get an array of housing data 61 | let houses = context.Data.CreateEnumerable(data, reuseRowObject = false) 62 | 63 | // // plot median house value by median income 64 | // Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue))) 65 | // |> Chart.withX_AxisStyle "Median income" 66 | // |> Chart.withY_AxisStyle "Median house value" 67 | // |> Chart.Show 68 | 69 | // build a data loading pipeline 70 | let pipeline = 71 | EstimatorChain() 72 | 73 | // step 1: divide the median house value by 1000 74 | .Append( 75 | context.Transforms.CustomMapping( 76 | Action(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f), 77 | "MedianHouseValue")) 78 | 79 | // get a 10-record preview of the transformed data 80 | let model = data |> pipeline.Fit 81 | let preview = (data |> model.Transform).Preview(maxRows = 10) 82 | 83 | // // show the preview 84 | // preview.ColumnView |> Seq.iter(fun c -> 85 | // printf "%-30s|" c.Column.Name 86 | // preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value) 87 | // printfn "") 88 | 89 | // // plot median house value by longitude 90 | // Chart.Point(houses |> Seq.map(fun h -> (h.Longitude, h.MedianHouseValue))) 91 | // |> Chart.withX_AxisStyle "Longitude" 92 | // |> Chart.withY_AxisStyle "Median house value" 93 | // |> Chart.Show 94 | 95 | // step 2: bin the longitude 96 | let pipeline2 = 97 | pipeline 98 | .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10)) 99 | 100 | // step 3: bin the latitude 101 | .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10)) 102 | 103 | // step 4: one-hot encode the longitude 104 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude")) 105 | 106 | // step 5: one-hot encode the latitude 107 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude")) 108 | 109 | .Append( 110 | context.Transforms.CustomMapping( 111 | Action(fun input output -> 112 | output.Location <- [| for x in input.EncodedLongitude do 113 | for y in input.EncodedLatitude do 114 | x * y |] ), 115 | "Location")) 116 | 117 | // get a 10-record preview of the transformed data 118 | let model = data |> pipeline2.Fit 119 | let preview = (data |> model.Transform).Preview(maxRows = 10) 120 | 121 | // // show the preview 122 | // preview.ColumnView |> Seq.iter(fun c -> 123 | // printf "%-30s|" c.Column.Name 124 | // preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value) 125 | // printfn "") 126 | 127 | // show the dense vector 128 | preview.RowView |> Seq.iter(fun r -> 129 | let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer 130 | vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v)) 131 | printfn "") 132 | 133 | 0 // return value -------------------------------------------------------------------------------- /LoadingData/CaliforniaHousing/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Load California housing data 2 | 3 | In this assignment you're going to build an app that can load a dataset with the prices of houses in California. The data is not ready for training yet and needs a bit of processing. 4 | 5 | The first thing you'll need is a data file with house prices. The data from the 1990 California cencus has exactly what we need. 6 | 7 | Download the [California 1990 housing census](https://github.com/mdfarragher/DSC/blob/master/LoadingData/CaliforniaHousing/california_housing.csv) and save it as **california_housing.csv**. 8 | 9 | This is a CSV file with 17,000 records that looks like this: 10 |  11 | ![Data File](./assets/data.png) 12 | 13 | The file contains information on 17k housing blocks all over the state of California: 14 | 15 | * Column 1: The longitude of the housing block 16 | * Column 2: The latitude of the housing block 17 | * Column 3: The median age of all the houses in the block 18 | * Column 4: The total number of rooms in all houses in the block 19 | * Column 5: The total number of bedrooms in all houses in the block 20 | * Column 6: The total number of people living in all houses in the block 21 | * Column 7: The total number of households in all houses in the block 22 | * Column 8: The median income of all people living in all houses in the block 23 | * Column 9: The median house value for all houses in the block 24 | 25 | We can use this data to train an app to predict the value of any house in and outside the state of California. 26 | 27 | Unfortunately we cannot train on this dataset directly. The data needs to be processed first to make it suitable for training. This is what you will do in this assignment. 28 | 29 | Let's get started. 30 | 31 | In these assignments you will not be using the code in Github. Instead, you'll be building all the applications 100% from scratch. So please make sure to create a new folder somewhere to hold all of your assignments. 32 | 33 | Now please open a console window. You are going to create a new subfolder for this assignment and set up a blank console application: 34 | 35 | ```bash 36 | $ dotnet new console --language F# --output LoadingData 37 | $ cd LoadingData 38 | ``` 39 | 40 | Also make sure to copy the dataset file(s) into this folder because the code you're going to type next will expect them here. 41 | 42 | Now install the following packages 43 | 44 | ```bash 45 | $ dotnet add package Microsoft.ML 46 | $ dotnet add package FSharp.Plotly 47 | ``` 48 | 49 | **Microsoft.ML** is the Microsoft machine learning package. We will use to build all our applications in this course. And **FSharp.Plotly** is an advanced scientific plotting library. 50 | 51 | Now you are ready to add types. You’ll need one type to hold all the information for a single housing block. 52 | 53 | Edit the Program.fs file with Visual Studio Code and add the following code: 54 | 55 | ```fsharp 56 | open System 57 | open Microsoft.ML 58 | open Microsoft.ML.Data 59 | open FSharp.Plotly 60 | 61 | /// The HouseBlockData class holds one single housing block data record. 62 | [] 63 | type HouseBlockData = { 64 | [] Longitude : float32 65 | [] Latitude : float32 66 | [] HousingMedianAge : float32 67 | [] TotalRooms : float32 68 | [] TotalBedrooms : float32 69 | [] Population : float32 70 | [] Households : float32 71 | [] MedianIncome : float32 72 | [] MedianHouseValue : float32 73 | } 74 | ``` 75 | 76 | The **HouseBlockData** class holds all the data for one single housing block. Note that we're loading each column as a 32-bit floating point number, and that every field is tagged with a **LoadColumn** attribute that will tell the CSV data loading code which column to import data from. 77 | 78 | We also need the **CLIMutable** attribute to tell F# that we want a 'C#-style' class implementation with a default constructor and setters functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 79 | 80 | Next you need to load the data in memory: 81 | 82 | ```fsharp 83 | /// file paths to data files (assumes os = windows!) 84 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory 85 | 86 | [] 87 | let main argv = 88 | 89 | // create the machine learning context 90 | let context = new MLContext() 91 | 92 | // load the dataset 93 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 94 | 95 | // the rest of the code goes here... 96 | 97 | 0 // return value 98 | ``` 99 | 100 | This code sets up the **main** function which is the main entry point of the application. The code calls the **LoadFromTextFile** method to load the CSV data in memory. Note the **HouseBlockData** type argument that tells the method which class to use to load the data. 101 | 102 | Also note that **dataPath** uses a Windows path separator to access the data file. Change this accordingly if you're using OS/X or Linux. 103 | 104 | So now we have the data in memory. Let's plot the median house value as a function of median income and see what happens. 105 | 106 | Add the following code: 107 | 108 | ```fsharp 109 | // get an array of housing data 110 | let houses = context.Data.CreateEnumerable(data, reuseRowObject = false) 111 | 112 | // plot median house value by median income 113 | Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue))) 114 | |> Chart.withX_AxisStyle "Median income" 115 | |> Chart.withY_AxisStyle "Median house value" 116 | |> Chart.Show 117 | 118 | // the rest of the code goes here 119 | ``` 120 | 121 | The housing data is stored in memory as a data view, but we want to work with the **HouseBlockData** records directly. So we call **CreateEnumerable** to convert the data view to an enumeration of **HouseDataBlock** instances. 122 | 123 | The **Chart.Point** method then sets up a scatterplot. We pipe the **houses** enumeration into the **Seq.map** function and project a tuple for every housing block. The tuples contain the median income and median house value for every block, and **Chart.Point** will use these as X- and Y coordinates. 124 | 125 | The **Chart.withX_AxisStyle** and **Chart.withY_AxisStyle** functions set the chart axis titles, and **Chart.Show** renders the chart on screen. Your app will open a web browser and display the chart there. 126 | 127 | This is a good moment to save your work ;) 128 | 129 | We're now ready to run the app. Open a Powershell terminal and make sure you're in the project folder. Then type the following: 130 | 131 | ```bash 132 | $ dotnet build 133 | ``` 134 | 135 | This will build the project and populate the bin folder. 136 | 137 | Then type the following: 138 | 139 | ```bash 140 | $ dotnet run 141 | ``` 142 | 143 | Your app will run and open the chart in a new browser window. It should look like this: 144 | 145 | ![Median house value by median income](./assets/plot.png) 146 | 147 | As the median income level increases, the median house value also increases. There's still a big spread in the house values, but a vague 'cigar' shape is visible which suggests a linear relationship between these two variables. 148 | 149 | But look at the horizontal line at 500,000. What's that all about? 150 | 151 | This is what **clipping** looks like. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 back down to $500,000. We see this appear in the graph as a horizontal line that disrupts the linear 'cigar' shape. 152 | 153 | Let's start by using **data scrubbing** to get rid of these clipped records. Add the following code: 154 | 155 | ```fsharp 156 | // keep only records with a median house value < 500,000 157 | let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0) 158 | 159 | // the rest of the code goes here... 160 | ``` 161 | 162 | The **FilterRowsByColumn** method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset. 163 | 164 | Move your plotting code BELOW this code fragment and run your app again. 165 | 166 | Did this fix the problem? Is the clipping line gone? 167 | 168 | Now let's take a closer look at the CSV file. Notice how all the columns are numbers in the range of 0..3000, but the median house value is in a range of 0..500,000. 169 | 170 | Remember when we talked about training data science models that we discussed having all data in a similar range? 171 | 172 | So let's fix that now by using **data scaling**. We're going to divide the median house value by 1,000 to bring it down to a range more in line with the other data columns. 173 | 174 | Start by adding the following type: 175 | 176 | ```fsharp 177 | /// The ToMedianHouseValue class is used in a column data conversion. 178 | [] 179 | type ToMedianHouseValue = { 180 | mutable NormalizedMedianHouseValue : float32 181 | } 182 | ``` 183 | 184 | And then add the following code at the bottom of your **main** function: 185 | 186 | ```fsharp 187 | // build a data loading pipeline 188 | let pipeline = 189 | EstimatorChain() 190 | 191 | // step 1: divide the median house value by 1000 192 | .Append( 193 | context.Transforms.CustomMapping( 194 | Action(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f), 195 | "MedianHouseValue")) 196 | 197 | // the rest of the code goes here... 198 | ``` 199 | 200 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components. 201 | 202 | This pipeline has only one component: 203 | 204 | * **CustomMapping** which takes the median house values, divides them by 1,000 and stores them in a new column called **NormalizedMedianHouseValue**. Note that we need the new **ToMedianHouseValue** type to access this new column in code. 205 | 206 | Also note the **mutable** keyword in the type definition for **ToMedianHouseValue**. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 207 | 208 | If we had left out the keyword, the **output.NormalizedMedianHouseValue = ...** line would fail. 209 | 210 | Now let's see if the conversion worked. Add the following code at the bottom of the **main** function: 211 | 212 | ```fsharp 213 | // get a 10-record preview of the transformed data 214 | let model = data |> pipeline.Fit 215 | let preview = (data |> model.Transform).Preview(maxRows = 10) 216 | 217 | // show the preview 218 | preview.ColumnView |> Seq.iter(fun c -> 219 | printf "%-30s|" c.Column.Name 220 | preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value) 221 | printfn "") 222 | 223 | // the rest of the code goes here... 224 | ``` 225 | 226 | The **pipeline.Fit** method sets up the pipeline, creates a data science model and stores it in the **model** variable. The **model.Transform** method then runs the dataset through the pipeline and creates predictions for every housing block. And finally the **Preview** method extracts a 10-row preview from the collection of predictions. 227 | 228 | Next, we use **Seq.iter** to enumerate every column in the preview. We print the column name and then use a second **Seq.iter** to show all the preview values in this column. 229 | 230 | This will print a transposed view of the preview data with the columns stacked vertically and the rows stacked horizontally. Flipping the preview makes it easier to read, despite the very long column names. 231 | 232 | Now run your code. 233 | 234 | Find the MedianHouseValue and NormalizedMedianHouseValue columns in the output. Do they contain the correct values? Does the normalized column contain the oroginal house values divided by 1,000? 235 | 236 | Now let's fix the latitude and longitude. We're reading them in directly, but remember that we discussed how **Geo data should always be binned, one-hot encoded, and crossed?** 237 | 238 | Let's do that now. Add the following types at the top of the file: 239 | 240 | ```fsharp 241 | /// The ToLocation class is used in a column data conversion. 242 | [] 243 | type FromLocation = { 244 | EncodedLongitude : float32[] 245 | EncodedLatitude : float32[] 246 | } 247 | 248 | /// The ToLocation class is used in a column data conversion. 249 | [] 250 | type ToLocation = { 251 | mutable Location : float32[] 252 | } 253 | ``` 254 | 255 | Note the **mutable** keyword again, which indicates that we're going to modify the **Location** property of the **ToLocation** type after construction. 256 | 257 | We will use these types in the next code snippet. 258 | 259 | Now scroll down to the bottom of the **main** function and add the following code just before the final line that retuns a zero return value: 260 | 261 | ```fsharp 262 | // step 2: bin, encode, and cross the longitude and latitude 263 | let pipeline2 = 264 | pipeline 265 | .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10)) 266 | 267 | // step 3: bin the latitude 268 | .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10)) 269 | 270 | // step 4: one-hot encode the longitude 271 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude")) 272 | 273 | // step 5: one-hot encode the latitude 274 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude")) 275 | 276 | // step 6: cross the longitude and latitude vectors 277 | .Append( 278 | context.Transforms.CustomMapping( 279 | Action(fun input output -> 280 | output.Location <- [| for x in input.EncodedLongitude do 281 | for y in input.EncodedLatitude do 282 | x * y |] ), 283 | "Location")) 284 | 285 | // the rest of the code goes here... 286 | ``` 287 | 288 | Note how we're extending the data loading pipeline with extra components. The new components are: 289 | 290 | * Two **NormalizeBinning** components that bin the longitude and latitude values into 10 bins 291 | 292 | * Two **OneHotEncoding** components that one-hot encode the longitude and latitude bins 293 | 294 | * One **CustomMapping** component that multiples (crosses) the longitude and latitude vectors to create a feature cross: a 100-element vector with all zeroes except for a single '1' value. 295 | 296 | Note how the custom mapping uses two nested for-loops inside the **[| ... |]** array brackets. This sets up an inline enumerator that multiples the two longitude and latitude vectors and produces a 1-dimensional array with 100 elements. 297 | 298 | Let's see if this worked. Add the following code to the bottom of the **main** function: 299 | 300 | ```fsharp 301 | // get a 10-record preview of the transformed data 302 | let model = data |> pipeline2.Fit 303 | let preview = (data |> model.Transform).Preview(maxRows = 10) 304 | 305 | // show the preview 306 | preview.ColumnView |> Seq.iter(fun c -> 307 | printf "%-30s|" c.Column.Name 308 | preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value) 309 | printfn "") 310 | 311 | // the rest of the code goes here... 312 | ``` 313 | 314 | This is the same code you used previously to create predictions, get a preview, and display the preview on the console. But now you're using **pipeline2** instead. 315 | 316 | Now run your app. 317 | 318 | What does the data look like now? Can you spot the new columns with the binned and one-hot encoded longitude and latitude values? 319 | 320 | And is the new **Location** column present? 321 | 322 | You should see the new **Location** column, but the code can't display its contents properly. 323 | 324 | So let's fix that. Add the following code to display all the individual values in the **Location** vector: 325 | 326 | ```fsharp 327 | // show the dense vector 328 | preview.RowView |> Seq.iter(fun r -> 329 | let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer 330 | vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v)) 331 | printfn "") 332 | ``` 333 | 334 | We use **Seq.iter** to enumerate every row in the preview. And note the **:?>** operator which casts the value to a **VBuffer** of floats. With this casted value we can access the **DenseValues** property which is a float array of all the elements in the vector. So we pipe that property into a second **Seq.iter** to print the values. 335 | 336 | Now run your app. What do you see? Did it work? Are there 100 digits in the **Location** column? And is there only a single '1' digit in each row? 337 | 338 | Post your results in our group. -------------------------------------------------------------------------------- /LoadingData/CaliforniaHousing/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/data.png -------------------------------------------------------------------------------- /LoadingData/CaliforniaHousing/assets/plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/plot.png -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/Mnist.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open System.IO 3 | open Microsoft.ML 4 | open Microsoft.ML.Data 5 | open Microsoft.ML.Transforms 6 | 7 | /// The Digit class represents one mnist digit. 8 | [] 9 | type Digit = { 10 | [] Number : float32 11 | [] [] PixelValues : float32[] 12 | } 13 | 14 | /// The DigitPrediction class represents one digit prediction. 15 | [] 16 | type DigitPrediction = { 17 | Score : float32[] 18 | } 19 | 20 | /// file paths to train and test data files (assumes os = windows!) 21 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory 22 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory 23 | 24 | [] 25 | let main argv = 26 | 27 | // create a machine learning context 28 | let context = new MLContext() 29 | 30 | // load the datafiles 31 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',') 32 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',') 33 | 34 | // build a training pipeline 35 | let pipeline = 36 | EstimatorChain() 37 | 38 | // step 1: map the number column to a key value and store in the label column 39 | .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue)) 40 | 41 | // step 2: concatenate all feature columns 42 | .Append(context.Transforms.Concatenate("Features", "PixelValues")) 43 | 44 | // step 3: cache data to speed up training 45 | .AppendCacheCheckpoint(context) 46 | 47 | // step 4: train the model with SDCA 48 | .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy()) 49 | 50 | // step 5: map the label key value back to a number 51 | .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label")) 52 | 53 | // train the model 54 | let model = trainData |> pipeline.Fit 55 | 56 | // get predictions and compare them to the ground truth 57 | let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate 58 | 59 | // show evaluation metrics 60 | printfn "Evaluation metrics" 61 | printfn " MicroAccuracy: %f" metrics.MicroAccuracy 62 | printfn " MacroAccuracy: %f" metrics.MacroAccuracy 63 | printfn " LogLoss: %f" metrics.LogLoss 64 | printfn " LogLossReduction: %f" metrics.LogLossReduction 65 | 66 | // grab five digits from the test data 67 | let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq 68 | let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ] 69 | 70 | // create a prediction engine 71 | let engine = context.Model.CreatePredictionEngine model 72 | 73 | // show predictions 74 | printfn "Model predictions:" 75 | printf " #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn "" 76 | testDigits |> Seq.iter( 77 | fun digit -> 78 | printf " %i\t" (int digit.Number) 79 | let p = engine.Predict digit 80 | p.Score |> Seq.iter (fun s -> printf "%f\t" s) 81 | printfn "") 82 | 83 | 0 // return value -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Recognize handwritten digits 2 | 3 | In this article, You are going to build an app that recognizes handwritten digits from the famous MNIST machine learning dataset: 4 | 5 | ![MNIST digits](./assets/mnist.png) 6 | 7 | Your app must read these images of handwritten digits and correctly predict which digit is visible in each image. 8 | 9 | This may seem like an easy challenge, but look at this: 10 | 11 | ![Difficult MNIST digits](./assets/mnist_hard.png) 12 | 13 | These are a couple of digits from the dataset. Are you able to identify each one? It probably won’t surprise you to hear that the human error rate on this exercise is around 2.5%. 14 | 15 | The first thing you will need for your app is a data file with images of handwritten digits. We will not use the original MNIST data because it's stored in a nonstandard binary format. 16 | 17 | Instead, we'll use these excellent [CSV files](https://www.kaggle.com/oddrationale/mnist-in-csv/) prepared by Daniel Dato on Kaggle. 18 | 19 | Create a Kaggle account if you don't have one yet, then download **mnist_train.csv** and **mnist_test.csv** and save them in your project folder. 20 | 21 | There are 60,000 images in the training file and 10,000 in the test file. Each image is monochrome and resized to 28x28 pixels. 22 | 23 | The training file looks like this: 24 | 25 | ![Data file](./assets/datafile.png) 26 | 27 | It’s a CSV file with 785 columns: 28 | 29 | * The first column contains the label. It tells us which one of the 10 possible digits is visible in the image. 30 | * The next 784 columns are the pixel intensity values (0..255) for each pixel in the image, counting from left to right and top to bottom. 31 | 32 | You are going to build a multiclass classification machine learning model that reads in all 785 columns, and then makes a prediction for each digit in the dataset. 33 | 34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 35 | 36 | ```bash 37 | $ dotnet new console --language F# --output Mnist 38 | $ cd Mnist 39 | ``` 40 | 41 | Now install the ML.NET package: 42 | 43 | ```bash 44 | $ dotnet add package Microsoft.ML 45 | ``` 46 | 47 | Now you are ready to add types. You’ll need one to hold a digit, and one to hold your model prediction. 48 | 49 | Replace the contents of the Program.fs file with this: 50 | 51 | ```fsharp 52 | open System 53 | open System.IO 54 | open Microsoft.ML 55 | open Microsoft.ML.Data 56 | open Microsoft.ML.Transforms 57 | 58 | /// The Digit class represents one mnist digit. 59 | [] 60 | type Digit = { 61 | [] Number : float32 62 | [] [] PixelValues : float32[] 63 | } 64 | 65 | /// The DigitPrediction class represents one digit prediction. 66 | [] 67 | type DigitPrediction = { 68 | Score : float32[] 69 | } 70 | ``` 71 | 72 | The **Digit** type holds one single MNIST digit image. Note how the **PixelValues** field is tagged with a **VectorType** attribute. This tells ML.NET to combine the 784 individual pixel columns into a single vector value. 73 | 74 | There's also a **DigitPrediction** type which will hold a single prediction. And notice how the prediction score is actually an array? The model will generate 10 scores, one for every possible digit value. 75 | 76 | Also note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 77 | 78 | Next you'll need to load the data in memory: 79 | 80 | ```fsharp 81 | /// file paths to train and test data files (assumes os = windows!) 82 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory 83 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory 84 | 85 | [] 86 | let main argv = 87 | 88 | // create a machine learning context 89 | let context = new MLContext() 90 | 91 | // load the datafiles 92 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',') 93 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',') 94 | 95 | // the rest of the code goes here.... 96 | 97 | 0 // return value 98 | ``` 99 | 100 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory. We call this function twice to load the training and testing datasets separately. 101 | 102 | Now let’s build the machine learning pipeline: 103 | 104 | ```fsharp 105 | // build a training pipeline 106 | let pipeline = 107 | EstimatorChain() 108 | 109 | // step 1: map the number column to a key value and store in the label column 110 | .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue)) 111 | 112 | // step 2: concatenate all feature columns 113 | .Append(context.Transforms.Concatenate("Features", "PixelValues")) 114 | 115 | // step 3: cache data to speed up training 116 | .AppendCacheCheckpoint(context) 117 | 118 | // step 4: train the model with SDCA 119 | .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy()) 120 | 121 | // step 5: map the label key value back to a number 122 | .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label")) 123 | 124 | // train the model 125 | let model = trainData |> pipeline.Fit 126 | 127 | // the rest of the code goes here.... 128 | ``` 129 | 130 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 131 | 132 | This pipeline has the following components: 133 | 134 | * **MapValueToKey** which reads the **Number** column and builds a dictionary of unique values. It then produces an output column called **Label** which contains the dictionary key for each number value. We need this step because we can only train a multiclass classifier on keys. 135 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column. 136 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm which comes next. 137 | * A **SdcaMaximumEntropy** classification learner which will train the model to make accurate predictions. 138 | * A final **MapKeyToValue** step which converts the keys in the **Label** column back to the original number values. We need this step to show the numbers when making predictions. 139 | 140 | With the pipeline fully assembled, we can train the model by piping the training data into the **Fit** function. 141 | 142 | You now have a fully- trained model. So now it's time to take the test set, predict the number for each digit image, and calculate the accuracy metrics of the model: 143 | 144 | ```fsharp 145 | // get predictions and compare them to the ground truth 146 | let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate 147 | 148 | // show evaluation metrics 149 | printfn "Evaluation metrics" 150 | printfn " MicroAccuracy: %f" metrics.MicroAccuracy 151 | printfn " MacroAccuracy: %f" metrics.MacroAccuracy 152 | printfn " LogLoss: %f" metrics.LogLoss 153 | printfn " LogLossReduction: %f" metrics.LogLossReduction 154 | 155 | // the rest of the code goes here.... 156 | ``` 157 | 158 | This code pipes the test data into the **Transform** function to set up predictions for every single image in the test set. Then it pipes these predictions into the **Evaluate** function to compare these predictions to the actual labels and automatically calculate four metrics: 159 | 160 | * **MicroAccuracy**: this is the average accuracy (=the number of correct predictions divided by the total number of predictions) for every digit in the dataset. 161 | * **MacroAccuracy**: this is calculated by first calculating the average accuracy for each unique prediction value, and then taking the averages of those averages. 162 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes. 163 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance. 164 | 165 | We can compare the micro- and macro accuracy to discover if the dataset is biased. In an unbiased set each unique label value will appear roughly the same number of times, and the micro- and macro accuracy values will be close together. 166 | 167 | If the values are far apart, this suggests that there is some kind of bias in the data that we need to deal with. 168 | 169 | To wrap up, let’s use the model to make a prediction. 170 | 171 | You will pick five arbitrary digits from the test set, run them through the model, and make a prediction for each one. 172 | 173 | Here’s how to do it: 174 | 175 | ```fsharp 176 | // grab five digits from the test data 177 | let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq 178 | let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ] 179 | 180 | // create a prediction engine 181 | let engine = context.Model.CreatePredictionEngine model 182 | 183 | // show predictions 184 | printfn "Model predictions:" 185 | printf " #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn "" 186 | testDigits |> Seq.iter( 187 | fun digit -> 188 | printf " %i\t" (int digit.Number) 189 | let p = engine.Predict digit 190 | p.Score |> Seq.iter (fun s -> printf "%f\t" s) 191 | printfn "") 192 | ``` 193 | 194 | This code calls the **CreateEnumerable** function to convert the test dataview to an array of **Digit** instances. Then it picks five random digits for testing. 195 | 196 | We then call the **CreatePredictionEngine** function to set up a prediction engine. 197 | 198 | The code then calls **Seq.iter** to print column headings for each of the 10 possible digit values. We then pipe the 5 test digits into another **Seq.iter**, make a prediction for each test digit, and then use a third **Seq.iter** to display the 10 prediction scores. 199 | 200 | This will produce a table with 5 rows of test digits, and 10 columns of prediction scores. The column with the highest score represents the prediction for that particular test digit. 201 | 202 | That's it, you're done! 203 | 204 | Go to your terminal and run your code: 205 | 206 | ```bash 207 | $ dotnet run 208 | ``` 209 | 210 | What results do you get? What are your micro- and macro accuracy values? Which logloss and logloss reduction did you get? 211 | 212 | Do you think the dataset is biased? 213 | 214 | What can you say about the accuracy? Is this a good model? How far away are you from the human accuracy rate? Is this a superhuman or subhuman AI? 215 | 216 | What did the 5 digit predictions look like? Do you understand why the model gets confused sometimes? 217 | 218 | Think about the code in this assignment. How could you improve the accuracy of the model even further? 219 | 220 | Share your results in our group! 221 | -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/assets/datafile.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/datafile.png -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/assets/mnist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist.png -------------------------------------------------------------------------------- /MulticlassClassification/DigitRecognition/assets/mnist_hard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist_hard.png -------------------------------------------------------------------------------- /MulticlassClassification/FlagToxicComments/README.md: -------------------------------------------------------------------------------- 1 | # The case 2 | 3 | Online discussions about things you care about can be difficult. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions. Many platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. 4 | 5 | The Conversation AI team is a research initiative founded by Jigsaw and Google. It is working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments that are rude, disrespectful or likely to make someone leave a discussion. 6 | 7 | The team has built a range of public tools to detect toxicity. But the current apps still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding. 8 | 9 | In this case study, you’re going to build an app that is capable of detecting different types of of toxicity like threats, obscenity, insults, and hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits. 10 | 11 | How accurate will your app be? Do you think you will be able to flag every toxic comment? 12 | 13 | That's for you to find out! 14 | 15 | # The dataset 16 | 17 | ![The dataset](./assets/data.png) 18 | 19 | In this case study you'll be working with a dataset containing over 313,000 comments from Wikipedia talk pages. 20 | 21 | There are two files in the dataset: 22 | * [train.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/train.csv) which contains 160k records, 2 input features, and 6 output labels. You will use this file to train your model. 23 | * [test.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/test.csv) which contains 153k records and 2 input features. You will use this file to test your model. 24 | 25 | You'll need to [download the dataset from Kaggle](https://www.kaggle.com/c/8076/download-all) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet. 26 | 27 | Here's a description of all columns in the training file: 28 | * **id**: the identifier of the comment 29 | * **comment_text**: the text of the comment 30 | * **toxic**: 1 if the comment is toxic, 0 if it is not 31 | * **severe_toxic**: 1 if the comment is severely toxic, 0 if it is not 32 | * **obscene**: 1 if the comment is obscene, 0 if it is not 33 | * **threat**: 1 if the comment is threatening, 0 if it is not 34 | * **insult**: 1 if the comment is insulting, 0 if it is not 35 | * **identity_hate**: 1 if the comment expresses identity hatred, 0 if it does not 36 | 37 | # Getting started 38 | Go to the console and set up a new console application: 39 | 40 | ```bash 41 | $ dotnet new console --language F# --output FlagToxicComments 42 | $ cd FlagToxicComments 43 | ``` 44 | 45 | Then install the ML.NET NuGet package: 46 | 47 | ```bash 48 | $ dotnet add package Microsoft.ML 49 | $ dotnet add package Microsoft.ML.FastTree 50 | ``` 51 | 52 | And launch the Visual Studio Code editor: 53 | 54 | ```bash 55 | $ code . 56 | ``` 57 | 58 | The rest is up to you! 59 | 60 | # Hint 61 | To process text data, you'll need to add a **FeaturizeText** component to your machine learning pipeline. 62 | 63 | Your code should look something like this: 64 | 65 | ```fsharp 66 | // Assume we have a partial pipeline in the variable 'partialPipe' 67 | // This line adds a text featurizer to the pipeline. It reads the 'CommentText' column and 68 | // transforms it to a numeric vector and stores it in the 'Features' column 69 | let completePipe = partialPipe.Append(context.Transforms.Text.FeaturizeText("Features", "CommentText")) 70 | ``` 71 | 72 | FeaturizeText is a handy all-in-one component that can read text columns, process them, and convert them to numeric vectors 73 | that are ready for model training. 74 | 75 | # Your assignment 76 | I want you to build an app that reads the training and testing files in memory and featurizes the comments to prepare them for analysis. 77 | 78 | Then train a multiclass classifier on the training data and generate predictions for the comments in the testing file. 79 | 80 | Measure the micro- and macro accuracy. Report your best values in our group. 81 | 82 | See if you can get the accuracies as close to 1 as possible. Share in our group how you did it. Which learning algorithm did you select, and how did you configure your model? 83 | 84 | Good luck! -------------------------------------------------------------------------------- /MulticlassClassification/FlagToxicComments/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/FlagToxicComments/assets/data.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science with F# and ML.NET 2 | 3 | ![Data Science with C# and ML.NET](./assets/DSC-FS.jpg) 4 | 5 | This repository contains all course assignments of my **Data Science with F# and ML.NET** course and will get you up to speed with Microsoft's new ML.NET library. 6 | 7 | By working through the code examples, you will learn how to design, train, and evaluate complex AI models with simple F# code. I'll provide you with all the code, libraries, and data sets you need to get started. 8 | 9 | Please note that this repository only contains code examples with no additional support. 10 | 11 | If you prefer a full-featured e-learning experience with live coaching, please check out my online course here: 12 | 13 | https://www.machinelearningadvantage.com/datascience-with-fsharp 14 | 15 | 16 | # Table of contents 17 | 18 | Transforming data: [Processing California housing data](./LoadingData/CaliforniaHousing) 19 | 20 | Regression: [Predict taxi fares in New York](./Regression/TaxiFarePrediction) 21 | 22 | Case study: [Predict house prices in Iowa](./Regression/HousePricePrediction) 23 | 24 | Binary classification: [Predict heart disease in Ohio](./BinaryClassification/HeartDiseasePrediction) 25 | 26 | Case study: [Detect credit card fraud in Europe](./BinaryClassification/FraudDetection) 27 | 28 | Multiclass classification: [Recognize handwriting](./MulticlassClassification/DigitRecognition) 29 | 30 | Evaluating models: [Detect SMS spam messages](./BinaryClassification/SpamDetection) 31 | 32 | Case study: [Flag toxic comments on Wikipedia](./MulticlassClassification/FlagToxicComments) 33 | 34 | Decision trees: [Predict Titanic survivors](./BinaryClassification/TitanicPrediction) 35 | 36 | Case study: [Predict Diabetes in Pima indians](./BinaryClassification/DiabetesDetection) 37 | 38 | Ensembles: [Predict bike demand in Washington DC](./Regression/BikeDemandPrediction) 39 | 40 | Clustering: [Classify Iris flowers](./Clustering/IrisFlower) 41 | 42 | Recommendation: [Build a movie recommender](./Recommendation/MovieRecommender) 43 | -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/MovieRecommender.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open Microsoft.ML 3 | open Microsoft.ML.Trainers 4 | open Microsoft.ML.Data 5 | 6 | /// The MovieRating class holds a single movie rating. 7 | [] 8 | type MovieRating = { 9 | [] UserID : float32 10 | [] MovieID : float32 11 | [] Label : float32 12 | } 13 | 14 | /// The MovieRatingPrediction class holds a single movie prediction. 15 | [] 16 | type MovieRatingPrediction = { 17 | Label : float32 18 | Score : float32 19 | } 20 | 21 | /// The MovieTitle class holds a single movie title. 22 | [] 23 | type MovieTitle = { 24 | [] MovieID : float32 25 | [] Title : string 26 | [] Genres: string 27 | } 28 | 29 | // file paths to data files (assumes os = windows!) 30 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory 31 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory 32 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory 33 | 34 | [] 35 | let main argv = 36 | 37 | // set up a new machine learning context 38 | let context = new MLContext() 39 | 40 | // load training and test data 41 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',') 42 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',') 43 | 44 | // prepare matrix factorization options 45 | let options = 46 | MatrixFactorizationTrainer.Options( 47 | MatrixColumnIndexColumnName = "UserIDEncoded", 48 | MatrixRowIndexColumnName = "MovieIDEncoded", 49 | LabelColumnName = "Label", 50 | NumberOfIterations = 20, 51 | ApproximationRank = 100) 52 | 53 | // set up a training pipeline 54 | let pipeline = 55 | EstimatorChain() 56 | 57 | // step 1: map userId and movieId to keys 58 | .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID")) 59 | .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID")) 60 | 61 | // step 2: find recommendations using matrix factorization 62 | .Append(context.Recommendation().Trainers.MatrixFactorization(options)) 63 | 64 | // train the model 65 | let model = trainData |> pipeline.Fit 66 | 67 | // calculate predictions and compare them to the ground truth 68 | let metrics = testData |> model.Transform |> context.Regression.Evaluate 69 | 70 | // show model metrics 71 | printfn "Model metrics:" 72 | printfn " RMSE: %f" metrics.RootMeanSquaredError 73 | printfn " MAE: %f" metrics.MeanAbsoluteError 74 | printfn " MSE: %f" metrics.MeanSquaredError 75 | 76 | // set up a prediction engine 77 | let engine = context.Model.CreatePredictionEngine model 78 | 79 | // check if Mark likes 'GoldenEye' 80 | printfn "Does Mark like GoldenEye?" 81 | let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f } 82 | printfn " Score: %f" p.Score 83 | 84 | // load all movie titles 85 | let movieData = context.Data.LoadFromTextFile(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 86 | let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false) 87 | 88 | // find Mark's top 5 movies 89 | let marksMovies = 90 | movies |> Seq.map(fun m -> 91 | let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f } 92 | (m.Title, p2.Score)) 93 | |> Seq.sortByDescending(fun t -> snd t) 94 | 95 | // print the results 96 | printfn "What are Mark's top-5 movies?" 97 | marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn " %f %s" (snd t) (fst t)) 98 | 99 | 0 // return value 100 | -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Recommend new movies to film fans 2 | 3 | In this assignment you're going to build a movie recommendation system that can recommend new movies to film fans. 4 | 5 | The first thing you'll need is a data file with thousands of movies rated by many different users. The [MovieLens Project](https://movielens.org) has exactly what you need. 6 | 7 | Download the [movie ratings for training](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-train.csv), [movie ratings for testing](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-test.csv), and the [movie dictionary](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-movies.csv) and save these files in your project folder. You now have 100,000 movie ratings with 99,980 set aside for training and 20 for testing. 8 | 9 | The training and testing files are in CSV format and look like this: 10 |  11 | 12 | ![Data File](./assets/data.png) 13 | 14 | There are only four columns of data: 15 | 16 | * The ID of the user 17 | * The ID of the movie 18 | * The movie rating on a scale from 1–5 19 | * The timestamp of the rating 20 | 21 | There's also a movie dictionary in CSV format with all the movie IDs and titles: 22 | 23 | 24 | ![Data File](./assets/movies.png) 25 | 26 | You are going to build a data science model that reads in each user ID, movie ID, and rating, and then predicts the ratings each user would give for every movie in the dataset. 27 | 28 | Once you have a fully trained model, you can easily add a new user with a couple of favorite movies and then ask the model to generate predictions for any of the other movies in the dataset. 29 | 30 | And in fact this is exactly how the recommendation systems on Netflix and Amazon work. 31 | 32 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 33 | 34 | ```bash 35 | $ dotnet new console --language F# --output MovieRecommender 36 | $ cd MovieRecommender 37 | ``` 38 | 39 | Now install the following packages 40 | 41 | ```bash 42 | $ dotnet add package Microsoft.ML 43 | $ dotnet add package Microsoft.ML.Recommender 44 | ``` 45 | 46 | Now you're ready to add some types. You will need one type to hold a movie rating, and one to hold your model’s predictions. 47 | 48 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code: 49 | 50 | ```fsharp 51 | open System 52 | open Microsoft.ML 53 | open Microsoft.ML.Trainers 54 | open Microsoft.ML.Data 55 | 56 | /// The MovieRating class holds a single movie rating. 57 | [] 58 | type MovieRating = { 59 | [] UserID : float32 60 | [] MovieID : float32 61 | [] Label : float32 62 | } 63 | 64 | /// The MovieRatingPrediction class holds a single movie prediction. 65 | [] 66 | type MovieRatingPrediction = { 67 | Label : float32 68 | Score : float32 69 | } 70 | 71 | // the rest of the code goes here... 72 | ``` 73 | 74 | The **MovieRating** type holds one single movie rating. Note how each field is tagged with a **LoadColumn** attribute that tell the CSV data loading code which column to import data from. 75 | 76 | You're also declaring a **MovieRatingPrediction** type which will hold a single movie rating prediction. 77 | 78 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 79 | 80 | Before we continue, we need to set up a third type that will hold our movie dictionary: 81 | 82 | ```fsharp 83 | /// The MovieTitle class holds a single movie title. 84 | [] 85 | type MovieTitle = { 86 | [] MovieID : float32 87 | [] Title : string 88 | [] Genres: string 89 | } 90 | 91 | // the rest of the code goes here 92 | ``` 93 | 94 | This **MovieTitle** type contains a movie ID value and its corresponding title and genres. We will use this type later in our code to map movie IDs to their corresponding titles. 95 | 96 | Now you need to load the dataset in memory: 97 | 98 | ```fsharp 99 | // file paths to data files (assumes os = windows!) 100 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory 101 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory 102 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory 103 | 104 | [] 105 | let main argv = 106 | 107 | // set up a new machine learning context 108 | let context = new MLContext() 109 | 110 | // load training and test data 111 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',') 112 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',') 113 | 114 | // the rest of the code goes here... 115 | 116 | 0 // return value 117 | ``` 118 | 119 | This code calls the **LoadFromTextFile** function twice to load the training and testing CSV data into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **MovieRating** class. 120 | 121 | Now you're ready to start building the machine learning model: 122 | 123 | ```fsharp 124 | // prepare matrix factorization options 125 | let options = 126 | MatrixFactorizationTrainer.Options( 127 | MatrixColumnIndexColumnName = "UserIDEncoded", 128 | MatrixRowIndexColumnName = "MovieIDEncoded", 129 | LabelColumnName = "Label", 130 | NumberOfIterations = 20, 131 | ApproximationRank = 100) 132 | 133 | // set up a training pipeline 134 | let pipeline = 135 | EstimatorChain() 136 | 137 | // step 1: map userId and movieId to keys 138 | .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID")) 139 | .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID")) 140 | 141 | // step 2: find recommendations using matrix factorization 142 | .Append(context.Recommendation().Trainers.MatrixFactorization(options)) 143 | 144 | // train the model 145 | let model = trainData |> pipeline.Fit 146 | 147 | // the rest of the code goes here... 148 | ``` 149 | 150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components. 151 | 152 | This pipeline has the following components: 153 | 154 | * **MapValueToKey** which reads the UserID column and builds a dictionary of unique ID values. It then produces an output column called UserIDEncoded containing an encoding for each ID. This step converts the IDs to numbers that the model can work with. 155 | * Another **MapValueToKey** which reads the MovieID column, encodes it, and stores the encodings in output column called MovieIDEncoded. 156 | * A **MatrixFactorization** component that performs matrix factorization on the encoded ID columns and the ratings. This step calculates the movie rating predictions for every user and movie. 157 | 158 | With the pipeline fully assembled, you train the model by piping the training data into the **Fit** function. 159 | 160 | You now have a fully- trained model. So now you need to load the validation data, predict the rating for each user and movie, and calculate the accuracy metrics of the model: 161 | 162 | ```fsharp 163 | // calculate predictions and compare them to the ground truth 164 | let metrics = testData |> model.Transform |> context.Regression.Evaluate 165 | 166 | // show model metrics 167 | printfn "Model metrics:" 168 | printfn " RMSE: %f" metrics.RootMeanSquaredError 169 | printfn " MAE: %f" metrics.MeanAbsoluteError 170 | printfn " MSE: %f" metrics.MeanSquaredError 171 | 172 | // the rest of the code goes here... 173 | ``` 174 | 175 | This code pipes the test data into the **Transform** function to make predictions for every user and movie in the test dataset. It then pipes these predictions into the **Evaluate** function to compare them to the actual ratings. 176 | 177 | The **Evaluate** function calculates the following three metrics: 178 | 179 | * **RootMeanSquaredError**: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction. 180 | * **MeanAbsoluteError**: this is the mean absolute prediction error, expressed as a rating. 181 | * **MeanSquaredError**: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE. 182 | 183 | To wrap up, let’s use the model to make a prediction about me. Here are 6 movies I like: 184 | 185 | * Blade Runner 186 | * True Lies 187 | * Speed 188 | * Twelve Monkeys 189 | * Things to do in Denver when you're dead 190 | * Cloud Atlas 191 | 192 | And 6 more movies I really didn't like at all: 193 | 194 | * Ace Ventura: when nature calls 195 | * Naked Gun 33 1/3 196 | * Highlander II 197 | * Throw momma from the train 198 | * Jingle all the way 199 | * Dude, where's my car? 200 | 201 | You'll find my ratings at the very end of the training file. I added myself as user 999. 202 | 203 | So based on this list, do you think I would enjoy the James Bond movie ‘GoldenEye’? 204 | 205 | Let's write some code to find out: 206 | 207 | ```fsharp 208 | // set up a prediction engine 209 | let engine = context.Model.CreatePredictionEngine model 210 | 211 | // check if Mark likes 'GoldenEye' 212 | printfn "Does Mark like GoldenEye?" 213 | let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f } 214 | printfn " Score: %f" p.Score 215 | 216 | // the rest of the code goes here... 217 | ``` 218 | 219 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then calls **Predict** to create a prediction for user 999 (me) and movie 10 (GoldenEye). 220 | 221 | Let’s do one more thing and ask the model to predict my top-5 favorite movies. 222 | 223 | We can ask the model to predict my favorite movies, but it will just produce movie ID values. So now's the time to load that movie dictionary that will help us convert movie IDs to their corresponding titles: 224 | 225 | ```fsharp 226 | // load all movie titles 227 | let movieData = context.Data.LoadFromTextFile(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true) 228 | let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false) 229 | 230 | // the rest of the code goes here... 231 | ``` 232 | 233 | This code calls **LoadFromTextFile** to load the movie dictionary in memory, and then calls **CreateEnumerable** to create an enumeration of **MovieTitle** instances. 234 | 235 | We can now find my favorite movies like this: 236 | 237 | ```fsharp 238 | // find Mark's top 5 movies 239 | let marksMovies = 240 | movies |> Seq.map(fun m -> 241 | let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f } 242 | (m.Title, p2.Score)) 243 | |> Seq.sortByDescending(fun t -> snd t) 244 | 245 | // print the results 246 | printfn "What are Mark's top-5 movies?" 247 | marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn " %f %s" (snd t) (fst t)) 248 | ``` 249 | 250 | The code pipes the movie dictionary into **Seq.map** to create an enumeration of tuples. The first tuple element is the movie title and the second element is the rating the model thinks I would give to that movie. 251 | 252 | The code then pipes the enumeration of tuples into **Seq.sortByDescending** to sort the list by rating. This will put my favorite movies at the top of the list. 253 | 254 | Finally, the code pipes the rated movie list into **Seq.take** to grab the top-5, and then prints out the title and correspnding rating. 255 | 256 | That's it, your code is done. Go to your terminal and run the app: 257 | 258 | ```bash 259 | $ dotnet run 260 | ``` 261 | 262 | Which training and validation metrics did you get? What are your RMSE and MAE values? Now look at how the data has been partitioned into training and validaton sets. Do you think this a good result? What could you improve? 263 | 264 | What rating did the model predict I would give to the movie GoldenEye? And what are my 5 favorite movies according to the model? 265 | 266 | Share your results in our group and then ask me if the predictions are correct ;) 267 | -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/data.png -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/assets/movies.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/movies.png -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/recommendation-movies.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/recommendation-movies.csv -------------------------------------------------------------------------------- /Recommendation/MovieRecommender/recommendation-ratings-test.csv: -------------------------------------------------------------------------------- 1 | userId,movieId,rating,timestamp 2 | 1,1097,5,964981680 3 | 1,1127,4,964982513 4 | 1,1136,5,964981327 5 | 1,1196,5,964981827 6 | 1,1197,5,964981872 7 | 1,1198,5,964981827 8 | 1,1206,5,964983737 9 | 1,1208,4,964983250 10 | 1,1210,5,964980499 11 | 1,1213,5,964982951 12 | 1,1214,4,964981855 13 | 2,114060,2,1445715276 14 | 2,115713,3.5,1445714854 15 | 2,122882,5,1445715272 16 | 2,131724,5,1445714851 17 | 3,2105,2,1306463559 18 | 3,2288,4,1306463631 19 | 3,2851,5,1306463925 20 | 3,2424,0.5,1306464293 21 | -------------------------------------------------------------------------------- /Regression/BikeDemandPrediction/BikeDemand.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /Regression/BikeDemandPrediction/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open System.IO 3 | open Microsoft.ML 4 | open Microsoft.ML.Data 5 | 6 | /// The DemandObservation class holds one single bike demand observation record. 7 | [] 8 | type DemandObservation = { 9 | [] Season : float32 10 | [] Year : float32 11 | [] Month : float32 12 | [] Hour : float32 13 | [] Holiday : float32 14 | [] Weekday : float32 15 | [] WorkingDay : float32 16 | [] Weather : float32 17 | [] Temperature : float32 18 | [] NormalizedTemperature : float32 19 | [] Humidity : float32 20 | [] Windspeed : float32 21 | [] [] Count : float32 22 | } 23 | 24 | /// The DemandPrediction class holds one single bike demand prediction. 25 | [] 26 | type DemandPrediction = { 27 | [] PredictedCount : float32; 28 | } 29 | 30 | // file paths to data files (assumes os = windows!) 31 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory 32 | 33 | /// The main application entry point. 34 | [] 35 | let main argv = 36 | 37 | // create the machine learning context 38 | let context = new MLContext(); 39 | 40 | // load the dataset 41 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 42 | 43 | // split the dataset into 80% training and 20% testing 44 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 45 | 46 | // build a training pipeline 47 | let pipeline = 48 | EstimatorChain() 49 | 50 | // step 1: concatenate all feature columns 51 | .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed")) 52 | 53 | // step 2: cache the data to speed up training 54 | .AppendCacheCheckpoint(context) 55 | 56 | // step 3: use a fast forest learner 57 | .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10)) 58 | 59 | // train the model 60 | let model = partitions.TrainSet |> pipeline.Fit 61 | 62 | // evaluate the model 63 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate 64 | 65 | // show evaluation metrics 66 | printfn "Model metrics:" 67 | printfn " RMSE:%f" metrics.RootMeanSquaredError 68 | printfn " MSE: %f" metrics.MeanSquaredError 69 | printfn " MAE: %f" metrics.MeanAbsoluteError 70 | 71 | // set up a sample observation 72 | let sample ={ 73 | Season = 3.0f 74 | Year = 1.0f 75 | Month = 8.0f 76 | Hour = 10.0f 77 | Holiday = 0.0f 78 | Weekday = 4.0f 79 | WorkingDay = 1.0f 80 | Weather = 1.0f 81 | Temperature = 0.8f 82 | NormalizedTemperature = 0.7576f 83 | Humidity = 0.55f 84 | Windspeed = 0.2239f 85 | Count = 0.0f // the field to predict 86 | } 87 | 88 | // create a prediction engine 89 | let engine = context.Model.CreatePredictionEngine model 90 | 91 | // make the prediction 92 | let prediction = sample |> engine.Predict 93 | 94 | // show the prediction 95 | printfn "\r" 96 | printfn "Single prediction:" 97 | printfn " Predicted bike count: %f" prediction.PredictedCount 98 | 99 | 0 // return value -------------------------------------------------------------------------------- /Regression/BikeDemandPrediction/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Predict bike sharing demand in Washington DC 2 | 3 | In this assignment you're going to build an app that can predict bike sharing demand in Washington DC. 4 | 5 | A bike-sharing system is a service in which bicycles are made available to individuals on a short term. Users borrow a bike from a dock and return it at another dock belonging to the same system. Docks are bike racks that lock the bike, and only release it by computer control. 6 | 7 | You’ve probably seen docks around town, they look like this: 8 | 9 | ![Bike sharing rack](./assets/bikesharing.jpeg) 10 | 11 | Bike sharing companies try to even out supply by manually distributing bikes across town, but they need to know how many bikes will be in demand at any given time in the city. 12 | 13 | So let’s give them a hand with a machine learning model! 14 | 15 | You are going to train a forest of regression decision trees on a dataset of bike sharing demand. Then you’ll use the fully-trained model to make a prediction for a given date and time. 16 | 17 | The first thing you will need is a data file with lots of bike sharing demand numbers. We are going to use the [UCI Bike Sharing Dataset](http://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) from [Capital Bikeshare](https://www.capitalbikeshare.com/) in Metro DC. This dataset has 17,380 bike sharing records that span a 2-year period. 18 | 19 | [Download the dataset](https://github.com/mdfarragher/DSC/blob/master/Regression/BikeDemandPrediction/bikedemand.csv) and save it in your project folder as **bikedmand.csv**. 20 | 21 | The file looks like this: 22 | 23 | ![Data File](./assets/data.png) 24 | 25 | It’s a comma-separated file with 17 columns: 26 | 27 | * Instant: the record index 28 | * Date: the date of the observation 29 | * Season: the season (1 = springer, 2 = summer, 3 = fall, 4 = winter) 30 | * Year: the year of the observation (0 = 2011, 1 = 2012) 31 | * Month: the month of the observation ( 1 to 12) 32 | * Hour: the hour of the observation (0 to 23) 33 | * Holiday: if the date is a holiday or not 34 | * Weekday: the day of the week of the observation 35 | * WorkingDay: if the date is a working day 36 | * Weather: the weather during the observation (1 = clear, 2 = mist, 3 = light snow/rain, 4 = heavy rain) 37 | * Temperature : the normalized temperature in Celsius 38 | * ATemperature: the normalized feeling temperature in Celsius 39 | * Humidity: the normalized humidity 40 | * Windspeed: the normalized wind speed 41 | * Casual: the number of casual bike users at the time 42 | * Registered: the number of registered bike users at the time 43 | * Count: the total number of rental bikes in operation at the time 44 | 45 | You can ignore the record index, the date, and the number of casual and registered bikes, and use everything else as input features. The final column **Count** is the label you're trying to predict. 46 | 47 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 48 | 49 | ```bash 50 | $ dotnet new console --language F# --output BikeDemand 51 | $ cd BikeDemand 52 | ``` 53 | 54 | Now install the following packages 55 | 56 | ```bash 57 | $ dotnet add package Microsoft.ML 58 | $ dotnet add package Microsoft.ML.FastTree 59 | ``` 60 | 61 | Now you are ready to add some types. You’ll need one to hold a bike demand record, and one to hold your model predictions. 62 | 63 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code: 64 | 65 | ```fsharp 66 | open System 67 | open System.IO 68 | open Microsoft.ML 69 | open Microsoft.ML.Data 70 | 71 | /// The DemandObservation class holds one single bike demand observation record. 72 | [] 73 | type DemandObservation = { 74 | [] Season : float32 75 | [] Year : float32 76 | [] Month : float32 77 | [] Hour : float32 78 | [] Holiday : float32 79 | [] Weekday : float32 80 | [] WorkingDay : float32 81 | [] Weather : float32 82 | [] Temperature : float32 83 | [] NormalizedTemperature : float32 84 | [] Humidity : float32 85 | [] Windspeed : float32 86 | [] [] Count : float32 87 | } 88 | 89 | /// The DemandPrediction class holds one single bike demand prediction. 90 | [] 91 | type DemandPrediction = { 92 | [] PredictedCount : float32; 93 | } 94 | 95 | // the rest of the code goes here... 96 | ``` 97 | 98 | The **DemandObservation** type holds one single bike trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from. 99 | 100 | You're also declaring a **DemandPrediction** type which will hold a single bike demand prediction. 101 | 102 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 103 | 104 | Now you need to load the training data in memory: 105 | 106 | ```fsharp 107 | // file paths to data files (assumes os = windows!) 108 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory 109 | 110 | /// The main application entry point. 111 | [] 112 | let main argv = 113 | 114 | // create the machine learning context 115 | let context = new MLContext(); 116 | 117 | // load the dataset 118 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 119 | 120 | // split the dataset into 80% training and 20% testing 121 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2) 122 | 123 | // the rest of the code goes here... 124 | 125 | 0 // return value 126 | ``` 127 | 128 | This code uses the method **LoadFromTextFile** to load the data directly into memory. The field annotations we set up earlier tell the method how to store the loaded data in the **DemandObservation** class. 129 | 130 | The code then calls **TrainTestSplit** to reserve 80% of the data for training and 20% for testing. 131 | 132 | Now let’s build the machine learning pipeline: 133 | 134 | ```fsharp 135 | // build a training pipeline 136 | let pipeline = 137 | EstimatorChain() 138 | 139 | // step 1: concatenate all feature columns 140 | .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed")) 141 | 142 | // step 2: cache the data to speed up training 143 | .AppendCacheCheckpoint(context) 144 | 145 | // step 3: use a fast forest learner 146 | .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10)) 147 | 148 | // train the model 149 | let model = partitions.TrainSet |> pipeline.Fit 150 | 151 | // the rest of the code goes here... 152 | ``` 153 | 154 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components. 155 | 156 | This pipeline has the following components: 157 | 158 | * **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column. 159 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm. 160 | * A final **FastForest** regression learner which will train the model to make accurate predictions using a forest of decision trees. 161 | 162 | The **FastForest** learner is a very nice training algorithm that uses gradient boosting to build a forest of weak decision trees. 163 | 164 | Gradient boosting builds a stack of weak decision trees. It starts with a single weak tree that tries to predict the bike demand. Then it adds a second tree on top of the first one to correct the error in the first tree. And then it adds a third tree on top of the second one to correct the output of the second tree. And so on. 165 | 166 | The result is a fairly strong prediction model that is made up of a stack of weak decision trees that incrementally correct each other's output. 167 | 168 | Note the use of hyperparameters to configure the learner: 169 | 170 | * **NumberOfLeaves** is the maximum number of leaf nodes each weak decision tree will have. In this forest each tree will have at most 10 leaf nodes. 171 | * **NumberOfTrees** is the total number of weak decision trees to create in the forest. This forest will hold 100 trees. 172 | * **MinimumExampleCountPerLeaf** is the minimum number of data points at which a leaf node is split. In this model each leaf is split when it has 10 or more qualifying data points. 173 | 174 | These hyperparameters are the default for the **FastForest** learner, but you can tweak them if you want. 175 | 176 | With the pipeline fully assembled, you can pipe the trainig data into the **Fit** function to train the model. 177 | 178 | You now have a fully- trained model. So next, you'll have to load the test data, predict the bike demand, and calculate the accuracy of your model: 179 | 180 | ```fsharp 181 | // evaluate the model 182 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate 183 | 184 | // show evaluation metrics 185 | printfn "Model metrics:" 186 | printfn " RMSE:%f" metrics.RootMeanSquaredError 187 | printfn " MSE: %f" metrics.MeanSquaredError 188 | printfn " MAE: %f" metrics.MeanAbsoluteError 189 | 190 | // the rest of the code goes here... 191 | ``` 192 | 193 | This code pipes the test data into the **Transform** function to set up predictions for every single bike demand record in the test partition. The code then pipes these predictions into the **Evaluate** function to compares them to the actual bike demand and automatically calculate these metrics: 194 | 195 | * **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction. 196 | * **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE. 197 | * **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in number of bikes. 198 | 199 | To wrap up, let’s use the model to make a prediction. 200 | 201 | I want to rent a bike in the fall of 2012, on a Thursday in August at 10am in the morning in clear weather. What will the bike demand be on that day? 202 | 203 | Here’s how to make that prediction: 204 | 205 | ```fsharp 206 | // set up a sample observation 207 | let sample ={ 208 | Season = 3.0f 209 | Year = 1.0f 210 | Month = 8.0f 211 | Hour = 10.0f 212 | Holiday = 0.0f 213 | Weekday = 4.0f 214 | WorkingDay = 1.0f 215 | Weather = 1.0f 216 | Temperature = 0.8f 217 | NormalizedTemperature = 0.7576f 218 | Humidity = 0.55f 219 | Windspeed = 0.2239f 220 | Count = 0.0f // the field to predict 221 | } 222 | 223 | // create a prediction engine 224 | let engine = context.Model.CreatePredictionEngine model 225 | 226 | // make the prediction 227 | let prediction = sample |> engine.Predict 228 | 229 | // show the prediction 230 | printfn "\r" 231 | printfn "Single prediction:" 232 | printfn " Predicted bike count: %f" prediction.PredictedCount 233 | ``` 234 | 235 | This code sets up a new bike demand observation, and then uses the **CreatePredictionEngine** function to set up a prediction engine and call **Predict** to make a demand prediction. 236 | 237 | What will the model prediction be? 238 | 239 | Time to find out. Go to your terminal and run your code: 240 | 241 | ```bash 242 | $ dotnet run 243 | ``` 244 | 245 | What results do you get? What are your RMSE and MAE values? Is this a good result? 246 | 247 | And what bike demand does your model predict on the day I wanted to take my bike ride? 248 | 249 | Now take a look at the hyperparameters. Try to change the behavior of the fast forest learner and see what happens to the accuracy of your model. Did your model improve or get worse? 250 | 251 | Share your results in our group! 252 | -------------------------------------------------------------------------------- /Regression/BikeDemandPrediction/assets/bikesharing.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/bikesharing.jpeg -------------------------------------------------------------------------------- /Regression/BikeDemandPrediction/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/data.png -------------------------------------------------------------------------------- /Regression/HousePricePrediction/README.md: -------------------------------------------------------------------------------- 1 | # The case 2 | 3 | Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But a detailed analysis of houses and sales prices actually proves that these metrics have a much greater influence on price negotiations than the number of bedrooms or a white-picket fence. 4 | 5 | In this case study, you're going to answer the age-old question: what exactly determines the sales price of a house? 6 | 7 | And once you have your fully-trained app up and running, you can use it to predict the sales price of any house. Just plug in the relevant numbers and your app will generate a sales price prediction. 8 | 9 | But how accurate will these predictions be? Can you actually use this app in a realtor business? 10 | 11 | That's for you to find out! 12 | 13 | # The dataset 14 | 15 | ![The dataset](./assets/data.png) 16 | 17 | In this case study you'll be working with the Iowa House Price dataset. This data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. 18 | 19 | The data set contains 1460 records and a large number of feature columns involved in assessing home values. You can use any combination of features you like to generate your house price predictions. 20 | 21 | There is 1 file in the dataset: 22 | * [data.csv](https://github.com/mdfarragher/DSC/blob/master/Regression/HousePricePrediction/data.csv) which contains 1460 records, 80 input features, and one output label. You will use this file to train and evaluate your model. 23 | 24 | Download the file and save it in your project folder. 25 | 26 | Here's a description of all 81 columns in the training file: 27 | * SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. 28 | * MSSubClass: The building class 29 | * MSZoning: The general zoning classification 30 | * LotFrontage: Linear feet of street connected to property 31 | * LotArea: Lot size in square feet 32 | * Street: Type of road access 33 | * Alley: Type of alley access 34 | * LotShape: General shape of property 35 | * LandContour: Flatness of the property 36 | * Utilities: Type of utilities available 37 | * LotConfig: Lot configuration 38 | * LandSlope: Slope of property 39 | * Neighborhood: Physical locations within Ames city limits 40 | * Condition1: Proximity to main road or railroad 41 | * Condition2: Proximity to main road or railroad (if a second is present) 42 | * BldgType: Type of dwelling 43 | * HouseStyle: Style of dwelling 44 | * OverallQual: Overall material and finish quality 45 | * OverallCond: Overall condition rating 46 | * YearBuilt: Original construction date 47 | * YearRemodAdd: Remodel date 48 | * RoofStyle: Type of roof 49 | * RoofMatl: Roof material 50 | * Exterior1st: Exterior covering on house 51 | * Exterior2nd: Exterior covering on house (if more than one material) 52 | * MasVnrType: Masonry veneer type 53 | * MasVnrArea: Masonry veneer area in square feet 54 | * ExterQual: Exterior material quality 55 | * ExterCond: Present condition of the material on the exterior 56 | * Foundation: Type of foundation 57 | * BsmtQual: Height of the basement 58 | * BsmtCond: General condition of the basement 59 | * BsmtExposure: Walkout or garden level basement walls 60 | * BsmtFinType1: Quality of basement finished area 61 | * BsmtFinSF1: Type 1 finished square feet 62 | * BsmtFinType2: Quality of second finished area (if present) 63 | * BsmtFinSF2: Type 2 finished square feet 64 | * BsmtUnfSF: Unfinished square feet of basement area 65 | * TotalBsmtSF: Total square feet of basement area 66 | * Heating: Type of heating 67 | * HeatingQC: Heating quality and condition 68 | * CentralAir: Central air conditioning 69 | * Electrical: Electrical system 70 | * 1stFlrSF: First Floor square feet 71 | * 2ndFlrSF: Second floor square feet 72 | * LowQualFinSF: Low quality finished square feet (all floors) 73 | * GrLivArea: Above grade (ground) living area square feet 74 | * BsmtFullBath: Basement full bathrooms 75 | * BsmtHalfBath: Basement half bathrooms 76 | * FullBath: Full bathrooms above grade 77 | * HalfBath: Half baths above grade 78 | * Bedroom: Number of bedrooms above basement level 79 | * Kitchen: Number of kitchens 80 | * KitchenQual: Kitchen quality 81 | * TotRmsAbvGrd: Total rooms above grade (does not include * bathrooms) 82 | * Functional: Home functionality rating 83 | * Fireplaces: Number of fireplaces 84 | * FireplaceQu: Fireplace quality 85 | * GarageType: Garage location 86 | * GarageYrBlt: Year garage was built 87 | * GarageFinish: Interior finish of the garage 88 | * GarageCars: Size of garage in car capacity 89 | * GarageArea: Size of garage in square feet 90 | * GarageQual: Garage quality 91 | * GarageCond: Garage condition 92 | * PavedDrive: Paved driveway 93 | * WoodDeckSF: Wood deck area in square feet 94 | * OpenPorchSF: Open porch area in square feet 95 | * EnclosedPorch: Enclosed porch area in square feet 96 | * 3SsnPorch: Three season porch area in square feet 97 | * ScreenPorch: Screen porch area in square feet 98 | * PoolArea: Pool area in square feet 99 | * PoolQC: Pool quality 100 | * Fence: Fence quality 101 | * MiscFeature: Miscellaneous feature not covered in other categories 102 | * MiscVal: $Value of miscellaneous feature 103 | * MoSold: Month Sold 104 | * YrSold: Year Sold 105 | * SaleType: Type of sale 106 | * SaleCondition: Condition of sale 107 | 108 | # Getting started 109 | Go to the console and set up a new console application: 110 | 111 | ```bash 112 | $ dotnet new console --language F# --output HousePricePrediction 113 | $ cd HousePricePrediction 114 | ``` 115 | 116 | Then install the ML.NET NuGet package: 117 | 118 | ```bash 119 | $ dotnet add package Microsoft.ML 120 | $ dotnet add package Microsoft.ML.FastTree 121 | ``` 122 | 123 | And launch the Visual Studio Code editor: 124 | 125 | ```bash 126 | $ code . 127 | ``` 128 | 129 | The rest is up to you! 130 | 131 | # Your assignment 132 | I want you to build an app that reads the data file, processes it, and then trains a linear regression model on the data. 133 | 134 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns. 135 | 136 | Partition the data and use the trained model to make house price predictions on all the houses in the test partition. Calculate the best possible **RMSE** and **MAE** and share it in our group. 137 | 138 | See if you can get the RMSE as low as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 139 | 140 | Good luck! -------------------------------------------------------------------------------- /Regression/HousePricePrediction/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/HousePricePrediction/assets/data.png -------------------------------------------------------------------------------- /Regression/TaxiFarePrediction/Program.fs: -------------------------------------------------------------------------------- 1 | open System 2 | open Microsoft.ML 3 | open Microsoft.ML.Data 4 | 5 | /// The TaxiTrip class represents a single taxi trip. 6 | [] 7 | type TaxiTrip = { 8 | [] VendorId : string 9 | [] RateCode : string 10 | [] PassengerCount : float32 11 | [] TripDistance : float32 12 | [] PaymentType : string 13 | [] [] FareAmount : float32 14 | } 15 | 16 | /// The TaxiTripFarePrediction class represents a single far prediction. 17 | [] 18 | type TaxiTripFarePrediction = { 19 | [] FareAmount : float32 20 | } 21 | 22 | // file paths to data files (assumes os = windows!) 23 | let dataPath = sprintf "%s\\yellow_tripdata_2018-12.csv" Environment.CurrentDirectory 24 | 25 | /// The main application entry point. 26 | [] 27 | let main argv = 28 | 29 | // create the machine learning context 30 | let context = new MLContext() 31 | 32 | // load the data 33 | let dataView = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 34 | 35 | // split into a training and test partition 36 | let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2) 37 | 38 | // set up a learning pipeline 39 | let pipeline = 40 | EstimatorChain() 41 | 42 | // one-hot encode all text features 43 | .Append(context.Transforms.Categorical.OneHotEncoding("VendorId")) 44 | .Append(context.Transforms.Categorical.OneHotEncoding("RateCode")) 45 | .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType")) 46 | 47 | // combine all input features into a single column 48 | .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance")) 49 | 50 | // cache the data to speed up training 51 | .AppendCacheCheckpoint(context) 52 | 53 | // use the fast tree learner 54 | .Append(context.Regression.Trainers.FastTree()) 55 | 56 | // train the model 57 | let model = partitions.TrainSet |> pipeline.Fit 58 | 59 | // get regression metrics to score the model 60 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate 61 | 62 | // show the metrics 63 | printfn "Model metrics:" 64 | printfn " RMSE:%f" metrics.RootMeanSquaredError 65 | printfn " MSE: %f" metrics.MeanSquaredError 66 | printfn " MAE: %f" metrics.MeanAbsoluteError 67 | 68 | // create a prediction engine for one single prediction 69 | let engine = context.Model.CreatePredictionEngine model 70 | 71 | let taxiTripSample = { 72 | VendorId = "VTS" 73 | RateCode = "1" 74 | PassengerCount = 1.0f 75 | TripDistance = 3.75f 76 | PaymentType = "CRD" 77 | FareAmount = 0.0f // To predict. Actual/Observed = 15.5 78 | } 79 | 80 | // make the prediction 81 | let prediction = taxiTripSample |> engine.Predict 82 | 83 | // show the prediction 84 | printfn "\r" 85 | printfn "Single prediction:" 86 | printfn " Predicted fare: %f" prediction.FareAmount 87 | 88 | 0 // return value -------------------------------------------------------------------------------- /Regression/TaxiFarePrediction/README.md: -------------------------------------------------------------------------------- 1 | # Assignment: Predict taxi fares in New York 2 | 3 | In this assignment you're going to build an app that can predict taxi fares in New York. 4 | 5 | The first thing you'll need is a data file with transcripts of New York taxi rides. The [NYC Taxi & Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) provides yearly TLC Trip Record Data files which have exactly what you need. 6 | 7 | Download the [Yellow Taxi Trip Records from December 2018](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-12.csv) and save it as **yellow_tripdata_2018-12.csv**. 8 | 9 | This is a CSV file with 8,173,233 records that looks like this: 10 |  11 | 12 | ![Data File](./assets/data.png) 13 | 14 | 15 | There are a lot of columns with interesting information in this data file, but you will only train on the following: 16 | 17 | * Column 0: The data provider vendor ID 18 | * Column 3: Number of passengers 19 | * Column 4: Trip distance 20 | * Column 5: The rate code (standard, JFK, Newark, …) 21 | * Column 9: Payment type (credit card, cash, …) 22 | * Column 10: Fare amount 23 | 24 | You are going to build a machine learning model in F# that will use columns 0, 3, 4, 5, and 9 as input, and use them to predict the taxi fare for every trip. Then you’ll compare the predicted fares with the actual taxi fares in column 10, and evaluate the accuracy of your model. 25 | 26 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project: 27 | 28 | ```bash 29 | $ dotnet new console --language F# --output PricePrediction 30 | $ cd PricePrediction 31 | ``` 32 | 33 | Now install the following packages 34 | 35 | ```bash 36 | $ dotnet add package Microsoft.ML 37 | $ dotnet add package Microsoft.ML.FastTree 38 | ``` 39 | 40 | Now you are ready to add some classes. You’ll need one to hold a taxi trip, and one to hold your model predictions. 41 | 42 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code: 43 | 44 | ```fsharp 45 | /// The TaxiTrip class represents a single taxi trip. 46 | [] 47 | type TaxiTrip = { 48 | [] VendorId : string 49 | [] RateCode : string 50 | [] PassengerCount : float32 51 | [] TripDistance : float32 52 | [] PaymentType : string 53 | [] [] FareAmount : float32 54 | } 55 | 56 | /// The TaxiTripFarePrediction class represents a single far prediction. 57 | [] 58 | type TaxiTripFarePrediction = { 59 | [] FareAmount : float32 60 | } 61 | 62 | // the rest of the code goes here... 63 | ``` 64 | 65 | The **TaxiTrip** type holds one single taxi trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from. 66 | 67 | You're also declaring a **TaxiTripFarePrediction** type which will hold a single fare prediction. 68 | 69 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes. 70 | 71 | Also note the **mutable** keyword in the definition for **TaxiTripFarePrediction**. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 72 | 73 | We're loading all data columns as **float32**, except **VendorId**, **RateCode** and **PaymentType**. These columns hold numeric values but you will load them as string fields. 74 | 75 | The reason you need to do this is because RateCode is an enumeration with the following values: 76 | 77 | * 1 = standard 78 | * 2 = JFK 79 | * 3 = Newark 80 | * 4 = Nassau 81 | * 5 = negotiated 82 | * 6 = group 83 | 84 | And PaymentType is defined as follows: 85 | 86 | * 1 = Credit card 87 | * 2 = Cash 88 | * 3 = No charge 89 | * 4 = Dispute 90 | * 5 = Unknown 91 | * 6 = Voided trip 92 | 93 | These actual numbers don’t mean anything in this context. And we certainly don’t want the machine learning model to start believing that a trip to Newark is three times as important as a standard fare. 94 | 95 | So converting these values to strings is a perfect trick to show the model that **VendorId**, **RateCode** and **PaymentType** are just labels, and the underlying numbers don’t mean anything. 96 | 97 | Now you need to load the training data in memory: 98 | 99 | ```fsharp 100 | // file paths to data files (assumes os = windows!) 101 | let dataPath = sprintf "%s\\yellow_tripdata_2018-12_small.csv" Environment.CurrentDirectory 102 | 103 | /// The main application entry point. 104 | [] 105 | let main argv = 106 | 107 | // create the machine learning context 108 | let context = new MLContext() 109 | 110 | // load the data 111 | let dataView = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',') 112 | 113 | // split into a training and test partition 114 | let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2) 115 | 116 | // the rest of the code goes here... 117 | 118 | 0 // return value 119 | ``` 120 | 121 | This code calls **LoadFromTextFile** to load the CSV data into memory. Note the **TaxiTrip** type that tells the method which class to use to load the data. 122 | 123 | There is only one single data file, so you need to call **TrainTestSplit** to set up a training partition with 80% of the data and a test partition with the remaining 20% of the data. 124 | 125 | You often see this 80/20 split in data science, it’s a very common approach to train and test a model. 126 | 127 | Now you’re ready to start building the machine learning model: 128 | 129 | ```fsharp 130 | // set up a learning pipeline 131 | let pipeline = 132 | EstimatorChain() 133 | 134 | // one-hot encode all text features 135 | .Append(context.Transforms.Categorical.OneHotEncoding("VendorId")) 136 | .Append(context.Transforms.Categorical.OneHotEncoding("RateCode")) 137 | .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType")) 138 | 139 | // combine all input features into a single column 140 | .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance")) 141 | 142 | // cache the data to speed up training 143 | .AppendCacheCheckpoint(context) 144 | 145 | // use the fast tree learner 146 | .Append(context.Regression.Trainers.FastTree()) 147 | 148 | // train the model 149 | let model = partitions.TrainSet |> pipeline.Fit 150 | 151 | // the rest of the code goes here... 152 | ``` 153 | 154 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components. 155 | 156 | This pipeline has the following components: 157 | 158 | * A group of three **OneHotEncodings** to perform one hot encoding on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because we don't want the machine learning model to treat the enumerative data as numeric values. 159 | * **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column. 160 | * **AppendCacheCheckpoint** which caches all data in memory to speed up the training process. 161 | * A final **FastTree** regression learner which will train the model to make accurate predictions. 162 | 163 | The **FastTreeRegressionTrainer** is a very nice training algorithm that uses gradient boosting, a machine learning technique for regression problems. 164 | 165 | A gradient boosting algorithm builds up a collection of weak regression models. It starts out with a weak model that tries to predict the taxi fare. Then it adds a second model that attempts to correct the error in the first model. And then it adds a third model, and so on. 166 | 167 | The result is a fairly strong prediction model that is actually just an ensemble of weaker prediction models stacked on top of each other. 168 | 169 | We will explore Gradient Boosting in detail in a later section. 170 | 171 | With the pipeline fully assembled, you can train the model on the training partition by piping the **TrainSet** into the **pipeline.Fit** function. 172 | 173 | You now have a fully- trained model. So next, you'll have to grab the validation data, predict the taxi fare for each trip, and calculate the accuracy of your model: 174 | 175 | ```fsharp 176 | // get regression metrics to score the model 177 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate 178 | 179 | // show the metrics 180 | printfn "Model metrics:" 181 | printfn " RMSE:%f" metrics.RootMeanSquaredError 182 | printfn " MSE: %f" metrics.MeanSquaredError 183 | printfn " MAE: %f" metrics.MeanAbsoluteError 184 | 185 | // the rest of the code goes here... 186 | ``` 187 | 188 | This code pipes the **TestSet** into the **model.Transform** function to generate predictions for every single taxi trip in the test partition. We then pipe these predictions into the **Evaluate** function to compare then to the actual taxi fares and automatically calculates these metrics: 189 | 190 | * **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction. 191 | * **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in dollars. 192 | * **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE. 193 | 194 | To wrap up, let’s use the model to make a prediction. 195 | 196 | Imagine that I'm going to take a standard taxi trip, I cover a distance of 3.75 miles, I am the only passenger, and I pay by credit card. What would my fare be? 197 | 198 | Here’s how to make that prediction: 199 | 200 | ```fsharp 201 | // create a prediction engine for one single prediction 202 | let engine = context.Model.CreatePredictionEngine model 203 | 204 | let taxiTripSample = { 205 | VendorId = "VTS" 206 | RateCode = "1" 207 | PassengerCount = 1.0f 208 | TripDistance = 3.75f 209 | PaymentType = "CRD" 210 | FareAmount = 0.0f // To predict. Actual/Observed = 15.5 211 | } 212 | 213 | // make the prediction 214 | let prediction = taxiTripSample |> engine.Predict 215 | 216 | // show the prediction 217 | printfn "\r" 218 | printfn "Single prediction:" 219 | printfn " Predicted fare: %f" prediction.FareAmount 220 | ``` 221 | 222 | You use the **CreatePredictionEngine** method to set up a prediction engine. This is a type that can make predictions for individual data records. 223 | 224 | Next, you set up a sample with all the details of my taxi trip and pipe it into the **Predict** function to make a single prediction. 225 | 226 | The trip should cost anywhere between $13.50 and $18.50, depending on the trip duration (which depends on the time of day). Will the model predict a fare in this range? 227 | 228 | Let's find out. Go to your terminal and run your code: 229 | 230 | ```bash 231 | $ dotnet run 232 | ``` 233 | 234 | What results do you get? What are your RMSE and MAE values? Is this a good result? 235 | 236 | And how much does your model predict I have to pay for my taxi ride? Is the prediction in the range of accetable values for this trip? 237 | 238 | Now make some changes to my trip. Change the vendor ID, or the distance, or the manner of payment. How does this affect the final fare prediction? And what do you think this means? 239 | 240 | Think about the code in this assignment. How could you improve the accuracy of the model? What's your best RMSE value? 241 | 242 | Share your results in our group! 243 | -------------------------------------------------------------------------------- /Regression/TaxiFarePrediction/TaxiFarePrediction.fsproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | Exe 5 | netcoreapp3.1 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /Regression/TaxiFarePrediction/assets/data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/TaxiFarePrediction/assets/data.png -------------------------------------------------------------------------------- /assets/DSC-FS.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/assets/DSC-FS.jpg --------------------------------------------------------------------------------