├── .github
    └── FUNDING.yml
├── .gitignore
├── BinaryClassification
    ├── DiabetesDetection
    │   ├── README.md
    │   ├── assets
    │   │   └── data.png
    │   └── diabetes.csv
    ├── FraudDetection
    │   ├── README.md
    │   └── assets
    │   │   └── data.png
    ├── HeartDiseasePrediction
    │   ├── Heart.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   ├── assets
    │   │   └── data.png
    │   └── processed.cleveland.data.csv
    ├── SpamDetection
    │   ├── Program.fs
    │   ├── README.md
    │   ├── SpamDetection.fsproj
    │   ├── assets
    │   │   └── data.png
    │   └── spam.tsv
    └── TitanicPrediction
    │   ├── Program.fs
    │   ├── README.md
    │   ├── TitanicPrediction.fsproj
    │   ├── assets
    │       ├── data.jpg
    │       └── titanic.jpeg
    │   ├── test_data.csv
    │   └── train_data.csv
├── Clustering
    └── IrisFlower
    │   ├── IrisFlower.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   ├── assets
    │       ├── data.png
    │       └── flowers.png
    │   └── iris-data.csv
├── LoadingData
    └── CaliforniaHousing
    │   ├── CaliforniaHousing.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   ├── assets
    │       ├── data.png
    │       └── plot.png
    │   └── california_housing.csv
├── MulticlassClassification
    ├── DigitRecognition
    │   ├── Mnist.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   └── assets
    │   │   ├── datafile.png
    │   │   ├── mnist.png
    │   │   └── mnist_hard.png
    └── FlagToxicComments
    │   ├── README.md
    │   └── assets
    │       └── data.png
├── README.md
├── Recommendation
    └── MovieRecommender
    │   ├── MovieRecommender.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   ├── assets
    │       ├── data.png
    │       └── movies.png
    │   ├── recommendation-movies.csv
    │   ├── recommendation-ratings-test.csv
    │   └── recommendation-ratings-train.csv
├── Regression
    ├── BikeDemandPrediction
    │   ├── BikeDemand.fsproj
    │   ├── Program.fs
    │   ├── README.md
    │   ├── assets
    │   │   ├── bikesharing.jpeg
    │   │   └── data.png
    │   └── bikedemand.csv
    ├── HousePricePrediction
    │   ├── README.md
    │   ├── assets
    │   │   └── data.png
    │   └── data.csv
    └── TaxiFarePrediction
    │   ├── Program.fs
    │   ├── README.md
    │   ├── TaxiFarePrediction.fsproj
    │   ├── assets
    │       └── data.png
    │   └── yellow_tripdata_2018-12_small.csv
└── assets
    └── DSC-FS.jpg


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 | 
3 | github: [mdfarragher]
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | BinaryClassification/HeartDiseasePrediction/bin/
 2 | BinaryClassification/HeartDiseasePrediction/obj/
 3 | BinaryClassification/SpamDetection/bin/
 4 | BinaryClassification/SpamDetection/obj/
 5 | BinaryClassification/TitanicPrediction/bin/
 6 | BinaryClassification/TitanicPrediction/obj/
 7 | Clustering/IrisFlower/obj/
 8 | MulticlassClassification/DigitRecognition/bin/
 9 | MulticlassClassification/DigitRecognition/obj/
10 | Regression/BikeDemandPrediction/bin/
11 | Regression/BikeDemandPrediction/obj/
12 | Regression/TaxiFarePrediction/bin/
13 | Regression/TaxiFarePrediction/obj/
14 | MulticlassClassification/DigitRecognition/mnist_test.csv
15 | MulticlassClassification/DigitRecognition/mnist_train.csv
16 | Clustering/IrisFlower/bin/
17 | Recommendation/MovieRecommender/bin/
18 | Recommendation/MovieRecommender/obj/
19 | LoadingData/CaliforniaHousing/bin/
20 | LoadingData/CaliforniaHousing/obj/
21 | Regression/TaxiFarePrediction/yellow_tripdata_2018-12.csv
22 | 


--------------------------------------------------------------------------------
/BinaryClassification/DiabetesDetection/README.md:
--------------------------------------------------------------------------------
 1 | # The case
 2 | 
 3 | The Pima are a tribe of North American Indians who traditionally lived along the Gila and Salt rivers in Arizona, U.S., in what was the core area of the prehistoric Hohokam culture. They speak a Uto-Aztecan language and call themselves the River People and are usually considered to be the descendants of the Hohokam.
 4 | 
 5 | But there's a weird thing about the Pima: they have the highest reported prevalence of diabetes of any population in the world.  Their diabetes is exclusively type 2 diabetes, with no evidence of type 1 diabetes, even in very young children with an early onset of the disease.
 6 | 
 7 | This suggests that the Pima carry a specific gene mutation that makes them extremely susceptive to diabetes. The tribe has been the focus of many medical studies over the years.
 8 | 
 9 | In this case study, you're going to participate in one of these medical studies. You will build an app that loads a dataset of Pima medical records and tries to predict from the data who has diabetes and who has not. 
10 | 
11 | How accurate will your app be? Do you think you will be able to correctly predict every single diabetes case? 
12 | 
13 | That's for you to find out! 
14 | 
15 | # The dataset
16 | 
17 | ![The dataset](./assets/data.png)
18 | 
19 | In this case study you'll be working with a dataset containing the medical records of 768 Pima women. 
20 | 
21 | There is a single file in the dataset:
22 | * [diabetes.csv](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) which contains 768 records, 8 input features, and 1 output label. You will use this file to train and test your model.
23 | 
24 | You'll need to [download the dataset](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) and save it in your project folder to get started.
25 | 
26 | Here's a description of all columns in the file:
27 | * **Pregnancies**: the number of times the woman got pregnant
28 | * **Glucose**: the plasma glucose concentration at 2 hours in an oral glucose tolerance test
29 | * **BloodPressure**: the diastolic blood pressure (mm Hg)
30 | * **SkinThickness**: the triceps skin fold thickness (mm)
31 | * **Insulin**: the 2-hour serum insulin concentration (mu U/ml)
32 | * **BMI**: the body mass index (weight in kg/(height in m)^2)
33 | * **DiabetesPedigreeFunction**: the diabetes pedigree function
34 | * **Age**: the age (years)
35 | * **Outcome**: the label you need to predict - 1 if the woman has diabetes, 0 if she has not
36 | 
37 | 
38 | # Getting started
39 | Go to the console and set up a new console application:
40 | 
41 | ```bash
42 | $ dotnet new console --language F# --output DiabetesDetection
43 | $ cd DiabetesDetection
44 | ```
45 | 
46 | Then install the ML.NET NuGet package:
47 | 
48 | ```bash
49 | $ dotnet add package Microsoft.ML
50 | $ dotnet add package Microsoft.ML.FastTree
51 | ```
52 | 
53 | And launch the Visual Studio Code editor:
54 | 
55 | ```bash
56 | $ code .
57 | ```
58 | 
59 | The rest is up to you! 
60 | 
61 | # Your assignment
62 | I want you to build an app that reads the data file and splits it for training and testing. Reserve 80% of all records for training and 20% for testing. 
63 | 
64 | Process the data and train a binary classifier on the training partition. Then use the fully-trained model to generate predictions for the records in the testing partition. 
65 | 
66 | Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group.
67 | 
68 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 
69 | 
70 | Good luck!


--------------------------------------------------------------------------------
/BinaryClassification/DiabetesDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/DiabetesDetection/assets/data.png


--------------------------------------------------------------------------------
/BinaryClassification/FraudDetection/README.md:
--------------------------------------------------------------------------------
 1 | # The case
 2 | 
 3 | It is very important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
 4 | 
 5 | Credit card fraud happens a lot. During two days in September 2013 in Europe, credit card networks recorded at least 492 fraud cases out of a total of 284,807 transactions. That's 246 fraud cases per day!
 6 | 
 7 | In this case study, you're going to help credit card companies detect fraud in real time. You will build an app and train it on detected fraud cases, and then test your predictions on a new set of transactions. 
 8 | 
 9 | How accurate will your app be? Do you think you will be able to detect financial fraud in real time? 
10 | 
11 | That's for you to find out! 
12 | 
13 | # The dataset
14 | 
15 | ![The dataset](./assets/data.png)
16 | 
17 | In this case study you'll be working with a dataset containing transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. 
18 | 
19 | Note that the dataset is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions.
20 | 
21 | The data set contains 285k records, 30 feature columns, and a single label indicating if the transaction is fraudulent or not. You can use any combination of features you like to generate your fraud predictions.
22 | 
23 | There is a single file in the dataset:
24 | * [creditcard.csv](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) which contains 285k records, 30 input features, and one output label. You will use this file to train and test your model.
25 | 
26 | The file is about 150 MB in size. You'll need to [download it from Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet. 
27 | 
28 | Here's a description of all 31 columns in the data file:
29 | * Time: Number of seconds elapsed between this transaction and the first transaction in the dataset
30 | * V1-V28: A feature of the transaction, processed to a number to protect user identities and sensitive information
31 | * Amount: Transaction amount
32 | * Class: 1 for fraudulent transactions, 0 otherwise
33 | 
34 | # Getting started
35 | Go to the console and set up a new console application:
36 | 
37 | ```bash
38 | $ dotnet new console --language F# --output FraudDetection
39 | $ cd FraudDetection
40 | ```
41 | 
42 | Then install the ML.NET NuGet package:
43 | 
44 | ```bash
45 | $ dotnet add package Microsoft.ML
46 | $ dotnet add package Microsoft.ML.FastTree
47 | ```
48 | 
49 | And launch the Visual Studio Code editor:
50 | 
51 | ```bash
52 | $ code .
53 | ```
54 | 
55 | The rest is up to you! 
56 | 
57 | # Your assignment
58 | I want you to build an app that reads the data file in memory and splits it. Use 80% for training and 20% for testing.
59 | 
60 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns. 
61 | 
62 | Processes the selected input features, then train a binary classifier on the data, and generate predictions for the transactions in the testing partition. 
63 | 
64 | Use the trained model to make fraud predictions on the test data. Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group.
65 | 
66 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 
67 | 
68 | Good luck!


--------------------------------------------------------------------------------
/BinaryClassification/FraudDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/FraudDetection/assets/data.png


--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/Heart.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |     <PackageReference Include="Microsoft.ML.FastTree" Version="1.5.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/Program.fs:
--------------------------------------------------------------------------------
  1 | ﻿open System
  2 | open System.IO
  3 | open Microsoft.ML
  4 | open Microsoft.ML.Data
  5 | 
  6 | /// The HeartData record holds one single heart data record.
  7 | [<CLIMutable>]
  8 | type HeartData = {
  9 |     [<LoadColumn(0)>] Age : float32
 10 |     [<LoadColumn(1)>] Sex : float32
 11 |     [<LoadColumn(2)>] Cp : float32
 12 |     [<LoadColumn(3)>] TrestBps : float32
 13 |     [<LoadColumn(4)>] Chol : float32
 14 |     [<LoadColumn(5)>] Fbs : float32
 15 |     [<LoadColumn(6)>] RestEcg : float32
 16 |     [<LoadColumn(7)>] Thalac : float32
 17 |     [<LoadColumn(8)>] Exang : float32
 18 |     [<LoadColumn(9)>] OldPeak : float32
 19 |     [<LoadColumn(10)>] Slope : float32
 20 |     [<LoadColumn(11)>] Ca : float32
 21 |     [<LoadColumn(12)>] Thal : float32
 22 |     [<LoadColumn(13)>] Diagnosis : float32
 23 | }
 24 | 
 25 | /// The HeartPrediction class contains a single heart data prediction.
 26 | [<CLIMutable>]
 27 | type HeartPrediction = {
 28 |     [<ColumnName("PredictedLabel")>] Prediction : bool
 29 |     Probability : float32
 30 |     Score : float32
 31 | }
 32 | 
 33 | /// The ToLabel class is a helper class for a column transformation.
 34 | [<CLIMutable>]
 35 | type ToLabel = {
 36 |     mutable Label : bool
 37 | }
 38 | 
 39 | /// file paths to data files (assumes os = windows!)
 40 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory
 41 | 
 42 | /// The main application entry point.
 43 | [<EntryPoint>]
 44 | let main argv =
 45 | 
 46 |     // set up a machine learning context
 47 |     let context = new MLContext()
 48 | 
 49 |     // load training and test data
 50 |     let data = context.Data.LoadFromTextFile<HeartData>(dataPath, hasHeader = false, separatorChar = ',')
 51 | 
 52 |     // split the data into a training and test partition
 53 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
 54 | 
 55 |     // set up a training pipeline
 56 |     let pipeline = 
 57 |         EstimatorChain()
 58 | 
 59 |             // step 1: convert the label value to a boolean
 60 |             .Append(
 61 |                 context.Transforms.CustomMapping(
 62 |                     Action<HeartData, ToLabel>(fun input output -> output.Label <- input.Diagnosis > 0.0f),
 63 |                     "LabelMapping"))
 64 |     
 65 |             // step 2: concatenate all feature columns
 66 |             .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal"))
 67 | 
 68 |             // step 3: set up a fast tree learner
 69 |             .Append(context.BinaryClassification.Trainers.FastTree())
 70 | 
 71 |     // train the model
 72 |     let model = partitions.TrainSet |> pipeline.Fit
 73 | 
 74 |     // make predictions and compare with the ground truth
 75 |     let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
 76 | 
 77 |     // report the results
 78 |     printfn "Model metrics:"
 79 |     printfn "  Accuracy:          %f" metrics.Accuracy
 80 |     printfn "  Auc:               %f" metrics.AreaUnderRocCurve
 81 |     printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
 82 |     printfn "  F1Score:           %f" metrics.F1Score
 83 |     printfn "  LogLoss:           %f" metrics.LogLoss
 84 |     printfn "  LogLossReduction:  %f" metrics.LogLossReduction
 85 |     printfn "  PositivePrecision: %f" metrics.PositivePrecision
 86 |     printfn "  PositiveRecall:    %f" metrics.PositiveRecall
 87 |     printfn "  NegativePrecision: %f" metrics.NegativePrecision
 88 |     printfn "  NegativeRecall:    %f" metrics.NegativeRecall
 89 | 
 90 |     // set up a prediction engine
 91 |     let predictionEngine = context.Model.CreatePredictionEngine model
 92 | 
 93 |     // create a sample patient
 94 |     let sample = { 
 95 |         Age = 36.0f
 96 |         Sex = 1.0f
 97 |         Cp = 4.0f
 98 |         TrestBps = 145.0f
 99 |         Chol = 210.0f
100 |         Fbs = 0.0f
101 |         RestEcg = 2.0f
102 |         Thalac = 148.0f
103 |         Exang = 1.0f
104 |         OldPeak = 1.9f
105 |         Slope = 2.0f
106 |         Ca = 1.0f
107 |         Thal = 7.0f
108 |         Diagnosis = 0.0f // unused
109 |     }
110 | 
111 |     // make the prediction
112 |     let prediction = sample |> predictionEngine.Predict
113 | 
114 |     // report the results
115 |     printfn "\r"
116 |     printfn "Single prediction:"
117 |     printfn "  Prediction:  %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk")
118 |     printfn "  Probability: %f" prediction.Probability
119 | 
120 |     0 // return value


--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Predict heart disease risk
  2 | 
  3 | In this assignment you're going to build an app that can predict the heart disease risk in a group of patients.
  4 | 
  5 | The first thing you will need for your app is a data file with patients, their medical info, and their heart disease risk assessment. We're going to use the famous [UCI Heart Disease Dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) which has real-life data from 303 patients.
  6 | 
  7 | Download the [Processed Cleveland Data](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data) file and save it as **processed.cleveland.data.csv**.
  8 | 
  9 | The data file looks like this:
 10 | 
 11 | ![Processed Cleveland Data](./assets/data.png)
 12 | 
 13 | It’s a CSV file with 14 columns of information:
 14 | 
 15 | * Age
 16 | * Sex: 1 = male, 0 = female
 17 | * Chest Pain Type: 1 = typical angina, 2 = atypical angina , 3 = non-anginal pain, 4 = asymptomatic
 18 | * Resting blood pressure in mm Hg on admission to the hospital
 19 | * Serum cholesterol in mg/dl
 20 | * Fasting blood sugar > 120 mg/dl: 1 = true; 0 = false
 21 | * Resting EKG results: 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
 22 | * Maximum heart rate achieved
 23 | * Exercise induced angina: 1 = yes; 0 = no
 24 | * ST depression induced by exercise relative to rest
 25 | * Slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
 26 | * Number of major vessels (0–3) colored by fluoroscopy
 27 | * Thallium heart scan results: 3 = normal, 6 = fixed defect, 7 = reversible defect
 28 | * Diagnosis of heart disease: 0 = normal risk, 1-4 = elevated risk
 29 | 
 30 | The first 13 columns are patient diagnostic information, and the last column is the diagnosis: 0 means a healthy patient, and values 1-4 mean an elevated risk of heart disease.
 31 | 
 32 | You are going to build a binary classification machine learning model that reads in all 13 columns of patient information, and then makes a prediction for the heart disease risk.
 33 | 
 34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 35 | 
 36 | ```bash
 37 | $ dotnet new console --language F# --output Heart
 38 | $ cd Heart
 39 | ```
 40 | 
 41 | Now install the following ML.NET packages:
 42 | 
 43 | ```bash
 44 | $ dotnet add package Microsoft.ML
 45 | $ dotnet add package Microsoft.ML.FastTree
 46 | ```
 47 | 
 48 | Now you are ready to add some types. You’ll need one to hold patient info, and one to hold your model predictions.
 49 | 
 50 | Replace the contents of the Program.fs file with this:
 51 | 
 52 | ```fsharp
 53 | open System
 54 | open System.IO
 55 | open Microsoft.ML
 56 | open Microsoft.ML.Data
 57 | 
 58 | /// The HeartData record holds one single heart data record.
 59 | [<CLIMutable>]
 60 | type HeartData = {
 61 |     [<LoadColumn(0)>] Age : float32
 62 |     [<LoadColumn(1)>] Sex : float32
 63 |     [<LoadColumn(2)>] Cp : float32
 64 |     [<LoadColumn(3)>] TrestBps : float32
 65 |     [<LoadColumn(4)>] Chol : float32
 66 |     [<LoadColumn(5)>] Fbs : float32
 67 |     [<LoadColumn(6)>] RestEcg : float32
 68 |     [<LoadColumn(7)>] Thalac : float32
 69 |     [<LoadColumn(8)>] Exang : float32
 70 |     [<LoadColumn(9)>] OldPeak : float32
 71 |     [<LoadColumn(10)>] Slope : float32
 72 |     [<LoadColumn(11)>] Ca : float32
 73 |     [<LoadColumn(12)>] Thal : float32
 74 |     [<LoadColumn(13)>] Diagnosis : float32
 75 | }
 76 | 
 77 | /// The HeartPrediction class contains a single heart data prediction.
 78 | [<CLIMutable>]
 79 | type HeartPrediction = {
 80 |     [<ColumnName("PredictedLabel")>] Prediction : bool
 81 |     Probability : float32
 82 |     Score : float32
 83 | }
 84 | 
 85 | // the rest of the code goes here....
 86 | ```
 87 | 
 88 | The **HeartData** class holds one single patient record. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.
 89 | 
 90 | There's also a **HeartPrediction** class which will hold a single heart disease prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction.
 91 | 
 92 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 93 | 
 94 | Now look at the final **Diagnosis** column in the data file. Our label is an integer value between 0-4, with 0 meaning 'no risk' and 1-4 meaning 'elevated risk'. 
 95 | 
 96 | But you're building a Binary Classifier which means your model needs to be trained on boolean labels.
 97 | 
 98 | So you'll have to somehow convert the 'raw' numeric label (stored in the **Diagnosis** field) to a boolean value. 
 99 | 
100 | To set that up, you'll need a helper type:
101 | 
102 | ```fsharp
103 | /// The ToLabel class is a helper class for a column transformation.
104 | [<CLIMutable>]
105 | type ToLabel = {
106 |     mutable Label : bool
107 | }
108 | 
109 | // the rest of the code goes here....
110 | ```
111 | 
112 | The **ToLabel** type contains the label converted to a boolean value. We'll set up that conversion in a minute.
113 | 
114 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 
115 | 
116 | Now you're going to load the training data in memory:
117 | 
118 | ```fsharp
119 | /// file paths to data files (assumes os = windows!)
120 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory
121 | 
122 | /// The main application entry point.
123 | [<EntryPoint>]
124 | let main argv =
125 | 
126 |     // set up a machine learning context
127 |     let context = new MLContext()
128 | 
129 |     // load training and test data
130 |     let data = context.Data.LoadFromTextFile<HeartData>(dataPath, hasHeader = false, separatorChar = ',')
131 | 
132 |     // split the data into a training and test partition
133 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
134 | 
135 |     // the rest of the code goes here....
136 | 
137 |     0 // return value
138 | ```
139 | 
140 | This code uses the method **LoadFromTextFile** to load the CSV data directly into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **HeartData** class.
141 | 
142 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data.
143 | 
144 | Now you’re ready to start building the machine learning model:
145 | 
146 | ```fsharp
147 | // set up a training pipeline
148 | let pipeline = 
149 |     EstimatorChain()
150 | 
151 |         // step 1: convert the label value to a boolean
152 |         .Append(
153 |             context.Transforms.CustomMapping(
154 |                 Action<HeartData, ToLabel>(fun input output -> output.Label <- input.Diagnosis > 0.0f),
155 |                 "LabelMapping"))
156 | 
157 |         // step 2: concatenate all feature columns
158 |         .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal"))
159 | 
160 |         // step 3: set up a fast tree learner
161 |         .Append(context.BinaryClassification.Trainers.FastTree())
162 | 
163 | // train the model
164 | let model = partitions.TrainSet |> pipeline.Fit
165 | 
166 | // the rest of the code goes here....
167 | ```
168 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
169 | 
170 | This pipeline has the following components:
171 | 
172 | * A **CustomMapping** that transforms the numeric label to a boolean value. We define 0 values as healthy, and anything above 0 as an elevated risk.
173 | * **Concatenate** which combines all input data columns into a single column called 'Features'. This is a required step because ML.NET can only train on a single input column.
174 | * A **FastTree** classification learner which will train the model to make accurate predictions.
175 | 
176 | The **FastTreeBinaryClassificationTrainer** is a very nice training algorithm that uses gradient boosting, a machine learning technique for classification problems.
177 | 
178 | With the pipeline fully assembled, we can train the model by piping the **TrainSet** into the **Fit** function.
179 | 
180 | You now have a fully- trained model. So now it's time to take the test partition, predict the diagnosis for each patient, and calculate the accuracy metrics of the model:
181 | 
182 | ```fsharp
183 | // make predictions and compare with the ground truth
184 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
185 | 
186 | // report the results
187 | printfn "Model metrics:"
188 | printfn "  Accuracy:          %f" metrics.Accuracy
189 | printfn "  Auc:               %f" metrics.AreaUnderRocCurve
190 | printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
191 | printfn "  F1Score:           %f" metrics.F1Score
192 | printfn "  LogLoss:           %f" metrics.LogLoss
193 | printfn "  LogLossReduction:  %f" metrics.LogLossReduction
194 | printfn "  PositivePrecision: %f" metrics.PositivePrecision
195 | printfn "  PositiveRecall:    %f" metrics.PositiveRecall
196 | printfn "  NegativePrecision: %f" metrics.NegativePrecision
197 | printfn "  NegativeRecall:    %f" metrics.NegativeRecall
198 | 
199 | // the rest of the code goes here....
200 | ```
201 | 
202 | This code pipes the **TestSet** into **model.Transform** to set up a prediction for every patient in the set, and then pipes the predictions into **Evaluate** to compare these predictions to the ground truth and automatically calculate all evaluation metrics:
203 | 
204 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
205 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
206 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
207 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
208 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
209 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
210 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
211 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
212 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
213 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
214 | 
215 | When monitoring heart disease, you definitely want to avoid false negatives because you don’t want to be sending high-risk patients home and telling them everything is okay.
216 | 
217 | You also want to avoid false positives, but they are a lot better than a false negative because later tests would probably discover that the patient is healthy after all.
218 | 
219 | To wrap up, You’re going to create a new patient record and ask the model to make a prediction:
220 | 
221 | ```fsharp
222 | // set up a prediction engine
223 | let predictionEngine = context.Model.CreatePredictionEngine model
224 | 
225 | // create a sample patient
226 | let sample = { 
227 |     Age = 36.0f
228 |     Sex = 1.0f
229 |     Cp = 4.0f
230 |     TrestBps = 145.0f
231 |     Chol = 210.0f
232 |     Fbs = 0.0f
233 |     RestEcg = 2.0f
234 |     Thalac = 148.0f
235 |     Exang = 1.0f
236 |     OldPeak = 1.9f
237 |     Slope = 2.0f
238 |     Ca = 1.0f
239 |     Thal = 7.0f
240 |     Diagnosis = 0.0f // unused
241 | }
242 | 
243 | // make the prediction
244 | let prediction = sample |> predictionEngine.Predict
245 | 
246 | // report the results
247 | printfn "\r"
248 | printfn "Single prediction:"
249 | printfn "  Prediction:  %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk")
250 | printfn "  Probability: %f" prediction.Probability
251 | ```
252 | 
253 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then creates a new patient record for a 36-year old male with asymptomatic chest pain and a bunch of other medical info. 
254 | 
255 | We then pipe the patient record into the **Predict** function and display the diagnosis. 
256 | 
257 | What’s the model going to predict?
258 | 
259 | Time to find out. Go to your terminal and run your code:
260 | 
261 | ```bash
262 | $ dotnet run
263 | ```
264 | 
265 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value?
266 | 
267 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model? 
268 | 
269 | And what about our patient? What did your model predict?
270 | 
271 | Think about the code in this assignment. How could you improve the accuracy of the model? What are your best AUC and AUCPRC values? 
272 | 
273 | Share your results in our group!
274 | 


--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/HeartDiseasePrediction/assets/data.png


--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/processed.cleveland.data.csv:
--------------------------------------------------------------------------------
  1 | 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
  2 | 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
  3 | 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
  4 | 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
  5 | 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
  6 | 56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
  7 | 62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3
  8 | 57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0
  9 | 63.0,1.0,4.0,130.0,254.0,0.0,2.0,147.0,0.0,1.4,2.0,1.0,7.0,2
 10 | 53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1
 11 | 57.0,1.0,4.0,140.0,192.0,0.0,0.0,148.0,0.0,0.4,2.0,0.0,6.0,0
 12 | 56.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,0
 13 | 56.0,1.0,3.0,130.0,256.0,1.0,2.0,142.0,1.0,0.6,2.0,1.0,6.0,2
 14 | 44.0,1.0,2.0,120.0,263.0,0.0,0.0,173.0,0.0,0.0,1.0,0.0,7.0,0
 15 | 52.0,1.0,3.0,172.0,199.0,1.0,0.0,162.0,0.0,0.5,1.0,0.0,7.0,0
 16 | 57.0,1.0,3.0,150.0,168.0,0.0,0.0,174.0,0.0,1.6,1.0,0.0,3.0,0
 17 | 48.0,1.0,2.0,110.0,229.0,0.0,0.0,168.0,0.0,1.0,3.0,0.0,7.0,1
 18 | 54.0,1.0,4.0,140.0,239.0,0.0,0.0,160.0,0.0,1.2,1.0,0.0,3.0,0
 19 | 48.0,0.0,3.0,130.0,275.0,0.0,0.0,139.0,0.0,0.2,1.0,0.0,3.0,0
 20 | 49.0,1.0,2.0,130.0,266.0,0.0,0.0,171.0,0.0,0.6,1.0,0.0,3.0,0
 21 | 64.0,1.0,1.0,110.0,211.0,0.0,2.0,144.0,1.0,1.8,2.0,0.0,3.0,0
 22 | 58.0,0.0,1.0,150.0,283.0,1.0,2.0,162.0,0.0,1.0,1.0,0.0,3.0,0
 23 | 58.0,1.0,2.0,120.0,284.0,0.0,2.0,160.0,0.0,1.8,2.0,0.0,3.0,1
 24 | 58.0,1.0,3.0,132.0,224.0,0.0,2.0,173.0,0.0,3.2,1.0,2.0,7.0,3
 25 | 60.0,1.0,4.0,130.0,206.0,0.0,2.0,132.0,1.0,2.4,2.0,2.0,7.0,4
 26 | 50.0,0.0,3.0,120.0,219.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0
 27 | 58.0,0.0,3.0,120.0,340.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0
 28 | 66.0,0.0,1.0,150.0,226.0,0.0,0.0,114.0,0.0,2.6,3.0,0.0,3.0,0
 29 | 43.0,1.0,4.0,150.0,247.0,0.0,0.0,171.0,0.0,1.5,1.0,0.0,3.0,0
 30 | 40.0,1.0,4.0,110.0,167.0,0.0,2.0,114.0,1.0,2.0,2.0,0.0,7.0,3
 31 | 69.0,0.0,1.0,140.0,239.0,0.0,0.0,151.0,0.0,1.8,1.0,2.0,3.0,0
 32 | 60.0,1.0,4.0,117.0,230.0,1.0,0.0,160.0,1.0,1.4,1.0,2.0,7.0,2
 33 | 64.0,1.0,3.0,140.0,335.0,0.0,0.0,158.0,0.0,0.0,1.0,0.0,3.0,1
 34 | 59.0,1.0,4.0,135.0,234.0,0.0,0.0,161.0,0.0,0.5,2.0,0.0,7.0,0
 35 | 44.0,1.0,3.0,130.0,233.0,0.0,0.0,179.0,1.0,0.4,1.0,0.0,3.0,0
 36 | 42.0,1.0,4.0,140.0,226.0,0.0,0.0,178.0,0.0,0.0,1.0,0.0,3.0,0
 37 | 43.0,1.0,4.0,120.0,177.0,0.0,2.0,120.0,1.0,2.5,2.0,0.0,7.0,3
 38 | 57.0,1.0,4.0,150.0,276.0,0.0,2.0,112.0,1.0,0.6,2.0,1.0,6.0,1
 39 | 55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0,3
 40 | 61.0,1.0,3.0,150.0,243.0,1.0,0.0,137.0,1.0,1.0,2.0,0.0,3.0,0
 41 | 65.0,0.0,4.0,150.0,225.0,0.0,2.0,114.0,0.0,1.0,2.0,3.0,7.0,4
 42 | 40.0,1.0,1.0,140.0,199.0,0.0,0.0,178.0,1.0,1.4,1.0,0.0,7.0,0
 43 | 71.0,0.0,2.0,160.0,302.0,0.0,0.0,162.0,0.0,0.4,1.0,2.0,3.0,0
 44 | 59.0,1.0,3.0,150.0,212.0,1.0,0.0,157.0,0.0,1.6,1.0,0.0,3.0,0
 45 | 61.0,0.0,4.0,130.0,330.0,0.0,2.0,169.0,0.0,0.0,1.0,0.0,3.0,1
 46 | 58.0,1.0,3.0,112.0,230.0,0.0,2.0,165.0,0.0,2.5,2.0,1.0,7.0,4
 47 | 51.0,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0,0.0,3.0,0
 48 | 50.0,1.0,4.0,150.0,243.0,0.0,2.0,128.0,0.0,2.6,2.0,0.0,7.0,4
 49 | 65.0,0.0,3.0,140.0,417.0,1.0,2.0,157.0,0.0,0.8,1.0,1.0,3.0,0
 50 | 53.0,1.0,3.0,130.0,197.0,1.0,2.0,152.0,0.0,1.2,3.0,0.0,3.0,0
 51 | 41.0,0.0,2.0,105.0,198.0,0.0,0.0,168.0,0.0,0.0,1.0,1.0,3.0,0
 52 | 65.0,1.0,4.0,120.0,177.0,0.0,0.0,140.0,0.0,0.4,1.0,0.0,7.0,0
 53 | 44.0,1.0,4.0,112.0,290.0,0.0,2.0,153.0,0.0,0.0,1.0,1.0,3.0,2
 54 | 44.0,1.0,2.0,130.0,219.0,0.0,2.0,188.0,0.0,0.0,1.0,0.0,3.0,0
 55 | 60.0,1.0,4.0,130.0,253.0,0.0,0.0,144.0,1.0,1.4,1.0,1.0,7.0,1
 56 | 54.0,1.0,4.0,124.0,266.0,0.0,2.0,109.0,1.0,2.2,2.0,1.0,7.0,1
 57 | 50.0,1.0,3.0,140.0,233.0,0.0,0.0,163.0,0.0,0.6,2.0,1.0,7.0,1
 58 | 41.0,1.0,4.0,110.0,172.0,0.0,2.0,158.0,0.0,0.0,1.0,0.0,7.0,1
 59 | 54.0,1.0,3.0,125.0,273.0,0.0,2.0,152.0,0.0,0.5,3.0,1.0,3.0,0
 60 | 51.0,1.0,1.0,125.0,213.0,0.0,2.0,125.0,1.0,1.4,1.0,1.0,3.0,0
 61 | 51.0,0.0,4.0,130.0,305.0,0.0,0.0,142.0,1.0,1.2,2.0,0.0,7.0,2
 62 | 46.0,0.0,3.0,142.0,177.0,0.0,2.0,160.0,1.0,1.4,3.0,0.0,3.0,0
 63 | 58.0,1.0,4.0,128.0,216.0,0.0,2.0,131.0,1.0,2.2,2.0,3.0,7.0,1
 64 | 54.0,0.0,3.0,135.0,304.0,1.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
 65 | 54.0,1.0,4.0,120.0,188.0,0.0,0.0,113.0,0.0,1.4,2.0,1.0,7.0,2
 66 | 60.0,1.0,4.0,145.0,282.0,0.0,2.0,142.0,1.0,2.8,2.0,2.0,7.0,2
 67 | 60.0,1.0,3.0,140.0,185.0,0.0,2.0,155.0,0.0,3.0,2.0,0.0,3.0,1
 68 | 54.0,1.0,3.0,150.0,232.0,0.0,2.0,165.0,0.0,1.6,1.0,0.0,7.0,0
 69 | 59.0,1.0,4.0,170.0,326.0,0.0,2.0,140.0,1.0,3.4,3.0,0.0,7.0,2
 70 | 46.0,1.0,3.0,150.0,231.0,0.0,0.0,147.0,0.0,3.6,2.0,0.0,3.0,1
 71 | 65.0,0.0,3.0,155.0,269.0,0.0,0.0,148.0,0.0,0.8,1.0,0.0,3.0,0
 72 | 67.0,1.0,4.0,125.0,254.0,1.0,0.0,163.0,0.0,0.2,2.0,2.0,7.0,3
 73 | 62.0,1.0,4.0,120.0,267.0,0.0,0.0,99.0,1.0,1.8,2.0,2.0,7.0,1
 74 | 65.0,1.0,4.0,110.0,248.0,0.0,2.0,158.0,0.0,0.6,1.0,2.0,6.0,1
 75 | 44.0,1.0,4.0,110.0,197.0,0.0,2.0,177.0,0.0,0.0,1.0,1.0,3.0,1
 76 | 65.0,0.0,3.0,160.0,360.0,0.0,2.0,151.0,0.0,0.8,1.0,0.0,3.0,0
 77 | 60.0,1.0,4.0,125.0,258.0,0.0,2.0,141.0,1.0,2.8,2.0,1.0,7.0,1
 78 | 51.0,0.0,3.0,140.0,308.0,0.0,2.0,142.0,0.0,1.5,1.0,1.0,3.0,0
 79 | 48.0,1.0,2.0,130.0,245.0,0.0,2.0,180.0,0.0,0.2,2.0,0.0,3.0,0
 80 | 58.0,1.0,4.0,150.0,270.0,0.0,2.0,111.0,1.0,0.8,1.0,0.0,7.0,3
 81 | 45.0,1.0,4.0,104.0,208.0,0.0,2.0,148.0,1.0,3.0,2.0,0.0,3.0,0
 82 | 53.0,0.0,4.0,130.0,264.0,0.0,2.0,143.0,0.0,0.4,2.0,0.0,3.0,0
 83 | 39.0,1.0,3.0,140.0,321.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0
 84 | 68.0,1.0,3.0,180.0,274.0,1.0,2.0,150.0,1.0,1.6,2.0,0.0,7.0,3
 85 | 52.0,1.0,2.0,120.0,325.0,0.0,0.0,172.0,0.0,0.2,1.0,0.0,3.0,0
 86 | 44.0,1.0,3.0,140.0,235.0,0.0,2.0,180.0,0.0,0.0,1.0,0.0,3.0,0
 87 | 47.0,1.0,3.0,138.0,257.0,0.0,2.0,156.0,0.0,0.0,1.0,0.0,3.0,0
 88 | 53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
 89 | 53.0,0.0,4.0,138.0,234.0,0.0,2.0,160.0,0.0,0.0,1.0,0.0,3.0,0
 90 | 51.0,0.0,3.0,130.0,256.0,0.0,2.0,149.0,0.0,0.5,1.0,0.0,3.0,0
 91 | 66.0,1.0,4.0,120.0,302.0,0.0,2.0,151.0,0.0,0.4,2.0,0.0,3.0,0
 92 | 62.0,0.0,4.0,160.0,164.0,0.0,2.0,145.0,0.0,6.2,3.0,3.0,7.0,3
 93 | 62.0,1.0,3.0,130.0,231.0,0.0,0.0,146.0,0.0,1.8,2.0,3.0,7.0,0
 94 | 44.0,0.0,3.0,108.0,141.0,0.0,0.0,175.0,0.0,0.6,2.0,0.0,3.0,0
 95 | 63.0,0.0,3.0,135.0,252.0,0.0,2.0,172.0,0.0,0.0,1.0,0.0,3.0,0
 96 | 52.0,1.0,4.0,128.0,255.0,0.0,0.0,161.0,1.0,0.0,1.0,1.0,7.0,1
 97 | 59.0,1.0,4.0,110.0,239.0,0.0,2.0,142.0,1.0,1.2,2.0,1.0,7.0,2
 98 | 60.0,0.0,4.0,150.0,258.0,0.0,2.0,157.0,0.0,2.6,2.0,2.0,7.0,3
 99 | 52.0,1.0,2.0,134.0,201.0,0.0,0.0,158.0,0.0,0.8,1.0,1.0,3.0,0
100 | 48.0,1.0,4.0,122.0,222.0,0.0,2.0,186.0,0.0,0.0,1.0,0.0,3.0,0
101 | 45.0,1.0,4.0,115.0,260.0,0.0,2.0,185.0,0.0,0.0,1.0,0.0,3.0,0
102 | 34.0,1.0,1.0,118.0,182.0,0.0,2.0,174.0,0.0,0.0,1.0,0.0,3.0,0
103 | 57.0,0.0,4.0,128.0,303.0,0.0,2.0,159.0,0.0,0.0,1.0,1.0,3.0,0
104 | 71.0,0.0,3.0,110.0,265.0,1.0,2.0,130.0,0.0,0.0,1.0,1.0,3.0,0
105 | 49.0,1.0,3.0,120.0,188.0,0.0,0.0,139.0,0.0,2.0,2.0,3.0,7.0,3
106 | 54.0,1.0,2.0,108.0,309.0,0.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0
107 | 59.0,1.0,4.0,140.0,177.0,0.0,0.0,162.0,1.0,0.0,1.0,1.0,7.0,2
108 | 57.0,1.0,3.0,128.0,229.0,0.0,2.0,150.0,0.0,0.4,2.0,1.0,7.0,1
109 | 61.0,1.0,4.0,120.0,260.0,0.0,0.0,140.0,1.0,3.6,2.0,1.0,7.0,2
110 | 39.0,1.0,4.0,118.0,219.0,0.0,0.0,140.0,0.0,1.2,2.0,0.0,7.0,3
111 | 61.0,0.0,4.0,145.0,307.0,0.0,2.0,146.0,1.0,1.0,2.0,0.0,7.0,1
112 | 56.0,1.0,4.0,125.0,249.0,1.0,2.0,144.0,1.0,1.2,2.0,1.0,3.0,1
113 | 52.0,1.0,1.0,118.0,186.0,0.0,2.0,190.0,0.0,0.0,2.0,0.0,6.0,0
114 | 43.0,0.0,4.0,132.0,341.0,1.0,2.0,136.0,1.0,3.0,2.0,0.0,7.0,2
115 | 62.0,0.0,3.0,130.0,263.0,0.0,0.0,97.0,0.0,1.2,2.0,1.0,7.0,2
116 | 41.0,1.0,2.0,135.0,203.0,0.0,0.0,132.0,0.0,0.0,2.0,0.0,6.0,0
117 | 58.0,1.0,3.0,140.0,211.0,1.0,2.0,165.0,0.0,0.0,1.0,0.0,3.0,0
118 | 35.0,0.0,4.0,138.0,183.0,0.0,0.0,182.0,0.0,1.4,1.0,0.0,3.0,0
119 | 63.0,1.0,4.0,130.0,330.0,1.0,2.0,132.0,1.0,1.8,1.0,3.0,7.0,3
120 | 65.0,1.0,4.0,135.0,254.0,0.0,2.0,127.0,0.0,2.8,2.0,1.0,7.0,2
121 | 48.0,1.0,4.0,130.0,256.0,1.0,2.0,150.0,1.0,0.0,1.0,2.0,7.0,3
122 | 63.0,0.0,4.0,150.0,407.0,0.0,2.0,154.0,0.0,4.0,2.0,3.0,7.0,4
123 | 51.0,1.0,3.0,100.0,222.0,0.0,0.0,143.0,1.0,1.2,2.0,0.0,3.0,0
124 | 55.0,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,0.0,7.0,3
125 | 65.0,1.0,1.0,138.0,282.0,1.0,2.0,174.0,0.0,1.4,2.0,1.0,3.0,1
126 | 45.0,0.0,2.0,130.0,234.0,0.0,2.0,175.0,0.0,0.6,2.0,0.0,3.0,0
127 | 56.0,0.0,4.0,200.0,288.0,1.0,2.0,133.0,1.0,4.0,3.0,2.0,7.0,3
128 | 54.0,1.0,4.0,110.0,239.0,0.0,0.0,126.0,1.0,2.8,2.0,1.0,7.0,3
129 | 44.0,1.0,2.0,120.0,220.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
130 | 62.0,0.0,4.0,124.0,209.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
131 | 54.0,1.0,3.0,120.0,258.0,0.0,2.0,147.0,0.0,0.4,2.0,0.0,7.0,0
132 | 51.0,1.0,3.0,94.0,227.0,0.0,0.0,154.0,1.0,0.0,1.0,1.0,7.0,0
133 | 29.0,1.0,2.0,130.0,204.0,0.0,2.0,202.0,0.0,0.0,1.0,0.0,3.0,0
134 | 51.0,1.0,4.0,140.0,261.0,0.0,2.0,186.0,1.0,0.0,1.0,0.0,3.0,0
135 | 43.0,0.0,3.0,122.0,213.0,0.0,0.0,165.0,0.0,0.2,2.0,0.0,3.0,0
136 | 55.0,0.0,2.0,135.0,250.0,0.0,2.0,161.0,0.0,1.4,2.0,0.0,3.0,0
137 | 70.0,1.0,4.0,145.0,174.0,0.0,0.0,125.0,1.0,2.6,3.0,0.0,7.0,4
138 | 62.0,1.0,2.0,120.0,281.0,0.0,2.0,103.0,0.0,1.4,2.0,1.0,7.0,3
139 | 35.0,1.0,4.0,120.0,198.0,0.0,0.0,130.0,1.0,1.6,2.0,0.0,7.0,1
140 | 51.0,1.0,3.0,125.0,245.0,1.0,2.0,166.0,0.0,2.4,2.0,0.0,3.0,0
141 | 59.0,1.0,2.0,140.0,221.0,0.0,0.0,164.0,1.0,0.0,1.0,0.0,3.0,0
142 | 59.0,1.0,1.0,170.0,288.0,0.0,2.0,159.0,0.0,0.2,2.0,0.0,7.0,1
143 | 52.0,1.0,2.0,128.0,205.0,1.0,0.0,184.0,0.0,0.0,1.0,0.0,3.0,0
144 | 64.0,1.0,3.0,125.0,309.0,0.0,0.0,131.0,1.0,1.8,2.0,0.0,7.0,1
145 | 58.0,1.0,3.0,105.0,240.0,0.0,2.0,154.0,1.0,0.6,2.0,0.0,7.0,0
146 | 47.0,1.0,3.0,108.0,243.0,0.0,0.0,152.0,0.0,0.0,1.0,0.0,3.0,1
147 | 57.0,1.0,4.0,165.0,289.0,1.0,2.0,124.0,0.0,1.0,2.0,3.0,7.0,4
148 | 41.0,1.0,3.0,112.0,250.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
149 | 45.0,1.0,2.0,128.0,308.0,0.0,2.0,170.0,0.0,0.0,1.0,0.0,3.0,0
150 | 60.0,0.0,3.0,102.0,318.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,0
151 | 52.0,1.0,1.0,152.0,298.0,1.0,0.0,178.0,0.0,1.2,2.0,0.0,7.0,0
152 | 42.0,0.0,4.0,102.0,265.0,0.0,2.0,122.0,0.0,0.6,2.0,0.0,3.0,0
153 | 67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,0
154 | 55.0,1.0,4.0,160.0,289.0,0.0,2.0,145.0,1.0,0.8,2.0,1.0,7.0,4
155 | 64.0,1.0,4.0,120.0,246.0,0.0,2.0,96.0,1.0,2.2,3.0,1.0,3.0,3
156 | 70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,1
157 | 51.0,1.0,4.0,140.0,299.0,0.0,0.0,173.0,1.0,1.6,1.0,0.0,7.0,1
158 | 58.0,1.0,4.0,125.0,300.0,0.0,2.0,171.0,0.0,0.0,1.0,2.0,7.0,1
159 | 60.0,1.0,4.0,140.0,293.0,0.0,2.0,170.0,0.0,1.2,2.0,2.0,7.0,2
160 | 68.0,1.0,3.0,118.0,277.0,0.0,0.0,151.0,0.0,1.0,1.0,1.0,7.0,0
161 | 46.0,1.0,2.0,101.0,197.0,1.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0
162 | 77.0,1.0,4.0,125.0,304.0,0.0,2.0,162.0,1.0,0.0,1.0,3.0,3.0,4
163 | 54.0,0.0,3.0,110.0,214.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0
164 | 58.0,0.0,4.0,100.0,248.0,0.0,2.0,122.0,0.0,1.0,2.0,0.0,3.0,0
165 | 48.0,1.0,3.0,124.0,255.0,1.0,0.0,175.0,0.0,0.0,1.0,2.0,3.0,0
166 | 57.0,1.0,4.0,132.0,207.0,0.0,0.0,168.0,1.0,0.0,1.0,0.0,7.0,0
167 | 52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
168 | 54.0,0.0,2.0,132.0,288.0,1.0,2.0,159.0,1.0,0.0,1.0,1.0,3.0,0
169 | 35.0,1.0,4.0,126.0,282.0,0.0,2.0,156.0,1.0,0.0,1.0,0.0,7.0,1
170 | 45.0,0.0,2.0,112.0,160.0,0.0,0.0,138.0,0.0,0.0,2.0,0.0,3.0,0
171 | 70.0,1.0,3.0,160.0,269.0,0.0,0.0,112.0,1.0,2.9,2.0,1.0,7.0,3
172 | 53.0,1.0,4.0,142.0,226.0,0.0,2.0,111.0,1.0,0.0,1.0,0.0,7.0,0
173 | 59.0,0.0,4.0,174.0,249.0,0.0,0.0,143.0,1.0,0.0,2.0,0.0,3.0,1
174 | 62.0,0.0,4.0,140.0,394.0,0.0,2.0,157.0,0.0,1.2,2.0,0.0,3.0,0
175 | 64.0,1.0,4.0,145.0,212.0,0.0,2.0,132.0,0.0,2.0,2.0,2.0,6.0,4
176 | 57.0,1.0,4.0,152.0,274.0,0.0,0.0,88.0,1.0,1.2,2.0,1.0,7.0,1
177 | 52.0,1.0,4.0,108.0,233.0,1.0,0.0,147.0,0.0,0.1,1.0,3.0,7.0,0
178 | 56.0,1.0,4.0,132.0,184.0,0.0,2.0,105.0,1.0,2.1,2.0,1.0,6.0,1
179 | 43.0,1.0,3.0,130.0,315.0,0.0,0.0,162.0,0.0,1.9,1.0,1.0,3.0,0
180 | 53.0,1.0,3.0,130.0,246.0,1.0,2.0,173.0,0.0,0.0,1.0,3.0,3.0,0
181 | 48.0,1.0,4.0,124.0,274.0,0.0,2.0,166.0,0.0,0.5,2.0,0.0,7.0,3
182 | 56.0,0.0,4.0,134.0,409.0,0.0,2.0,150.0,1.0,1.9,2.0,2.0,7.0,2
183 | 42.0,1.0,1.0,148.0,244.0,0.0,2.0,178.0,0.0,0.8,1.0,2.0,3.0,0
184 | 59.0,1.0,1.0,178.0,270.0,0.0,2.0,145.0,0.0,4.2,3.0,0.0,7.0,0
185 | 60.0,0.0,4.0,158.0,305.0,0.0,2.0,161.0,0.0,0.0,1.0,0.0,3.0,1
186 | 63.0,0.0,2.0,140.0,195.0,0.0,0.0,179.0,0.0,0.0,1.0,2.0,3.0,0
187 | 42.0,1.0,3.0,120.0,240.0,1.0,0.0,194.0,0.0,0.8,3.0,0.0,7.0,0
188 | 66.0,1.0,2.0,160.0,246.0,0.0,0.0,120.0,1.0,0.0,2.0,3.0,6.0,2
189 | 54.0,1.0,2.0,192.0,283.0,0.0,2.0,195.0,0.0,0.0,1.0,1.0,7.0,1
190 | 69.0,1.0,3.0,140.0,254.0,0.0,2.0,146.0,0.0,2.0,2.0,3.0,7.0,2
191 | 50.0,1.0,3.0,129.0,196.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
192 | 51.0,1.0,4.0,140.0,298.0,0.0,0.0,122.0,1.0,4.2,2.0,3.0,7.0,3
193 | 43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
194 | 62.0,0.0,4.0,138.0,294.0,1.0,0.0,106.0,0.0,1.9,2.0,3.0,3.0,2
195 | 68.0,0.0,3.0,120.0,211.0,0.0,2.0,115.0,0.0,1.5,2.0,0.0,3.0,0
196 | 67.0,1.0,4.0,100.0,299.0,0.0,2.0,125.0,1.0,0.9,2.0,2.0,3.0,3
197 | 69.0,1.0,1.0,160.0,234.0,1.0,2.0,131.0,0.0,0.1,2.0,1.0,3.0,0
198 | 45.0,0.0,4.0,138.0,236.0,0.0,2.0,152.0,1.0,0.2,2.0,0.0,3.0,0
199 | 50.0,0.0,2.0,120.0,244.0,0.0,0.0,162.0,0.0,1.1,1.0,0.0,3.0,0
200 | 59.0,1.0,1.0,160.0,273.0,0.0,2.0,125.0,0.0,0.0,1.0,0.0,3.0,1
201 | 50.0,0.0,4.0,110.0,254.0,0.0,2.0,159.0,0.0,0.0,1.0,0.0,3.0,0
202 | 64.0,0.0,4.0,180.0,325.0,0.0,0.0,154.0,1.0,0.0,1.0,0.0,3.0,0
203 | 57.0,1.0,3.0,150.0,126.0,1.0,0.0,173.0,0.0,0.2,1.0,1.0,7.0,0
204 | 64.0,0.0,3.0,140.0,313.0,0.0,0.0,133.0,0.0,0.2,1.0,0.0,7.0,0
205 | 43.0,1.0,4.0,110.0,211.0,0.0,0.0,161.0,0.0,0.0,1.0,0.0,7.0,0
206 | 45.0,1.0,4.0,142.0,309.0,0.0,2.0,147.0,1.0,0.0,2.0,3.0,7.0,3
207 | 58.0,1.0,4.0,128.0,259.0,0.0,2.0,130.0,1.0,3.0,2.0,2.0,7.0,3
208 | 50.0,1.0,4.0,144.0,200.0,0.0,2.0,126.0,1.0,0.9,2.0,0.0,7.0,3
209 | 55.0,1.0,2.0,130.0,262.0,0.0,0.0,155.0,0.0,0.0,1.0,0.0,3.0,0
210 | 62.0,0.0,4.0,150.0,244.0,0.0,0.0,154.0,1.0,1.4,2.0,0.0,3.0,1
211 | 37.0,0.0,3.0,120.0,215.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
212 | 38.0,1.0,1.0,120.0,231.0,0.0,0.0,182.0,1.0,3.8,2.0,0.0,7.0,4
213 | 41.0,1.0,3.0,130.0,214.0,0.0,2.0,168.0,0.0,2.0,2.0,0.0,3.0,0
214 | 66.0,0.0,4.0,178.0,228.0,1.0,0.0,165.0,1.0,1.0,2.0,2.0,7.0,3
215 | 52.0,1.0,4.0,112.0,230.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,1
216 | 56.0,1.0,1.0,120.0,193.0,0.0,2.0,162.0,0.0,1.9,2.0,0.0,7.0,0
217 | 46.0,0.0,2.0,105.0,204.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0
218 | 46.0,0.0,4.0,138.0,243.0,0.0,2.0,152.0,1.0,0.0,2.0,0.0,3.0,0
219 | 64.0,0.0,4.0,130.0,303.0,0.0,0.0,122.0,0.0,2.0,2.0,2.0,3.0,0
220 | 59.0,1.0,4.0,138.0,271.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0
221 | 41.0,0.0,3.0,112.0,268.0,0.0,2.0,172.0,1.0,0.0,1.0,0.0,3.0,0
222 | 54.0,0.0,3.0,108.0,267.0,0.0,2.0,167.0,0.0,0.0,1.0,0.0,3.0,0
223 | 39.0,0.0,3.0,94.0,199.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
224 | 53.0,1.0,4.0,123.0,282.0,0.0,0.0,95.0,1.0,2.0,2.0,2.0,7.0,3
225 | 63.0,0.0,4.0,108.0,269.0,0.0,0.0,169.0,1.0,1.8,2.0,2.0,3.0,1
226 | 34.0,0.0,2.0,118.0,210.0,0.0,0.0,192.0,0.0,0.7,1.0,0.0,3.0,0
227 | 47.0,1.0,4.0,112.0,204.0,0.0,0.0,143.0,0.0,0.1,1.0,0.0,3.0,0
228 | 67.0,0.0,3.0,152.0,277.0,0.0,0.0,172.0,0.0,0.0,1.0,1.0,3.0,0
229 | 54.0,1.0,4.0,110.0,206.0,0.0,2.0,108.0,1.0,0.0,2.0,1.0,3.0,3
230 | 66.0,1.0,4.0,112.0,212.0,0.0,2.0,132.0,1.0,0.1,1.0,1.0,3.0,2
231 | 52.0,0.0,3.0,136.0,196.0,0.0,2.0,169.0,0.0,0.1,2.0,0.0,3.0,0
232 | 55.0,0.0,4.0,180.0,327.0,0.0,1.0,117.0,1.0,3.4,2.0,0.0,3.0,2
233 | 49.0,1.0,3.0,118.0,149.0,0.0,2.0,126.0,0.0,0.8,1.0,3.0,3.0,1
234 | 74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,0
235 | 54.0,0.0,3.0,160.0,201.0,0.0,0.0,163.0,0.0,0.0,1.0,1.0,3.0,0
236 | 54.0,1.0,4.0,122.0,286.0,0.0,2.0,116.0,1.0,3.2,2.0,2.0,3.0,3
237 | 56.0,1.0,4.0,130.0,283.0,1.0,2.0,103.0,1.0,1.6,3.0,0.0,7.0,2
238 | 46.0,1.0,4.0,120.0,249.0,0.0,2.0,144.0,0.0,0.8,1.0,0.0,7.0,1
239 | 49.0,0.0,2.0,134.0,271.0,0.0,0.0,162.0,0.0,0.0,2.0,0.0,3.0,0
240 | 42.0,1.0,2.0,120.0,295.0,0.0,0.0,162.0,0.0,0.0,1.0,0.0,3.0,0
241 | 41.0,1.0,2.0,110.0,235.0,0.0,0.0,153.0,0.0,0.0,1.0,0.0,3.0,0
242 | 41.0,0.0,2.0,126.0,306.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
243 | 49.0,0.0,4.0,130.0,269.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
244 | 61.0,1.0,1.0,134.0,234.0,0.0,0.0,145.0,0.0,2.6,2.0,2.0,3.0,2
245 | 60.0,0.0,3.0,120.0,178.0,1.0,0.0,96.0,0.0,0.0,1.0,0.0,3.0,0
246 | 67.0,1.0,4.0,120.0,237.0,0.0,0.0,71.0,0.0,1.0,2.0,0.0,3.0,2
247 | 58.0,1.0,4.0,100.0,234.0,0.0,0.0,156.0,0.0,0.1,1.0,1.0,7.0,2
248 | 47.0,1.0,4.0,110.0,275.0,0.0,2.0,118.0,1.0,1.0,2.0,1.0,3.0,1
249 | 52.0,1.0,4.0,125.0,212.0,0.0,0.0,168.0,0.0,1.0,1.0,2.0,7.0,3
250 | 62.0,1.0,2.0,128.0,208.0,1.0,2.0,140.0,0.0,0.0,1.0,0.0,3.0,0
251 | 57.0,1.0,4.0,110.0,201.0,0.0,0.0,126.0,1.0,1.5,2.0,0.0,6.0,0
252 | 58.0,1.0,4.0,146.0,218.0,0.0,0.0,105.0,0.0,2.0,2.0,1.0,7.0,1
253 | 64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,0
254 | 51.0,0.0,3.0,120.0,295.0,0.0,2.0,157.0,0.0,0.6,1.0,0.0,3.0,0
255 | 43.0,1.0,4.0,115.0,303.0,0.0,0.0,181.0,0.0,1.2,2.0,0.0,3.0,0
256 | 42.0,0.0,3.0,120.0,209.0,0.0,0.0,173.0,0.0,0.0,2.0,0.0,3.0,0
257 | 67.0,0.0,4.0,106.0,223.0,0.0,0.0,142.0,0.0,0.3,1.0,2.0,3.0,0
258 | 76.0,0.0,3.0,140.0,197.0,0.0,1.0,116.0,0.0,1.1,2.0,0.0,3.0,0
259 | 70.0,1.0,2.0,156.0,245.0,0.0,2.0,143.0,0.0,0.0,1.0,0.0,3.0,0
260 | 57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,1
261 | 44.0,0.0,3.0,118.0,242.0,0.0,0.0,149.0,0.0,0.3,2.0,1.0,3.0,0
262 | 58.0,0.0,2.0,136.0,319.0,1.0,2.0,152.0,0.0,0.0,1.0,2.0,3.0,3
263 | 60.0,0.0,1.0,150.0,240.0,0.0,0.0,171.0,0.0,0.9,1.0,0.0,3.0,0
264 | 44.0,1.0,3.0,120.0,226.0,0.0,0.0,169.0,0.0,0.0,1.0,0.0,3.0,0
265 | 61.0,1.0,4.0,138.0,166.0,0.0,2.0,125.0,1.0,3.6,2.0,1.0,3.0,4
266 | 42.0,1.0,4.0,136.0,315.0,0.0,0.0,125.0,1.0,1.8,2.0,0.0,6.0,2
267 | 52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2
268 | 59.0,1.0,3.0,126.0,218.0,1.0,0.0,134.0,0.0,2.2,2.0,1.0,6.0,2
269 | 40.0,1.0,4.0,152.0,223.0,0.0,0.0,181.0,0.0,0.0,1.0,0.0,7.0,1
270 | 42.0,1.0,3.0,130.0,180.0,0.0,0.0,150.0,0.0,0.0,1.0,0.0,3.0,0
271 | 61.0,1.0,4.0,140.0,207.0,0.0,2.0,138.0,1.0,1.9,1.0,1.0,7.0,1
272 | 66.0,1.0,4.0,160.0,228.0,0.0,2.0,138.0,0.0,2.3,1.0,0.0,6.0,0
273 | 46.0,1.0,4.0,140.0,311.0,0.0,0.0,120.0,1.0,1.8,2.0,2.0,7.0,2
274 | 71.0,0.0,4.0,112.0,149.0,0.0,0.0,125.0,0.0,1.6,2.0,0.0,3.0,0
275 | 59.0,1.0,1.0,134.0,204.0,0.0,0.0,162.0,0.0,0.8,1.0,2.0,3.0,1
276 | 64.0,1.0,1.0,170.0,227.0,0.0,2.0,155.0,0.0,0.6,2.0,0.0,7.0,0
277 | 66.0,0.0,3.0,146.0,278.0,0.0,2.0,152.0,0.0,0.0,2.0,1.0,3.0,0
278 | 39.0,0.0,3.0,138.0,220.0,0.0,0.0,152.0,0.0,0.0,2.0,0.0,3.0,0
279 | 57.0,1.0,2.0,154.0,232.0,0.0,2.0,164.0,0.0,0.0,1.0,1.0,3.0,1
280 | 58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0
281 | 57.0,1.0,4.0,110.0,335.0,0.0,0.0,143.0,1.0,3.0,2.0,1.0,7.0,2
282 | 47.0,1.0,3.0,130.0,253.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
283 | 55.0,0.0,4.0,128.0,205.0,0.0,1.0,130.0,1.0,2.0,2.0,1.0,7.0,3
284 | 35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0
285 | 61.0,1.0,4.0,148.0,203.0,0.0,0.0,161.0,0.0,0.0,1.0,1.0,7.0,2
286 | 58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4
287 | 58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2
288 | 58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
289 | 56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
290 | 56.0,1.0,2.0,120.0,240.0,0.0,0.0,169.0,0.0,0.0,3.0,0.0,3.0,0
291 | 67.0,1.0,3.0,152.0,212.0,0.0,2.0,150.0,0.0,0.8,2.0,0.0,7.0,1
292 | 55.0,0.0,2.0,132.0,342.0,0.0,0.0,166.0,0.0,1.2,1.0,0.0,3.0,0
293 | 44.0,1.0,4.0,120.0,169.0,0.0,0.0,144.0,1.0,2.8,3.0,0.0,6.0,2
294 | 63.0,1.0,4.0,140.0,187.0,0.0,2.0,144.0,1.0,4.0,1.0,2.0,7.0,2
295 | 63.0,0.0,4.0,124.0,197.0,0.0,0.0,136.0,1.0,0.0,2.0,0.0,3.0,1
296 | 41.0,1.0,2.0,120.0,157.0,0.0,0.0,182.0,0.0,0.0,1.0,0.0,3.0,0
297 | 59.0,1.0,4.0,164.0,176.0,1.0,2.0,90.0,0.0,1.0,2.0,2.0,6.0,3
298 | 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
299 | 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
300 | 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
301 | 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
302 | 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
303 | 38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0
304 | 


--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/Program.fs:
--------------------------------------------------------------------------------
  1 | ﻿open System
  2 | open System.IO
  3 | open Microsoft.ML
  4 | open Microsoft.ML.Data
  5 | 
  6 | /// The SpamInput class contains one single message which may be spam or ham.
  7 | [<CLIMutable>]
  8 | type SpamInput = {
  9 |     [<LoadColumn(0)>] Verdict : string
 10 |     [<LoadColumn(1)>] Message : string
 11 | }
 12 | 
 13 | /// The SpamPrediction class contains one single spam prediction.
 14 | [<CLIMutable>]
 15 | type SpamPrediction = {
 16 |     [<ColumnName("PredictedLabel")>] IsSpam : bool
 17 |     Score : float32
 18 |     Probability : float32
 19 | }
 20 | 
 21 | /// This class describes what output columns we want to produce.
 22 | [<CLIMutable>]
 23 | type ToLabel ={
 24 |     mutable Label : bool
 25 | }
 26 | 
 27 | /// Helper function to cast the ML pipeline to an estimator
 28 | let castToEstimator (x : IEstimator<_>) = 
 29 |     match x with 
 30 |     | :? IEstimator<ITransformer> as y -> y
 31 |     | _ -> failwith "Cannot cast pipeline to IEstimator<ITransformer>"
 32 | 
 33 | /// file paths to data files (assumes os = windows!)
 34 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory
 35 | 
 36 | [<EntryPoint>]
 37 | let main arv =
 38 | 
 39 |     // set up a machine learning context
 40 |     let context = new MLContext()
 41 | 
 42 |     // load the spam dataset in memory
 43 |     let data = context.Data.LoadFromTextFile<SpamInput>(dataPath, hasHeader = true, separatorChar = '\t')
 44 | 
 45 |     // use 80% for training and 20% for testing
 46 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
 47 | 
 48 |     // set up a training pipeline
 49 |     let pipeline = 
 50 |         EstimatorChain()
 51 | 
 52 |             // step 1: transform the 'spam' and 'ham' values to true and false
 53 |             .Append(
 54 |                 context.Transforms.CustomMapping(
 55 |                     Action<SpamInput, ToLabel>(fun input output -> output.Label <- input.Verdict = "spam"),
 56 |                     "MyLambda"))
 57 | 
 58 |             // step 2: featureize the input text
 59 |             .Append(context.Transforms.Text.FeaturizeText("Features", "Message"))
 60 | 
 61 |             // step 3: use a stochastic dual coordinate ascent learner
 62 |             .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression())
 63 | 
 64 |     // test the full data set by performing k-fold cross validation
 65 |     printfn "Performing cross validation:"
 66 |     let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5)
 67 | 
 68 |     // report the results
 69 |     cvResults |> Seq.iter(fun f -> printfn "  Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve)
 70 | 
 71 |     // train the model on the training set
 72 |     let model = partitions.TrainSet |> pipeline.Fit
 73 | 
 74 |     // evaluate the model on the test set
 75 |     let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
 76 | 
 77 |     // report the results
 78 |     printfn "Model metrics:"
 79 |     printfn "  Accuracy:          %f" metrics.Accuracy
 80 |     printfn "  Auc:               %f" metrics.AreaUnderRocCurve
 81 |     printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
 82 |     printfn "  F1Score:           %f" metrics.F1Score
 83 |     printfn "  LogLoss:           %f" metrics.LogLoss
 84 |     printfn "  LogLossReduction:  %f" metrics.LogLossReduction
 85 |     printfn "  PositivePrecision: %f" metrics.PositivePrecision
 86 |     printfn "  PositiveRecall:    %f" metrics.PositiveRecall
 87 |     printfn "  NegativePrecision: %f" metrics.NegativePrecision
 88 |     printfn "  NegativeRecall:    %f" metrics.NegativeRecall
 89 | 
 90 |     // set up a prediction engine
 91 |     let engine = context.Model.CreatePredictionEngine model
 92 | 
 93 |     // create sample messages
 94 |     let messages = [
 95 |         { Message = "Hi, wanna grab lunch together today?"; Verdict = "" }
 96 |         { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" }
 97 |         { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" }
 98 |         { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" }
 99 |     ]
100 | 
101 |     // make the predictions
102 |     printfn "Model predictions:"
103 |     let predictions = messages |> List.iter(fun m -> 
104 |             let p = engine.Predict m
105 |             printfn "  %f %s" p.Probability m.Message)
106 | 
107 |     0 // return value


--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Detect spam SMS messages
  2 | 
  3 | In this assignment you're going to build an app that can automatically detect spam SMS messages.
  4 | 
  5 | The first thing you'll need is a file with lots of SMS messages, correctly labelled as being spam or not spam. You will use a dataset compiled by Caroline Tagg in her [2009 PhD thesis](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf). This dataset has 5574 messages.
  6 | 
  7 | Download the [list of messages](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/SpamDetection/spam.tsv) and save it as **spam.tsv**.
  8 | 
  9 | The data file looks like this:
 10 | 
 11 | ![Spam message list](./assets/data.png)
 12 | 
 13 | It’s a TSV file with only 2 columns of information:
 14 | 
 15 | * Label: ‘spam’ for a spam message and ‘ham’ for a normal message.
 16 | * Message: the full text of the SMS message.
 17 | 
 18 | You will build a binary classification model that reads in all messages and then makes a prediction for each message if it is spam or ham.
 19 | 
 20 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 21 | 
 22 | ```bash
 23 | $ dotnet new console --language F# --output SpamDetection
 24 | $ cd SpamDetection
 25 | ```
 26 | 
 27 | Now install the following ML.NET packages:
 28 | 
 29 | ```bash
 30 | $ dotnet add package Microsoft.ML
 31 | ```
 32 | 
 33 | Now you are ready to add some classes. You’ll need need one to hold a labelled message, and one to hold the model predictions.
 34 | 
 35 | Replace the contents of the Program.fs file with this:
 36 | 
 37 | ```fsharp
 38 | open System
 39 | open System.IO
 40 | open Microsoft.ML
 41 | open Microsoft.ML.Data
 42 | 
 43 | /// The SpamInput class contains one single message which may be spam or ham.
 44 | [<CLIMutable>]
 45 | type SpamInput = {
 46 |     [<LoadColumn(0)>] Verdict : string
 47 |     [<LoadColumn(1)>] Message : string
 48 | }
 49 | 
 50 | /// The SpamPrediction class contains one single spam prediction.
 51 | [<CLIMutable>]
 52 | type SpamPrediction = {
 53 |     [<ColumnName("PredictedLabel")>] IsSpam : bool
 54 |     Score : float32
 55 |     Probability : float32
 56 | }
 57 | 
 58 | // the rest of the code goes here....
 59 | ```
 60 | 
 61 | The **SpamInput** class holds one single message. Note how each field is tagged with a **LoadColumn** attribute that tells the data loading code which column to import data from.
 62 | 
 63 | There's also a **SpamPrediction** class which will hold a single spam prediction. There's a boolean **IsSpam**, a **Probability** value, and the **Score** the model will assign to the prediction.
 64 | 
 65 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 66 | 
 67 | Now look at the first column in the data file. Our label is a string with the value 'spam' meaning it's a spam message, and 'ham' meaning it's a normal message. 
 68 | 
 69 | But you're building a Binary Classifier which needs to be trained on boolean labels.
 70 | 
 71 | So you'll have to somehow convert the 'raw' text labels (stored in the **Verdict** field) to a boolean value. 
 72 | 
 73 | To set that up, you'll need a helper type:
 74 | 
 75 | ```fsharp
 76 | /// This class describes what output columns we want to produce.
 77 | [<CLIMutable>]
 78 | type ToLabel ={
 79 |     mutable Label : bool
 80 | }
 81 | 
 82 | // the rest of the code goes here....
 83 | ```
 84 | 
 85 | Note how the **ToLabel** type contains a **Label** field with the converted boolean label value. We will set up this conversion in a minute.
 86 | 
 87 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 
 88 | 
 89 | We need one more helper function before we can load the dataset. Add the following code:
 90 | 
 91 | ```fsharp
 92 | /// Helper function to cast the ML pipeline to an estimator
 93 | let castToEstimator (x : IEstimator<_>) = 
 94 |     match x with 
 95 |     | :? IEstimator<ITransformer> as y -> y
 96 |     | _ -> failwith "Cannot cast pipeline to IEstimator<ITransformer>"
 97 | 
 98 | // the rest of the code goes here
 99 | ```
100 | 
101 | The **castToEstimator** function takes an **IEstimator<>** argument and uses pattern matching to cast the value to an **IEstimator\<ITransformer\>** type. You'll see in a minute why we need this helper function. 
102 | 
103 | Now you're ready to load the training data in memory:
104 | 
105 | ```fsharp
106 | /// file paths to data files (assumes os = windows!)
107 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory
108 | 
109 | [<EntryPoint>]
110 | let main arv =
111 | 
112 |     // set up a machine learning context
113 |     let context = new MLContext()
114 | 
115 |     // load the spam dataset in memory
116 |     let data = context.Data.LoadFromTextFile<SpamInput>(dataPath, hasHeader = true, separatorChar = '\t')
117 | 
118 |     // use 80% for training and 20% for testing
119 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
120 | 
121 | 
122 |     // the rest of the code goes here....
123 | ```
124 | 
125 | This code uses the **LoadFromTextFile** function to load the TSV data directly into memory. The field annotations in the **SpamInput** type tell the function how to store the loaded data.
126 | 
127 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data.
128 | 
129 | Now you’re ready to start building the machine learning model:
130 | 
131 | ```fsharp
132 | // set up a training pipeline
133 | let pipeline = 
134 |     EstimatorChain()
135 | 
136 |         // step 1: transform the 'spam' and 'ham' values to true and false
137 |         .Append(
138 |             context.Transforms.CustomMapping(
139 |                 Action<SpamInput, ToLabel>(fun input output -> output.Label <- input.Verdict = "spam"),
140 |                 "MyLambda"))
141 | 
142 |         // step 2: featureize the input text
143 |         .Append(context.Transforms.Text.FeaturizeText("Features", "Message"))
144 | 
145 |         // step 3: use a stochastic dual coordinate ascent learner
146 |         .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression())
147 | 
148 | // the rest of the code goes here....
149 | ```
150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
151 | 
152 | This pipeline has the following components:
153 | 
154 | * A **CustomMapping** that transforms the text label to a boolean value. We define 'spam' values as spam and anything else as normal messages.
155 | * **FeaturizeText** which calculates a numerical value for each message. This is a required step because machine learning models cannot handle text data directly.
156 | * A **SdcaLogisticRegression** classification learner which will train the model to make accurate predictions.
157 | 
158 | The FeaturizeText component is a very nice solution for handling text input data. The component performs a number of transformations on the text to prepare it for model training:
159 | 
160 | * Normalize the text (=remove punctuation, diacritics, switching to lowercase etc.)
161 | * Tokenize each word.
162 | * Remove all stopwords
163 | * Extract Ngrams and skip-grams
164 | * TF-IDF rescaling
165 | * Bag of words conversion
166 | 
167 | The result is that each message is converted to a vector of numeric values that can easily be processed by the model.
168 | 
169 | Before you start training, you're going to perform a quick check to see if the dataset has enough data to reliably train a binary classification model.
170 | 
171 | We have 5574 messages which makes this a very small dataset. We'd prefer to have between 10k-100k records for reliable training. For small datasets like this one, we'll have to perform **K-Fold Cross Validation** to make sure we have enough data to work with. 
172 | 
173 | Let's set that up right now:
174 | 
175 | ```fsharp
176 | // test the full data set by performing k-fold cross validation
177 | printfn "Performing cross validation:"
178 | let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5)
179 | 
180 | // report the results
181 | cvResults |> Seq.iter(fun f -> printfn "  Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve)
182 | 
183 | // the rest of the code goes here....
184 | ```
185 | 
186 | This code calls the **CrossValidate** method to perform K-Fold Cross Validation on the training partition using 5 folds. Note how we call **castToEstimator** to cast the pipeline to an **IEstimator\<ITransformer\>** type. 
187 | 
188 | We need to do this because the **EstimatorChain** function we use every time to build the machine learning pipeline produces a type that cannot be read directly by **CrossValidate**. And the F# compiler is unable to perform the type cast for us automatically, so we need the helper function to perform the cast explicitly.   
189 | 
190 | Next, the code reports the individual AUC for each fold. For a well-balanced dataset we expect to see roughly identical AUC values for each fold. Any outliers are hints that the dataset may be unbalanced and too small to train on.
191 | 
192 | Now let's train the model and get some validation metrics:
193 | 
194 | ```fsharp
195 | // train the model on the training set
196 | let model = partitions.TrainSet |> pipeline.Fit
197 | 
198 | // evaluate the model on the test set
199 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
200 | 
201 | // report the results
202 | printfn "Model metrics:"
203 | printfn "  Accuracy:          %f" metrics.Accuracy
204 | printfn "  Auc:               %f" metrics.AreaUnderRocCurve
205 | printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
206 | printfn "  F1Score:           %f" metrics.F1Score
207 | printfn "  LogLoss:           %f" metrics.LogLoss
208 | printfn "  LogLossReduction:  %f" metrics.LogLossReduction
209 | printfn "  PositivePrecision: %f" metrics.PositivePrecision
210 | printfn "  PositiveRecall:    %f" metrics.PositiveRecall
211 | printfn "  NegativePrecision: %f" metrics.NegativePrecision
212 | printfn "  NegativeRecall:    %f" metrics.NegativeRecall
213 | 
214 | // the rest of the code goes here
215 | ```
216 | 
217 | This code trains the model by piping the training data into the **Fit** function. Then it pipes the test data into the **Transform** function to make a prediction for every message in the validation partition. 
218 | 
219 | The code pipes these predictions into the **Evaluate** function to compare these predictions to the ground truth and calculate the following metrics:
220 | 
221 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
222 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
223 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
224 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
225 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
226 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
227 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
228 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
229 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
230 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
231 | 
232 | When filtering spam, you definitely want to avoid false positives because you don’t want to be sending important emails into the junk folder.
233 | 
234 | You also want to avoid false negatives but they are not as bad as a false positive. Having some spam slipping through the filter is not the end of the world.
235 | 
236 | To wrap up, You’re going to create a couple of messages and ask the model to make a prediction:
237 | 
238 | ```fsharp
239 | // set up a prediction engine
240 | let engine = context.Model.CreatePredictionEngine model
241 | 
242 | // create sample messages
243 | let messages = [
244 |     { Message = "Hi, wanna grab lunch together today?"; Verdict = "" }
245 |     { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" }
246 |     { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" }
247 |     { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" }
248 | ]
249 | 
250 | // make the predictions
251 | printfn "Model predictions:"
252 | let predictions = messages |> List.iter(fun m -> 
253 |         let p = engine.Predict m
254 |         printfn "  %f %s" p.Probability m.Message)
255 | ```
256 | 
257 | This code calls the **CreatePredictionEngine** function to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction.
258 | 
259 | The code creates four new test messages and calls **List.iter** to make spam predictions for each message. What’s the result going to be?
260 | 
261 | Time to find out. Go to your terminal and run your code:
262 | 
263 | ```bash
264 | $ dotnet run
265 | ```
266 | 
267 | What results do you get? What are your five AUC values from K-Fold Cross Validation and the average AUC over all folds? Are there any outliers? Are the five values grouped close together? 
268 | 
269 | What can you conclude from your cross-validation results? Do we have enough data to make reliable spam predictions? 
270 | 
271 | Based on the results of cross-validation, would you say this dataset is well-balanced? And what does this say about the metrics you should use to evaluate your model? 
272 | 
273 | Which metrics did you pick to evaluate the model? And what do the values say about the accuracy of your model? 
274 | 
275 | And what about the four test messages? Dit the model accurately predict which ones are spam?
276 | 
277 | Think about the code in this assignment. How could you improve the accuracy of the model even more? What are your best AUC values after optimization? 
278 | 
279 | Share your results in our group!
280 | 


--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/SpamDetection.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |   </ItemGroup>
15 | 
16 | </Project>
17 | 


--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/SpamDetection/assets/data.png


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/Program.fs:
--------------------------------------------------------------------------------
  1 | ﻿open System
  2 | open System.IO
  3 | open Microsoft.ML
  4 | open Microsoft.ML.Data
  5 | open Microsoft.ML.Transforms
  6 | 
  7 | /// The Passenger class represents one passenger on the Titanic.
  8 | [<CLIMutable>]
  9 | type Passenger = {
 10 |     [<LoadColumn(1)>] Label : bool
 11 |     [<LoadColumn(2)>] Pclass : float32
 12 |     [<LoadColumn(4)>] Sex : string
 13 |     [<LoadColumn(5)>] RawAge : string // not a float!
 14 |     [<LoadColumn(6)>] SibSp : float32
 15 |     [<LoadColumn(7)>] Parch : float32
 16 |     [<LoadColumn(8)>] Ticket : string
 17 |     [<LoadColumn(9)>] Fare : float32
 18 |     [<LoadColumn(10)>] Cabin : string
 19 |     [<LoadColumn(11)>] Embarked : string
 20 | }
 21 | 
 22 | /// The PassengerPrediction class represents one model prediction. 
 23 | [<CLIMutable>]
 24 | type PassengerPrediction = {
 25 |     [<ColumnName("PredictedLabel")>] Prediction : bool
 26 |     Probability : float32
 27 |     Score : float32
 28 | }
 29 | 
 30 | /// The ToAge class is a helper class for a column transformation.
 31 | [<CLIMutable>]
 32 | type ToAge = {
 33 |     mutable Age : string
 34 | }
 35 | 
 36 | /// file path to the train data file (assumes os = windows!)
 37 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory
 38 | 
 39 | /// file path to the test data file (assumes os = windows!)
 40 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory
 41 | 
 42 | [<EntryPoint>]
 43 | let main argv = 
 44 | 
 45 |     // set up a machine learning context
 46 |     let context = new MLContext()
 47 | 
 48 |     // load the training and testing data in memory
 49 |     let trainData = context.Data.LoadFromTextFile<Passenger>(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
 50 |     let testData = context.Data.LoadFromTextFile<Passenger>(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
 51 | 
 52 |     // set up a training pipeline
 53 |     let pipeline = 
 54 |         EstimatorChain()
 55 | 
 56 |             // step 1: replace missing ages with '?'
 57 |             .Append(
 58 |                 context.Transforms.CustomMapping(
 59 |                     Action<Passenger, ToAge>(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge),
 60 |                     "AgeMapping"))
 61 | 
 62 |             // step 2: convert string ages to floats
 63 |             .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single))
 64 | 
 65 |             // step 3: replace missing age values with the mean age
 66 |             .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean))
 67 | 
 68 |             // step 4: replace string columns with one-hot encoded vectors
 69 |             .Append(context.Transforms.Categorical.OneHotEncoding("Sex"))
 70 |             .Append(context.Transforms.Categorical.OneHotEncoding("Ticket"))
 71 |             .Append(context.Transforms.Categorical.OneHotEncoding("Cabin"))
 72 |             .Append(context.Transforms.Categorical.OneHotEncoding("Embarked"))
 73 | 
 74 |             // step 5: concatenate everything into a single feature column 
 75 |             .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked"))
 76 | 
 77 |             // step 6: use a fasttree trainer
 78 |             .Append(context.BinaryClassification.Trainers.FastTree())
 79 | 
 80 |     // train the model
 81 |     let model = trainData |> pipeline.Fit
 82 | 
 83 |     // make predictions and compare with ground truth
 84 |     let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate
 85 | 
 86 |     // report the results
 87 |     printfn "Model metrics:"
 88 |     printfn "  Accuracy:          %f" metrics.Accuracy
 89 |     printfn "  Auc:               %f" metrics.AreaUnderRocCurve
 90 |     printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
 91 |     printfn "  F1Score:           %f" metrics.F1Score
 92 |     printfn "  LogLoss:           %f" metrics.LogLoss
 93 |     printfn "  LogLossReduction:  %f" metrics.LogLossReduction
 94 |     printfn "  PositivePrecision: %f" metrics.PositivePrecision
 95 |     printfn "  PositiveRecall:    %f" metrics.PositiveRecall
 96 |     printfn "  NegativePrecision: %f" metrics.NegativePrecision
 97 |     printfn "  NegativeRecall:    %f" metrics.NegativeRecall
 98 | 
 99 |     // set up a prediction engine
100 |     let engine = context.Model.CreatePredictionEngine model
101 | 
102 |     // create a sample record
103 |     let passenger = {
104 |         Pclass = 1.0f
105 |         Sex = "male"
106 |         RawAge = "48"
107 |         SibSp = 0.0f
108 |         Parch = 0.0f
109 |         Ticket = "B"
110 |         Fare = 70.0f
111 |         Cabin = "123"
112 |         Embarked = "S"
113 |         Label = false // unused!
114 |     }
115 | 
116 |     // make the prediction
117 |     let prediction = engine.Predict passenger
118 | 
119 |     // report the results
120 |     printfn "Model prediction:"
121 |     printfn "  Prediction:  %s" (if prediction.Prediction then "survived" else "perished")
122 |     printfn "  Probability: %f" prediction.Probability
123 | 
124 |     0 // return value


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Predict who survived the Titanic disaster
  2 | 
  3 | The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
  4 | 
  5 | ![Sinking Titanic](./assets/titanic.jpeg)
  6 | 
  7 | In this assignment you're going to build an app that can predict which Titanic passengers survived the disaster. You will use a decision tree classifier to make your predictions.
  8 | 
  9 | The first thing you will need for your app is the passenger manifest of the Titanic's last voyage. You will use the famous [Kaggle Titanic Dataset](https://github.com/sbaidachni/MLNETTitanic/tree/master/MLNetTitanic) which has data for a subset of 891 passengers.
 10 | 
 11 | Download the [test_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/test_data.csv) and [train_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/train_data.csv) files and save them to your project folder.
 12 | 
 13 | The training data file looks like this:
 14 | 
 15 | ![Training data](./assets/data.jpg)
 16 | 
 17 | It’s a CSV file with 12 columns of information:
 18 | 
 19 | * The passenger identifier
 20 | * The label column containing ‘1’ if the passenger survived and ‘0’ if the passenger perished
 21 | * The class of travel (1–3)
 22 | * The name of the passenger
 23 | * The gender of the passenger (‘male’ or ‘female’)
 24 | * The age of the passenger, or ‘0’ if the age is unknown
 25 | * The number of siblings and/or spouses aboard
 26 | * The number of parents and/or children aboard
 27 | * The ticket number
 28 | * The fare paid
 29 | * The cabin number
 30 | * The port in which the passenger embarked
 31 | 
 32 | The second column is the label: 0 means the passenger perished, and 1 means the passenger survived. All other columns are input features from the passenger manifest.
 33 | 
 34 | You're gooing to build a binary classification model that reads in all columns and then predicts for each passenger if he or she survived.
 35 | 
 36 | Let’s get started. Here’s how to set up a new console project in NET Core:
 37 | 
 38 | ```bash
 39 | $ dotnet new console --language F# --output TitanicPrediction
 40 | $ cd TitanicPrediction
 41 | ```
 42 | 
 43 | Next, you need to install the correct NuGet packages:
 44 | 
 45 | ```
 46 | $ dotnet add package Microsoft.ML
 47 | $ dotnet add package Microsoft.ML.FastTree
 48 | ```
 49 | 
 50 | Now you are ready to add some classes. You’ll need one to hold passenger data, and one to hold your model predictions.
 51 | 
 52 | Replace the contents of the Program.fs file with this:
 53 | 
 54 | ```fsharp
 55 | open System
 56 | open System.IO
 57 | open Microsoft.ML
 58 | open Microsoft.ML.Data
 59 | open Microsoft.ML.Transforms
 60 | 
 61 | /// The Passenger class represents one passenger on the Titanic.
 62 | [<CLIMutable>]
 63 | type Passenger = {
 64 |     [<LoadColumn(1)>] Label : bool
 65 |     [<LoadColumn(2)>] Pclass : float32
 66 |     [<LoadColumn(4)>] Sex : string
 67 |     [<LoadColumn(5)>] RawAge : string // not a float!
 68 |     [<LoadColumn(6)>] SibSp : float32
 69 |     [<LoadColumn(7)>] Parch : float32
 70 |     [<LoadColumn(8)>] Ticket : string
 71 |     [<LoadColumn(9)>] Fare : float32
 72 |     [<LoadColumn(10)>] Cabin : string
 73 |     [<LoadColumn(11)>] Embarked : string
 74 | }
 75 | 
 76 | /// The PassengerPrediction class represents one model prediction. 
 77 | [<CLIMutable>]
 78 | type PassengerPrediction = {
 79 |     [<ColumnName("PredictedLabel")>] Prediction : bool
 80 |     Probability : float32
 81 |     Score : float32
 82 | }
 83 | 
 84 | // the rest of the code goes here...
 85 | ```
 86 | 
 87 | The **Passenger** type holds one single passenger record. There's also a **PassengerPrediction** type which will hold a single passenger prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction.
 88 | 
 89 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 90 | 
 91 | Now look at the age column in the data file. It's a number, but for some passengers in the manifest the age is not known and the column is empty.
 92 | 
 93 | ML.NET can automatically load and process missing numeric values, but only if they are present in the CSV file as a '?'.
 94 | 
 95 | The Titanic datafile uses an empty string to denote missing values, so we'll have to perform a feature conversion
 96 | 
 97 | Notice how the age is loaded as s string into a Passenger class field called **RawAge**. 
 98 | 
 99 | We will process the missing values later in our app. To prepare for this, we'll need an additional helper type:
100 | 
101 | ```fsharp
102 | /// The ToAge class is a helper class for a column transformation.
103 | [<CLIMutable>]
104 | type ToAge = {
105 |     mutable Age : string
106 | }
107 | 
108 | // the rest of the code goes here...
109 | ```
110 | 
111 | The **ToAge** type will contain the converted age values. We will set up this conversion in a minute. 
112 | 
113 | Note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 
114 | 
115 | Now you're going to load the training data in memory:
116 | 
117 | ```fsharp
118 | /// file path to the train data file (assumes os = windows!)
119 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory
120 | 
121 | /// file path to the test data file (assumes os = windows!)
122 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory
123 | 
124 | [<EntryPoint>]
125 | let main argv = 
126 | 
127 |     // set up a machine learning context
128 |     let context = new MLContext()
129 | 
130 |     // load the training and testing data in memory
131 |     let trainData = context.Data.LoadFromTextFile<Passenger>(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
132 |     let testData = context.Data.LoadFromTextFile<Passenger>(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
133 | 
134 |     // the rest of the code goes here...
135 | 
136 |     0 // return value
137 | ```
138 | 
139 | This code calls the **LoadFromTextFile** function twice to load the training and testing datasets in memory.
140 | 
141 | ML.NET expects missing data in CSV files to appear as a ‘?’, but unfortunately the Titanic file uses an empty string to indicate an unknown age. So the first thing you need to do is replace all empty age strings occurrences with ‘?’.
142 | 
143 | Add the following code:
144 | 
145 | ```fsharp
146 | // set up a training pipeline
147 | let pipeline = 
148 |     EstimatorChain()
149 | 
150 |         // step 1: replace missing ages with '?'
151 |         .Append(
152 |             context.Transforms.CustomMapping(
153 |                 Action<Passenger, ToAge>(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge),
154 |                 "AgeMapping"))
155 | 
156 |         // the rest of the code goes here...
157 | ```
158 | 
159 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
160 | 
161 | The **CustomMapping** component converts empty age strings to ‘?’ values.
162 | 
163 | Now ML.NET is happy with the age values. You will now convert the string ages to numeric values and instruct ML.NET to replace any missing values with the mean age over the entire dataset.
164 | 
165 | Add the following code, and make sure you match the indentation level of the previous **Append** function exactly. Indentation is significant in F# and the wrong indentation level will lead to compiler errors:
166 | 
167 | ```fsharp
168 | // step 2: convert string ages to floats
169 | .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single))
170 | 
171 | // step 3: replace missing age values with the mean age
172 | .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean))
173 | 
174 | // the rest of the code goes here...
175 | ```
176 | 
177 | The **ConvertType** component converts the Age column to a single-precision floating point value. And the **ReplaceMissingValues** component replaces any missing values with the mean value of all ages in the entire dataset. 
178 | 
179 | Now let's process the rest of the data columns. The Sex, Ticket, Cabin, and Embarked columns are enumerations of string values. As you've already learned, you'll need to one-hot encode them:
180 | 
181 | ```fsharp
182 | // step 4: replace string columns with one-hot encoded vectors
183 | .Append(context.Transforms.Categorical.OneHotEncoding("Sex"))
184 | .Append(context.Transforms.Categorical.OneHotEncoding("Ticket"))
185 | .Append(context.Transforms.Categorical.OneHotEncoding("Cabin"))
186 | .Append(context.Transforms.Categorical.OneHotEncoding("Embarked"))
187 | 
188 | // the rest of the code goes here...
189 | ```
190 | 
191 | The **OneHotEncoding** components take an input column, one-hot encode all values, and produce a new column with the same name holding the one-hot vectors. 
192 | 
193 | Now let's wrap up the pipeline:
194 | 
195 | ```fsharp
196 |         // step 5: concatenate everything into a single feature column 
197 |         .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked"))
198 | 
199 |         // step 6: use a fasttree trainer
200 |         .Append(context.BinaryClassification.Trainers.FastTree())
201 | 
202 | // the rest of the code goes here (indented back 2 levels!)...
203 | ```
204 | 
205 | The **Concatenate** component concatenates all remaining feature columns into a single column for training. This is required because ML.NET can only train on a single input column.
206 | 
207 | And the **FastTreeBinaryClassificationTrainer** is the algorithm that's going to train the model. You're going to build a decision tree classifier that uses the Fast Tree algorithm to train on the data and configure the tree.
208 | 
209 | Note the indentation level of the 'the rest of the code...' comment. Make sure that when you add the remaining code you indent this code back by two levels to match the indentation level of the **main** function.
210 | 
211 | Now all you need to do now is train the model on the entire dataset, compare the predictions with the labels, and compute a bunch of metrics that describe how accurate the model is:
212 | 
213 | ```fsharp
214 | // train the model
215 | let model = trainData |> pipeline.Fit
216 | 
217 | // make predictions and compare with ground truth
218 | let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate
219 | 
220 | // report the results
221 | printfn "Model metrics:"
222 | printfn "  Accuracy:          %f" metrics.Accuracy
223 | printfn "  Auc:               %f" metrics.AreaUnderRocCurve
224 | printfn "  Auprc:             %f" metrics.AreaUnderPrecisionRecallCurve
225 | printfn "  F1Score:           %f" metrics.F1Score
226 | printfn "  LogLoss:           %f" metrics.LogLoss
227 | printfn "  LogLossReduction:  %f" metrics.LogLossReduction
228 | printfn "  PositivePrecision: %f" metrics.PositivePrecision
229 | printfn "  PositiveRecall:    %f" metrics.PositiveRecall
230 | printfn "  NegativePrecision: %f" metrics.NegativePrecision
231 | printfn "  NegativeRecall:    %f" metrics.NegativeRecall
232 | 
233 | // the rest of the code goes here...
234 | ```
235 | 
236 | This code pipes the training data into the **Fit** function to train the model on the entire dataset.
237 | 
238 | We then pipe the test data into the **Transform** function to set up a prediction for each passenger, and pipe these predictions into the **Evaluate** function to compare them to the label and automatically calculate evaluation metrics.
239 | 
240 | We then display the following metrics:
241 | 
242 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
243 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
244 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
245 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
246 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
247 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
248 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
249 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
250 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
251 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
252 | 
253 | To wrap up, let's have some fun and pretend that I’m going to take a trip on the Titanic too. I will embark in Southampton and pay $70 for a first-class cabin. I travel on my own without parents, children, or my spouse. 
254 | 
255 | What are my odds of surviving?
256 | 
257 | Add the following code:
258 | 
259 | ```fsharp
260 | // set up a prediction engine
261 | let engine = context.Model.CreatePredictionEngine model
262 | 
263 | // create a sample record
264 | let passenger = {
265 |     Pclass = 1.0f
266 |     Sex = "male"
267 |     RawAge = "48"
268 |     SibSp = 0.0f
269 |     Parch = 0.0f
270 |     Ticket = "B"
271 |     Fare = 70.0f
272 |     Cabin = "123"
273 |     Embarked = "S"
274 |     Label = false // unused!
275 | }
276 | 
277 | // make the prediction
278 | let prediction = engine.Predict passenger
279 | 
280 | // report the results
281 | printfn "Model prediction:"
282 | printfn "  Prediction:  %s" (if prediction.Prediction then "survived" else "perished")
283 | printfn "  Probability: %f" prediction.Probability
284 | ```
285 | 
286 | This code uses the **CreatePredictionEngine** method to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction.
287 | 
288 | The code sets up a new passenger record with my information and then calls **Predict** to make a prediction about my survival chances. 
289 | 
290 | So would I have survived the Titanic disaster?
291 | 
292 | Time to find out. Go to your terminal and run your code:
293 | 
294 | ```bash
295 | $ dotnet run
296 | ```
297 | 
298 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value?
299 | 
300 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model? 
301 | 
302 | And what about me? Did I survive the disaster?
303 | 
304 | Do you think a decision tree is a good choice to predict Titanic survivors? Which other algorithms could you use instead? Do they give a better result?
305 | 
306 | Share your results in our group!
307 | 


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/TitanicPrediction.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |     <PackageReference Include="Microsoft.ML.FastTree" Version="1.5.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/assets/data.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/data.jpg


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/assets/titanic.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/titanic.jpeg


--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/test_data.csv:
--------------------------------------------------------------------------------
  1 | "PassengerId","Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"
  2 | 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)","female","38",1,0,"PC 17599","71.2833","C85","C"
  3 | 3,1,3,"Heikkinen, Miss. Laina","female","26",0,0,"STON/O2. 3101282","7.925","","S"
  4 | 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","female","27",0,2,"347742","11.1333","","S"
  5 | 11,1,3,"Sandstrom, Miss. Marguerite Rut","female","4",1,1,"PP 9549","16.7","G6","S"
  6 | 18,1,2,"Williams, Mr. Charles Eugene","male","",0,0,"244373","13","","S"
  7 | 25,0,3,"Palsson, Miss. Torborg Danira","female","8",3,1,"349909","21.075","","S"
  8 | 30,0,3,"Todoroff, Mr. Lalio","male","",0,0,"349216","7.8958","","S"
  9 | 31,0,1,"Uruchurtu, Don. Manuel E","male","40",0,0,"PC 17601","27.7208","","C"
 10 | 34,0,2,"Wheadon, Mr. Edward H","male","66",0,0,"C.A. 24579","10.5","","S"
 11 | 38,0,3,"Cann, Mr. Ernest Charles","male","21",0,0,"A./5. 2152","8.05","","S"
 12 | 43,0,3,"Kraeff, Mr. Theodor","male","",0,0,"349253","7.8958","","C"
 13 | 49,0,3,"Samaan, Mr. Youssef","male","",2,0,"2662","21.6792","","C"
 14 | 51,0,3,"Panula, Master. Juha Niilo","male","7",4,1,"3101295","39.6875","","S"
 15 | 55,0,1,"Ostby, Mr. Engelhart Cornelius","male","65",0,1,"113509","61.9792","B30","C"
 16 | 60,0,3,"Goodwin, Master. William Frederick","male","11",5,2,"CA 2144","46.9","","S"
 17 | 64,0,3,"Skoog, Master. Harald","male","4",3,2,"347088","27.9","","S"
 18 | 67,1,2,"Nye, Mrs. (Elizabeth Ramell)","female","29",0,0,"C.A. 29395","10.5","F33","S"
 19 | 72,0,3,"Goodwin, Miss. Lillian Amy","female","16",5,2,"CA 2144","46.9","","S"
 20 | 76,0,3,"Moen, Mr. Sigurd Hansen","male","25",0,0,"348123","7.65","F G73","S"
 21 | 78,0,3,"Moutal, Mr. Rahamin Haim","male","",0,0,"374746","8.05","","S"
 22 | 81,0,3,"Waelens, Mr. Achille","male","22",0,0,"345767","9","","S"
 23 | 85,1,2,"Ilett, Miss. Bertha","female","17",0,0,"SO/C 14885","10.5","","S"
 24 | 87,0,3,"Ford, Mr. William Neal","male","16",1,3,"W./C. 6608","34.375","","S"
 25 | 93,0,1,"Chaffee, Mr. Herbert Fuller","male","46",1,0,"W.E.P. 5734","61.175","E31","S"
 26 | 95,0,3,"Coxon, Mr. Daniel","male","59",0,0,"364500","7.25","","S"
 27 | 99,1,2,"Doling, Mrs. John T (Ada Julia Bone)","female","34",0,1,"231919","23","","S"
 28 | 113,0,3,"Barton, Mr. David John","male","22",0,0,"324669","8.05","","S"
 29 | 121,0,2,"Hickman, Mr. Stanley George","male","21",2,0,"S.O.C. 14879","73.5","","S"
 30 | 123,0,2,"Nasser, Mr. Nicholas","male","32.5",1,0,"237736","30.0708","","C"
 31 | 136,0,2,"Richard, Mr. Emile","male","23",0,0,"SC/PARIS 2133","15.0458","","C"
 32 | 140,0,1,"Giglio, Mr. Victor","male","24",0,0,"PC 17593","79.2","B86","C"
 33 | 144,0,3,"Burke, Mr. Jeremiah","male","19",0,0,"365222","6.75","","Q"
 34 | 146,0,2,"Nicholls, Mr. Joseph Charles","male","19",1,1,"C.A. 33112","36.75","","S"
 35 | 148,0,3,"Ford, Miss. Robina Maggie ""Ruby""","female","9",2,2,"W./C. 6608","34.375","","S"
 36 | 156,0,1,"Williams, Mr. Charles Duane","male","51",0,1,"PC 17597","61.3792","","C"
 37 | 157,1,3,"Gilnagh, Miss. Katherine ""Katie""","female","16",0,0,"35851","7.7333","","Q"
 38 | 158,0,3,"Corn, Mr. Harry","male","30",0,0,"SOTON/OQ 392090","8.05","","S"
 39 | 166,1,3,"Goldsmith, Master. Frank John William ""Frankie""","male","9",0,2,"363291","20.525","","S"
 40 | 167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)","female","",0,1,"113505","55","E33","S"
 41 | 168,0,3,"Skoog, Mrs. William (Anna Bernhardina Karlsson)","female","45",1,4,"347088","27.9","","S"
 42 | 195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)","female","44",0,0,"PC 17610","27.7208","B4","C"
 43 | 201,0,3,"Vande Walle, Mr. Nestor Cyriel","male","28",0,0,"345770","9.5","","S"
 44 | 206,0,3,"Strom, Miss. Telma Matilda","female","2",0,1,"347054","10.4625","G6","S"
 45 | 210,1,1,"Blank, Mr. Henry","male","40",0,0,"112277","31","A31","C"
 46 | 218,0,2,"Jacobsohn, Mr. Sidney Samuel","male","42",1,0,"243847","27","","S"
 47 | 223,0,3,"Green, Mr. George Henry","male","51",0,0,"21440","8.05","","S"
 48 | 241,0,3,"Zabour, Miss. Thamine","female","",1,0,"2665","14.4542","","C"
 49 | 243,0,2,"Coleridge, Mr. Reginald Charles","male","29",0,0,"W./C. 14263","10.5","","S"
 50 | 251,0,3,"Reed, Mr. James George","male","",0,0,"362316","7.25","","S"
 51 | 255,0,3,"Rosblom, Mrs. Viktor (Helena Wilhelmina)","female","41",0,2,"370129","20.2125","","S"
 52 | 265,0,3,"Henry, Miss. Delia","female","",0,0,"382649","7.75","","Q"
 53 | 266,0,2,"Reeves, Mr. David","male","36",0,0,"C.A. 17248","10.5","","S"
 54 | 271,0,1,"Cairns, Mr. Alexander","male","",0,0,"113798","31","","S"
 55 | 279,0,3,"Rice, Master. Eric","male","7",4,1,"382652","29.125","","Q"
 56 | 285,0,1,"Smith, Mr. Richard William","male","",0,0,"113056","26","A19","S"
 57 | 296,0,1,"Lewy, Mr. Ervin G","male","",0,0,"PC 17612","27.7208","","C"
 58 | 305,0,3,"Williams, Mr. Howard Hugh ""Harry""","male","",0,0,"A/5 2466","8.05","","S"
 59 | 306,1,1,"Allison, Master. Hudson Trevor","male","0.92",1,2,"113781","151.55","C22 C26","S"
 60 | 311,1,1,"Hays, Miss. Margaret Bechstein","female","24",0,0,"11767","83.1583","C54","C"
 61 | 314,0,3,"Hendekovic, Mr. Ignjac","male","28",0,0,"349243","7.8958","","S"
 62 | 315,0,2,"Hart, Mr. Benjamin","male","43",1,1,"F.C.C. 13529","26.25","","S"
 63 | 333,0,1,"Graham, Mr. George Edward","male","38",0,1,"PC 17582","153.4625","C91","S"
 64 | 335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinsheimer)","female","",1,0,"PC 17611","133.65","","S"
 65 | 337,0,1,"Pears, Mr. Thomas Clinton","male","29",1,0,"113776","66.6","C2","S"
 66 | 341,1,2,"Navratil, Master. Edmond Roger","male","2",1,1,"230080","26","F2","S"
 67 | 344,0,2,"Sedgwick, Mr. Charles Frederick Waddington","male","25",0,0,"244361","13","","S"
 68 | 345,0,2,"Fox, Mr. Stanley Hubert","male","36",0,0,"229236","13","","S"
 69 | 359,1,3,"McGovern, Miss. Mary","female","",0,0,"330931","7.8792","","Q"
 70 | 365,0,3,"O'Brien, Mr. Thomas","male","",1,0,"370365","15.5","","Q"
 71 | 366,0,3,"Adahl, Mr. Mauritz Nils Martin","male","30",0,0,"C 7076","7.25","","S"
 72 | 367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)","female","60",1,0,"110813","75.25","D37","C"
 73 | 374,0,1,"Ringhini, Mr. Sante","male","22",0,0,"PC 17760","135.6333","","C"
 74 | 375,0,3,"Palsson, Miss. Stina Viola","female","3",3,1,"349909","21.075","","S"
 75 | 376,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)","female","",1,0,"PC 17604","82.1708","","C"
 76 | 383,0,3,"Tikkanen, Mr. Juho","male","32",0,0,"STON/O 2. 3101293","7.925","","S"
 77 | 387,0,3,"Goodwin, Master. Sidney Leonard","male","1",5,2,"CA 2144","46.9","","S"
 78 | 393,0,3,"Gustafsson, Mr. Johan Birger","male","28",2,0,"3101277","7.925","","S"
 79 | 396,0,3,"Johansson, Mr. Erik","male","22",0,0,"350052","7.7958","","S"
 80 | 401,1,3,"Niskanen, Mr. Juha","male","39",0,0,"STON/O 2. 3101289","7.925","","S"
 81 | 407,0,3,"Widegren, Mr. Carl/Charles Peter","male","51",0,0,"347064","7.75","","S"
 82 | 408,1,2,"Richards, Master. William Rowe","male","3",1,1,"29106","18.75","","S"
 83 | 414,0,2,"Cunningham, Mr. Alfred Fleming","male","",0,0,"239853","0","","S"
 84 | 419,0,2,"Matthews, Mr. William John","male","30",0,0,"28228","13","","S"
 85 | 422,0,3,"Charters, Mr. David","male","21",0,0,"A/5. 13032","7.7333","","Q"
 86 | 423,0,3,"Zimmerman, Mr. Leo","male","29",0,0,"315082","7.875","","S"
 87 | 427,1,2,"Clarke, Mrs. Charles V (Ada Maria Winfield)","female","28",1,0,"2003","26","","S"
 88 | 428,1,2,"Phillips, Miss. Kate Florence (""Mrs Kate Louise Phillips Marshall"")","female","19",0,0,"250655","26","","S"
 89 | 434,0,3,"Kallio, Mr. Nikolai Erland","male","17",0,0,"STON/O 2. 3101274","7.125","","S"
 90 | 437,0,3,"Ford, Miss. Doolina Margaret ""Daisy""","female","21",2,2,"W./C. 6608","34.375","","S"
 91 | 438,1,2,"Richards, Mrs. Sidney (Emily Hocking)","female","24",2,3,"29106","18.75","","S"
 92 | 441,1,2,"Hart, Mrs. Benjamin (Esther Ada Bloomfield)","female","45",1,1,"F.C.C. 13529","26.25","","S"
 93 | 446,1,1,"Dodge, Master. Washington","male","4",0,2,"33638","81.8583","A34","S"
 94 | 448,1,1,"Seward, Mr. Frederic Kimber","male","34",0,0,"113794","26.55","","S"
 95 | 449,1,3,"Baclini, Miss. Marie Catherine","female","5",2,1,"2666","19.2583","","C"
 96 | 462,0,3,"Morley, Mr. William","male","34",0,0,"364506","8.05","","S"
 97 | 465,0,3,"Maisner, Mr. Simon","male","",0,0,"A/S 2816","8.05","","S"
 98 | 483,0,3,"Rouse, Mr. Richard Henry","male","50",0,0,"A/5 3594","8.05","","S"
 99 | 493,0,1,"Molson, Mr. Harry Markland","male","55",0,0,"113787","30.5","C30","S"
100 | 495,0,3,"Stanley, Mr. Edward Roland","male","21",0,0,"A/4 45380","8.05","","S"
101 | 497,1,1,"Eustis, Miss. Elizabeth Mussey","female","54",1,0,"36947","78.2667","D20","C"
102 | 507,1,2,"Quick, Mrs. Frederick Charles (Jane Richards)","female","33",0,2,"26360","26","","S"
103 | 508,1,1,"Bradley, Mr. George (""George Arthur Brayton"")","male","",0,0,"111427","26.55","","S"
104 | 512,0,3,"Webber, Mr. James","male","",0,0,"SOTON/OQ 3101316","8.05","","S"
105 | 518,0,3,"Ryan, Mr. Patrick","male","",0,0,"371110","24.15","","Q"
106 | 522,0,3,"Vovk, Mr. Janko","male","22",0,0,"349252","7.8958","","S"
107 | 530,0,2,"Hocking, Mr. Richard George","male","23",2,1,"29104","11.5","","S"
108 | 531,1,2,"Quick, Miss. Phyllis May","female","2",1,1,"26360","26","","S"
109 | 532,0,3,"Toufik, Mr. Nakli","male","",0,0,"2641","7.2292","","C"
110 | 538,1,1,"LeRoy, Miss. Bertha","female","30",0,0,"PC 17761","106.425","","C"
111 | 543,0,3,"Andersson, Miss. Sigrid Elisabeth","female","11",4,2,"347082","31.275","","S"
112 | 547,1,2,"Beane, Mrs. Edward (Ethel Clarke)","female","19",1,0,"2908","26","","S"
113 | 551,1,1,"Thayer, Mr. John Borland Jr","male","17",0,2,"17421","110.8833","C70","C"
114 | 558,0,1,"Robbins, Mr. Victor","male","",0,0,"PC 17757","227.525","","C"
115 | 561,0,3,"Morrow, Mr. Thomas Rowan","male","",0,0,"372622","7.75","","Q"
116 | 570,1,3,"Jonsson, Mr. Carl","male","32",0,0,"350417","7.8542","","S"
117 | 574,1,3,"Kelly, Miss. Mary","female","",0,0,"14312","7.75","","Q"
118 | 589,0,3,"Gilinski, Mr. Eliezer","male","22",0,0,"14973","8.05","","S"
119 | 591,0,3,"Rintamaki, Mr. Matti","male","35",0,0,"STON/O 2. 3101273","7.125","","S"
120 | 592,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)","female","52",1,0,"36947","78.2667","D20","C"
121 | 600,1,1,"Duff Gordon, Sir. Cosmo Edmund (""Mr Morgan"")","male","49",1,0,"PC 17485","56.9292","A20","C"
122 | 602,0,3,"Slabenoff, Mr. Petco","male","",0,0,"349214","7.8958","","S"
123 | 609,1,2,"Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)","female","22",1,2,"SC/Paris 2123","41.5792","","C"
124 | 616,1,2,"Herman, Miss. Alice","female","24",1,2,"220845","65","","S"
125 | 619,1,2,"Becker, Miss. Marion Louise","female","4",2,1,"230136","39","F4","S"
126 | 635,0,3,"Skoog, Miss. Mabel","female","9",3,2,"347088","27.9","","S"
127 | 641,0,3,"Jensen, Mr. Hans Peder","male","20",0,0,"350050","7.8542","","S"
128 | 647,0,3,"Cor, Mr. Liudevit","male","19",0,0,"349231","7.8958","","S"
129 | 648,1,1,"Simonius-Blumer, Col. Oberst Alfons","male","56",0,0,"13213","35.5","A26","C"
130 | 650,1,3,"Stanley, Miss. Amy Zillah Elsie","female","23",0,0,"CA. 2314","7.55","","S"
131 | 655,0,3,"Hegarty, Miss. Hanora ""Nora""","female","18",0,0,"365226","6.75","","Q"
132 | 657,0,3,"Radeff, Mr. Alexander","male","",0,0,"349223","7.8958","","S"
133 | 661,1,1,"Frauenthal, Dr. Henry William","male","50",2,0,"PC 17611","133.65","","S"
134 | 664,0,3,"Coleff, Mr. Peju","male","36",0,0,"349210","7.4958","","S"
135 | 673,0,2,"Mitchell, Mr. Henry Michael","male","70",0,0,"C.A. 24580","10.5","","S"
136 | 675,0,2,"Watson, Mr. Ennis Hastings","male","",0,0,"239856","0","","S"
137 | 679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)","female","43",1,6,"CA 2144","46.9","","S"
138 | 688,0,3,"Dakic, Mr. Branko","male","19",0,0,"349228","10.1708","","S"
139 | 698,1,3,"Mullens, Miss. Katherine ""Katie""","female","",0,0,"35852","7.7333","","Q"
140 | 705,0,3,"Hansen, Mr. Henrik Juul","male","26",1,0,"350025","7.8542","","S"
141 | 713,1,1,"Taylor, Mr. Elmer Zebley","male","48",1,0,"19996","52","C126","S"
142 | 720,0,3,"Johnson, Mr. Malkolm Joackim","male","33",0,0,"347062","7.775","","S"
143 | 727,1,2,"Renouf, Mrs. Peter Henry (Lillian Jefferys)","female","30",3,0,"31027","21","","S"
144 | 732,0,3,"Hassan, Mr. Houssein G N","male","11",0,0,"2699","18.7875","","C"
145 | 740,0,3,"Nankoff, Mr. Minko","male","",0,0,"349218","7.8958","","S"
146 | 741,1,1,"Hawksford, Mr. Walter James","male","",0,0,"16988","30","D45","S"
147 | 742,0,1,"Cavendish, Mr. Tyrell William","male","36",1,0,"19877","78.85","C46","S"
148 | 744,0,3,"McNamee, Mr. Neal","male","24",1,0,"376566","16.1","","S"
149 | 748,1,2,"Sinkkonen, Miss. Anna","female","30",0,0,"250648","13","","S"
150 | 751,1,2,"Wells, Miss. Joan","female","4",1,1,"29103","23","","S"
151 | 752,1,3,"Moor, Master. Meier","male","6",0,1,"392096","12.475","E121","S"
152 | 762,0,3,"Nirva, Mr. Iisakki Antino Aijo","male","41",0,0,"SOTON/O2 3101272","7.125","","S"
153 | 763,1,3,"Barah, Mr. Hanna Assi","male","20",0,0,"2663","7.2292","","C"
154 | 769,0,3,"Moran, Mr. Daniel J","male","",1,0,"371110","24.15","","Q"
155 | 770,0,3,"Gronnestad, Mr. Daniel Danielsen","male","32",0,0,"8471","8.3625","","S"
156 | 783,0,1,"Long, Mr. Milton Clyde","male","29",0,0,"113501","30","D6","S"
157 | 786,0,3,"Harmer, Mr. Abraham (David Lishin)","male","25",0,0,"374887","7.25","","S"
158 | 792,0,2,"Gaskell, Mr. Alfred","male","16",0,0,"239865","26","","S"
159 | 795,0,3,"Dantcheff, Mr. Ristiu","male","25",0,0,"349203","7.8958","","S"
160 | 797,1,1,"Leader, Dr. Alice (Farnham)","female","49",0,0,"17465","25.9292","D17","S"
161 | 801,0,2,"Ponesell, Mr. Martin","male","34",0,0,"250647","13","","S"
162 | 810,1,1,"Chambers, Mrs. Norman Campbell (Bertha Griggs)","female","33",1,0,"113806","53.1","E8","S"
163 | 812,0,3,"Lester, Mr. James","male","39",0,0,"A/4 48871","24.15","","S"
164 | 815,0,3,"Tomlin, Mr. Ernest Portage","male","30.5",0,0,"364499","8.05","","S"
165 | 821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gregg)","female","52",1,1,"12749","93.5","B69","S"
166 | 829,1,3,"McCormack, Mr. Thomas Joseph","male","",0,0,"367228","7.75","","Q"
167 | 832,1,2,"Richards, Master. George Sibley","male","0.83",1,1,"29106","18.75","","S"
168 | 845,0,3,"Culumovic, Mr. Jeso","male","17",0,0,"315090","8.6625","","S"
169 | 850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)","female","",1,0,"17453","89.1042","C92","C"
170 | 851,0,3,"Andersson, Master. Sigvard Harald Elias","male","4",4,2,"347082","31.275","","S"
171 | 853,0,3,"Boulos, Miss. Nourelain","female","9",1,1,"2678","15.2458","","C"
172 | 857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)","female","45",1,1,"36928","164.8667","","S"
173 | 858,1,1,"Daly, Mr. Peter Denis ","male","51",0,0,"113055","26.55","E17","S"
174 | 860,0,3,"Razi, Mr. Raihed","male","",0,0,"2629","7.2292","","C"
175 | 865,0,2,"Gill, Mr. John William","male","24",0,0,"233866","13","","S"
176 | 867,1,2,"Duran y More, Miss. Asuncion","female","27",1,0,"SC/PARIS 2149","13.8583","","C"
177 | 874,0,3,"Vander Cruyssen, Mr. Victor","male","47",0,0,"345765","9","","S"
178 | 879,0,3,"Laleff, Mr. Kristo","male","",0,0,"349217","7.8958","","S"
179 | 882,0,3,"Markun, Mr. Johann","male","33",0,0,"349257","7.8958","","S"
180 | 886,0,3,"Rice, Mrs. William (Margaret Norton)","female","39",0,5,"382652","29.125","","Q"
181 | 


--------------------------------------------------------------------------------
/Clustering/IrisFlower/IrisFlower.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |   </ItemGroup>
15 | 
16 | </Project>
17 | 


--------------------------------------------------------------------------------
/Clustering/IrisFlower/Program.fs:
--------------------------------------------------------------------------------
 1 | ﻿open System
 2 | open Microsoft.ML
 3 | open Microsoft.ML.Data
 4 | 
 5 | /// A type that holds a single iris flower.
 6 | [<CLIMutable>]
 7 | type IrisData = {
 8 |     [<LoadColumn(0)>] SepalLength : float32
 9 |     [<LoadColumn(1)>] SepalWidth : float32
10 |     [<LoadColumn(2)>] PetalLength : float32
11 |     [<LoadColumn(3)>] PetalWidth : float32
12 |     [<LoadColumn(4)>] Label : string
13 | }
14 | 
15 | /// A type that holds a single model prediction.
16 | [<CLIMutable>]
17 | type IrisPrediction = {
18 |     PredictedLabel : uint32
19 |     Score : float32[]
20 | }
21 | 
22 | /// file paths to data files (assumes os = windows!)
23 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory
24 | 
25 | [<EntryPoint>]
26 | let main argv = 
27 | 
28 |     // get the machine learning context
29 |     let context = new MLContext();
30 | 
31 |     // read the iris flower data from a text file
32 |     let data = context.Data.LoadFromTextFile<IrisData>(dataPath, hasHeader = false, separatorChar = ',')
33 | 
34 |     // split the data into a training and testing partition
35 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
36 | 
37 |     // set up a learning pipeline
38 |     let pipeline = 
39 |         EstimatorChain()
40 | 
41 |             // step 1: concatenate features into a single column
42 |             .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
43 | 
44 |             // step 2: use k-means clustering to find the iris types
45 |             .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3))
46 | 
47 |     // train the model on the training data
48 |     let model = partitions.TrainSet |> pipeline.Fit 
49 | 
50 |     // get predictions and compare to ground truth
51 |     let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate
52 | 
53 |     // show results
54 |     printfn "Nodel results"
55 |     printfn "   Average distance:     %f" metrics.AverageDistance
56 |     printfn "   Davies Bouldin index: %f" metrics.DaviesBouldinIndex
57 | 
58 |     // set up a prediction engine
59 |     let engine = context.Model.CreatePredictionEngine model
60 | 
61 |     // grab 3 flowers from the dataset
62 |     let flowers = context.Data.CreateEnumerable<IrisData>(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq
63 |     let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ]
64 | 
65 |     // show predictions for the three flowers
66 |     printfn "Predictions for the 3 test flowers:"
67 |     printfn "  Label\t\t\tPredicted\tScores"
68 |     testFlowers |> Seq.iter(fun f -> 
69 |             let p = engine.Predict f
70 |             printf "  %-15s\t%i\t\t" f.Label p.PredictedLabel
71 |             p.Score |> Seq.iter(fun s -> printf "%f\t" s)
72 |             printfn "")
73 | 
74 |     0 // return value


--------------------------------------------------------------------------------
/Clustering/IrisFlower/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Cluster Iris flowers
  2 | 
  3 | In this assignment you are going to build an unsupervised learning app that clusters Iris flowers into discrete groups. 
  4 | 
  5 | There are three types of Iris flowers: Versicolor, Setosa, and Virginica. Each flower has two sets of leaves: the inner Petals and the outer Sepals.
  6 | 
  7 | Your goal is to build an app that can identify an Iris flower by its sepal and petal size.
  8 | 
  9 | ![MNIST digits](./assets/flowers.png)
 10 | 
 11 | Your challenge is that you're not going to use the dataset labels. Your app has to recognize patterns in the dataset and cluster the flowers into three groups without any help. 
 12 | 
 13 | Clustering is an example of **unsupervised learning** where the data science model has to figure out the labels on its own. 
 14 | 
 15 | The first thing you will need for your app is a data file with Iris flower petal and sepal sizes. You can use this [CSV file](https://github.com/mdfarragher/DSC/blob/master/Clustering/IrisFlower/iris-data.csv). Save it as **iris-data.csv** in your project folder.
 16 | 
 17 | The file looks like this:
 18 | 
 19 | ![Data file](./assets/data.png)
 20 | 
 21 | It’s a CSV file with 5 columns:
 22 | 
 23 | * The length of the Sepal in centimeters
 24 | * The width of the Sepal in centimeters
 25 | * The length of the Petal in centimeters
 26 | * The width of the Petal in centimeters
 27 | * The type of Iris flower
 28 | 
 29 | You are going to build a clustering data science model that reads the data and then guesses the label for each flower in the dataset.
 30 | 
 31 | Of course the app won't know the real names of the flowers, so it's just going to number them: 1, 2, and 3.
 32 | 
 33 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 34 | 
 35 | ```bash
 36 | $ dotnet new console --language F# --output IrisFlowers
 37 | $ cd IrisFlowers
 38 | ```
 39 | 
 40 | Now install the ML.NET package:
 41 | 
 42 | ```bash
 43 | $ dotnet add package Microsoft.ML
 44 | ```
 45 | 
 46 | Now you are ready to add some types. You’ll need one to hold a flower and one to hold your model prediction.
 47 | 
 48 | Edit the Program.fs file and replace its contents with this:
 49 | 
 50 | ```fsharp
 51 | open System
 52 | open Microsoft.ML
 53 | open Microsoft.ML.Data
 54 | 
 55 | /// A type that holds a single iris flower.
 56 | [<CLIMutable>]
 57 | type IrisData = {
 58 |     [<LoadColumn(0)>] SepalLength : float32
 59 |     [<LoadColumn(1)>] SepalWidth : float32
 60 |     [<LoadColumn(2)>] PetalLength : float32
 61 |     [<LoadColumn(3)>] PetalWidth : float32
 62 |     [<LoadColumn(4)>] Label : string
 63 | }
 64 | 
 65 | /// A type that holds a single model prediction.
 66 | [<CLIMutable>]
 67 | type IrisPrediction = {
 68 |     PredictedLabel : uint32
 69 |     Score : float32[]
 70 | }
 71 | 
 72 | // the rest of the code goes here....
 73 | ```
 74 | 
 75 | The **IrisData** type holds one single flower. Note how the fields are tagged with the **LoadColumn** attribute that tells ML.NET how to load the data from the data file.
 76 | 
 77 | We are loading the label in the 5th column, but we won't be using the label during training because we want the model to figure out the iris flower types on its own.
 78 | 
 79 | There's also an **IrisPrediction** type which will hold a prediction for a single flower. The prediction consists of the ID of the cluster that the flower belongs to. Clusters are numbered from 1 upwards. And notice how the score field is an array? Each individual score value represents the distance of the flower to one specific cluster. 
 80 | 
 81 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 82 | 
 83 | Next you'll need to load the data in memory:
 84 | 
 85 | ```fsharp
 86 | /// file paths to data files (assumes os = windows!)
 87 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory
 88 | 
 89 | [<EntryPoint>]
 90 | let main argv = 
 91 | 
 92 |     // get the machine learning context
 93 |     let context = new MLContext();
 94 | 
 95 |     // read the iris flower data from a text file
 96 |     let data = context.Data.LoadFromTextFile<IrisData>(dataPath, hasHeader = false, separatorChar = ',')
 97 | 
 98 |     // split the data into a training and testing partition
 99 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
100 | 
101 |     // the rest of the code goes here....
102 | 
103 |     0 // return value
104 | ```
105 | 
106 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory, and then calls **TrainTestSplit** to split the dataset into an 80% training partition and a 20% test partition.
107 | 
108 | Now let’s build the data science pipeline:
109 | 
110 | ```fsharp
111 | // set up a learning pipeline
112 | let pipeline = 
113 |     EstimatorChain()
114 | 
115 |         // step 1: concatenate features into a single column
116 |         .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
117 | 
118 |         // step 2: use k-means clustering to find the iris types
119 |         .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3))
120 | 
121 | // train the model on the training data
122 | let model = partitions.TrainSet |> pipeline.Fit 
123 | 
124 | // the rest of the code goes here...
125 | ```
126 | 
127 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
128 | 
129 | This pipeline has two components:
130 | 
131 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
132 | * A **KMeans** component which performs K-Means Clustering on the data and tries to find all Iris flower types. 
133 | 
134 | With the pipeline fully assembled, the code trains the model by piping the training set into the **Fit** function.
135 | 
136 | You now have a fully- trained model. So now it's time to take the test set, predict the type of each flower, and calculate the accuracy metrics of the model:
137 | 
138 | ```fsharp
139 | // get predictions and compare to ground truth
140 | let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate
141 | 
142 | // show results
143 | printfn "Nodel results"
144 | printfn "   Average distance:     %f" metrics.AverageDistance
145 | printfn "   Davies Bouldin index: %f" metrics.DaviesBouldinIndex
146 | 
147 | // the rest of the code goes here....
148 | ```
149 | 
150 | This code pipes the test set into the **Transform** function to set up predictions for every flower in the test set. Then it pipes these predictions into the **Evaluate** function to compare each predictions with the label and automatically calculates two metrics:
151 | 
152 | * **AverageDistance**: this is the average distance of a flower to the center point of its cluster, averaged over all clusters in the dataset. It is a measure for the 'tightness' of the clusters. Lower values are better and mean more concentrated clusters. 
153 | * **DaviesBouldinIndex**: this metric is the average 'similarity' of each cluster with its most similar cluster. Similarity is defined as the ratio of within-cluster distances to between-cluster distances. So in other words, clusters which are farther apart and more concentrated will result in a better score. Low values indicate better clustering.
154 | 
155 | So Average Distance measures how concentrated the clusters are in the dataset, and the Davies Bouldin Index measures both concentration and how far apart the clusters are spaced. Both metrics are negative-based with zero being the perfect score.
156 | 
157 | To wrap up, let’s use the model to make predictions.
158 | 
159 | You will pick three arbitrary flowers from the test set, run them through the model, and compare the predictions with the labels provided in the data file.
160 | 
161 | Here’s how to do it:
162 | 
163 | ```fsharp
164 |     // set up a prediction engine
165 |     let engine = context.Model.CreatePredictionEngine model
166 | 
167 |     // grab 3 flowers from the dataset
168 |     let flowers = context.Data.CreateEnumerable<IrisData>(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq
169 |     let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ]
170 | 
171 |     // show predictions for the three flowers
172 |     printfn "Predictions for the 3 test flowers:"
173 |     printfn "  Label\t\t\tPredicted\tScores"
174 |     testFlowers |> Seq.iter(fun f -> 
175 |             let p = engine.Predict f
176 |             printf "  %-15s\t%i\t\t" f.Label p.PredictedLabel
177 |             p.Score |> Seq.iter(fun s -> printf "%f\t" s)
178 |             printfn "")
179 | ```
180 | 
181 | This code calls **CreatePredictionEngine** to set up a prediction engine. This is a type that can generate individual predictions from sample data.
182 | 
183 | Then we call the **CreateEnumerable** function to convert the test partition into an array of **IrisData** instances. Note the **Array.ofSeq** function at the end which converts the enumeration to an array.
184 | 
185 | Next, we pick three test flowers and pipe them into **Seq.iter**. For each flower, we generate a prediction, print the predicted label (a cluster ID between 1 and 3) and then use a second **Seq.iter** to write the three scores to the console. 
186 | 
187 | That's it, you're done!
188 | 
189 | Go to your terminal and run your code:
190 | 
191 | ```bash
192 | $ dotnet run
193 | ```
194 | 
195 | What results do you get? What is your average distance and your davies bouldin index? 
196 | 
197 | What do you think this says about the quality of the clusters?
198 | 
199 | What did the 3 flower predictions look like? Does the cluster prediction match the label every time? 
200 | 
201 | Now change the code and check the predictions for every flower. How often does the model get it wrong? Which Iris types are the most confusing to the model?
202 | 
203 | Share your results in our group. 


--------------------------------------------------------------------------------
/Clustering/IrisFlower/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/data.png


--------------------------------------------------------------------------------
/Clustering/IrisFlower/assets/flowers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/flowers.png


--------------------------------------------------------------------------------
/Clustering/IrisFlower/iris-data.csv:
--------------------------------------------------------------------------------
  1 | 5.1,3.5,1.4,0.2,Iris-setosa
  2 | 4.9,3.0,1.4,0.2,Iris-setosa
  3 | 4.7,3.2,1.3,0.2,Iris-setosa
  4 | 4.6,3.1,1.5,0.2,Iris-setosa
  5 | 5.0,3.6,1.4,0.2,Iris-setosa
  6 | 5.4,3.9,1.7,0.4,Iris-setosa
  7 | 4.6,3.4,1.4,0.3,Iris-setosa
  8 | 5.0,3.4,1.5,0.2,Iris-setosa
  9 | 4.4,2.9,1.4,0.2,Iris-setosa
 10 | 4.9,3.1,1.5,0.1,Iris-setosa
 11 | 5.4,3.7,1.5,0.2,Iris-setosa
 12 | 4.8,3.4,1.6,0.2,Iris-setosa
 13 | 4.8,3.0,1.4,0.1,Iris-setosa
 14 | 4.3,3.0,1.1,0.1,Iris-setosa
 15 | 5.8,4.0,1.2,0.2,Iris-setosa
 16 | 5.7,4.4,1.5,0.4,Iris-setosa
 17 | 5.4,3.9,1.3,0.4,Iris-setosa
 18 | 5.1,3.5,1.4,0.3,Iris-setosa
 19 | 5.7,3.8,1.7,0.3,Iris-setosa
 20 | 5.1,3.8,1.5,0.3,Iris-setosa
 21 | 5.4,3.4,1.7,0.2,Iris-setosa
 22 | 5.1,3.7,1.5,0.4,Iris-setosa
 23 | 4.6,3.6,1.0,0.2,Iris-setosa
 24 | 5.1,3.3,1.7,0.5,Iris-setosa
 25 | 4.8,3.4,1.9,0.2,Iris-setosa
 26 | 5.0,3.0,1.6,0.2,Iris-setosa
 27 | 5.0,3.4,1.6,0.4,Iris-setosa
 28 | 5.2,3.5,1.5,0.2,Iris-setosa
 29 | 5.2,3.4,1.4,0.2,Iris-setosa
 30 | 4.7,3.2,1.6,0.2,Iris-setosa
 31 | 4.8,3.1,1.6,0.2,Iris-setosa
 32 | 5.4,3.4,1.5,0.4,Iris-setosa
 33 | 5.2,4.1,1.5,0.1,Iris-setosa
 34 | 5.5,4.2,1.4,0.2,Iris-setosa
 35 | 4.9,3.1,1.5,0.1,Iris-setosa
 36 | 5.0,3.2,1.2,0.2,Iris-setosa
 37 | 5.5,3.5,1.3,0.2,Iris-setosa
 38 | 4.9,3.1,1.5,0.1,Iris-setosa
 39 | 4.4,3.0,1.3,0.2,Iris-setosa
 40 | 5.1,3.4,1.5,0.2,Iris-setosa
 41 | 5.0,3.5,1.3,0.3,Iris-setosa
 42 | 4.5,2.3,1.3,0.3,Iris-setosa
 43 | 4.4,3.2,1.3,0.2,Iris-setosa
 44 | 5.0,3.5,1.6,0.6,Iris-setosa
 45 | 5.1,3.8,1.9,0.4,Iris-setosa
 46 | 4.8,3.0,1.4,0.3,Iris-setosa
 47 | 5.1,3.8,1.6,0.2,Iris-setosa
 48 | 4.6,3.2,1.4,0.2,Iris-setosa
 49 | 5.3,3.7,1.5,0.2,Iris-setosa
 50 | 5.0,3.3,1.4,0.2,Iris-setosa
 51 | 7.0,3.2,4.7,1.4,Iris-versicolor
 52 | 6.4,3.2,4.5,1.5,Iris-versicolor
 53 | 6.9,3.1,4.9,1.5,Iris-versicolor
 54 | 5.5,2.3,4.0,1.3,Iris-versicolor
 55 | 6.5,2.8,4.6,1.5,Iris-versicolor
 56 | 5.7,2.8,4.5,1.3,Iris-versicolor
 57 | 6.3,3.3,4.7,1.6,Iris-versicolor
 58 | 4.9,2.4,3.3,1.0,Iris-versicolor
 59 | 6.6,2.9,4.6,1.3,Iris-versicolor
 60 | 5.2,2.7,3.9,1.4,Iris-versicolor
 61 | 5.0,2.0,3.5,1.0,Iris-versicolor
 62 | 5.9,3.0,4.2,1.5,Iris-versicolor
 63 | 6.0,2.2,4.0,1.0,Iris-versicolor
 64 | 6.1,2.9,4.7,1.4,Iris-versicolor
 65 | 5.6,2.9,3.6,1.3,Iris-versicolor
 66 | 6.7,3.1,4.4,1.4,Iris-versicolor
 67 | 5.6,3.0,4.5,1.5,Iris-versicolor
 68 | 5.8,2.7,4.1,1.0,Iris-versicolor
 69 | 6.2,2.2,4.5,1.5,Iris-versicolor
 70 | 5.6,2.5,3.9,1.1,Iris-versicolor
 71 | 5.9,3.2,4.8,1.8,Iris-versicolor
 72 | 6.1,2.8,4.0,1.3,Iris-versicolor
 73 | 6.3,2.5,4.9,1.5,Iris-versicolor
 74 | 6.1,2.8,4.7,1.2,Iris-versicolor
 75 | 6.4,2.9,4.3,1.3,Iris-versicolor
 76 | 6.6,3.0,4.4,1.4,Iris-versicolor
 77 | 6.8,2.8,4.8,1.4,Iris-versicolor
 78 | 6.7,3.0,5.0,1.7,Iris-versicolor
 79 | 6.0,2.9,4.5,1.5,Iris-versicolor
 80 | 5.7,2.6,3.5,1.0,Iris-versicolor
 81 | 5.5,2.4,3.8,1.1,Iris-versicolor
 82 | 5.5,2.4,3.7,1.0,Iris-versicolor
 83 | 5.8,2.7,3.9,1.2,Iris-versicolor
 84 | 6.0,2.7,5.1,1.6,Iris-versicolor
 85 | 5.4,3.0,4.5,1.5,Iris-versicolor
 86 | 6.0,3.4,4.5,1.6,Iris-versicolor
 87 | 6.7,3.1,4.7,1.5,Iris-versicolor
 88 | 6.3,2.3,4.4,1.3,Iris-versicolor
 89 | 5.6,3.0,4.1,1.3,Iris-versicolor
 90 | 5.5,2.5,4.0,1.3,Iris-versicolor
 91 | 5.5,2.6,4.4,1.2,Iris-versicolor
 92 | 6.1,3.0,4.6,1.4,Iris-versicolor
 93 | 5.8,2.6,4.0,1.2,Iris-versicolor
 94 | 5.0,2.3,3.3,1.0,Iris-versicolor
 95 | 5.6,2.7,4.2,1.3,Iris-versicolor
 96 | 5.7,3.0,4.2,1.2,Iris-versicolor
 97 | 5.7,2.9,4.2,1.3,Iris-versicolor
 98 | 6.2,2.9,4.3,1.3,Iris-versicolor
 99 | 5.1,2.5,3.0,1.1,Iris-versicolor
100 | 5.7,2.8,4.1,1.3,Iris-versicolor
101 | 6.3,3.3,6.0,2.5,Iris-virginica
102 | 5.8,2.7,5.1,1.9,Iris-virginica
103 | 7.1,3.0,5.9,2.1,Iris-virginica
104 | 6.3,2.9,5.6,1.8,Iris-virginica
105 | 6.5,3.0,5.8,2.2,Iris-virginica
106 | 7.6,3.0,6.6,2.1,Iris-virginica
107 | 4.9,2.5,4.5,1.7,Iris-virginica
108 | 7.3,2.9,6.3,1.8,Iris-virginica
109 | 6.7,2.5,5.8,1.8,Iris-virginica
110 | 7.2,3.6,6.1,2.5,Iris-virginica
111 | 6.5,3.2,5.1,2.0,Iris-virginica
112 | 6.4,2.7,5.3,1.9,Iris-virginica
113 | 6.8,3.0,5.5,2.1,Iris-virginica
114 | 5.7,2.5,5.0,2.0,Iris-virginica
115 | 5.8,2.8,5.1,2.4,Iris-virginica
116 | 6.4,3.2,5.3,2.3,Iris-virginica
117 | 6.5,3.0,5.5,1.8,Iris-virginica
118 | 7.7,3.8,6.7,2.2,Iris-virginica
119 | 7.7,2.6,6.9,2.3,Iris-virginica
120 | 6.0,2.2,5.0,1.5,Iris-virginica
121 | 6.9,3.2,5.7,2.3,Iris-virginica
122 | 5.6,2.8,4.9,2.0,Iris-virginica
123 | 7.7,2.8,6.7,2.0,Iris-virginica
124 | 6.3,2.7,4.9,1.8,Iris-virginica
125 | 6.7,3.3,5.7,2.1,Iris-virginica
126 | 7.2,3.2,6.0,1.8,Iris-virginica
127 | 6.2,2.8,4.8,1.8,Iris-virginica
128 | 6.1,3.0,4.9,1.8,Iris-virginica
129 | 6.4,2.8,5.6,2.1,Iris-virginica
130 | 7.2,3.0,5.8,1.6,Iris-virginica
131 | 7.4,2.8,6.1,1.9,Iris-virginica
132 | 7.9,3.8,6.4,2.0,Iris-virginica
133 | 6.4,2.8,5.6,2.2,Iris-virginica
134 | 6.3,2.8,5.1,1.5,Iris-virginica
135 | 6.1,2.6,5.6,1.4,Iris-virginica
136 | 7.7,3.0,6.1,2.3,Iris-virginica
137 | 6.3,3.4,5.6,2.4,Iris-virginica
138 | 6.4,3.1,5.5,1.8,Iris-virginica
139 | 6.0,3.0,4.8,1.8,Iris-virginica
140 | 6.9,3.1,5.4,2.1,Iris-virginica
141 | 6.7,3.1,5.6,2.4,Iris-virginica
142 | 6.9,3.1,5.1,2.3,Iris-virginica
143 | 5.8,2.7,5.1,1.9,Iris-virginica
144 | 6.8,3.2,5.9,2.3,Iris-virginica
145 | 6.7,3.3,5.7,2.5,Iris-virginica
146 | 6.7,3.0,5.2,2.3,Iris-virginica
147 | 6.3,2.5,5.0,1.9,Iris-virginica
148 | 6.5,3.0,5.2,2.0,Iris-virginica
149 | 6.2,3.4,5.4,2.3,Iris-virginica
150 | 5.9,3.0,5.1,1.8,Iris-virginica
151 | 


--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/CaliforniaHousing.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="FSharp.Plotly" Version="1.2.2" />
14 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/Program.fs:
--------------------------------------------------------------------------------
  1 | ﻿open System
  2 | open Microsoft.ML
  3 | open Microsoft.ML.Data
  4 | open FSharp.Plotly
  5 | 
  6 | /// The HouseBlockData class holds one single housing block data record.
  7 | [<CLIMutable>]
  8 | type HouseBlockData = {
  9 |     [<LoadColumn(0)>] Longitude : float32
 10 |     [<LoadColumn(1)>] Latitude : float32
 11 |     [<LoadColumn(2)>] HousingMedianAge : float32
 12 |     [<LoadColumn(3)>] TotalRooms : float32
 13 |     [<LoadColumn(4)>] TotalBedrooms : float32
 14 |     [<LoadColumn(5)>] Population : float32
 15 |     [<LoadColumn(6)>] Households : float32
 16 |     [<LoadColumn(7)>] MedianIncome : float32
 17 |     [<LoadColumn(8)>] MedianHouseValue : float32
 18 | }
 19 | 
 20 | /// The ToMedianHouseValue class is used in a column data conversion.
 21 | [<CLIMutable>]
 22 | type ToMedianHouseValue = {
 23 |     mutable NormalizedMedianHouseValue : float32
 24 | }
 25 | 
 26 | /// The ToRoomsPerPerson class is used in a column data conversion.
 27 | [<CLIMutable>]
 28 | type ToRoomsPerPerson = {
 29 |     mutable RoomsPerPerson : float32
 30 | }
 31 | 
 32 | /// The ToLocation class is used in a column data conversion.
 33 | [<CLIMutable>]
 34 | type FromLocation = {
 35 |     EncodedLongitude : float32[]
 36 |     EncodedLatitude : float32[]
 37 | }
 38 | 
 39 | /// The ToLocation class is used in a column data conversion.
 40 | [<CLIMutable>]
 41 | type ToLocation = {
 42 |     mutable Location : float32[]
 43 | }
 44 | 
 45 | /// file paths to data files (assumes os = windows!)
 46 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory
 47 | 
 48 | [<EntryPoint>]
 49 | let main argv =
 50 | 
 51 |     // create the machine learning context
 52 |     let context = new MLContext()
 53 | 
 54 |     // load the dataset
 55 |     let data = context.Data.LoadFromTextFile<HouseBlockData>(dataPath, hasHeader = true, separatorChar = ',')
 56 | 
 57 |     // keep only records with a median house value < 500,000
 58 |     let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0)
 59 | 
 60 |     // get an array of housing data
 61 |     let houses = context.Data.CreateEnumerable<HouseBlockData>(data, reuseRowObject = false)
 62 | 
 63 |     // // plot median house value by median income
 64 |     // Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue))) 
 65 |     //     |> Chart.withX_AxisStyle "Median income"
 66 |     //     |> Chart.withY_AxisStyle "Median house value"
 67 |     //     |> Chart.Show
 68 | 
 69 |     // build a data loading pipeline
 70 |     let pipeline = 
 71 |         EstimatorChain()
 72 | 
 73 |             // step 1: divide the median house value by 1000
 74 |             .Append(
 75 |                 context.Transforms.CustomMapping(
 76 |                     Action<HouseBlockData, ToMedianHouseValue>(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f),
 77 |                     "MedianHouseValue"))
 78 | 
 79 |     // get a 10-record preview of the transformed data
 80 |     let model = data |> pipeline.Fit
 81 |     let preview = (data |> model.Transform).Preview(maxRows = 10)
 82 | 
 83 |     // // show the preview
 84 |     // preview.ColumnView |> Seq.iter(fun c ->
 85 |     //     printf "%-30s|" c.Column.Name
 86 |     //     preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
 87 |     //     printfn "")
 88 | 
 89 |     // // plot median house value by longitude
 90 |     // Chart.Point(houses |> Seq.map(fun h -> (h.Longitude, h.MedianHouseValue))) 
 91 |     //     |> Chart.withX_AxisStyle "Longitude"
 92 |     //     |> Chart.withY_AxisStyle "Median house value"
 93 |     //     |> Chart.Show
 94 | 
 95 |     // step 2: bin the longitude
 96 |     let pipeline2 = 
 97 |         pipeline
 98 |             .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10))
 99 | 
100 |             // step 3: bin the latitude
101 |             .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10))
102 | 
103 |             // step 4: one-hot encode the longitude
104 |             .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude"))
105 | 
106 |             // step 5: one-hot encode the latitude
107 |             .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude"))
108 | 
109 |             .Append(
110 |                 context.Transforms.CustomMapping(
111 |                     Action<FromLocation, ToLocation>(fun input output -> 
112 |                         output.Location <- [|   for x in input.EncodedLongitude do
113 |                                                     for y in input.EncodedLatitude do
114 |                                                         x * y |] ),
115 |                     "Location"))
116 | 
117 |     // get a 10-record preview of the transformed data
118 |     let model = data |> pipeline2.Fit
119 |     let preview = (data |> model.Transform).Preview(maxRows = 10)
120 | 
121 |     // // show the preview
122 |     // preview.ColumnView |> Seq.iter(fun c ->
123 |     //     printf "%-30s|" c.Column.Name
124 |     //     preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
125 |     //     printfn "")
126 | 
127 |     // show the dense vector
128 |     preview.RowView |> Seq.iter(fun r ->
129 |         let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer<float32>
130 |         vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v))
131 |         printfn "")
132 | 
133 |     0 // return value


--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Load California housing data
  2 | 
  3 | In this assignment you're going to build an app that can load a dataset with the prices of houses in California. The data is not ready for training yet and needs a bit of processing. 
  4 | 
  5 | The first thing you'll need is a data file with house prices. The data from the 1990 California cencus has exactly what we need. 
  6 | 
  7 | Download the [California 1990 housing census](https://github.com/mdfarragher/DSC/blob/master/LoadingData/CaliforniaHousing/california_housing.csv) and save it as **california_housing.csv**. 
  8 | 
  9 | This is a CSV file with 17,000 records that looks like this:
 10 | ￼
 11 | ![Data File](./assets/data.png)
 12 | 
 13 | The file contains information on 17k housing blocks all over the state of California:
 14 | 
 15 | * Column 1: The longitude of the housing block
 16 | * Column 2: The latitude of the housing block
 17 | * Column 3: The median age of all the houses in the block
 18 | * Column 4: The total number of rooms in all houses in the block
 19 | * Column 5: The total number of bedrooms in all houses in the block
 20 | * Column 6: The total number of people living in all houses in the block
 21 | * Column 7: The total number of households in all houses in the block
 22 | * Column 8: The median income of all people living in all houses in the block
 23 | * Column 9: The median house value for all houses in the block
 24 | 
 25 | We can use this data to train an app to predict the value of any house in and outside the state of California. 
 26 | 
 27 | Unfortunately we cannot train on this dataset directly. The data needs to be processed first to make it suitable for training. This is what you will do in this assignment. 
 28 | 
 29 | Let's get started. 
 30 | 
 31 | In these assignments you will not be using the code in Github. Instead, you'll be building all the applications 100% from scratch. So please make sure to create a new folder somewhere to hold all of your assignments.
 32 | 
 33 | Now please open a console window. You are going to create a new subfolder for this assignment and set up a blank console application:
 34 | 
 35 | ```bash
 36 | $ dotnet new console --language F# --output LoadingData
 37 | $ cd LoadingData
 38 | ```
 39 | 
 40 | Also make sure to copy the dataset file(s) into this folder because the code you're going to type next will expect them here.  
 41 | 
 42 | Now install the following packages
 43 | 
 44 | ```bash
 45 | $ dotnet add package Microsoft.ML
 46 | $ dotnet add package FSharp.Plotly
 47 | ```
 48 | 
 49 | **Microsoft.ML** is the Microsoft machine learning package. We will use to build all our applications in this course. And **FSharp.Plotly** is an advanced scientific plotting library.
 50 | 
 51 | Now you are ready to add types. You’ll need one type to hold all the information for a single housing block.
 52 | 
 53 | Edit the Program.fs file with Visual Studio Code and add the following code:
 54 | 
 55 | ```fsharp
 56 | open System
 57 | open Microsoft.ML
 58 | open Microsoft.ML.Data
 59 | open FSharp.Plotly
 60 | 
 61 | /// The HouseBlockData class holds one single housing block data record.
 62 | [<CLIMutable>]
 63 | type HouseBlockData = {
 64 |     [<LoadColumn(0)>] Longitude : float32
 65 |     [<LoadColumn(1)>] Latitude : float32
 66 |     [<LoadColumn(2)>] HousingMedianAge : float32
 67 |     [<LoadColumn(3)>] TotalRooms : float32
 68 |     [<LoadColumn(4)>] TotalBedrooms : float32
 69 |     [<LoadColumn(5)>] Population : float32
 70 |     [<LoadColumn(6)>] Households : float32
 71 |     [<LoadColumn(7)>] MedianIncome : float32
 72 |     [<LoadColumn(8)>] MedianHouseValue : float32
 73 | }
 74 | ```
 75 | 
 76 | The **HouseBlockData** class holds all the data for one single housing block. Note that we're loading each column as a 32-bit floating point number, and that every field is tagged with a **LoadColumn** attribute that will tell the CSV data loading code which column to import data from.
 77 | 
 78 | We also need the **CLIMutable** attribute to tell F# that we want a 'C#-style' class implementation with a default constructor and setters functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 79 | 
 80 | Next you need to load the data in memory:
 81 | 
 82 | ```fsharp
 83 | /// file paths to data files (assumes os = windows!)
 84 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory
 85 | 
 86 | [<EntryPoint>]
 87 | let main argv =
 88 | 
 89 |     // create the machine learning context
 90 |     let context = new MLContext()
 91 | 
 92 |     // load the dataset
 93 |     let data = context.Data.LoadFromTextFile<HouseBlockData>(dataPath, hasHeader = true, separatorChar = ',')
 94 | 
 95 |     // the rest of the code goes here...
 96 | 
 97 |     0 // return value
 98 | ```
 99 | 
100 | This code sets up the **main** function which is the main entry point of the application. The code calls the **LoadFromTextFile** method to load the CSV data in memory. Note the **HouseBlockData** type argument that tells the method which class to use to load the data.
101 | 
102 | Also note that **dataPath** uses a Windows path separator to access the data file. Change this accordingly if you're using OS/X or Linux. 
103 | 
104 | So now we have the data in memory. Let's plot the median house value as a function of median income and see what happens. 
105 | 
106 | Add the following code:
107 | 
108 | ```fsharp
109 | // get an array of housing data
110 | let houses = context.Data.CreateEnumerable<HouseBlockData>(data, reuseRowObject = false)
111 | 
112 | // plot median house value by median income
113 | Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue))) 
114 |     |> Chart.withX_AxisStyle "Median income"
115 |     |> Chart.withY_AxisStyle "Median house value"
116 |     |> Chart.Show
117 | 
118 | // the rest of the code goes here
119 | ```
120 | 
121 | The housing data is stored in memory as a data view, but we want to work with the **HouseBlockData** records directly. So we call **CreateEnumerable** to convert the data view to an enumeration of **HouseDataBlock** instances.
122 | 
123 | The **Chart.Point** method then sets up a scatterplot. We pipe the **houses** enumeration into the **Seq.map** function and project a tuple for every housing block. The tuples contain the median income and median house value for every block, and **Chart.Point** will use these as X- and Y coordinates.
124 | 
125 | The **Chart.withX_AxisStyle** and **Chart.withY_AxisStyle** functions set the chart axis titles, and **Chart.Show** renders the chart on screen. Your app will open a web browser and display the chart there. 
126 | 
127 | This is a good moment to save your work ;) 
128 | 
129 | We're now ready to run the app. Open a Powershell terminal and make sure you're in the project folder. Then type the following: 
130 | 
131 | ```bash
132 | $ dotnet build
133 | ```
134 | 
135 | This will build the project and populate the bin folder. 
136 | 
137 | Then type the following:
138 | 
139 | ```bash
140 | $ dotnet run
141 | ```
142 | 
143 | Your app will run and open the chart in a new browser window. It should look like this:
144 | 
145 | ![Median house value by median income](./assets/plot.png)
146 | 
147 | As the median income level increases, the median house value also increases. There's still a big spread in the house values, but a vague 'cigar' shape is visible which suggests a linear relationship between these two variables.
148 | 
149 | But look at the horizontal line at 500,000. What's that all about? 
150 | 
151 | This is what **clipping** looks like. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 back down to $500,000. We see this appear in the graph as a horizontal line that disrupts the linear 'cigar' shape. 
152 | 
153 | Let's start by using **data scrubbing** to get rid of these clipped records. Add the following code:
154 | 
155 | ```fsharp
156 | // keep only records with a median house value < 500,000
157 | let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0)
158 | 
159 | // the rest of the code goes here...
160 | ```
161 | 
162 | The **FilterRowsByColumn** method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset.  
163 | 
164 | Move your plotting code BELOW this code fragment and run your app again. 
165 | 
166 | Did this fix the problem? Is the clipping line gone?
167 | 
168 | Now let's take a closer look at the CSV file. Notice how all the columns are numbers in the range of 0..3000, but the median house value is in a range of 0..500,000. 
169 | 
170 | Remember when we talked about training data science models that we discussed having all data in a similar range?
171 | 
172 | So let's fix that now by using **data scaling**. We're going to divide the median house value by 1,000 to bring it down to a range more in line with the other data columns. 
173 | 
174 | Start by adding the following type:
175 | 
176 | ```fsharp
177 | /// The ToMedianHouseValue class is used in a column data conversion.
178 | [<CLIMutable>]
179 | type ToMedianHouseValue = {
180 |     mutable NormalizedMedianHouseValue : float32
181 | }
182 | ```
183 | 
184 | And then add the following code at the bottom of your **main** function:
185 | 
186 | ```fsharp
187 | // build a data loading pipeline
188 | let pipeline = 
189 |     EstimatorChain()
190 | 
191 |         // step 1: divide the median house value by 1000
192 |         .Append(
193 |             context.Transforms.CustomMapping(
194 |                 Action<HouseBlockData, ToMedianHouseValue>(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f),
195 |                 "MedianHouseValue"))
196 | 
197 | // the rest of the code goes here...
198 | ```
199 | 
200 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
201 | 
202 | This pipeline has only one component:
203 | 
204 | * **CustomMapping** which takes the median house values, divides them by 1,000 and stores them in a new column called **NormalizedMedianHouseValue**. Note that we need the new **ToMedianHouseValue** type to access this new column in code.
205 | 
206 | Also note the **mutable** keyword in the type definition for **ToMedianHouseValue**. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 
207 | 
208 | If we had left out the keyword, the **output.NormalizedMedianHouseValue = ...** line would fail.
209 | 
210 | Now let's see if the conversion worked. Add the following code at the bottom of the **main** function:
211 | 
212 | ```fsharp
213 | // get a 10-record preview of the transformed data
214 | let model = data |> pipeline.Fit
215 | let preview = (data |> model.Transform).Preview(maxRows = 10)
216 | 
217 | // show the preview
218 | preview.ColumnView |> Seq.iter(fun c ->
219 |     printf "%-30s|" c.Column.Name
220 |     preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
221 |     printfn "")
222 | 
223 | // the rest of the code goes here...
224 | ```
225 | 
226 | The **pipeline.Fit** method sets up the pipeline, creates a data science model and stores it in the **model** variable. The **model.Transform** method then runs the dataset through the pipeline and creates predictions for every housing block. And finally the **Preview** method extracts a 10-row preview from the collection of predictions.
227 | 
228 | Next, we use **Seq.iter** to enumerate every column in the preview. We print the column name and then use a second **Seq.iter** to show all the preview values in this column.
229 | 
230 | This will print a transposed view of the preview data with the columns stacked vertically and the rows stacked horizontally. Flipping the preview makes it easier to read, despite the very long column names. 
231 | 
232 | Now run your code. 
233 | 
234 | Find the MedianHouseValue and NormalizedMedianHouseValue columns in the output. Do they contain the correct values? Does the normalized column contain the oroginal house values divided by 1,000? 
235 | 
236 | Now let's fix the latitude and longitude. We're reading them in directly, but remember that we discussed how **Geo data should always be binned, one-hot encoded, and crossed?** 
237 | 
238 | Let's do that now. Add the following types at the top of the file:
239 | 
240 | ```fsharp
241 | /// The ToLocation class is used in a column data conversion.
242 | [<CLIMutable>]
243 | type FromLocation = {
244 |     EncodedLongitude : float32[]
245 |     EncodedLatitude : float32[]
246 | }
247 | 
248 | /// The ToLocation class is used in a column data conversion.
249 | [<CLIMutable>]
250 | type ToLocation = {
251 |     mutable Location : float32[]
252 | }
253 | ```
254 | 
255 | Note the **mutable** keyword again, which indicates that we're going to modify the **Location** property of the **ToLocation** type after construction. 
256 | 
257 | We will use these types in the next code snippet.
258 | 
259 | Now scroll down to the bottom of the **main** function and add the following code just before the final line that retuns a zero return value:
260 | 
261 | ```fsharp
262 | // step 2: bin, encode, and cross the longitude and latitude
263 | let pipeline2 = 
264 |     pipeline
265 |         .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10))
266 | 
267 |         // step 3: bin the latitude
268 |         .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10))
269 | 
270 |         // step 4: one-hot encode the longitude
271 |         .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude"))
272 | 
273 |         // step 5: one-hot encode the latitude
274 |         .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude"))
275 | 
276 |         // step 6: cross the longitude and latitude vectors
277 |         .Append(
278 |             context.Transforms.CustomMapping(
279 |                 Action<FromLocation, ToLocation>(fun input output -> 
280 |                     output.Location <- [|   for x in input.EncodedLongitude do
281 |                                                 for y in input.EncodedLatitude do
282 |                                                     x * y |] ),
283 |                 "Location"))
284 | 
285 | // the rest of the code goes here...
286 | ```
287 | 
288 | Note how we're extending the data loading pipeline with extra components. The new components are:
289 | 
290 | * Two **NormalizeBinning** components that bin the longitude and latitude values into 10 bins
291 | 
292 | * Two **OneHotEncoding** components that one-hot encode the longitude and latitude bins
293 | 
294 | * One **CustomMapping** component that multiples (crosses) the longitude and latitude vectors to create a feature cross: a 100-element vector with all zeroes except for a single '1' value.
295 | 
296 | Note how the custom mapping uses two nested for-loops inside the **[| ... |]** array brackets. This sets up an inline enumerator that multiples the two longitude and latitude vectors and produces a 1-dimensional array with 100 elements. 
297 | 
298 | Let's see if this worked. Add the following code to the bottom of the **main** function:
299 | 
300 | ```fsharp
301 | // get a 10-record preview of the transformed data
302 | let model = data |> pipeline2.Fit
303 | let preview = (data |> model.Transform).Preview(maxRows = 10)
304 | 
305 | // show the preview
306 | preview.ColumnView |> Seq.iter(fun c ->
307 |     printf "%-30s|" c.Column.Name
308 |     preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
309 |     printfn "")
310 | 
311 | // the rest of the code goes here...
312 | ```
313 | 
314 | This is the same code you used previously to create predictions, get a preview, and display the preview on the console. But now you're using **pipeline2** instead.
315 | 
316 | Now run your app. 
317 | 
318 | What does the data look like now? Can you spot the new columns with the binned and one-hot encoded longitude and latitude values?
319 | 
320 | And is the new **Location** column present?
321 | 
322 | You should see the new **Location** column, but the code can't display its contents properly. 
323 | 
324 | So let's fix that. Add the following code to display all the individual values in the **Location** vector:
325 | 
326 | ```fsharp
327 | // show the dense vector
328 | preview.RowView |> Seq.iter(fun r ->
329 |     let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer<float32>
330 |     vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v))
331 |     printfn "")
332 | ```
333 | 
334 | We use **Seq.iter** to enumerate every row in the preview. And note the **:?>** operator which casts the value to a **VBuffer** of floats. With this casted value we can access the **DenseValues** property which is a float array of all the elements in the vector. So we pipe that property into a second **Seq.iter** to print the values.
335 | 
336 | Now run your app. What do you see? Did it work? Are there 100 digits in the **Location** column? And is there only a single '1' digit in each row? 
337 | 
338 | Post your results in our group. 


--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/data.png


--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/assets/plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/plot.png


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/Mnist.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |   </ItemGroup>
15 | 
16 | </Project>
17 | 


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/Program.fs:
--------------------------------------------------------------------------------
 1 | ﻿open System
 2 | open System.IO
 3 | open Microsoft.ML
 4 | open Microsoft.ML.Data
 5 | open Microsoft.ML.Transforms
 6 | 
 7 | /// The Digit class represents one mnist digit.
 8 | [<CLIMutable>]
 9 | type Digit = {
10 |     [<LoadColumn(0)>] Number : float32
11 |     [<LoadColumn(1, 784)>] [<VectorType(784)>] PixelValues : float32[]
12 | }
13 | 
14 | /// The DigitPrediction class represents one digit prediction.
15 | [<CLIMutable>]
16 | type DigitPrediction = {
17 |     Score : float32[]
18 | }
19 | 
20 | /// file paths to train and test data files (assumes os = windows!)
21 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory
22 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory
23 | 
24 | [<EntryPoint>]
25 | let main argv = 
26 | 
27 |     // create a machine learning context
28 |     let context = new MLContext()
29 | 
30 |     // load the datafiles
31 |     let trainData = context.Data.LoadFromTextFile<Digit>(trainDataPath, hasHeader = true, separatorChar = ',')
32 |     let testData = context.Data.LoadFromTextFile<Digit>(testDataPath, hasHeader = true, separatorChar = ',')
33 | 
34 |     // build a training pipeline
35 |     let pipeline = 
36 |         EstimatorChain()
37 | 
38 |             // step 1: map the number column to a key value and store in the label column
39 |             .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue))
40 | 
41 |             // step 2: concatenate all feature columns
42 |             .Append(context.Transforms.Concatenate("Features", "PixelValues"))
43 |             
44 |             // step 3: cache data to speed up training                
45 |             .AppendCacheCheckpoint(context)
46 | 
47 |             // step 4: train the model with SDCA
48 |             .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy())
49 | 
50 |             // step 5: map the label key value back to a number
51 |             .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label"))
52 | 
53 |     // train the model
54 |     let model = trainData |> pipeline.Fit
55 | 
56 |     // get predictions and compare them to the ground truth
57 |     let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate
58 | 
59 |     // show evaluation metrics
60 |     printfn "Evaluation metrics"
61 |     printfn "  MicroAccuracy:    %f" metrics.MicroAccuracy
62 |     printfn "  MacroAccuracy:    %f" metrics.MacroAccuracy
63 |     printfn "  LogLoss:          %f" metrics.LogLoss
64 |     printfn "  LogLossReduction: %f" metrics.LogLossReduction
65 | 
66 |     // grab five digits from the test data
67 |     let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq
68 |     let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ]
69 | 
70 |     // create a prediction engine
71 |     let engine = context.Model.CreatePredictionEngine model
72 | 
73 |     // show predictions
74 |     printfn "Model predictions:"
75 |     printf "  #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn ""
76 |     testDigits |> Seq.iter(
77 |         fun digit -> 
78 |             printf "  %i\t" (int digit.Number)
79 |             let p = engine.Predict digit
80 |             p.Score |> Seq.iter (fun s -> printf "%f\t" s)
81 |             printfn "")
82 | 
83 |     0 // return value


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Recognize handwritten digits
  2 | 
  3 | In this article, You are going to build an app that recognizes handwritten digits from the famous MNIST machine learning dataset:
  4 | 
  5 | ![MNIST digits](./assets/mnist.png)
  6 | 
  7 | Your app must read these images of handwritten digits and correctly predict which digit is visible in each image.
  8 | 
  9 | This may seem like an easy challenge, but look at this:
 10 | 
 11 | ![Difficult MNIST digits](./assets/mnist_hard.png)
 12 | 
 13 | These are a couple of digits from the dataset. Are you able to identify each one? It probably won’t surprise you to hear that the human error rate on this exercise is around 2.5%.
 14 | 
 15 | The first thing you will need for your app is a data file with images of handwritten digits. We will not use the original MNIST data because it's stored in a nonstandard binary format.
 16 | 
 17 | Instead, we'll use these excellent [CSV files](https://www.kaggle.com/oddrationale/mnist-in-csv/) prepared by Daniel Dato on Kaggle.
 18 | 
 19 | Create a Kaggle account if you don't have one yet, then download **mnist_train.csv** and **mnist_test.csv** and save them in your project folder.
 20 | 
 21 | There are 60,000 images in the training file and 10,000 in the test file. Each image is monochrome and resized to 28x28 pixels.
 22 | 
 23 | The training file looks like this:
 24 | 
 25 | ![Data file](./assets/datafile.png)
 26 | 
 27 | It’s a CSV file with 785 columns:
 28 | 
 29 | * The first column contains the label. It tells us which one of the 10 possible digits is visible in the image.
 30 | * The next 784 columns are the pixel intensity values (0..255) for each pixel in the image, counting from left to right and top to bottom.
 31 | 
 32 | You are going to build a multiclass classification machine learning model that reads in all 785 columns, and then makes a prediction for each digit in the dataset.
 33 | 
 34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 35 | 
 36 | ```bash
 37 | $ dotnet new console --language F# --output Mnist
 38 | $ cd Mnist
 39 | ```
 40 | 
 41 | Now install the ML.NET package:
 42 | 
 43 | ```bash
 44 | $ dotnet add package Microsoft.ML
 45 | ```
 46 | 
 47 | Now you are ready to add types. You’ll need one to hold a digit, and one to hold your model prediction.
 48 | 
 49 | Replace the contents of the Program.fs file with this:
 50 | 
 51 | ```fsharp
 52 | open System
 53 | open System.IO
 54 | open Microsoft.ML
 55 | open Microsoft.ML.Data
 56 | open Microsoft.ML.Transforms
 57 | 
 58 | /// The Digit class represents one mnist digit.
 59 | [<CLIMutable>]
 60 | type Digit = {
 61 |     [<LoadColumn(0)>] Number : float32
 62 |     [<LoadColumn(1, 784)>] [<VectorType(784)>] PixelValues : float32[]
 63 | }
 64 | 
 65 | /// The DigitPrediction class represents one digit prediction.
 66 | [<CLIMutable>]
 67 | type DigitPrediction = {
 68 |     Score : float32[]
 69 | }
 70 | ```
 71 | 
 72 | The **Digit** type holds one single MNIST digit image. Note how the **PixelValues** field is tagged with a **VectorType** attribute. This tells ML.NET to combine the 784 individual pixel columns into a single vector value.
 73 | 
 74 | There's also a **DigitPrediction** type which will hold a single prediction. And notice how the prediction score is actually an array? The model will generate 10 scores, one for every possible digit value. 
 75 | 
 76 | Also note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 77 | 
 78 | Next you'll need to load the data in memory:
 79 | 
 80 | ```fsharp
 81 | /// file paths to train and test data files (assumes os = windows!)
 82 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory
 83 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory
 84 | 
 85 | [<EntryPoint>]
 86 | let main argv = 
 87 | 
 88 |     // create a machine learning context
 89 |     let context = new MLContext()
 90 | 
 91 |     // load the datafiles
 92 |     let trainData = context.Data.LoadFromTextFile<Digit>(trainDataPath, hasHeader = true, separatorChar = ',')
 93 |     let testData = context.Data.LoadFromTextFile<Digit>(testDataPath, hasHeader = true, separatorChar = ',')
 94 | 
 95 |     // the rest of the code goes here....
 96 | 
 97 |     0 // return value
 98 | ```
 99 | 
100 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory. We call this function twice to load the training and testing datasets separately.
101 | 
102 | Now let’s build the machine learning pipeline:
103 | 
104 | ```fsharp
105 | // build a training pipeline
106 | let pipeline = 
107 |     EstimatorChain()
108 | 
109 |         // step 1: map the number column to a key value and store in the label column
110 |         .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue))
111 | 
112 |         // step 2: concatenate all feature columns
113 |         .Append(context.Transforms.Concatenate("Features", "PixelValues"))
114 |         
115 |         // step 3: cache data to speed up training                
116 |         .AppendCacheCheckpoint(context)
117 | 
118 |         // step 4: train the model with SDCA
119 |         .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy())
120 | 
121 |         // step 5: map the label key value back to a number
122 |         .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label"))
123 | 
124 | // train the model
125 | let model = trainData |> pipeline.Fit
126 | 
127 | // the rest of the code goes here....
128 | ```
129 | 
130 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
131 | 
132 | This pipeline has the following components:
133 | 
134 | * **MapValueToKey** which reads the **Number** column and builds a dictionary of unique values. It then produces an output column called **Label** which contains the dictionary key for each number value. We need this step because we can only train a multiclass classifier on keys. 
135 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
136 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm which comes next.
137 | * A **SdcaMaximumEntropy** classification learner which will train the model to make accurate predictions.
138 | * A final **MapKeyToValue** step which converts the keys in the **Label** column back to the original number values. We need this step to show the numbers when making predictions. 
139 | 
140 | With the pipeline fully assembled, we can train the model by piping the training data into the **Fit** function.
141 | 
142 | You now have a fully- trained model. So now it's time to take the test set, predict the number for each digit image, and calculate the accuracy metrics of the model:
143 | 
144 | ```fsharp
145 | // get predictions and compare them to the ground truth
146 | let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate
147 | 
148 | // show evaluation metrics
149 | printfn "Evaluation metrics"
150 | printfn "  MicroAccuracy:    %f" metrics.MicroAccuracy
151 | printfn "  MacroAccuracy:    %f" metrics.MacroAccuracy
152 | printfn "  LogLoss:          %f" metrics.LogLoss
153 | printfn "  LogLossReduction: %f" metrics.LogLossReduction
154 | 
155 | // the rest of the code goes here....
156 | ```
157 | 
158 | This code pipes the test data into the **Transform** function to set up predictions for every single image in the test set. Then it pipes these predictions into the **Evaluate** function to compare these predictions to the actual labels and automatically calculate four metrics:
159 | 
160 | * **MicroAccuracy**: this is the average accuracy (=the number of correct predictions divided by the total number of predictions) for every digit in the dataset.
161 | * **MacroAccuracy**: this is calculated by first calculating the average accuracy for each unique prediction value, and then taking the averages of those averages.
162 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
163 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
164 | 
165 | We can compare the micro- and macro accuracy to discover if the dataset is biased. In an unbiased set each unique label value will appear roughly the same number of times, and the micro- and macro accuracy values will be close together.
166 | 
167 | If the values are far apart, this suggests that there is some kind of bias in the data that we need to deal with. 
168 | 
169 | To wrap up, let’s use the model to make a prediction.
170 | 
171 | You will pick five arbitrary digits from the test set, run them through the model, and make a prediction for each one.
172 | 
173 | Here’s how to do it:
174 | 
175 | ```fsharp
176 | // grab five digits from the test data
177 | let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq
178 | let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ]
179 | 
180 | // create a prediction engine
181 | let engine = context.Model.CreatePredictionEngine model
182 | 
183 | // show predictions
184 | printfn "Model predictions:"
185 | printf "  #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn ""
186 | testDigits |> Seq.iter(
187 |     fun digit -> 
188 |         printf "  %i\t" (int digit.Number)
189 |         let p = engine.Predict digit
190 |         p.Score |> Seq.iter (fun s -> printf "%f\t" s)
191 |         printfn "")
192 | ```
193 | 
194 | This code calls the **CreateEnumerable** function to convert the test dataview to an array of **Digit** instances. Then it picks five random digits for testing.
195 | 
196 | We then call the **CreatePredictionEngine** function to set up a prediction engine. 
197 | 
198 | The code then calls **Seq.iter** to print column headings for each of the 10 possible digit values. We then pipe the 5 test digits into another **Seq.iter**, make a prediction for each test digit, and then use a third **Seq.iter** to display the 10 prediction scores.
199 | 
200 | This will produce a table with 5 rows of test digits, and 10 columns of prediction scores. The column with the highest score represents the prediction for that particular test digit. 
201 | 
202 | That's it, you're done!
203 | 
204 | Go to your terminal and run your code:
205 | 
206 | ```bash
207 | $ dotnet run
208 | ```
209 | 
210 | What results do you get? What are your micro- and macro accuracy values? Which logloss and logloss reduction did you get?
211 | 
212 | Do you think the dataset is biased? 
213 | 
214 | What can you say about the accuracy? Is this a good model? How far away are you from the human accuracy rate? Is this a superhuman or subhuman AI? 
215 | 
216 | What did the 5 digit predictions look like? Do you understand why the model gets confused sometimes? 
217 | 
218 | Think about the code in this assignment. How could you improve the accuracy of the model even further?
219 | 
220 | Share your results in our group!
221 | 


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/datafile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/datafile.png


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/mnist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist.png


--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/mnist_hard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist_hard.png


--------------------------------------------------------------------------------
/MulticlassClassification/FlagToxicComments/README.md:
--------------------------------------------------------------------------------
 1 | # The case
 2 | 
 3 | Online discussions about things you care about can be difficult. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions. Many platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
 4 | 
 5 | The Conversation AI team is a research initiative founded by Jigsaw and Google. It is working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments that are rude, disrespectful or likely to make someone leave a discussion. 
 6 | 
 7 | The team has built a range of public tools to detect toxicity. But the current apps still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding.
 8 | 
 9 | In this case study, you’re going to build an app that is capable of detecting different types of of toxicity like threats, obscenity, insults, and hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits.
10 | 
11 | How accurate will your app be? Do you think you will be able to flag every toxic comment? 
12 | 
13 | That's for you to find out! 
14 | 
15 | # The dataset
16 | 
17 | ![The dataset](./assets/data.png)
18 | 
19 | In this case study you'll be working with a dataset containing over 313,000 comments from Wikipedia talk pages. 
20 | 
21 | There are two files in the dataset:
22 | * [train.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/train.csv) which contains 160k records, 2 input features, and 6 output labels. You will use this file to train your model.
23 | * [test.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/test.csv) which contains 153k records and 2 input features. You will use this file to test your model.
24 | 
25 | You'll need to [download the dataset from Kaggle](https://www.kaggle.com/c/8076/download-all) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet. 
26 | 
27 | Here's a description of all columns in the training file:
28 | * **id**: the identifier of the comment
29 | * **comment_text**: the text of the comment
30 | * **toxic**: 1 if the comment is toxic, 0 if it is not
31 | * **severe_toxic**: 1 if the comment is severely toxic, 0 if it is not
32 | * **obscene**: 1 if the comment is obscene, 0 if it is not
33 | * **threat**: 1 if the comment is threatening, 0 if it is not
34 | * **insult**: 1 if the comment is insulting, 0 if it is not
35 | * **identity_hate**: 1 if the comment expresses identity hatred, 0 if it does not
36 | 
37 | # Getting started
38 | Go to the console and set up a new console application:
39 | 
40 | ```bash
41 | $ dotnet new console --language F# --output FlagToxicComments
42 | $ cd FlagToxicComments
43 | ```
44 | 
45 | Then install the ML.NET NuGet package:
46 | 
47 | ```bash
48 | $ dotnet add package Microsoft.ML
49 | $ dotnet add package Microsoft.ML.FastTree
50 | ```
51 | 
52 | And launch the Visual Studio Code editor:
53 | 
54 | ```bash
55 | $ code .
56 | ```
57 | 
58 | The rest is up to you! 
59 | 
60 | # Hint
61 | To process text data, you'll need to add a **FeaturizeText** component to your machine learning pipeline. 
62 | 
63 | Your code should look something like this:
64 | 
65 | ```fsharp
66 | // Assume we have a partial pipeline in the variable 'partialPipe'
67 | // This line adds a text featurizer to the pipeline. It reads the 'CommentText' column and
68 | //   transforms it to a numeric vector and stores it in the 'Features' column
69 | let completePipe = partialPipe.Append(context.Transforms.Text.FeaturizeText("Features", "CommentText"))
70 | ```
71 | 
72 | FeaturizeText is a handy all-in-one component that can read text columns, process them, and convert them to numeric vectors 
73 | that are ready for model training. 
74 | 
75 | # Your assignment
76 | I want you to build an app that reads the training and testing files in memory and featurizes the comments to prepare them for analysis.
77 | 
78 | Then train a multiclass classifier on the training data and generate predictions for the comments in the testing file. 
79 | 
80 | Measure the micro- and macro accuracy. Report your best values in our group.
81 | 
82 | See if you can get the accuracies as close to 1 as possible. Share in our group how you did it. Which learning algorithm did you select, and how did you configure your model? 
83 | 
84 | Good luck!


--------------------------------------------------------------------------------
/MulticlassClassification/FlagToxicComments/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/FlagToxicComments/assets/data.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Science with F# and ML.NET
 2 | 
 3 | ![Data Science with C# and ML.NET](./assets/DSC-FS.jpg)
 4 | 
 5 | This repository contains all course assignments of my **Data Science with F# and ML.NET** course and will get you up to speed with Microsoft's new ML.NET library.
 6 | 
 7 | By working through the code examples, you will learn how to design, train, and evaluate complex AI models with simple F# code. I'll provide you with all the code, libraries, and data sets you need to get started.
 8 | 
 9 | Please note that this repository only contains code examples with no additional support. 
10 | 
11 | If you prefer a full-featured e-learning experience with live coaching, please check out my online course here:
12 | 
13 | https://www.machinelearningadvantage.com/datascience-with-fsharp
14 | 
15 | 
16 | # Table of contents
17 | 
18 | Transforming data: [Processing California housing data](./LoadingData/CaliforniaHousing)
19 | 
20 | Regression: [Predict taxi fares in New York](./Regression/TaxiFarePrediction)
21 | 
22 | Case study: [Predict house prices in Iowa](./Regression/HousePricePrediction)
23 | 
24 | Binary classification: [Predict heart disease in Ohio](./BinaryClassification/HeartDiseasePrediction)
25 | 
26 | Case study: [Detect credit card fraud in Europe](./BinaryClassification/FraudDetection)
27 | 
28 | Multiclass classification: [Recognize handwriting](./MulticlassClassification/DigitRecognition)
29 | 
30 | Evaluating models: [Detect SMS spam messages](./BinaryClassification/SpamDetection)
31 | 
32 | Case study: [Flag toxic comments on Wikipedia](./MulticlassClassification/FlagToxicComments)
33 | 
34 | Decision trees: [Predict Titanic survivors](./BinaryClassification/TitanicPrediction)
35 | 
36 | Case study: [Predict Diabetes in Pima indians](./BinaryClassification/DiabetesDetection)
37 | 
38 | Ensembles: [Predict bike demand in Washington DC](./Regression/BikeDemandPrediction)
39 | 
40 | Clustering: [Classify Iris flowers](./Clustering/IrisFlower)
41 | 
42 | Recommendation: [Build a movie recommender](./Recommendation/MovieRecommender)
43 | 


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/MovieRecommender.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |     <PackageReference Include="Microsoft.ML.Recommender" Version="0.17.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/Program.fs:
--------------------------------------------------------------------------------
  1 | ﻿open System
  2 | open Microsoft.ML
  3 | open Microsoft.ML.Trainers
  4 | open Microsoft.ML.Data
  5 | 
  6 | /// The MovieRating class holds a single movie rating.
  7 | [<CLIMutable>]
  8 | type MovieRating = {
  9 |     [<LoadColumn(0)>] UserID : float32
 10 |     [<LoadColumn(1)>] MovieID : float32
 11 |     [<LoadColumn(2)>] Label : float32
 12 | }
 13 | 
 14 | /// The MovieRatingPrediction class holds a single movie prediction.
 15 | [<CLIMutable>]
 16 | type MovieRatingPrediction = {
 17 |     Label : float32
 18 |     Score : float32
 19 | }
 20 | 
 21 | /// The MovieTitle class holds a single movie title.
 22 | [<CLIMutable>]
 23 | type MovieTitle = {
 24 |     [<LoadColumn(0)>] MovieID : float32
 25 |     [<LoadColumn(1)>] Title : string
 26 |     [<LoadColumn(2)>] Genres: string
 27 | }
 28 | 
 29 | // file paths to data files (assumes os = windows!)
 30 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory
 31 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory
 32 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory
 33 | 
 34 | [<EntryPoint>]
 35 | let main argv = 
 36 | 
 37 |     // set up a new machine learning context
 38 |     let context = new MLContext()
 39 | 
 40 |     // load training and test data
 41 |     let trainData = context.Data.LoadFromTextFile<MovieRating>(trainDataPath, hasHeader = true, separatorChar = ',')
 42 |     let testData = context.Data.LoadFromTextFile<MovieRating>(testDataPath, hasHeader = true, separatorChar = ',')
 43 | 
 44 |     // prepare matrix factorization options
 45 |     let options = 
 46 |         MatrixFactorizationTrainer.Options(
 47 |             MatrixColumnIndexColumnName = "UserIDEncoded",
 48 |             MatrixRowIndexColumnName = "MovieIDEncoded",
 49 |             LabelColumnName = "Label",
 50 |             NumberOfIterations = 20,
 51 |             ApproximationRank = 100)
 52 | 
 53 |     // set up a training pipeline
 54 |     let pipeline = 
 55 |         EstimatorChain()
 56 | 
 57 |             // step 1: map userId and movieId to keys
 58 |             .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID"))
 59 |             .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID"))
 60 | 
 61 |             // step 2: find recommendations using matrix factorization
 62 |             .Append(context.Recommendation().Trainers.MatrixFactorization(options))
 63 | 
 64 |     // train the model
 65 |     let model = trainData |> pipeline.Fit
 66 | 
 67 |     // calculate predictions and compare them to the ground truth
 68 |     let metrics = testData |> model.Transform |> context.Regression.Evaluate
 69 | 
 70 |     // show model metrics
 71 |     printfn "Model metrics:"
 72 |     printfn "  RMSE: %f" metrics.RootMeanSquaredError
 73 |     printfn "  MAE:  %f" metrics.MeanAbsoluteError
 74 |     printfn "  MSE:  %f" metrics.MeanSquaredError
 75 | 
 76 |     // set up a prediction engine
 77 |     let engine = context.Model.CreatePredictionEngine model
 78 | 
 79 |     // check if Mark likes 'GoldenEye'
 80 |     printfn "Does Mark like GoldenEye?"
 81 |     let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f }
 82 |     printfn "  Score: %f" p.Score
 83 | 
 84 |     // load all movie titles
 85 |     let movieData = context.Data.LoadFromTextFile<MovieTitle>(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
 86 |     let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false)
 87 | 
 88 |     // find Mark's top 5 movies
 89 |     let marksMovies = 
 90 |         movies |> Seq.map(fun m ->
 91 |             let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f }
 92 |             (m.Title, p2.Score))
 93 |         |> Seq.sortByDescending(fun t -> snd t)
 94 | 
 95 |     // print the results
 96 |     printfn "What are Mark's top-5 movies?"
 97 |     marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn "  %f %s" (snd t) (fst t))
 98 | 
 99 |     0 // return value
100 | 


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Recommend new movies to film fans
  2 | 
  3 | In this assignment you're going to build a movie recommendation system that can recommend new movies to film fans.
  4 | 
  5 | The first thing you'll need is a data file with thousands of movies rated by many different users. The [MovieLens Project](https://movielens.org) has exactly what you need.
  6 | 
  7 | Download the [movie ratings for training](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-train.csv), [movie ratings for testing](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-test.csv), and the [movie dictionary](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-movies.csv) and save these files in your project folder. You now have 100,000 movie ratings with 99,980 set aside for training and 20 for testing. 
  8 | 
  9 | The training and testing files are in CSV format and look like this:
 10 | ￼
 11 | 
 12 | ![Data File](./assets/data.png)
 13 | 
 14 | There are only four columns of data:
 15 | 
 16 | * The ID of the user
 17 | * The ID of the movie
 18 | * The movie rating on a scale from 1–5
 19 | * The timestamp of the rating
 20 | 
 21 | There's also a movie dictionary in CSV format with all the movie IDs and titles:
 22 | 
 23 | 
 24 | ![Data File](./assets/movies.png)
 25 | 
 26 | You are going to build a data science model that reads in each user ID, movie ID, and rating, and then predicts the ratings each user would give for every movie in the dataset.
 27 | 
 28 | Once you have a fully trained model, you can easily add a new user with a couple of favorite movies and then ask the model to generate predictions for any of the other movies in the dataset.
 29 | 
 30 | And in fact this is exactly how the recommendation systems on Netflix and Amazon work. 
 31 | 
 32 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 33 | 
 34 | ```bash
 35 | $ dotnet new console --language F# --output MovieRecommender
 36 | $ cd MovieRecommender
 37 | ```
 38 | 
 39 | Now install the following packages
 40 | 
 41 | ```bash
 42 | $ dotnet add package Microsoft.ML
 43 | $ dotnet add package Microsoft.ML.Recommender
 44 | ```
 45 | 
 46 | Now you're ready to add some types. You will need one type to hold a movie rating, and one to hold your model’s predictions.
 47 | 
 48 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
 49 | 
 50 | ```fsharp
 51 | open System
 52 | open Microsoft.ML
 53 | open Microsoft.ML.Trainers
 54 | open Microsoft.ML.Data
 55 | 
 56 | /// The MovieRating class holds a single movie rating.
 57 | [<CLIMutable>]
 58 | type MovieRating = {
 59 |     [<LoadColumn(0)>] UserID : float32
 60 |     [<LoadColumn(1)>] MovieID : float32
 61 |     [<LoadColumn(2)>] Label : float32
 62 | }
 63 | 
 64 | /// The MovieRatingPrediction class holds a single movie prediction.
 65 | [<CLIMutable>]
 66 | type MovieRatingPrediction = {
 67 |     Label : float32
 68 |     Score : float32
 69 | }
 70 | 
 71 | // the rest of the code goes here...
 72 | ```
 73 | 
 74 | The **MovieRating** type holds one single movie rating. Note how each field is tagged with a **LoadColumn** attribute that tell the CSV data loading code which column to import data from.
 75 | 
 76 | You're also declaring a **MovieRatingPrediction** type which will hold a single movie rating prediction.
 77 | 
 78 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 79 | 
 80 | Before we continue, we need to set up a third type that will hold our movie dictionary:
 81 | 
 82 | ```fsharp
 83 | /// The MovieTitle class holds a single movie title.
 84 | [<CLIMutable>]
 85 | type MovieTitle = {
 86 |     [<LoadColumn(0)>] MovieID : float32
 87 |     [<LoadColumn(1)>] Title : string
 88 |     [<LoadColumn(2)>] Genres: string
 89 | }
 90 | 
 91 | // the rest of the code goes here
 92 | ```
 93 | 
 94 | This **MovieTitle** type contains a movie ID value and its corresponding title and genres. We will use this type later in our code to map movie IDs to their corresponding titles.
 95 | 
 96 | Now you need to load the dataset in memory:
 97 | 
 98 | ```fsharp
 99 | // file paths to data files (assumes os = windows!)
100 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory
101 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory
102 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory
103 | 
104 | [<EntryPoint>]
105 | let main argv = 
106 | 
107 |     // set up a new machine learning context
108 |     let context = new MLContext()
109 | 
110 |     // load training and test data
111 |     let trainData = context.Data.LoadFromTextFile<MovieRating>(trainDataPath, hasHeader = true, separatorChar = ',')
112 |     let testData = context.Data.LoadFromTextFile<MovieRating>(testDataPath, hasHeader = true, separatorChar = ',')
113 | 
114 |     // the rest of the code goes here...
115 | 
116 |     0 // return value
117 | ```
118 | 
119 | This code calls the **LoadFromTextFile** function twice to load the training and testing CSV data into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **MovieRating** class.
120 | 
121 | Now you're ready to start building the machine learning model:
122 | 
123 | ```fsharp
124 | // prepare matrix factorization options
125 | let options = 
126 |     MatrixFactorizationTrainer.Options(
127 |         MatrixColumnIndexColumnName = "UserIDEncoded",
128 |         MatrixRowIndexColumnName = "MovieIDEncoded",
129 |         LabelColumnName = "Label",
130 |         NumberOfIterations = 20,
131 |         ApproximationRank = 100)
132 | 
133 | // set up a training pipeline
134 | let pipeline = 
135 |     EstimatorChain()
136 | 
137 |         // step 1: map userId and movieId to keys
138 |         .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID"))
139 |         .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID"))
140 | 
141 |         // step 2: find recommendations using matrix factorization
142 |         .Append(context.Recommendation().Trainers.MatrixFactorization(options))
143 | 
144 | // train the model
145 | let model = trainData |> pipeline.Fit
146 | 
147 | // the rest of the code goes here...
148 | ```
149 | 
150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
151 | 
152 | This pipeline has the following components:
153 | 
154 | * **MapValueToKey** which reads the UserID column and builds a dictionary of unique ID values. It then produces an output column called UserIDEncoded containing an encoding for each ID. This step converts the IDs to numbers that the model can work with.
155 | * Another **MapValueToKey** which reads the MovieID column, encodes it, and stores the encodings in output column called MovieIDEncoded.
156 | * A **MatrixFactorization** component that performs matrix factorization on the encoded ID columns and the ratings. This step calculates the movie rating predictions for every user and movie.
157 | 
158 | With the pipeline fully assembled, you train the model by piping the training data into the **Fit** function.
159 | 
160 | You now have a fully- trained model. So now you need to load the validation data, predict the rating for each user and movie, and calculate the accuracy metrics of the model:
161 | 
162 | ```fsharp
163 | // calculate predictions and compare them to the ground truth
164 | let metrics = testData |> model.Transform |> context.Regression.Evaluate
165 | 
166 | // show model metrics
167 | printfn "Model metrics:"
168 | printfn "  RMSE: %f" metrics.RootMeanSquaredError
169 | printfn "  MAE:  %f" metrics.MeanAbsoluteError
170 | printfn "  MSE:  %f" metrics.MeanSquaredError
171 | 
172 | // the rest of the code goes here...
173 | ```
174 | 
175 | This code pipes the test data into the **Transform** function to make predictions for every user and movie in the test dataset. It then pipes these predictions into the **Evaluate** function to compare them to the actual ratings.
176 | 
177 | The **Evaluate** function calculates the following three metrics:
178 | 
179 | * **RootMeanSquaredError**: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
180 | * **MeanAbsoluteError**: this is the mean absolute prediction error, expressed as a rating.
181 | * **MeanSquaredError**: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.
182 | 
183 | To wrap up, let’s use the model to make a prediction about me. Here are 6 movies I like:
184 | 
185 | * Blade Runner
186 | * True Lies
187 | * Speed
188 | * Twelve Monkeys
189 | * Things to do in Denver when you're dead
190 | * Cloud Atlas
191 | 
192 | And 6 more movies I really didn't like at all:
193 | 
194 | * Ace Ventura: when nature calls
195 | * Naked Gun 33 1/3
196 | * Highlander II
197 | * Throw momma from the train
198 | * Jingle all the way
199 | * Dude, where's my car?
200 | 
201 | You'll find my ratings at the very end of the training file. I added myself as user 999. 
202 | 
203 | So based on this list, do you think I would enjoy the James Bond movie ‘GoldenEye’?
204 | 
205 | Let's write some code to find out:
206 | 
207 | ```fsharp
208 | // set up a prediction engine
209 | let engine = context.Model.CreatePredictionEngine model
210 | 
211 | // check if Mark likes 'GoldenEye'
212 | printfn "Does Mark like GoldenEye?"
213 | let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f }
214 | printfn "  Score: %f" p.Score
215 | 
216 | // the rest of the code goes here...
217 | ```
218 | 
219 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then calls **Predict** to create a prediction for user 999 (me) and movie 10 (GoldenEye). 
220 | 
221 | Let’s do one more thing and ask the model to predict my top-5 favorite movies. 
222 | 
223 | We can ask the model to predict my favorite movies, but it will just produce movie ID values. So now's the time to load that movie dictionary that will help us convert movie IDs to their corresponding titles:
224 | 
225 | ```fsharp
226 | // load all movie titles
227 | let movieData = context.Data.LoadFromTextFile<MovieTitle>(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
228 | let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false)
229 | 
230 | // the rest of the code goes here...
231 | ```
232 | 
233 | This code calls **LoadFromTextFile** to load the movie dictionary in memory, and then calls **CreateEnumerable** to create an enumeration of **MovieTitle** instances. 
234 | 
235 | We can now find my favorite movies like this:
236 | 
237 | ```fsharp
238 | // find Mark's top 5 movies
239 | let marksMovies = 
240 |     movies |> Seq.map(fun m ->
241 |         let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f }
242 |         (m.Title, p2.Score))
243 |     |> Seq.sortByDescending(fun t -> snd t)
244 | 
245 | // print the results
246 | printfn "What are Mark's top-5 movies?"
247 | marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn "  %f %s" (snd t) (fst t))
248 | ```
249 | 
250 | The code pipes the movie dictionary into **Seq.map** to create an enumeration of tuples. The first tuple element is the movie title and the second element is the rating the model thinks I would give to that movie.
251 | 
252 | The code then pipes the enumeration of tuples into **Seq.sortByDescending** to sort the list by rating. This will put my favorite movies at the top of the list.
253 | 
254 | Finally, the code pipes the rated movie list into **Seq.take** to grab the top-5, and then prints out the title and correspnding rating. 
255 | 
256 | That's it, your code is done. Go to your terminal and run the app:
257 | 
258 | ```bash
259 | $ dotnet run
260 | ```
261 | 
262 | Which training and validation metrics did you get? What are your RMSE and MAE values? Now look at how the data has been partitioned into training and validaton sets. Do you think this a good result? What could you improve?
263 | 
264 | What rating did the model predict I would give to the movie GoldenEye? And what are my 5 favorite movies according to the model? 
265 | 
266 | Share your results in our group and then ask me if the predictions are correct ;)
267 | 


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/data.png


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/assets/movies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/movies.png


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/recommendation-movies.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/recommendation-movies.csv


--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/recommendation-ratings-test.csv:
--------------------------------------------------------------------------------
 1 | userId,movieId,rating,timestamp
 2 | 1,1097,5,964981680
 3 | 1,1127,4,964982513
 4 | 1,1136,5,964981327
 5 | 1,1196,5,964981827
 6 | 1,1197,5,964981872
 7 | 1,1198,5,964981827
 8 | 1,1206,5,964983737
 9 | 1,1208,4,964983250
10 | 1,1210,5,964980499
11 | 1,1213,5,964982951
12 | 1,1214,4,964981855
13 | 2,114060,2,1445715276
14 | 2,115713,3.5,1445714854
15 | 2,122882,5,1445715272
16 | 2,131724,5,1445714851
17 | 3,2105,2,1306463559
18 | 3,2288,4,1306463631
19 | 3,2851,5,1306463925
20 | 3,2424,0.5,1306464293
21 | 


--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/BikeDemand.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |     <PackageReference Include="Microsoft.ML.FastTree" Version="1.5.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/Program.fs:
--------------------------------------------------------------------------------
 1 | ﻿open System
 2 | open System.IO
 3 | open Microsoft.ML
 4 | open Microsoft.ML.Data
 5 | 
 6 | /// The DemandObservation class holds one single bike demand observation record.
 7 | [<CLIMutable>]
 8 | type DemandObservation = {
 9 |     [<LoadColumn(2)>] Season : float32
10 |     [<LoadColumn(3)>] Year : float32
11 |     [<LoadColumn(4)>] Month : float32
12 |     [<LoadColumn(5)>] Hour : float32
13 |     [<LoadColumn(6)>] Holiday : float32
14 |     [<LoadColumn(7)>] Weekday : float32
15 |     [<LoadColumn(8)>] WorkingDay : float32
16 |     [<LoadColumn(9)>] Weather : float32
17 |     [<LoadColumn(10)>] Temperature : float32
18 |     [<LoadColumn(11)>] NormalizedTemperature : float32
19 |     [<LoadColumn(12)>] Humidity : float32
20 |     [<LoadColumn(13)>] Windspeed : float32
21 |     [<LoadColumn(16)>] [<ColumnName("Label")>] Count : float32
22 | }
23 | 
24 | /// The DemandPrediction class holds one single bike demand prediction.
25 | [<CLIMutable>]
26 | type DemandPrediction = {
27 |     [<ColumnName("Score")>] PredictedCount : float32;
28 | }
29 | 
30 | // file paths to data files (assumes os = windows!)
31 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory
32 | 
33 | /// The main application entry point.
34 | [<EntryPoint>]
35 | let main argv =
36 | 
37 |     // create the machine learning context
38 |     let context = new MLContext();
39 | 
40 |     // load the dataset
41 |     let data = context.Data.LoadFromTextFile<DemandObservation>(dataPath, hasHeader = true, separatorChar = ',')
42 | 
43 |     // split the dataset into 80% training and 20% testing
44 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
45 | 
46 |     // build a training pipeline
47 |     let pipeline = 
48 |         EstimatorChain()
49 |         
50 |             // step 1: concatenate all feature columns
51 |             .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed"))
52 |                                     
53 |             // step 2: cache the data to speed up training
54 |             .AppendCacheCheckpoint(context)
55 | 
56 |             // step 3: use a fast forest learner
57 |             .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10))
58 | 
59 |     // train the model
60 |     let model = partitions.TrainSet |> pipeline.Fit
61 | 
62 |     // evaluate the model
63 |     let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
64 | 
65 |     // show evaluation metrics
66 |     printfn "Model metrics:"
67 |     printfn "  RMSE:%f" metrics.RootMeanSquaredError
68 |     printfn "  MSE: %f" metrics.MeanSquaredError
69 |     printfn "  MAE: %f" metrics.MeanAbsoluteError
70 | 
71 |     // set up a sample observation
72 |     let sample ={
73 |         Season = 3.0f
74 |         Year = 1.0f
75 |         Month = 8.0f
76 |         Hour = 10.0f
77 |         Holiday = 0.0f
78 |         Weekday = 4.0f
79 |         WorkingDay = 1.0f
80 |         Weather = 1.0f
81 |         Temperature = 0.8f
82 |         NormalizedTemperature = 0.7576f
83 |         Humidity = 0.55f
84 |         Windspeed = 0.2239f
85 |         Count = 0.0f // the field to predict
86 |     }
87 | 
88 |     // create a prediction engine
89 |     let engine = context.Model.CreatePredictionEngine model
90 | 
91 |     // make the prediction
92 |     let prediction = sample |> engine.Predict
93 | 
94 |     // show the prediction
95 |     printfn "\r"
96 |     printfn "Single prediction:"
97 |     printfn "  Predicted bike count: %f" prediction.PredictedCount
98 | 
99 |     0 // return value


--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Predict bike sharing demand in Washington DC
  2 | 
  3 | In this assignment you're going to build an app that can predict bike sharing demand in Washington DC.
  4 | 
  5 | A bike-sharing system is a service in which bicycles are made available to individuals on a short term. Users borrow a bike from a dock and return it at another dock belonging to the same system. Docks are bike racks that lock the bike, and only release it by computer control.
  6 | 
  7 | You’ve probably seen docks around town, they look like this:
  8 | 
  9 | ![Bike sharing rack](./assets/bikesharing.jpeg)
 10 | 
 11 | Bike sharing companies try to even out supply by manually distributing bikes across town, but they need to know how many bikes will be in demand at any given time in the city.
 12 | 
 13 | So let’s give them a hand with a machine learning model!
 14 | 
 15 | You are going to train a forest of regression decision trees on a dataset of bike sharing demand. Then you’ll use the fully-trained model to make a prediction for a given date and time.
 16 | 
 17 | The first thing you will need is a data file with lots of bike sharing demand numbers. We are going to use the [UCI Bike Sharing Dataset](http://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) from [Capital Bikeshare](https://www.capitalbikeshare.com/) in Metro DC. This dataset has 17,380 bike sharing records that span a 2-year period.
 18 | 
 19 | [Download the dataset](https://github.com/mdfarragher/DSC/blob/master/Regression/BikeDemandPrediction/bikedemand.csv) and save it in your project folder as **bikedmand.csv**.
 20 | 
 21 | The file looks like this:
 22 | 
 23 | ![Data File](./assets/data.png)
 24 | 
 25 | It’s a comma-separated file with 17 columns:
 26 | 
 27 | * Instant: the record index
 28 | * Date: the date of the observation
 29 | * Season: the season (1 = springer, 2 = summer, 3 = fall, 4 = winter)
 30 | * Year: the year of the observation (0 = 2011, 1 = 2012)
 31 | * Month: the month of the observation ( 1 to 12)
 32 | * Hour: the hour of the observation (0 to 23)
 33 | * Holiday: if the date is a holiday or not
 34 | * Weekday: the day of the week of the observation
 35 | * WorkingDay: if the date is a working day
 36 | * Weather: the weather during the observation (1 = clear, 2 = mist, 3 = light snow/rain, 4 = heavy rain)
 37 | * Temperature : the normalized temperature in Celsius
 38 | * ATemperature: the normalized feeling temperature in Celsius
 39 | * Humidity: the normalized humidity
 40 | * Windspeed: the normalized wind speed
 41 | * Casual: the number of casual bike users at the time
 42 | * Registered: the number of registered bike users at the time
 43 | * Count: the total number of rental bikes in operation at the time
 44 | 
 45 | You can ignore the record index, the date, and the number of casual and registered bikes, and use everything else as input features. The final column **Count** is the label you're trying to predict.
 46 | 
 47 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 48 | 
 49 | ```bash
 50 | $ dotnet new console --language F# --output BikeDemand
 51 | $ cd BikeDemand
 52 | ```
 53 | 
 54 | Now install the following packages
 55 | 
 56 | ```bash
 57 | $ dotnet add package Microsoft.ML
 58 | $ dotnet add package Microsoft.ML.FastTree
 59 | ```
 60 | 
 61 | Now you are ready to add some types. You’ll need one to hold a bike demand record, and one to hold your model predictions.
 62 | 
 63 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
 64 | 
 65 | ```fsharp
 66 | open System
 67 | open System.IO
 68 | open Microsoft.ML
 69 | open Microsoft.ML.Data
 70 | 
 71 | /// The DemandObservation class holds one single bike demand observation record.
 72 | [<CLIMutable>]
 73 | type DemandObservation = {
 74 |     [<LoadColumn(2)>] Season : float32
 75 |     [<LoadColumn(3)>] Year : float32
 76 |     [<LoadColumn(4)>] Month : float32
 77 |     [<LoadColumn(5)>] Hour : float32
 78 |     [<LoadColumn(6)>] Holiday : float32
 79 |     [<LoadColumn(7)>] Weekday : float32
 80 |     [<LoadColumn(8)>] WorkingDay : float32
 81 |     [<LoadColumn(9)>] Weather : float32
 82 |     [<LoadColumn(10)>] Temperature : float32
 83 |     [<LoadColumn(11)>] NormalizedTemperature : float32
 84 |     [<LoadColumn(12)>] Humidity : float32
 85 |     [<LoadColumn(13)>] Windspeed : float32
 86 |     [<LoadColumn(16)>] [<ColumnName("Label")>] Count : float32
 87 | }
 88 | 
 89 | /// The DemandPrediction class holds one single bike demand prediction.
 90 | [<CLIMutable>]
 91 | type DemandPrediction = {
 92 |     [<ColumnName("Score")>] PredictedCount : float32;
 93 | }
 94 | 
 95 | // the rest of the code goes here...
 96 | ```
 97 | 
 98 | The **DemandObservation** type holds one single bike trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.
 99 | 
100 | You're also declaring a **DemandPrediction** type which will hold a single bike demand prediction.
101 | 
102 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
103 | 
104 | Now you need to load the training data in memory:
105 | 
106 | ```fsharp
107 | // file paths to data files (assumes os = windows!)
108 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory
109 | 
110 | /// The main application entry point.
111 | [<EntryPoint>]
112 | let main argv =
113 | 
114 |     // create the machine learning context
115 |     let context = new MLContext();
116 | 
117 |     // load the dataset
118 |     let data = context.Data.LoadFromTextFile<DemandObservation>(dataPath, hasHeader = true, separatorChar = ',')
119 | 
120 |     // split the dataset into 80% training and 20% testing
121 |     let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
122 | 
123 |     // the rest of the code goes here...
124 | 
125 |     0 // return value
126 | ```
127 | 
128 | This code uses the method **LoadFromTextFile** to load the data directly into memory. The field annotations we set up earlier tell the method how to store the loaded data in the **DemandObservation** class.
129 | 
130 | The code then calls **TrainTestSplit** to reserve 80% of the data for training and 20% for testing.
131 | 
132 | Now let’s build the machine learning pipeline:
133 | 
134 | ```fsharp
135 | // build a training pipeline
136 | let pipeline = 
137 |     EstimatorChain()
138 |     
139 |         // step 1: concatenate all feature columns
140 |         .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed"))
141 |                                 
142 |         // step 2: cache the data to speed up training
143 |         .AppendCacheCheckpoint(context)
144 | 
145 |         // step 3: use a fast forest learner
146 |         .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10))
147 | 
148 | // train the model
149 | let model = partitions.TrainSet |> pipeline.Fit
150 | 
151 | // the rest of the code goes here...
152 | ```
153 | 
154 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
155 | 
156 | This pipeline has the following components:
157 | 
158 | * **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
159 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm.
160 | * A final **FastForest** regression learner which will train the model to make accurate predictions using a forest of decision trees.
161 | 
162 | The **FastForest** learner is a very nice training algorithm that uses gradient boosting to build a forest of weak decision trees.
163 | 
164 | Gradient boosting builds a stack of weak decision trees. It starts with a single weak tree that tries to predict the bike demand. Then it adds a second tree on top of the first one to correct the error in the first tree. And then it adds a third tree on top of the second one to correct the output of the second tree. And so on.
165 | 
166 | The result is a fairly strong prediction model that is made up of a stack of weak decision trees that incrementally correct each other's output. 
167 | 
168 | Note the use of hyperparameters to configure the learner:
169 | 
170 | * **NumberOfLeaves** is the maximum number of leaf nodes each weak decision tree will have. In this forest each tree will have at most 10 leaf nodes.
171 | * **NumberOfTrees** is the total number of weak decision trees to create in the forest. This forest will hold 100 trees.
172 | * **MinimumExampleCountPerLeaf** is the minimum number of data points at which a leaf node is split. In this model each leaf is split when it has 10 or more qualifying data points.
173 | 
174 | These hyperparameters are the default for the **FastForest** learner, but you can tweak them if you want. 
175 | 
176 | With the pipeline fully assembled, you can pipe the trainig data into the **Fit** function to train the model.
177 | 
178 | You now have a fully- trained model. So next, you'll have to load the test data, predict the bike demand, and calculate the accuracy of your model:
179 | 
180 | ```fsharp
181 | // evaluate the model
182 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
183 | 
184 | // show evaluation metrics
185 | printfn "Model metrics:"
186 | printfn "  RMSE:%f" metrics.RootMeanSquaredError
187 | printfn "  MSE: %f" metrics.MeanSquaredError
188 | printfn "  MAE: %f" metrics.MeanAbsoluteError
189 | 
190 | // the rest of the code goes here...
191 | ```
192 | 
193 | This code pipes the test data into the **Transform** function to set up predictions for every single bike demand record in the test partition. The code then pipes these predictions into the **Evaluate** function to compares them to the actual bike demand and automatically calculate these metrics:
194 | 
195 | * **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
196 | * **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.
197 | * **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in number of bikes.
198 | 
199 | To wrap up, let’s use the model to make a prediction.
200 | 
201 | I want to rent a bike in the fall of 2012, on a Thursday in August at 10am in the morning in clear weather. What will the bike demand be on that day?
202 | 
203 | Here’s how to make that prediction:
204 | 
205 | ```fsharp
206 | // set up a sample observation
207 | let sample ={
208 |     Season = 3.0f
209 |     Year = 1.0f
210 |     Month = 8.0f
211 |     Hour = 10.0f
212 |     Holiday = 0.0f
213 |     Weekday = 4.0f
214 |     WorkingDay = 1.0f
215 |     Weather = 1.0f
216 |     Temperature = 0.8f
217 |     NormalizedTemperature = 0.7576f
218 |     Humidity = 0.55f
219 |     Windspeed = 0.2239f
220 |     Count = 0.0f // the field to predict
221 | }
222 | 
223 | // create a prediction engine
224 | let engine = context.Model.CreatePredictionEngine model
225 | 
226 | // make the prediction
227 | let prediction = sample |> engine.Predict
228 | 
229 | // show the prediction
230 | printfn "\r"
231 | printfn "Single prediction:"
232 | printfn "  Predicted bike count: %f" prediction.PredictedCount
233 | ```
234 | 
235 | This code sets up a new bike demand observation, and then uses the **CreatePredictionEngine** function to set up a prediction engine and call **Predict** to make a demand prediction. 
236 | 
237 | What will the model prediction be?
238 | 
239 | Time to find out. Go to your terminal and run your code:
240 | 
241 | ```bash
242 | $ dotnet run
243 | ```
244 | 
245 | What results do you get? What are your RMSE and MAE values? Is this a good result? 
246 | 
247 | And what bike demand does your model predict on the day I wanted to take my bike ride? 
248 | 
249 | Now take a look at the hyperparameters. Try to change the behavior of the fast forest learner and see what happens to the accuracy of your model. Did your model improve or get worse? 
250 | 
251 | Share your results in our group!
252 | 


--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/assets/bikesharing.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/bikesharing.jpeg


--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/data.png


--------------------------------------------------------------------------------
/Regression/HousePricePrediction/README.md:
--------------------------------------------------------------------------------
  1 | # The case
  2 | 
  3 | Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But a detailed analysis of houses and sales prices actually proves that these metrics have a much greater influence on price negotiations than the number of bedrooms or a white-picket fence.
  4 | 
  5 | In this case study, you're going to answer the age-old question: what exactly determines the sales price of a house? 
  6 | 
  7 | And once you have your fully-trained app up and running, you can use it to predict the sales price of any house. Just plug in the relevant numbers and your app will generate a sales price prediction.
  8 | 
  9 | But how accurate will these predictions be? Can you actually use this app in a realtor business?
 10 | 
 11 | That's for you to find out! 
 12 | 
 13 | # The dataset
 14 | 
 15 | ![The dataset](./assets/data.png)
 16 | 
 17 | In this case study you'll be working with the Iowa House Price dataset. This data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. 
 18 | 
 19 | The data set contains 1460 records and a large number of feature columns involved in assessing home values. You can use any combination of features you like to generate your house price predictions.
 20 | 
 21 | There is 1 file in the dataset:
 22 | * [data.csv](https://github.com/mdfarragher/DSC/blob/master/Regression/HousePricePrediction/data.csv) which contains 1460 records, 80 input features, and one output label. You will use this file to train and evaluate your model.
 23 | 
 24 | Download the file and save it in your project folder.
 25 | 
 26 | Here's a description of all 81 columns in the training file:
 27 | * SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
 28 | * MSSubClass: The building class
 29 | * MSZoning: The general zoning classification
 30 | * LotFrontage: Linear feet of street connected to property
 31 | * LotArea: Lot size in square feet
 32 | * Street: Type of road access
 33 | * Alley: Type of alley access
 34 | * LotShape: General shape of property
 35 | * LandContour: Flatness of the property
 36 | * Utilities: Type of utilities available
 37 | * LotConfig: Lot configuration
 38 | * LandSlope: Slope of property
 39 | * Neighborhood: Physical locations within Ames city limits
 40 | * Condition1: Proximity to main road or railroad
 41 | * Condition2: Proximity to main road or railroad (if a second is present)
 42 | * BldgType: Type of dwelling
 43 | * HouseStyle: Style of dwelling
 44 | * OverallQual: Overall material and finish quality
 45 | * OverallCond: Overall condition rating
 46 | * YearBuilt: Original construction date
 47 | * YearRemodAdd: Remodel date
 48 | * RoofStyle: Type of roof
 49 | * RoofMatl: Roof material
 50 | * Exterior1st: Exterior covering on house
 51 | * Exterior2nd: Exterior covering on house (if more than one material)
 52 | * MasVnrType: Masonry veneer type
 53 | * MasVnrArea: Masonry veneer area in square feet
 54 | * ExterQual: Exterior material quality
 55 | * ExterCond: Present condition of the material on the exterior
 56 | * Foundation: Type of foundation
 57 | * BsmtQual: Height of the basement
 58 | * BsmtCond: General condition of the basement
 59 | * BsmtExposure: Walkout or garden level basement walls
 60 | * BsmtFinType1: Quality of basement finished area
 61 | * BsmtFinSF1: Type 1 finished square feet
 62 | * BsmtFinType2: Quality of second finished area (if present)
 63 | * BsmtFinSF2: Type 2 finished square feet
 64 | * BsmtUnfSF: Unfinished square feet of basement area
 65 | * TotalBsmtSF: Total square feet of basement area
 66 | * Heating: Type of heating
 67 | * HeatingQC: Heating quality and condition
 68 | * CentralAir: Central air conditioning
 69 | * Electrical: Electrical system
 70 | * 1stFlrSF: First Floor square feet
 71 | * 2ndFlrSF: Second floor square feet
 72 | * LowQualFinSF: Low quality finished square feet (all floors)
 73 | * GrLivArea: Above grade (ground) living area square feet
 74 | * BsmtFullBath: Basement full bathrooms
 75 | * BsmtHalfBath: Basement half bathrooms
 76 | * FullBath: Full bathrooms above grade
 77 | * HalfBath: Half baths above grade
 78 | * Bedroom: Number of bedrooms above basement level
 79 | * Kitchen: Number of kitchens
 80 | * KitchenQual: Kitchen quality
 81 | * TotRmsAbvGrd: Total rooms above grade (does not include * bathrooms)
 82 | * Functional: Home functionality rating
 83 | * Fireplaces: Number of fireplaces
 84 | * FireplaceQu: Fireplace quality
 85 | * GarageType: Garage location
 86 | * GarageYrBlt: Year garage was built
 87 | * GarageFinish: Interior finish of the garage
 88 | * GarageCars: Size of garage in car capacity
 89 | * GarageArea: Size of garage in square feet
 90 | * GarageQual: Garage quality
 91 | * GarageCond: Garage condition
 92 | * PavedDrive: Paved driveway
 93 | * WoodDeckSF: Wood deck area in square feet
 94 | * OpenPorchSF: Open porch area in square feet
 95 | * EnclosedPorch: Enclosed porch area in square feet
 96 | * 3SsnPorch: Three season porch area in square feet
 97 | * ScreenPorch: Screen porch area in square feet
 98 | * PoolArea: Pool area in square feet
 99 | * PoolQC: Pool quality
100 | * Fence: Fence quality
101 | * MiscFeature: Miscellaneous feature not covered in other categories
102 | * MiscVal: $Value of miscellaneous feature
103 | * MoSold: Month Sold
104 | * YrSold: Year Sold
105 | * SaleType: Type of sale
106 | * SaleCondition: Condition of sale
107 | 
108 | # Getting started
109 | Go to the console and set up a new console application:
110 | 
111 | ```bash
112 | $ dotnet new console --language F# --output HousePricePrediction
113 | $ cd HousePricePrediction
114 | ```
115 | 
116 | Then install the ML.NET NuGet package:
117 | 
118 | ```bash
119 | $ dotnet add package Microsoft.ML
120 | $ dotnet add package Microsoft.ML.FastTree
121 | ```
122 | 
123 | And launch the Visual Studio Code editor:
124 | 
125 | ```bash
126 | $ code .
127 | ```
128 | 
129 | The rest is up to you! 
130 | 
131 | # Your assignment
132 | I want you to build an app that reads the data file, processes it, and then trains a linear regression model on the data.
133 | 
134 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns. 
135 | 
136 | Partition the data and use the trained model to make house price predictions on all the houses in the test partition. Calculate the best possible **RMSE** and **MAE** and share it in our group. 
137 | 
138 | See if you can get the RMSE as low as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model? 
139 | 
140 | Good luck!


--------------------------------------------------------------------------------
/Regression/HousePricePrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/HousePricePrediction/assets/data.png


--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/Program.fs:
--------------------------------------------------------------------------------
 1 | ﻿open System
 2 | open Microsoft.ML
 3 | open Microsoft.ML.Data
 4 | 
 5 | /// The TaxiTrip class represents a single taxi trip.
 6 | [<CLIMutable>]
 7 | type TaxiTrip = {
 8 |     [<LoadColumn(0)>] VendorId : string
 9 |     [<LoadColumn(5)>] RateCode : string
10 |     [<LoadColumn(3)>] PassengerCount : float32
11 |     [<LoadColumn(4)>] TripDistance : float32
12 |     [<LoadColumn(9)>] PaymentType : string
13 |     [<LoadColumn(10)>] [<ColumnName("Label")>] FareAmount : float32
14 | }
15 | 
16 | /// The TaxiTripFarePrediction class represents a single far prediction.
17 | [<CLIMutable>]
18 | type TaxiTripFarePrediction = {
19 |     [<ColumnName("Score")>] FareAmount : float32
20 | }
21 | 
22 | // file paths to data files (assumes os = windows!)
23 | let dataPath = sprintf "%s\\yellow_tripdata_2018-12.csv" Environment.CurrentDirectory
24 | 
25 | /// The main application entry point.
26 | [<EntryPoint>]
27 | let main argv =
28 | 
29 |     // create the machine learning context
30 |     let context = new MLContext()
31 | 
32 |     // load the data
33 |     let dataView = context.Data.LoadFromTextFile<TaxiTrip>(dataPath, hasHeader = true, separatorChar = ',')
34 | 
35 |     // split into a training and test partition
36 |     let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2)
37 | 
38 |     // set up a learning pipeline
39 |     let pipeline = 
40 |         EstimatorChain()
41 |     
42 |             // one-hot encode all text features
43 |             .Append(context.Transforms.Categorical.OneHotEncoding("VendorId"))
44 |             .Append(context.Transforms.Categorical.OneHotEncoding("RateCode"))
45 |             .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType"))
46 | 
47 |             // combine all input features into a single column 
48 |             .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance"))
49 | 
50 |             // cache the data to speed up training
51 |             .AppendCacheCheckpoint(context)
52 | 
53 |             // use the fast tree learner 
54 |             .Append(context.Regression.Trainers.FastTree())
55 | 
56 |     // train the model
57 |     let model = partitions.TrainSet |> pipeline.Fit
58 | 
59 |     // get regression metrics to score the model
60 |     let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
61 | 
62 |     // show the metrics
63 |     printfn "Model metrics:"
64 |     printfn "  RMSE:%f" metrics.RootMeanSquaredError
65 |     printfn "  MSE: %f" metrics.MeanSquaredError
66 |     printfn "  MAE: %f" metrics.MeanAbsoluteError
67 | 
68 |     // create a prediction engine for one single prediction
69 |     let engine = context.Model.CreatePredictionEngine model
70 | 
71 |     let taxiTripSample = {
72 |         VendorId = "VTS"
73 |         RateCode = "1"
74 |         PassengerCount = 1.0f
75 |         TripDistance = 3.75f
76 |         PaymentType = "CRD"
77 |         FareAmount = 0.0f // To predict. Actual/Observed = 15.5
78 |     }
79 | 
80 |     // make the prediction
81 |     let prediction = taxiTripSample |> engine.Predict
82 | 
83 |     // show the prediction
84 |     printfn "\r"
85 |     printfn "Single prediction:"
86 |     printfn "  Predicted fare: %f" prediction.FareAmount
87 | 
88 |     0 // return value


--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/README.md:
--------------------------------------------------------------------------------
  1 | # Assignment: Predict taxi fares in New York
  2 | 
  3 | In this assignment you're going to build an app that can predict taxi fares in New York.
  4 | 
  5 | The first thing you'll need is a data file with transcripts of New York taxi rides. The [NYC Taxi & Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) provides yearly TLC Trip Record Data files which have exactly what you need.
  6 | 
  7 | Download the [Yellow Taxi Trip Records from December 2018](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-12.csv) and save it as **yellow_tripdata_2018-12.csv**. 
  8 | 
  9 | This is a CSV file with 8,173,233 records that looks like this:
 10 | ￼
 11 | 
 12 | ![Data File](./assets/data.png)
 13 | 
 14 | 
 15 | There are a lot of columns with interesting information in this data file, but you will only train on the following:
 16 | 
 17 | * Column 0: The data provider vendor ID
 18 | * Column 3: Number of passengers
 19 | * Column 4: Trip distance
 20 | * Column 5: The rate code (standard, JFK, Newark, …)
 21 | * Column 9: Payment type (credit card, cash, …)
 22 | * Column 10: Fare amount
 23 | 
 24 | You are going to build a machine learning model in F# that will use columns 0, 3, 4, 5, and 9 as input, and use them to predict the taxi fare for every trip. Then you’ll compare the predicted fares with the actual taxi fares in column 10, and evaluate the accuracy of your model.
 25 | 
 26 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
 27 | 
 28 | ```bash
 29 | $ dotnet new console --language F# --output PricePrediction
 30 | $ cd PricePrediction
 31 | ```
 32 | 
 33 | Now install the following packages
 34 | 
 35 | ```bash
 36 | $ dotnet add package Microsoft.ML
 37 | $ dotnet add package Microsoft.ML.FastTree
 38 | ```
 39 | 
 40 | Now you are ready to add some classes. You’ll need one to hold a taxi trip, and one to hold your model predictions.
 41 | 
 42 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
 43 | 
 44 | ```fsharp
 45 | /// The TaxiTrip class represents a single taxi trip.
 46 | [<CLIMutable>]
 47 | type TaxiTrip = {
 48 |     [<LoadColumn(0)>] VendorId : string
 49 |     [<LoadColumn(5)>] RateCode : string
 50 |     [<LoadColumn(3)>] PassengerCount : float32
 51 |     [<LoadColumn(4)>] TripDistance : float32
 52 |     [<LoadColumn(9)>] PaymentType : string
 53 |     [<LoadColumn(10)>] [<ColumnName("Label")>] FareAmount : float32
 54 | }
 55 | 
 56 | /// The TaxiTripFarePrediction class represents a single far prediction.
 57 | [<CLIMutable>]
 58 | type TaxiTripFarePrediction = {
 59 |     [<ColumnName("Score")>] FareAmount : float32
 60 | }
 61 | 
 62 | // the rest of the code goes here...
 63 | ```
 64 | 
 65 | The **TaxiTrip** type holds one single taxi trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.
 66 | 
 67 | You're also declaring a **TaxiTripFarePrediction** type which will hold a single fare prediction.
 68 | 
 69 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.  
 70 | 
 71 | Also note the **mutable** keyword in the definition for **TaxiTripFarePrediction**. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction. 
 72 | 
 73 | We're loading all data columns as **float32**, except **VendorId**, **RateCode** and **PaymentType**. These columns hold numeric values but you will load them as string fields.
 74 | 
 75 | The reason you need to do this is because RateCode is an enumeration with the following values:
 76 | 
 77 | * 1 = standard
 78 | * 2 = JFK
 79 | * 3 = Newark
 80 | * 4 = Nassau
 81 | * 5 = negotiated
 82 | * 6 = group
 83 | 
 84 | And PaymentType is defined as follows:
 85 | 
 86 | * 1 = Credit card
 87 | * 2 = Cash
 88 | * 3 = No charge
 89 | * 4 = Dispute
 90 | * 5 = Unknown
 91 | * 6 = Voided trip
 92 | 
 93 | These actual numbers don’t mean anything in this context. And we certainly don’t want the machine learning model to start believing that a trip to Newark is three times as important as a standard fare.
 94 | 
 95 | So converting these values to strings is a perfect trick to show the model that **VendorId**, **RateCode** and **PaymentType** are just labels, and the underlying numbers don’t mean anything.
 96 | 
 97 | Now you need to load the training data in memory:
 98 | 
 99 | ```fsharp
100 | // file paths to data files (assumes os = windows!)
101 | let dataPath = sprintf "%s\\yellow_tripdata_2018-12_small.csv" Environment.CurrentDirectory
102 | 
103 | /// The main application entry point.
104 | [<EntryPoint>]
105 | let main argv =
106 | 
107 |     // create the machine learning context
108 |     let context = new MLContext()
109 | 
110 |     // load the data
111 |     let dataView = context.Data.LoadFromTextFile<TaxiTrip>(dataPath, hasHeader = true, separatorChar = ',')
112 | 
113 |     // split into a training and test partition
114 |     let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2)
115 | 
116 |     // the rest of the code goes here...
117 | 
118 |     0 // return value
119 | ```
120 | 
121 | This code calls **LoadFromTextFile** to load the CSV data into memory. Note the **TaxiTrip** type that tells the method which class to use to load the data.
122 | 
123 | There is only one single data file, so you need to call **TrainTestSplit** to set up a training partition with 80% of the data and a test partition with the remaining 20% of the data.
124 | 
125 | You often see this 80/20 split in data science, it’s a very common approach to train and test a model.
126 | 
127 | Now you’re ready to start building the machine learning model:
128 | 
129 | ```fsharp
130 | // set up a learning pipeline
131 | let pipeline = 
132 |     EstimatorChain()
133 | 
134 |         // one-hot encode all text features
135 |         .Append(context.Transforms.Categorical.OneHotEncoding("VendorId"))
136 |         .Append(context.Transforms.Categorical.OneHotEncoding("RateCode"))
137 |         .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType"))
138 | 
139 |         // combine all input features into a single column 
140 |         .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance"))
141 | 
142 |         // cache the data to speed up training
143 |         .AppendCacheCheckpoint(context)
144 | 
145 |         // use the fast tree learner 
146 |         .Append(context.Regression.Trainers.FastTree())
147 | 
148 | // train the model
149 | let model = partitions.TrainSet |> pipeline.Fit
150 | 
151 | // the rest of the code goes here...
152 | ```
153 | 
154 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
155 | 
156 | This pipeline has the following components:
157 | 
158 | * A group of three **OneHotEncodings** to perform one hot encoding on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because we don't want the machine learning model to treat the enumerative data as numeric values.
159 | * **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
160 | * **AppendCacheCheckpoint** which caches all data in memory to speed up the training process.
161 | * A final **FastTree** regression learner which will train the model to make accurate predictions.
162 | 
163 | The **FastTreeRegressionTrainer** is a very nice training algorithm that uses gradient boosting, a machine learning technique for regression problems.
164 | 
165 | A gradient boosting algorithm builds up a collection of weak regression models. It starts out with a weak model that tries to predict the taxi fare. Then it adds a second model that attempts to correct the error in the first model. And then it adds a third model, and so on.
166 | 
167 | The result is a fairly strong prediction model that is actually just an ensemble of weaker prediction models stacked on top of each other.
168 | 
169 | We will explore Gradient Boosting in detail in a later section.
170 | 
171 | With the pipeline fully assembled, you can train the model on the training partition by piping the **TrainSet** into the **pipeline.Fit** function.
172 | 
173 | You now have a fully- trained model. So next, you'll have to grab the validation data, predict the taxi fare for each trip, and calculate the accuracy of your model:
174 | 
175 | ```fsharp
176 | // get regression metrics to score the model
177 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
178 | 
179 | // show the metrics
180 | printfn "Model metrics:"
181 | printfn "  RMSE:%f" metrics.RootMeanSquaredError
182 | printfn "  MSE: %f" metrics.MeanSquaredError
183 | printfn "  MAE: %f" metrics.MeanAbsoluteError
184 | 
185 | // the rest of the code goes here...
186 | ```
187 | 
188 | This code pipes the **TestSet** into the **model.Transform** function to generate predictions for every single taxi trip in the test partition. We then pipe these predictions into the **Evaluate** function to compare then to the actual taxi fares and automatically calculates these metrics:
189 | 
190 | * **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
191 | * **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in dollars.
192 | * **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.
193 | 
194 | To wrap up, let’s use the model to make a prediction.
195 | 
196 | Imagine that I'm going to take a standard taxi trip, I cover a distance of 3.75 miles, I am the only passenger, and I pay by credit card. What would my fare be? 
197 | 
198 | Here’s how to make that prediction:
199 | 
200 | ```fsharp
201 | // create a prediction engine for one single prediction
202 | let engine = context.Model.CreatePredictionEngine model
203 | 
204 | let taxiTripSample = {
205 |     VendorId = "VTS"
206 |     RateCode = "1"
207 |     PassengerCount = 1.0f
208 |     TripDistance = 3.75f
209 |     PaymentType = "CRD"
210 |     FareAmount = 0.0f // To predict. Actual/Observed = 15.5
211 | }
212 | 
213 | // make the prediction
214 | let prediction = taxiTripSample |> engine.Predict
215 | 
216 | // show the prediction
217 | printfn "\r"
218 | printfn "Single prediction:"
219 | printfn "  Predicted fare: %f" prediction.FareAmount
220 | ```
221 | 
222 | You use the **CreatePredictionEngine** method to set up a prediction engine. This is a type that can make predictions for individual data records. 
223 | 
224 | Next, you set up a sample with all the details of my taxi trip and pipe it into the **Predict** function to make a single prediction.
225 | 
226 | The trip should cost anywhere between $13.50 and $18.50, depending on the trip duration (which depends on the time of day). Will the model predict a fare in this range?  
227 | 
228 | Let's find out. Go to your terminal and run your code:
229 | 
230 | ```bash
231 | $ dotnet run
232 | ```
233 | 
234 | What results do you get? What are your RMSE and MAE values? Is this a good result? 
235 | 
236 | And how much does your model predict I have to pay for my taxi ride? Is the prediction in the range of accetable values for this trip? 
237 | 
238 | Now make some changes to my trip. Change the vendor ID, or the distance, or the manner of payment. How does this affect the final fare prediction? And what do you think this means?  
239 | 
240 | Think about the code in this assignment. How could you improve the accuracy of the model? What's your best RMSE value? 
241 | 
242 | Share your results in our group!
243 | 


--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/TaxiFarePrediction.fsproj:
--------------------------------------------------------------------------------
 1 | ﻿<Project Sdk="Microsoft.NET.Sdk">
 2 | 
 3 |   <PropertyGroup>
 4 |     <OutputType>Exe</OutputType>
 5 |     <TargetFramework>netcoreapp3.1</TargetFramework>
 6 |   </PropertyGroup>
 7 | 
 8 |   <ItemGroup>
 9 |     <Compile Include="Program.fs" />
10 |   </ItemGroup>
11 | 
12 |   <ItemGroup>
13 |     <PackageReference Include="Microsoft.ML" Version="1.5.0" />
14 |     <PackageReference Include="Microsoft.ML.FastTree" Version="1.5.0" />
15 |   </ItemGroup>
16 | 
17 | </Project>
18 | 


--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/TaxiFarePrediction/assets/data.png


--------------------------------------------------------------------------------
/assets/DSC-FS.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/assets/DSC-FS.jpg


--------------------------------------------------------------------------------