├── .github
└── FUNDING.yml
├── .gitignore
├── BinaryClassification
├── DiabetesDetection
│ ├── README.md
│ ├── assets
│ │ └── data.png
│ └── diabetes.csv
├── FraudDetection
│ ├── README.md
│ └── assets
│ │ └── data.png
├── HeartDiseasePrediction
│ ├── Heart.fsproj
│ ├── Program.fs
│ ├── README.md
│ ├── assets
│ │ └── data.png
│ └── processed.cleveland.data.csv
├── SpamDetection
│ ├── Program.fs
│ ├── README.md
│ ├── SpamDetection.fsproj
│ ├── assets
│ │ └── data.png
│ └── spam.tsv
└── TitanicPrediction
│ ├── Program.fs
│ ├── README.md
│ ├── TitanicPrediction.fsproj
│ ├── assets
│ ├── data.jpg
│ └── titanic.jpeg
│ ├── test_data.csv
│ └── train_data.csv
├── Clustering
└── IrisFlower
│ ├── IrisFlower.fsproj
│ ├── Program.fs
│ ├── README.md
│ ├── assets
│ ├── data.png
│ └── flowers.png
│ └── iris-data.csv
├── LoadingData
└── CaliforniaHousing
│ ├── CaliforniaHousing.fsproj
│ ├── Program.fs
│ ├── README.md
│ ├── assets
│ ├── data.png
│ └── plot.png
│ └── california_housing.csv
├── MulticlassClassification
├── DigitRecognition
│ ├── Mnist.fsproj
│ ├── Program.fs
│ ├── README.md
│ └── assets
│ │ ├── datafile.png
│ │ ├── mnist.png
│ │ └── mnist_hard.png
└── FlagToxicComments
│ ├── README.md
│ └── assets
│ └── data.png
├── README.md
├── Recommendation
└── MovieRecommender
│ ├── MovieRecommender.fsproj
│ ├── Program.fs
│ ├── README.md
│ ├── assets
│ ├── data.png
│ └── movies.png
│ ├── recommendation-movies.csv
│ ├── recommendation-ratings-test.csv
│ └── recommendation-ratings-train.csv
├── Regression
├── BikeDemandPrediction
│ ├── BikeDemand.fsproj
│ ├── Program.fs
│ ├── README.md
│ ├── assets
│ │ ├── bikesharing.jpeg
│ │ └── data.png
│ └── bikedemand.csv
├── HousePricePrediction
│ ├── README.md
│ ├── assets
│ │ └── data.png
│ └── data.csv
└── TaxiFarePrediction
│ ├── Program.fs
│ ├── README.md
│ ├── TaxiFarePrediction.fsproj
│ ├── assets
│ └── data.png
│ └── yellow_tripdata_2018-12_small.csv
└── assets
└── DSC-FS.jpg
/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 |
3 | github: [mdfarragher]
4 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | BinaryClassification/HeartDiseasePrediction/bin/
2 | BinaryClassification/HeartDiseasePrediction/obj/
3 | BinaryClassification/SpamDetection/bin/
4 | BinaryClassification/SpamDetection/obj/
5 | BinaryClassification/TitanicPrediction/bin/
6 | BinaryClassification/TitanicPrediction/obj/
7 | Clustering/IrisFlower/obj/
8 | MulticlassClassification/DigitRecognition/bin/
9 | MulticlassClassification/DigitRecognition/obj/
10 | Regression/BikeDemandPrediction/bin/
11 | Regression/BikeDemandPrediction/obj/
12 | Regression/TaxiFarePrediction/bin/
13 | Regression/TaxiFarePrediction/obj/
14 | MulticlassClassification/DigitRecognition/mnist_test.csv
15 | MulticlassClassification/DigitRecognition/mnist_train.csv
16 | Clustering/IrisFlower/bin/
17 | Recommendation/MovieRecommender/bin/
18 | Recommendation/MovieRecommender/obj/
19 | LoadingData/CaliforniaHousing/bin/
20 | LoadingData/CaliforniaHousing/obj/
21 | Regression/TaxiFarePrediction/yellow_tripdata_2018-12.csv
22 |
--------------------------------------------------------------------------------
/BinaryClassification/DiabetesDetection/README.md:
--------------------------------------------------------------------------------
1 | # The case
2 |
3 | The Pima are a tribe of North American Indians who traditionally lived along the Gila and Salt rivers in Arizona, U.S., in what was the core area of the prehistoric Hohokam culture. They speak a Uto-Aztecan language and call themselves the River People and are usually considered to be the descendants of the Hohokam.
4 |
5 | But there's a weird thing about the Pima: they have the highest reported prevalence of diabetes of any population in the world. Their diabetes is exclusively type 2 diabetes, with no evidence of type 1 diabetes, even in very young children with an early onset of the disease.
6 |
7 | This suggests that the Pima carry a specific gene mutation that makes them extremely susceptive to diabetes. The tribe has been the focus of many medical studies over the years.
8 |
9 | In this case study, you're going to participate in one of these medical studies. You will build an app that loads a dataset of Pima medical records and tries to predict from the data who has diabetes and who has not.
10 |
11 | How accurate will your app be? Do you think you will be able to correctly predict every single diabetes case?
12 |
13 | That's for you to find out!
14 |
15 | # The dataset
16 |
17 | 
18 |
19 | In this case study you'll be working with a dataset containing the medical records of 768 Pima women.
20 |
21 | There is a single file in the dataset:
22 | * [diabetes.csv](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) which contains 768 records, 8 input features, and 1 output label. You will use this file to train and test your model.
23 |
24 | You'll need to [download the dataset](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/DiabetesDetection/diabetes.csv) and save it in your project folder to get started.
25 |
26 | Here's a description of all columns in the file:
27 | * **Pregnancies**: the number of times the woman got pregnant
28 | * **Glucose**: the plasma glucose concentration at 2 hours in an oral glucose tolerance test
29 | * **BloodPressure**: the diastolic blood pressure (mm Hg)
30 | * **SkinThickness**: the triceps skin fold thickness (mm)
31 | * **Insulin**: the 2-hour serum insulin concentration (mu U/ml)
32 | * **BMI**: the body mass index (weight in kg/(height in m)^2)
33 | * **DiabetesPedigreeFunction**: the diabetes pedigree function
34 | * **Age**: the age (years)
35 | * **Outcome**: the label you need to predict - 1 if the woman has diabetes, 0 if she has not
36 |
37 |
38 | # Getting started
39 | Go to the console and set up a new console application:
40 |
41 | ```bash
42 | $ dotnet new console --language F# --output DiabetesDetection
43 | $ cd DiabetesDetection
44 | ```
45 |
46 | Then install the ML.NET NuGet package:
47 |
48 | ```bash
49 | $ dotnet add package Microsoft.ML
50 | $ dotnet add package Microsoft.ML.FastTree
51 | ```
52 |
53 | And launch the Visual Studio Code editor:
54 |
55 | ```bash
56 | $ code .
57 | ```
58 |
59 | The rest is up to you!
60 |
61 | # Your assignment
62 | I want you to build an app that reads the data file and splits it for training and testing. Reserve 80% of all records for training and 20% for testing.
63 |
64 | Process the data and train a binary classifier on the training partition. Then use the fully-trained model to generate predictions for the records in the testing partition.
65 |
66 | Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group.
67 |
68 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model?
69 |
70 | Good luck!
--------------------------------------------------------------------------------
/BinaryClassification/DiabetesDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/DiabetesDetection/assets/data.png
--------------------------------------------------------------------------------
/BinaryClassification/FraudDetection/README.md:
--------------------------------------------------------------------------------
1 | # The case
2 |
3 | It is very important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
4 |
5 | Credit card fraud happens a lot. During two days in September 2013 in Europe, credit card networks recorded at least 492 fraud cases out of a total of 284,807 transactions. That's 246 fraud cases per day!
6 |
7 | In this case study, you're going to help credit card companies detect fraud in real time. You will build an app and train it on detected fraud cases, and then test your predictions on a new set of transactions.
8 |
9 | How accurate will your app be? Do you think you will be able to detect financial fraud in real time?
10 |
11 | That's for you to find out!
12 |
13 | # The dataset
14 |
15 | 
16 |
17 | In this case study you'll be working with a dataset containing transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.
18 |
19 | Note that the dataset is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions.
20 |
21 | The data set contains 285k records, 30 feature columns, and a single label indicating if the transaction is fraudulent or not. You can use any combination of features you like to generate your fraud predictions.
22 |
23 | There is a single file in the dataset:
24 | * [creditcard.csv](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) which contains 285k records, 30 input features, and one output label. You will use this file to train and test your model.
25 |
26 | The file is about 150 MB in size. You'll need to [download it from Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud/downloads/creditcard.csv/3) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet.
27 |
28 | Here's a description of all 31 columns in the data file:
29 | * Time: Number of seconds elapsed between this transaction and the first transaction in the dataset
30 | * V1-V28: A feature of the transaction, processed to a number to protect user identities and sensitive information
31 | * Amount: Transaction amount
32 | * Class: 1 for fraudulent transactions, 0 otherwise
33 |
34 | # Getting started
35 | Go to the console and set up a new console application:
36 |
37 | ```bash
38 | $ dotnet new console --language F# --output FraudDetection
39 | $ cd FraudDetection
40 | ```
41 |
42 | Then install the ML.NET NuGet package:
43 |
44 | ```bash
45 | $ dotnet add package Microsoft.ML
46 | $ dotnet add package Microsoft.ML.FastTree
47 | ```
48 |
49 | And launch the Visual Studio Code editor:
50 |
51 | ```bash
52 | $ code .
53 | ```
54 |
55 | The rest is up to you!
56 |
57 | # Your assignment
58 | I want you to build an app that reads the data file in memory and splits it. Use 80% for training and 20% for testing.
59 |
60 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns.
61 |
62 | Processes the selected input features, then train a binary classifier on the data, and generate predictions for the transactions in the testing partition.
63 |
64 | Use the trained model to make fraud predictions on the test data. Decide which metrics you're going to use to evaluate your model, but make sure to include the **AUC** too. Report your best values in our group.
65 |
66 | See if you can get the AUC as close to 1 as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model?
67 |
68 | Good luck!
--------------------------------------------------------------------------------
/BinaryClassification/FraudDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/FraudDetection/assets/data.png
--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/Heart.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open System.IO
3 | open Microsoft.ML
4 | open Microsoft.ML.Data
5 |
6 | /// The HeartData record holds one single heart data record.
7 | []
8 | type HeartData = {
9 | [] Age : float32
10 | [] Sex : float32
11 | [] Cp : float32
12 | [] TrestBps : float32
13 | [] Chol : float32
14 | [] Fbs : float32
15 | [] RestEcg : float32
16 | [] Thalac : float32
17 | [] Exang : float32
18 | [] OldPeak : float32
19 | [] Slope : float32
20 | [] Ca : float32
21 | [] Thal : float32
22 | [] Diagnosis : float32
23 | }
24 |
25 | /// The HeartPrediction class contains a single heart data prediction.
26 | []
27 | type HeartPrediction = {
28 | [] Prediction : bool
29 | Probability : float32
30 | Score : float32
31 | }
32 |
33 | /// The ToLabel class is a helper class for a column transformation.
34 | []
35 | type ToLabel = {
36 | mutable Label : bool
37 | }
38 |
39 | /// file paths to data files (assumes os = windows!)
40 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory
41 |
42 | /// The main application entry point.
43 | []
44 | let main argv =
45 |
46 | // set up a machine learning context
47 | let context = new MLContext()
48 |
49 | // load training and test data
50 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',')
51 |
52 | // split the data into a training and test partition
53 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
54 |
55 | // set up a training pipeline
56 | let pipeline =
57 | EstimatorChain()
58 |
59 | // step 1: convert the label value to a boolean
60 | .Append(
61 | context.Transforms.CustomMapping(
62 | Action(fun input output -> output.Label <- input.Diagnosis > 0.0f),
63 | "LabelMapping"))
64 |
65 | // step 2: concatenate all feature columns
66 | .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal"))
67 |
68 | // step 3: set up a fast tree learner
69 | .Append(context.BinaryClassification.Trainers.FastTree())
70 |
71 | // train the model
72 | let model = partitions.TrainSet |> pipeline.Fit
73 |
74 | // make predictions and compare with the ground truth
75 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
76 |
77 | // report the results
78 | printfn "Model metrics:"
79 | printfn " Accuracy: %f" metrics.Accuracy
80 | printfn " Auc: %f" metrics.AreaUnderRocCurve
81 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
82 | printfn " F1Score: %f" metrics.F1Score
83 | printfn " LogLoss: %f" metrics.LogLoss
84 | printfn " LogLossReduction: %f" metrics.LogLossReduction
85 | printfn " PositivePrecision: %f" metrics.PositivePrecision
86 | printfn " PositiveRecall: %f" metrics.PositiveRecall
87 | printfn " NegativePrecision: %f" metrics.NegativePrecision
88 | printfn " NegativeRecall: %f" metrics.NegativeRecall
89 |
90 | // set up a prediction engine
91 | let predictionEngine = context.Model.CreatePredictionEngine model
92 |
93 | // create a sample patient
94 | let sample = {
95 | Age = 36.0f
96 | Sex = 1.0f
97 | Cp = 4.0f
98 | TrestBps = 145.0f
99 | Chol = 210.0f
100 | Fbs = 0.0f
101 | RestEcg = 2.0f
102 | Thalac = 148.0f
103 | Exang = 1.0f
104 | OldPeak = 1.9f
105 | Slope = 2.0f
106 | Ca = 1.0f
107 | Thal = 7.0f
108 | Diagnosis = 0.0f // unused
109 | }
110 |
111 | // make the prediction
112 | let prediction = sample |> predictionEngine.Predict
113 |
114 | // report the results
115 | printfn "\r"
116 | printfn "Single prediction:"
117 | printfn " Prediction: %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk")
118 | printfn " Probability: %f" prediction.Probability
119 |
120 | 0 // return value
--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Predict heart disease risk
2 |
3 | In this assignment you're going to build an app that can predict the heart disease risk in a group of patients.
4 |
5 | The first thing you will need for your app is a data file with patients, their medical info, and their heart disease risk assessment. We're going to use the famous [UCI Heart Disease Dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) which has real-life data from 303 patients.
6 |
7 | Download the [Processed Cleveland Data](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data) file and save it as **processed.cleveland.data.csv**.
8 |
9 | The data file looks like this:
10 |
11 | 
12 |
13 | It’s a CSV file with 14 columns of information:
14 |
15 | * Age
16 | * Sex: 1 = male, 0 = female
17 | * Chest Pain Type: 1 = typical angina, 2 = atypical angina , 3 = non-anginal pain, 4 = asymptomatic
18 | * Resting blood pressure in mm Hg on admission to the hospital
19 | * Serum cholesterol in mg/dl
20 | * Fasting blood sugar > 120 mg/dl: 1 = true; 0 = false
21 | * Resting EKG results: 0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
22 | * Maximum heart rate achieved
23 | * Exercise induced angina: 1 = yes; 0 = no
24 | * ST depression induced by exercise relative to rest
25 | * Slope of the peak exercise ST segment: 1 = up-sloping, 2 = flat, 3 = down-sloping
26 | * Number of major vessels (0–3) colored by fluoroscopy
27 | * Thallium heart scan results: 3 = normal, 6 = fixed defect, 7 = reversible defect
28 | * Diagnosis of heart disease: 0 = normal risk, 1-4 = elevated risk
29 |
30 | The first 13 columns are patient diagnostic information, and the last column is the diagnosis: 0 means a healthy patient, and values 1-4 mean an elevated risk of heart disease.
31 |
32 | You are going to build a binary classification machine learning model that reads in all 13 columns of patient information, and then makes a prediction for the heart disease risk.
33 |
34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
35 |
36 | ```bash
37 | $ dotnet new console --language F# --output Heart
38 | $ cd Heart
39 | ```
40 |
41 | Now install the following ML.NET packages:
42 |
43 | ```bash
44 | $ dotnet add package Microsoft.ML
45 | $ dotnet add package Microsoft.ML.FastTree
46 | ```
47 |
48 | Now you are ready to add some types. You’ll need one to hold patient info, and one to hold your model predictions.
49 |
50 | Replace the contents of the Program.fs file with this:
51 |
52 | ```fsharp
53 | open System
54 | open System.IO
55 | open Microsoft.ML
56 | open Microsoft.ML.Data
57 |
58 | /// The HeartData record holds one single heart data record.
59 | []
60 | type HeartData = {
61 | [] Age : float32
62 | [] Sex : float32
63 | [] Cp : float32
64 | [] TrestBps : float32
65 | [] Chol : float32
66 | [] Fbs : float32
67 | [] RestEcg : float32
68 | [] Thalac : float32
69 | [] Exang : float32
70 | [] OldPeak : float32
71 | [] Slope : float32
72 | [] Ca : float32
73 | [] Thal : float32
74 | [] Diagnosis : float32
75 | }
76 |
77 | /// The HeartPrediction class contains a single heart data prediction.
78 | []
79 | type HeartPrediction = {
80 | [] Prediction : bool
81 | Probability : float32
82 | Score : float32
83 | }
84 |
85 | // the rest of the code goes here....
86 | ```
87 |
88 | The **HeartData** class holds one single patient record. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.
89 |
90 | There's also a **HeartPrediction** class which will hold a single heart disease prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction.
91 |
92 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
93 |
94 | Now look at the final **Diagnosis** column in the data file. Our label is an integer value between 0-4, with 0 meaning 'no risk' and 1-4 meaning 'elevated risk'.
95 |
96 | But you're building a Binary Classifier which means your model needs to be trained on boolean labels.
97 |
98 | So you'll have to somehow convert the 'raw' numeric label (stored in the **Diagnosis** field) to a boolean value.
99 |
100 | To set that up, you'll need a helper type:
101 |
102 | ```fsharp
103 | /// The ToLabel class is a helper class for a column transformation.
104 | []
105 | type ToLabel = {
106 | mutable Label : bool
107 | }
108 |
109 | // the rest of the code goes here....
110 | ```
111 |
112 | The **ToLabel** type contains the label converted to a boolean value. We'll set up that conversion in a minute.
113 |
114 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction.
115 |
116 | Now you're going to load the training data in memory:
117 |
118 | ```fsharp
119 | /// file paths to data files (assumes os = windows!)
120 | let dataPath = sprintf "%s\\processed.cleveland.data.csv" Environment.CurrentDirectory
121 |
122 | /// The main application entry point.
123 | []
124 | let main argv =
125 |
126 | // set up a machine learning context
127 | let context = new MLContext()
128 |
129 | // load training and test data
130 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',')
131 |
132 | // split the data into a training and test partition
133 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
134 |
135 | // the rest of the code goes here....
136 |
137 | 0 // return value
138 | ```
139 |
140 | This code uses the method **LoadFromTextFile** to load the CSV data directly into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **HeartData** class.
141 |
142 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data.
143 |
144 | Now you’re ready to start building the machine learning model:
145 |
146 | ```fsharp
147 | // set up a training pipeline
148 | let pipeline =
149 | EstimatorChain()
150 |
151 | // step 1: convert the label value to a boolean
152 | .Append(
153 | context.Transforms.CustomMapping(
154 | Action(fun input output -> output.Label <- input.Diagnosis > 0.0f),
155 | "LabelMapping"))
156 |
157 | // step 2: concatenate all feature columns
158 | .Append(context.Transforms.Concatenate("Features", "Age", "Sex", "Cp", "TrestBps", "Chol", "Fbs", "RestEcg", "Thalac", "Exang", "OldPeak", "Slope", "Ca", "Thal"))
159 |
160 | // step 3: set up a fast tree learner
161 | .Append(context.BinaryClassification.Trainers.FastTree())
162 |
163 | // train the model
164 | let model = partitions.TrainSet |> pipeline.Fit
165 |
166 | // the rest of the code goes here....
167 | ```
168 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
169 |
170 | This pipeline has the following components:
171 |
172 | * A **CustomMapping** that transforms the numeric label to a boolean value. We define 0 values as healthy, and anything above 0 as an elevated risk.
173 | * **Concatenate** which combines all input data columns into a single column called 'Features'. This is a required step because ML.NET can only train on a single input column.
174 | * A **FastTree** classification learner which will train the model to make accurate predictions.
175 |
176 | The **FastTreeBinaryClassificationTrainer** is a very nice training algorithm that uses gradient boosting, a machine learning technique for classification problems.
177 |
178 | With the pipeline fully assembled, we can train the model by piping the **TrainSet** into the **Fit** function.
179 |
180 | You now have a fully- trained model. So now it's time to take the test partition, predict the diagnosis for each patient, and calculate the accuracy metrics of the model:
181 |
182 | ```fsharp
183 | // make predictions and compare with the ground truth
184 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
185 |
186 | // report the results
187 | printfn "Model metrics:"
188 | printfn " Accuracy: %f" metrics.Accuracy
189 | printfn " Auc: %f" metrics.AreaUnderRocCurve
190 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
191 | printfn " F1Score: %f" metrics.F1Score
192 | printfn " LogLoss: %f" metrics.LogLoss
193 | printfn " LogLossReduction: %f" metrics.LogLossReduction
194 | printfn " PositivePrecision: %f" metrics.PositivePrecision
195 | printfn " PositiveRecall: %f" metrics.PositiveRecall
196 | printfn " NegativePrecision: %f" metrics.NegativePrecision
197 | printfn " NegativeRecall: %f" metrics.NegativeRecall
198 |
199 | // the rest of the code goes here....
200 | ```
201 |
202 | This code pipes the **TestSet** into **model.Transform** to set up a prediction for every patient in the set, and then pipes the predictions into **Evaluate** to compare these predictions to the ground truth and automatically calculate all evaluation metrics:
203 |
204 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
205 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
206 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
207 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
208 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
209 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
210 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
211 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
212 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
213 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
214 |
215 | When monitoring heart disease, you definitely want to avoid false negatives because you don’t want to be sending high-risk patients home and telling them everything is okay.
216 |
217 | You also want to avoid false positives, but they are a lot better than a false negative because later tests would probably discover that the patient is healthy after all.
218 |
219 | To wrap up, You’re going to create a new patient record and ask the model to make a prediction:
220 |
221 | ```fsharp
222 | // set up a prediction engine
223 | let predictionEngine = context.Model.CreatePredictionEngine model
224 |
225 | // create a sample patient
226 | let sample = {
227 | Age = 36.0f
228 | Sex = 1.0f
229 | Cp = 4.0f
230 | TrestBps = 145.0f
231 | Chol = 210.0f
232 | Fbs = 0.0f
233 | RestEcg = 2.0f
234 | Thalac = 148.0f
235 | Exang = 1.0f
236 | OldPeak = 1.9f
237 | Slope = 2.0f
238 | Ca = 1.0f
239 | Thal = 7.0f
240 | Diagnosis = 0.0f // unused
241 | }
242 |
243 | // make the prediction
244 | let prediction = sample |> predictionEngine.Predict
245 |
246 | // report the results
247 | printfn "\r"
248 | printfn "Single prediction:"
249 | printfn " Prediction: %s" (if prediction.Prediction then "Elevated heart disease risk" else "Normal heart disease risk")
250 | printfn " Probability: %f" prediction.Probability
251 | ```
252 |
253 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then creates a new patient record for a 36-year old male with asymptomatic chest pain and a bunch of other medical info.
254 |
255 | We then pipe the patient record into the **Predict** function and display the diagnosis.
256 |
257 | What’s the model going to predict?
258 |
259 | Time to find out. Go to your terminal and run your code:
260 |
261 | ```bash
262 | $ dotnet run
263 | ```
264 |
265 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value?
266 |
267 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model?
268 |
269 | And what about our patient? What did your model predict?
270 |
271 | Think about the code in this assignment. How could you improve the accuracy of the model? What are your best AUC and AUCPRC values?
272 |
273 | Share your results in our group!
274 |
--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/HeartDiseasePrediction/assets/data.png
--------------------------------------------------------------------------------
/BinaryClassification/HeartDiseasePrediction/processed.cleveland.data.csv:
--------------------------------------------------------------------------------
1 | 63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
2 | 67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
3 | 67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
4 | 37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
5 | 41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
6 | 56.0,1.0,2.0,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
7 | 62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3
8 | 57.0,0.0,4.0,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0
9 | 63.0,1.0,4.0,130.0,254.0,0.0,2.0,147.0,0.0,1.4,2.0,1.0,7.0,2
10 | 53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1
11 | 57.0,1.0,4.0,140.0,192.0,0.0,0.0,148.0,0.0,0.4,2.0,0.0,6.0,0
12 | 56.0,0.0,2.0,140.0,294.0,0.0,2.0,153.0,0.0,1.3,2.0,0.0,3.0,0
13 | 56.0,1.0,3.0,130.0,256.0,1.0,2.0,142.0,1.0,0.6,2.0,1.0,6.0,2
14 | 44.0,1.0,2.0,120.0,263.0,0.0,0.0,173.0,0.0,0.0,1.0,0.0,7.0,0
15 | 52.0,1.0,3.0,172.0,199.0,1.0,0.0,162.0,0.0,0.5,1.0,0.0,7.0,0
16 | 57.0,1.0,3.0,150.0,168.0,0.0,0.0,174.0,0.0,1.6,1.0,0.0,3.0,0
17 | 48.0,1.0,2.0,110.0,229.0,0.0,0.0,168.0,0.0,1.0,3.0,0.0,7.0,1
18 | 54.0,1.0,4.0,140.0,239.0,0.0,0.0,160.0,0.0,1.2,1.0,0.0,3.0,0
19 | 48.0,0.0,3.0,130.0,275.0,0.0,0.0,139.0,0.0,0.2,1.0,0.0,3.0,0
20 | 49.0,1.0,2.0,130.0,266.0,0.0,0.0,171.0,0.0,0.6,1.0,0.0,3.0,0
21 | 64.0,1.0,1.0,110.0,211.0,0.0,2.0,144.0,1.0,1.8,2.0,0.0,3.0,0
22 | 58.0,0.0,1.0,150.0,283.0,1.0,2.0,162.0,0.0,1.0,1.0,0.0,3.0,0
23 | 58.0,1.0,2.0,120.0,284.0,0.0,2.0,160.0,0.0,1.8,2.0,0.0,3.0,1
24 | 58.0,1.0,3.0,132.0,224.0,0.0,2.0,173.0,0.0,3.2,1.0,2.0,7.0,3
25 | 60.0,1.0,4.0,130.0,206.0,0.0,2.0,132.0,1.0,2.4,2.0,2.0,7.0,4
26 | 50.0,0.0,3.0,120.0,219.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0
27 | 58.0,0.0,3.0,120.0,340.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0
28 | 66.0,0.0,1.0,150.0,226.0,0.0,0.0,114.0,0.0,2.6,3.0,0.0,3.0,0
29 | 43.0,1.0,4.0,150.0,247.0,0.0,0.0,171.0,0.0,1.5,1.0,0.0,3.0,0
30 | 40.0,1.0,4.0,110.0,167.0,0.0,2.0,114.0,1.0,2.0,2.0,0.0,7.0,3
31 | 69.0,0.0,1.0,140.0,239.0,0.0,0.0,151.0,0.0,1.8,1.0,2.0,3.0,0
32 | 60.0,1.0,4.0,117.0,230.0,1.0,0.0,160.0,1.0,1.4,1.0,2.0,7.0,2
33 | 64.0,1.0,3.0,140.0,335.0,0.0,0.0,158.0,0.0,0.0,1.0,0.0,3.0,1
34 | 59.0,1.0,4.0,135.0,234.0,0.0,0.0,161.0,0.0,0.5,2.0,0.0,7.0,0
35 | 44.0,1.0,3.0,130.0,233.0,0.0,0.0,179.0,1.0,0.4,1.0,0.0,3.0,0
36 | 42.0,1.0,4.0,140.0,226.0,0.0,0.0,178.0,0.0,0.0,1.0,0.0,3.0,0
37 | 43.0,1.0,4.0,120.0,177.0,0.0,2.0,120.0,1.0,2.5,2.0,0.0,7.0,3
38 | 57.0,1.0,4.0,150.0,276.0,0.0,2.0,112.0,1.0,0.6,2.0,1.0,6.0,1
39 | 55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0,3
40 | 61.0,1.0,3.0,150.0,243.0,1.0,0.0,137.0,1.0,1.0,2.0,0.0,3.0,0
41 | 65.0,0.0,4.0,150.0,225.0,0.0,2.0,114.0,0.0,1.0,2.0,3.0,7.0,4
42 | 40.0,1.0,1.0,140.0,199.0,0.0,0.0,178.0,1.0,1.4,1.0,0.0,7.0,0
43 | 71.0,0.0,2.0,160.0,302.0,0.0,0.0,162.0,0.0,0.4,1.0,2.0,3.0,0
44 | 59.0,1.0,3.0,150.0,212.0,1.0,0.0,157.0,0.0,1.6,1.0,0.0,3.0,0
45 | 61.0,0.0,4.0,130.0,330.0,0.0,2.0,169.0,0.0,0.0,1.0,0.0,3.0,1
46 | 58.0,1.0,3.0,112.0,230.0,0.0,2.0,165.0,0.0,2.5,2.0,1.0,7.0,4
47 | 51.0,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0,0.0,3.0,0
48 | 50.0,1.0,4.0,150.0,243.0,0.0,2.0,128.0,0.0,2.6,2.0,0.0,7.0,4
49 | 65.0,0.0,3.0,140.0,417.0,1.0,2.0,157.0,0.0,0.8,1.0,1.0,3.0,0
50 | 53.0,1.0,3.0,130.0,197.0,1.0,2.0,152.0,0.0,1.2,3.0,0.0,3.0,0
51 | 41.0,0.0,2.0,105.0,198.0,0.0,0.0,168.0,0.0,0.0,1.0,1.0,3.0,0
52 | 65.0,1.0,4.0,120.0,177.0,0.0,0.0,140.0,0.0,0.4,1.0,0.0,7.0,0
53 | 44.0,1.0,4.0,112.0,290.0,0.0,2.0,153.0,0.0,0.0,1.0,1.0,3.0,2
54 | 44.0,1.0,2.0,130.0,219.0,0.0,2.0,188.0,0.0,0.0,1.0,0.0,3.0,0
55 | 60.0,1.0,4.0,130.0,253.0,0.0,0.0,144.0,1.0,1.4,1.0,1.0,7.0,1
56 | 54.0,1.0,4.0,124.0,266.0,0.0,2.0,109.0,1.0,2.2,2.0,1.0,7.0,1
57 | 50.0,1.0,3.0,140.0,233.0,0.0,0.0,163.0,0.0,0.6,2.0,1.0,7.0,1
58 | 41.0,1.0,4.0,110.0,172.0,0.0,2.0,158.0,0.0,0.0,1.0,0.0,7.0,1
59 | 54.0,1.0,3.0,125.0,273.0,0.0,2.0,152.0,0.0,0.5,3.0,1.0,3.0,0
60 | 51.0,1.0,1.0,125.0,213.0,0.0,2.0,125.0,1.0,1.4,1.0,1.0,3.0,0
61 | 51.0,0.0,4.0,130.0,305.0,0.0,0.0,142.0,1.0,1.2,2.0,0.0,7.0,2
62 | 46.0,0.0,3.0,142.0,177.0,0.0,2.0,160.0,1.0,1.4,3.0,0.0,3.0,0
63 | 58.0,1.0,4.0,128.0,216.0,0.0,2.0,131.0,1.0,2.2,2.0,3.0,7.0,1
64 | 54.0,0.0,3.0,135.0,304.0,1.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
65 | 54.0,1.0,4.0,120.0,188.0,0.0,0.0,113.0,0.0,1.4,2.0,1.0,7.0,2
66 | 60.0,1.0,4.0,145.0,282.0,0.0,2.0,142.0,1.0,2.8,2.0,2.0,7.0,2
67 | 60.0,1.0,3.0,140.0,185.0,0.0,2.0,155.0,0.0,3.0,2.0,0.0,3.0,1
68 | 54.0,1.0,3.0,150.0,232.0,0.0,2.0,165.0,0.0,1.6,1.0,0.0,7.0,0
69 | 59.0,1.0,4.0,170.0,326.0,0.0,2.0,140.0,1.0,3.4,3.0,0.0,7.0,2
70 | 46.0,1.0,3.0,150.0,231.0,0.0,0.0,147.0,0.0,3.6,2.0,0.0,3.0,1
71 | 65.0,0.0,3.0,155.0,269.0,0.0,0.0,148.0,0.0,0.8,1.0,0.0,3.0,0
72 | 67.0,1.0,4.0,125.0,254.0,1.0,0.0,163.0,0.0,0.2,2.0,2.0,7.0,3
73 | 62.0,1.0,4.0,120.0,267.0,0.0,0.0,99.0,1.0,1.8,2.0,2.0,7.0,1
74 | 65.0,1.0,4.0,110.0,248.0,0.0,2.0,158.0,0.0,0.6,1.0,2.0,6.0,1
75 | 44.0,1.0,4.0,110.0,197.0,0.0,2.0,177.0,0.0,0.0,1.0,1.0,3.0,1
76 | 65.0,0.0,3.0,160.0,360.0,0.0,2.0,151.0,0.0,0.8,1.0,0.0,3.0,0
77 | 60.0,1.0,4.0,125.0,258.0,0.0,2.0,141.0,1.0,2.8,2.0,1.0,7.0,1
78 | 51.0,0.0,3.0,140.0,308.0,0.0,2.0,142.0,0.0,1.5,1.0,1.0,3.0,0
79 | 48.0,1.0,2.0,130.0,245.0,0.0,2.0,180.0,0.0,0.2,2.0,0.0,3.0,0
80 | 58.0,1.0,4.0,150.0,270.0,0.0,2.0,111.0,1.0,0.8,1.0,0.0,7.0,3
81 | 45.0,1.0,4.0,104.0,208.0,0.0,2.0,148.0,1.0,3.0,2.0,0.0,3.0,0
82 | 53.0,0.0,4.0,130.0,264.0,0.0,2.0,143.0,0.0,0.4,2.0,0.0,3.0,0
83 | 39.0,1.0,3.0,140.0,321.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0
84 | 68.0,1.0,3.0,180.0,274.0,1.0,2.0,150.0,1.0,1.6,2.0,0.0,7.0,3
85 | 52.0,1.0,2.0,120.0,325.0,0.0,0.0,172.0,0.0,0.2,1.0,0.0,3.0,0
86 | 44.0,1.0,3.0,140.0,235.0,0.0,2.0,180.0,0.0,0.0,1.0,0.0,3.0,0
87 | 47.0,1.0,3.0,138.0,257.0,0.0,2.0,156.0,0.0,0.0,1.0,0.0,3.0,0
88 | 53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
89 | 53.0,0.0,4.0,138.0,234.0,0.0,2.0,160.0,0.0,0.0,1.0,0.0,3.0,0
90 | 51.0,0.0,3.0,130.0,256.0,0.0,2.0,149.0,0.0,0.5,1.0,0.0,3.0,0
91 | 66.0,1.0,4.0,120.0,302.0,0.0,2.0,151.0,0.0,0.4,2.0,0.0,3.0,0
92 | 62.0,0.0,4.0,160.0,164.0,0.0,2.0,145.0,0.0,6.2,3.0,3.0,7.0,3
93 | 62.0,1.0,3.0,130.0,231.0,0.0,0.0,146.0,0.0,1.8,2.0,3.0,7.0,0
94 | 44.0,0.0,3.0,108.0,141.0,0.0,0.0,175.0,0.0,0.6,2.0,0.0,3.0,0
95 | 63.0,0.0,3.0,135.0,252.0,0.0,2.0,172.0,0.0,0.0,1.0,0.0,3.0,0
96 | 52.0,1.0,4.0,128.0,255.0,0.0,0.0,161.0,1.0,0.0,1.0,1.0,7.0,1
97 | 59.0,1.0,4.0,110.0,239.0,0.0,2.0,142.0,1.0,1.2,2.0,1.0,7.0,2
98 | 60.0,0.0,4.0,150.0,258.0,0.0,2.0,157.0,0.0,2.6,2.0,2.0,7.0,3
99 | 52.0,1.0,2.0,134.0,201.0,0.0,0.0,158.0,0.0,0.8,1.0,1.0,3.0,0
100 | 48.0,1.0,4.0,122.0,222.0,0.0,2.0,186.0,0.0,0.0,1.0,0.0,3.0,0
101 | 45.0,1.0,4.0,115.0,260.0,0.0,2.0,185.0,0.0,0.0,1.0,0.0,3.0,0
102 | 34.0,1.0,1.0,118.0,182.0,0.0,2.0,174.0,0.0,0.0,1.0,0.0,3.0,0
103 | 57.0,0.0,4.0,128.0,303.0,0.0,2.0,159.0,0.0,0.0,1.0,1.0,3.0,0
104 | 71.0,0.0,3.0,110.0,265.0,1.0,2.0,130.0,0.0,0.0,1.0,1.0,3.0,0
105 | 49.0,1.0,3.0,120.0,188.0,0.0,0.0,139.0,0.0,2.0,2.0,3.0,7.0,3
106 | 54.0,1.0,2.0,108.0,309.0,0.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0
107 | 59.0,1.0,4.0,140.0,177.0,0.0,0.0,162.0,1.0,0.0,1.0,1.0,7.0,2
108 | 57.0,1.0,3.0,128.0,229.0,0.0,2.0,150.0,0.0,0.4,2.0,1.0,7.0,1
109 | 61.0,1.0,4.0,120.0,260.0,0.0,0.0,140.0,1.0,3.6,2.0,1.0,7.0,2
110 | 39.0,1.0,4.0,118.0,219.0,0.0,0.0,140.0,0.0,1.2,2.0,0.0,7.0,3
111 | 61.0,0.0,4.0,145.0,307.0,0.0,2.0,146.0,1.0,1.0,2.0,0.0,7.0,1
112 | 56.0,1.0,4.0,125.0,249.0,1.0,2.0,144.0,1.0,1.2,2.0,1.0,3.0,1
113 | 52.0,1.0,1.0,118.0,186.0,0.0,2.0,190.0,0.0,0.0,2.0,0.0,6.0,0
114 | 43.0,0.0,4.0,132.0,341.0,1.0,2.0,136.0,1.0,3.0,2.0,0.0,7.0,2
115 | 62.0,0.0,3.0,130.0,263.0,0.0,0.0,97.0,0.0,1.2,2.0,1.0,7.0,2
116 | 41.0,1.0,2.0,135.0,203.0,0.0,0.0,132.0,0.0,0.0,2.0,0.0,6.0,0
117 | 58.0,1.0,3.0,140.0,211.0,1.0,2.0,165.0,0.0,0.0,1.0,0.0,3.0,0
118 | 35.0,0.0,4.0,138.0,183.0,0.0,0.0,182.0,0.0,1.4,1.0,0.0,3.0,0
119 | 63.0,1.0,4.0,130.0,330.0,1.0,2.0,132.0,1.0,1.8,1.0,3.0,7.0,3
120 | 65.0,1.0,4.0,135.0,254.0,0.0,2.0,127.0,0.0,2.8,2.0,1.0,7.0,2
121 | 48.0,1.0,4.0,130.0,256.0,1.0,2.0,150.0,1.0,0.0,1.0,2.0,7.0,3
122 | 63.0,0.0,4.0,150.0,407.0,0.0,2.0,154.0,0.0,4.0,2.0,3.0,7.0,4
123 | 51.0,1.0,3.0,100.0,222.0,0.0,0.0,143.0,1.0,1.2,2.0,0.0,3.0,0
124 | 55.0,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,0.0,7.0,3
125 | 65.0,1.0,1.0,138.0,282.0,1.0,2.0,174.0,0.0,1.4,2.0,1.0,3.0,1
126 | 45.0,0.0,2.0,130.0,234.0,0.0,2.0,175.0,0.0,0.6,2.0,0.0,3.0,0
127 | 56.0,0.0,4.0,200.0,288.0,1.0,2.0,133.0,1.0,4.0,3.0,2.0,7.0,3
128 | 54.0,1.0,4.0,110.0,239.0,0.0,0.0,126.0,1.0,2.8,2.0,1.0,7.0,3
129 | 44.0,1.0,2.0,120.0,220.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
130 | 62.0,0.0,4.0,124.0,209.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
131 | 54.0,1.0,3.0,120.0,258.0,0.0,2.0,147.0,0.0,0.4,2.0,0.0,7.0,0
132 | 51.0,1.0,3.0,94.0,227.0,0.0,0.0,154.0,1.0,0.0,1.0,1.0,7.0,0
133 | 29.0,1.0,2.0,130.0,204.0,0.0,2.0,202.0,0.0,0.0,1.0,0.0,3.0,0
134 | 51.0,1.0,4.0,140.0,261.0,0.0,2.0,186.0,1.0,0.0,1.0,0.0,3.0,0
135 | 43.0,0.0,3.0,122.0,213.0,0.0,0.0,165.0,0.0,0.2,2.0,0.0,3.0,0
136 | 55.0,0.0,2.0,135.0,250.0,0.0,2.0,161.0,0.0,1.4,2.0,0.0,3.0,0
137 | 70.0,1.0,4.0,145.0,174.0,0.0,0.0,125.0,1.0,2.6,3.0,0.0,7.0,4
138 | 62.0,1.0,2.0,120.0,281.0,0.0,2.0,103.0,0.0,1.4,2.0,1.0,7.0,3
139 | 35.0,1.0,4.0,120.0,198.0,0.0,0.0,130.0,1.0,1.6,2.0,0.0,7.0,1
140 | 51.0,1.0,3.0,125.0,245.0,1.0,2.0,166.0,0.0,2.4,2.0,0.0,3.0,0
141 | 59.0,1.0,2.0,140.0,221.0,0.0,0.0,164.0,1.0,0.0,1.0,0.0,3.0,0
142 | 59.0,1.0,1.0,170.0,288.0,0.0,2.0,159.0,0.0,0.2,2.0,0.0,7.0,1
143 | 52.0,1.0,2.0,128.0,205.0,1.0,0.0,184.0,0.0,0.0,1.0,0.0,3.0,0
144 | 64.0,1.0,3.0,125.0,309.0,0.0,0.0,131.0,1.0,1.8,2.0,0.0,7.0,1
145 | 58.0,1.0,3.0,105.0,240.0,0.0,2.0,154.0,1.0,0.6,2.0,0.0,7.0,0
146 | 47.0,1.0,3.0,108.0,243.0,0.0,0.0,152.0,0.0,0.0,1.0,0.0,3.0,1
147 | 57.0,1.0,4.0,165.0,289.0,1.0,2.0,124.0,0.0,1.0,2.0,3.0,7.0,4
148 | 41.0,1.0,3.0,112.0,250.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
149 | 45.0,1.0,2.0,128.0,308.0,0.0,2.0,170.0,0.0,0.0,1.0,0.0,3.0,0
150 | 60.0,0.0,3.0,102.0,318.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,0
151 | 52.0,1.0,1.0,152.0,298.0,1.0,0.0,178.0,0.0,1.2,2.0,0.0,7.0,0
152 | 42.0,0.0,4.0,102.0,265.0,0.0,2.0,122.0,0.0,0.6,2.0,0.0,3.0,0
153 | 67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,0
154 | 55.0,1.0,4.0,160.0,289.0,0.0,2.0,145.0,1.0,0.8,2.0,1.0,7.0,4
155 | 64.0,1.0,4.0,120.0,246.0,0.0,2.0,96.0,1.0,2.2,3.0,1.0,3.0,3
156 | 70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,1
157 | 51.0,1.0,4.0,140.0,299.0,0.0,0.0,173.0,1.0,1.6,1.0,0.0,7.0,1
158 | 58.0,1.0,4.0,125.0,300.0,0.0,2.0,171.0,0.0,0.0,1.0,2.0,7.0,1
159 | 60.0,1.0,4.0,140.0,293.0,0.0,2.0,170.0,0.0,1.2,2.0,2.0,7.0,2
160 | 68.0,1.0,3.0,118.0,277.0,0.0,0.0,151.0,0.0,1.0,1.0,1.0,7.0,0
161 | 46.0,1.0,2.0,101.0,197.0,1.0,0.0,156.0,0.0,0.0,1.0,0.0,7.0,0
162 | 77.0,1.0,4.0,125.0,304.0,0.0,2.0,162.0,1.0,0.0,1.0,3.0,3.0,4
163 | 54.0,0.0,3.0,110.0,214.0,0.0,0.0,158.0,0.0,1.6,2.0,0.0,3.0,0
164 | 58.0,0.0,4.0,100.0,248.0,0.0,2.0,122.0,0.0,1.0,2.0,0.0,3.0,0
165 | 48.0,1.0,3.0,124.0,255.0,1.0,0.0,175.0,0.0,0.0,1.0,2.0,3.0,0
166 | 57.0,1.0,4.0,132.0,207.0,0.0,0.0,168.0,1.0,0.0,1.0,0.0,7.0,0
167 | 52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
168 | 54.0,0.0,2.0,132.0,288.0,1.0,2.0,159.0,1.0,0.0,1.0,1.0,3.0,0
169 | 35.0,1.0,4.0,126.0,282.0,0.0,2.0,156.0,1.0,0.0,1.0,0.0,7.0,1
170 | 45.0,0.0,2.0,112.0,160.0,0.0,0.0,138.0,0.0,0.0,2.0,0.0,3.0,0
171 | 70.0,1.0,3.0,160.0,269.0,0.0,0.0,112.0,1.0,2.9,2.0,1.0,7.0,3
172 | 53.0,1.0,4.0,142.0,226.0,0.0,2.0,111.0,1.0,0.0,1.0,0.0,7.0,0
173 | 59.0,0.0,4.0,174.0,249.0,0.0,0.0,143.0,1.0,0.0,2.0,0.0,3.0,1
174 | 62.0,0.0,4.0,140.0,394.0,0.0,2.0,157.0,0.0,1.2,2.0,0.0,3.0,0
175 | 64.0,1.0,4.0,145.0,212.0,0.0,2.0,132.0,0.0,2.0,2.0,2.0,6.0,4
176 | 57.0,1.0,4.0,152.0,274.0,0.0,0.0,88.0,1.0,1.2,2.0,1.0,7.0,1
177 | 52.0,1.0,4.0,108.0,233.0,1.0,0.0,147.0,0.0,0.1,1.0,3.0,7.0,0
178 | 56.0,1.0,4.0,132.0,184.0,0.0,2.0,105.0,1.0,2.1,2.0,1.0,6.0,1
179 | 43.0,1.0,3.0,130.0,315.0,0.0,0.0,162.0,0.0,1.9,1.0,1.0,3.0,0
180 | 53.0,1.0,3.0,130.0,246.0,1.0,2.0,173.0,0.0,0.0,1.0,3.0,3.0,0
181 | 48.0,1.0,4.0,124.0,274.0,0.0,2.0,166.0,0.0,0.5,2.0,0.0,7.0,3
182 | 56.0,0.0,4.0,134.0,409.0,0.0,2.0,150.0,1.0,1.9,2.0,2.0,7.0,2
183 | 42.0,1.0,1.0,148.0,244.0,0.0,2.0,178.0,0.0,0.8,1.0,2.0,3.0,0
184 | 59.0,1.0,1.0,178.0,270.0,0.0,2.0,145.0,0.0,4.2,3.0,0.0,7.0,0
185 | 60.0,0.0,4.0,158.0,305.0,0.0,2.0,161.0,0.0,0.0,1.0,0.0,3.0,1
186 | 63.0,0.0,2.0,140.0,195.0,0.0,0.0,179.0,0.0,0.0,1.0,2.0,3.0,0
187 | 42.0,1.0,3.0,120.0,240.0,1.0,0.0,194.0,0.0,0.8,3.0,0.0,7.0,0
188 | 66.0,1.0,2.0,160.0,246.0,0.0,0.0,120.0,1.0,0.0,2.0,3.0,6.0,2
189 | 54.0,1.0,2.0,192.0,283.0,0.0,2.0,195.0,0.0,0.0,1.0,1.0,7.0,1
190 | 69.0,1.0,3.0,140.0,254.0,0.0,2.0,146.0,0.0,2.0,2.0,3.0,7.0,2
191 | 50.0,1.0,3.0,129.0,196.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
192 | 51.0,1.0,4.0,140.0,298.0,0.0,0.0,122.0,1.0,4.2,2.0,3.0,7.0,3
193 | 43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
194 | 62.0,0.0,4.0,138.0,294.0,1.0,0.0,106.0,0.0,1.9,2.0,3.0,3.0,2
195 | 68.0,0.0,3.0,120.0,211.0,0.0,2.0,115.0,0.0,1.5,2.0,0.0,3.0,0
196 | 67.0,1.0,4.0,100.0,299.0,0.0,2.0,125.0,1.0,0.9,2.0,2.0,3.0,3
197 | 69.0,1.0,1.0,160.0,234.0,1.0,2.0,131.0,0.0,0.1,2.0,1.0,3.0,0
198 | 45.0,0.0,4.0,138.0,236.0,0.0,2.0,152.0,1.0,0.2,2.0,0.0,3.0,0
199 | 50.0,0.0,2.0,120.0,244.0,0.0,0.0,162.0,0.0,1.1,1.0,0.0,3.0,0
200 | 59.0,1.0,1.0,160.0,273.0,0.0,2.0,125.0,0.0,0.0,1.0,0.0,3.0,1
201 | 50.0,0.0,4.0,110.0,254.0,0.0,2.0,159.0,0.0,0.0,1.0,0.0,3.0,0
202 | 64.0,0.0,4.0,180.0,325.0,0.0,0.0,154.0,1.0,0.0,1.0,0.0,3.0,0
203 | 57.0,1.0,3.0,150.0,126.0,1.0,0.0,173.0,0.0,0.2,1.0,1.0,7.0,0
204 | 64.0,0.0,3.0,140.0,313.0,0.0,0.0,133.0,0.0,0.2,1.0,0.0,7.0,0
205 | 43.0,1.0,4.0,110.0,211.0,0.0,0.0,161.0,0.0,0.0,1.0,0.0,7.0,0
206 | 45.0,1.0,4.0,142.0,309.0,0.0,2.0,147.0,1.0,0.0,2.0,3.0,7.0,3
207 | 58.0,1.0,4.0,128.0,259.0,0.0,2.0,130.0,1.0,3.0,2.0,2.0,7.0,3
208 | 50.0,1.0,4.0,144.0,200.0,0.0,2.0,126.0,1.0,0.9,2.0,0.0,7.0,3
209 | 55.0,1.0,2.0,130.0,262.0,0.0,0.0,155.0,0.0,0.0,1.0,0.0,3.0,0
210 | 62.0,0.0,4.0,150.0,244.0,0.0,0.0,154.0,1.0,1.4,2.0,0.0,3.0,1
211 | 37.0,0.0,3.0,120.0,215.0,0.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0,0
212 | 38.0,1.0,1.0,120.0,231.0,0.0,0.0,182.0,1.0,3.8,2.0,0.0,7.0,4
213 | 41.0,1.0,3.0,130.0,214.0,0.0,2.0,168.0,0.0,2.0,2.0,0.0,3.0,0
214 | 66.0,0.0,4.0,178.0,228.0,1.0,0.0,165.0,1.0,1.0,2.0,2.0,7.0,3
215 | 52.0,1.0,4.0,112.0,230.0,0.0,0.0,160.0,0.0,0.0,1.0,1.0,3.0,1
216 | 56.0,1.0,1.0,120.0,193.0,0.0,2.0,162.0,0.0,1.9,2.0,0.0,7.0,0
217 | 46.0,0.0,2.0,105.0,204.0,0.0,0.0,172.0,0.0,0.0,1.0,0.0,3.0,0
218 | 46.0,0.0,4.0,138.0,243.0,0.0,2.0,152.0,1.0,0.0,2.0,0.0,3.0,0
219 | 64.0,0.0,4.0,130.0,303.0,0.0,0.0,122.0,0.0,2.0,2.0,2.0,3.0,0
220 | 59.0,1.0,4.0,138.0,271.0,0.0,2.0,182.0,0.0,0.0,1.0,0.0,3.0,0
221 | 41.0,0.0,3.0,112.0,268.0,0.0,2.0,172.0,1.0,0.0,1.0,0.0,3.0,0
222 | 54.0,0.0,3.0,108.0,267.0,0.0,2.0,167.0,0.0,0.0,1.0,0.0,3.0,0
223 | 39.0,0.0,3.0,94.0,199.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
224 | 53.0,1.0,4.0,123.0,282.0,0.0,0.0,95.0,1.0,2.0,2.0,2.0,7.0,3
225 | 63.0,0.0,4.0,108.0,269.0,0.0,0.0,169.0,1.0,1.8,2.0,2.0,3.0,1
226 | 34.0,0.0,2.0,118.0,210.0,0.0,0.0,192.0,0.0,0.7,1.0,0.0,3.0,0
227 | 47.0,1.0,4.0,112.0,204.0,0.0,0.0,143.0,0.0,0.1,1.0,0.0,3.0,0
228 | 67.0,0.0,3.0,152.0,277.0,0.0,0.0,172.0,0.0,0.0,1.0,1.0,3.0,0
229 | 54.0,1.0,4.0,110.0,206.0,0.0,2.0,108.0,1.0,0.0,2.0,1.0,3.0,3
230 | 66.0,1.0,4.0,112.0,212.0,0.0,2.0,132.0,1.0,0.1,1.0,1.0,3.0,2
231 | 52.0,0.0,3.0,136.0,196.0,0.0,2.0,169.0,0.0,0.1,2.0,0.0,3.0,0
232 | 55.0,0.0,4.0,180.0,327.0,0.0,1.0,117.0,1.0,3.4,2.0,0.0,3.0,2
233 | 49.0,1.0,3.0,118.0,149.0,0.0,2.0,126.0,0.0,0.8,1.0,3.0,3.0,1
234 | 74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,0
235 | 54.0,0.0,3.0,160.0,201.0,0.0,0.0,163.0,0.0,0.0,1.0,1.0,3.0,0
236 | 54.0,1.0,4.0,122.0,286.0,0.0,2.0,116.0,1.0,3.2,2.0,2.0,3.0,3
237 | 56.0,1.0,4.0,130.0,283.0,1.0,2.0,103.0,1.0,1.6,3.0,0.0,7.0,2
238 | 46.0,1.0,4.0,120.0,249.0,0.0,2.0,144.0,0.0,0.8,1.0,0.0,7.0,1
239 | 49.0,0.0,2.0,134.0,271.0,0.0,0.0,162.0,0.0,0.0,2.0,0.0,3.0,0
240 | 42.0,1.0,2.0,120.0,295.0,0.0,0.0,162.0,0.0,0.0,1.0,0.0,3.0,0
241 | 41.0,1.0,2.0,110.0,235.0,0.0,0.0,153.0,0.0,0.0,1.0,0.0,3.0,0
242 | 41.0,0.0,2.0,126.0,306.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
243 | 49.0,0.0,4.0,130.0,269.0,0.0,0.0,163.0,0.0,0.0,1.0,0.0,3.0,0
244 | 61.0,1.0,1.0,134.0,234.0,0.0,0.0,145.0,0.0,2.6,2.0,2.0,3.0,2
245 | 60.0,0.0,3.0,120.0,178.0,1.0,0.0,96.0,0.0,0.0,1.0,0.0,3.0,0
246 | 67.0,1.0,4.0,120.0,237.0,0.0,0.0,71.0,0.0,1.0,2.0,0.0,3.0,2
247 | 58.0,1.0,4.0,100.0,234.0,0.0,0.0,156.0,0.0,0.1,1.0,1.0,7.0,2
248 | 47.0,1.0,4.0,110.0,275.0,0.0,2.0,118.0,1.0,1.0,2.0,1.0,3.0,1
249 | 52.0,1.0,4.0,125.0,212.0,0.0,0.0,168.0,0.0,1.0,1.0,2.0,7.0,3
250 | 62.0,1.0,2.0,128.0,208.0,1.0,2.0,140.0,0.0,0.0,1.0,0.0,3.0,0
251 | 57.0,1.0,4.0,110.0,201.0,0.0,0.0,126.0,1.0,1.5,2.0,0.0,6.0,0
252 | 58.0,1.0,4.0,146.0,218.0,0.0,0.0,105.0,0.0,2.0,2.0,1.0,7.0,1
253 | 64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,0
254 | 51.0,0.0,3.0,120.0,295.0,0.0,2.0,157.0,0.0,0.6,1.0,0.0,3.0,0
255 | 43.0,1.0,4.0,115.0,303.0,0.0,0.0,181.0,0.0,1.2,2.0,0.0,3.0,0
256 | 42.0,0.0,3.0,120.0,209.0,0.0,0.0,173.0,0.0,0.0,2.0,0.0,3.0,0
257 | 67.0,0.0,4.0,106.0,223.0,0.0,0.0,142.0,0.0,0.3,1.0,2.0,3.0,0
258 | 76.0,0.0,3.0,140.0,197.0,0.0,1.0,116.0,0.0,1.1,2.0,0.0,3.0,0
259 | 70.0,1.0,2.0,156.0,245.0,0.0,2.0,143.0,0.0,0.0,1.0,0.0,3.0,0
260 | 57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,1
261 | 44.0,0.0,3.0,118.0,242.0,0.0,0.0,149.0,0.0,0.3,2.0,1.0,3.0,0
262 | 58.0,0.0,2.0,136.0,319.0,1.0,2.0,152.0,0.0,0.0,1.0,2.0,3.0,3
263 | 60.0,0.0,1.0,150.0,240.0,0.0,0.0,171.0,0.0,0.9,1.0,0.0,3.0,0
264 | 44.0,1.0,3.0,120.0,226.0,0.0,0.0,169.0,0.0,0.0,1.0,0.0,3.0,0
265 | 61.0,1.0,4.0,138.0,166.0,0.0,2.0,125.0,1.0,3.6,2.0,1.0,3.0,4
266 | 42.0,1.0,4.0,136.0,315.0,0.0,0.0,125.0,1.0,1.8,2.0,0.0,6.0,2
267 | 52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2
268 | 59.0,1.0,3.0,126.0,218.0,1.0,0.0,134.0,0.0,2.2,2.0,1.0,6.0,2
269 | 40.0,1.0,4.0,152.0,223.0,0.0,0.0,181.0,0.0,0.0,1.0,0.0,7.0,1
270 | 42.0,1.0,3.0,130.0,180.0,0.0,0.0,150.0,0.0,0.0,1.0,0.0,3.0,0
271 | 61.0,1.0,4.0,140.0,207.0,0.0,2.0,138.0,1.0,1.9,1.0,1.0,7.0,1
272 | 66.0,1.0,4.0,160.0,228.0,0.0,2.0,138.0,0.0,2.3,1.0,0.0,6.0,0
273 | 46.0,1.0,4.0,140.0,311.0,0.0,0.0,120.0,1.0,1.8,2.0,2.0,7.0,2
274 | 71.0,0.0,4.0,112.0,149.0,0.0,0.0,125.0,0.0,1.6,2.0,0.0,3.0,0
275 | 59.0,1.0,1.0,134.0,204.0,0.0,0.0,162.0,0.0,0.8,1.0,2.0,3.0,1
276 | 64.0,1.0,1.0,170.0,227.0,0.0,2.0,155.0,0.0,0.6,2.0,0.0,7.0,0
277 | 66.0,0.0,3.0,146.0,278.0,0.0,2.0,152.0,0.0,0.0,2.0,1.0,3.0,0
278 | 39.0,0.0,3.0,138.0,220.0,0.0,0.0,152.0,0.0,0.0,2.0,0.0,3.0,0
279 | 57.0,1.0,2.0,154.0,232.0,0.0,2.0,164.0,0.0,0.0,1.0,1.0,3.0,1
280 | 58.0,0.0,4.0,130.0,197.0,0.0,0.0,131.0,0.0,0.6,2.0,0.0,3.0,0
281 | 57.0,1.0,4.0,110.0,335.0,0.0,0.0,143.0,1.0,3.0,2.0,1.0,7.0,2
282 | 47.0,1.0,3.0,130.0,253.0,0.0,0.0,179.0,0.0,0.0,1.0,0.0,3.0,0
283 | 55.0,0.0,4.0,128.0,205.0,0.0,1.0,130.0,1.0,2.0,2.0,1.0,7.0,3
284 | 35.0,1.0,2.0,122.0,192.0,0.0,0.0,174.0,0.0,0.0,1.0,0.0,3.0,0
285 | 61.0,1.0,4.0,148.0,203.0,0.0,0.0,161.0,0.0,0.0,1.0,1.0,7.0,2
286 | 58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4
287 | 58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2
288 | 58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
289 | 56.0,1.0,2.0,130.0,221.0,0.0,2.0,163.0,0.0,0.0,1.0,0.0,7.0,0
290 | 56.0,1.0,2.0,120.0,240.0,0.0,0.0,169.0,0.0,0.0,3.0,0.0,3.0,0
291 | 67.0,1.0,3.0,152.0,212.0,0.0,2.0,150.0,0.0,0.8,2.0,0.0,7.0,1
292 | 55.0,0.0,2.0,132.0,342.0,0.0,0.0,166.0,0.0,1.2,1.0,0.0,3.0,0
293 | 44.0,1.0,4.0,120.0,169.0,0.0,0.0,144.0,1.0,2.8,3.0,0.0,6.0,2
294 | 63.0,1.0,4.0,140.0,187.0,0.0,2.0,144.0,1.0,4.0,1.0,2.0,7.0,2
295 | 63.0,0.0,4.0,124.0,197.0,0.0,0.0,136.0,1.0,0.0,2.0,0.0,3.0,1
296 | 41.0,1.0,2.0,120.0,157.0,0.0,0.0,182.0,0.0,0.0,1.0,0.0,3.0,0
297 | 59.0,1.0,4.0,164.0,176.0,1.0,2.0,90.0,0.0,1.0,2.0,2.0,6.0,3
298 | 57.0,0.0,4.0,140.0,241.0,0.0,0.0,123.0,1.0,0.2,2.0,0.0,7.0,1
299 | 45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
300 | 68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
301 | 57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
302 | 57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1
303 | 38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0
304 |
--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open System.IO
3 | open Microsoft.ML
4 | open Microsoft.ML.Data
5 |
6 | /// The SpamInput class contains one single message which may be spam or ham.
7 | []
8 | type SpamInput = {
9 | [] Verdict : string
10 | [] Message : string
11 | }
12 |
13 | /// The SpamPrediction class contains one single spam prediction.
14 | []
15 | type SpamPrediction = {
16 | [] IsSpam : bool
17 | Score : float32
18 | Probability : float32
19 | }
20 |
21 | /// This class describes what output columns we want to produce.
22 | []
23 | type ToLabel ={
24 | mutable Label : bool
25 | }
26 |
27 | /// Helper function to cast the ML pipeline to an estimator
28 | let castToEstimator (x : IEstimator<_>) =
29 | match x with
30 | | :? IEstimator as y -> y
31 | | _ -> failwith "Cannot cast pipeline to IEstimator"
32 |
33 | /// file paths to data files (assumes os = windows!)
34 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory
35 |
36 | []
37 | let main arv =
38 |
39 | // set up a machine learning context
40 | let context = new MLContext()
41 |
42 | // load the spam dataset in memory
43 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = '\t')
44 |
45 | // use 80% for training and 20% for testing
46 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
47 |
48 | // set up a training pipeline
49 | let pipeline =
50 | EstimatorChain()
51 |
52 | // step 1: transform the 'spam' and 'ham' values to true and false
53 | .Append(
54 | context.Transforms.CustomMapping(
55 | Action(fun input output -> output.Label <- input.Verdict = "spam"),
56 | "MyLambda"))
57 |
58 | // step 2: featureize the input text
59 | .Append(context.Transforms.Text.FeaturizeText("Features", "Message"))
60 |
61 | // step 3: use a stochastic dual coordinate ascent learner
62 | .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression())
63 |
64 | // test the full data set by performing k-fold cross validation
65 | printfn "Performing cross validation:"
66 | let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5)
67 |
68 | // report the results
69 | cvResults |> Seq.iter(fun f -> printfn " Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve)
70 |
71 | // train the model on the training set
72 | let model = partitions.TrainSet |> pipeline.Fit
73 |
74 | // evaluate the model on the test set
75 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
76 |
77 | // report the results
78 | printfn "Model metrics:"
79 | printfn " Accuracy: %f" metrics.Accuracy
80 | printfn " Auc: %f" metrics.AreaUnderRocCurve
81 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
82 | printfn " F1Score: %f" metrics.F1Score
83 | printfn " LogLoss: %f" metrics.LogLoss
84 | printfn " LogLossReduction: %f" metrics.LogLossReduction
85 | printfn " PositivePrecision: %f" metrics.PositivePrecision
86 | printfn " PositiveRecall: %f" metrics.PositiveRecall
87 | printfn " NegativePrecision: %f" metrics.NegativePrecision
88 | printfn " NegativeRecall: %f" metrics.NegativeRecall
89 |
90 | // set up a prediction engine
91 | let engine = context.Model.CreatePredictionEngine model
92 |
93 | // create sample messages
94 | let messages = [
95 | { Message = "Hi, wanna grab lunch together today?"; Verdict = "" }
96 | { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" }
97 | { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" }
98 | { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" }
99 | ]
100 |
101 | // make the predictions
102 | printfn "Model predictions:"
103 | let predictions = messages |> List.iter(fun m ->
104 | let p = engine.Predict m
105 | printfn " %f %s" p.Probability m.Message)
106 |
107 | 0 // return value
--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Detect spam SMS messages
2 |
3 | In this assignment you're going to build an app that can automatically detect spam SMS messages.
4 |
5 | The first thing you'll need is a file with lots of SMS messages, correctly labelled as being spam or not spam. You will use a dataset compiled by Caroline Tagg in her [2009 PhD thesis](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf). This dataset has 5574 messages.
6 |
7 | Download the [list of messages](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/SpamDetection/spam.tsv) and save it as **spam.tsv**.
8 |
9 | The data file looks like this:
10 |
11 | 
12 |
13 | It’s a TSV file with only 2 columns of information:
14 |
15 | * Label: ‘spam’ for a spam message and ‘ham’ for a normal message.
16 | * Message: the full text of the SMS message.
17 |
18 | You will build a binary classification model that reads in all messages and then makes a prediction for each message if it is spam or ham.
19 |
20 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
21 |
22 | ```bash
23 | $ dotnet new console --language F# --output SpamDetection
24 | $ cd SpamDetection
25 | ```
26 |
27 | Now install the following ML.NET packages:
28 |
29 | ```bash
30 | $ dotnet add package Microsoft.ML
31 | ```
32 |
33 | Now you are ready to add some classes. You’ll need need one to hold a labelled message, and one to hold the model predictions.
34 |
35 | Replace the contents of the Program.fs file with this:
36 |
37 | ```fsharp
38 | open System
39 | open System.IO
40 | open Microsoft.ML
41 | open Microsoft.ML.Data
42 |
43 | /// The SpamInput class contains one single message which may be spam or ham.
44 | []
45 | type SpamInput = {
46 | [] Verdict : string
47 | [] Message : string
48 | }
49 |
50 | /// The SpamPrediction class contains one single spam prediction.
51 | []
52 | type SpamPrediction = {
53 | [] IsSpam : bool
54 | Score : float32
55 | Probability : float32
56 | }
57 |
58 | // the rest of the code goes here....
59 | ```
60 |
61 | The **SpamInput** class holds one single message. Note how each field is tagged with a **LoadColumn** attribute that tells the data loading code which column to import data from.
62 |
63 | There's also a **SpamPrediction** class which will hold a single spam prediction. There's a boolean **IsSpam**, a **Probability** value, and the **Score** the model will assign to the prediction.
64 |
65 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
66 |
67 | Now look at the first column in the data file. Our label is a string with the value 'spam' meaning it's a spam message, and 'ham' meaning it's a normal message.
68 |
69 | But you're building a Binary Classifier which needs to be trained on boolean labels.
70 |
71 | So you'll have to somehow convert the 'raw' text labels (stored in the **Verdict** field) to a boolean value.
72 |
73 | To set that up, you'll need a helper type:
74 |
75 | ```fsharp
76 | /// This class describes what output columns we want to produce.
77 | []
78 | type ToLabel ={
79 | mutable Label : bool
80 | }
81 |
82 | // the rest of the code goes here....
83 | ```
84 |
85 | Note how the **ToLabel** type contains a **Label** field with the converted boolean label value. We will set up this conversion in a minute.
86 |
87 | Also note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction.
88 |
89 | We need one more helper function before we can load the dataset. Add the following code:
90 |
91 | ```fsharp
92 | /// Helper function to cast the ML pipeline to an estimator
93 | let castToEstimator (x : IEstimator<_>) =
94 | match x with
95 | | :? IEstimator as y -> y
96 | | _ -> failwith "Cannot cast pipeline to IEstimator"
97 |
98 | // the rest of the code goes here
99 | ```
100 |
101 | The **castToEstimator** function takes an **IEstimator<>** argument and uses pattern matching to cast the value to an **IEstimator\** type. You'll see in a minute why we need this helper function.
102 |
103 | Now you're ready to load the training data in memory:
104 |
105 | ```fsharp
106 | /// file paths to data files (assumes os = windows!)
107 | let dataPath = sprintf "%s\\spam.tsv" Environment.CurrentDirectory
108 |
109 | []
110 | let main arv =
111 |
112 | // set up a machine learning context
113 | let context = new MLContext()
114 |
115 | // load the spam dataset in memory
116 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = '\t')
117 |
118 | // use 80% for training and 20% for testing
119 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
120 |
121 |
122 | // the rest of the code goes here....
123 | ```
124 |
125 | This code uses the **LoadFromTextFile** function to load the TSV data directly into memory. The field annotations in the **SpamInput** type tell the function how to store the loaded data.
126 |
127 | The **TrainTestSplit** function then splits the data into a training partition with 80% of the data and a test partition with 20% of the data.
128 |
129 | Now you’re ready to start building the machine learning model:
130 |
131 | ```fsharp
132 | // set up a training pipeline
133 | let pipeline =
134 | EstimatorChain()
135 |
136 | // step 1: transform the 'spam' and 'ham' values to true and false
137 | .Append(
138 | context.Transforms.CustomMapping(
139 | Action(fun input output -> output.Label <- input.Verdict = "spam"),
140 | "MyLambda"))
141 |
142 | // step 2: featureize the input text
143 | .Append(context.Transforms.Text.FeaturizeText("Features", "Message"))
144 |
145 | // step 3: use a stochastic dual coordinate ascent learner
146 | .Append(context.BinaryClassification.Trainers.SdcaLogisticRegression())
147 |
148 | // the rest of the code goes here....
149 | ```
150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
151 |
152 | This pipeline has the following components:
153 |
154 | * A **CustomMapping** that transforms the text label to a boolean value. We define 'spam' values as spam and anything else as normal messages.
155 | * **FeaturizeText** which calculates a numerical value for each message. This is a required step because machine learning models cannot handle text data directly.
156 | * A **SdcaLogisticRegression** classification learner which will train the model to make accurate predictions.
157 |
158 | The FeaturizeText component is a very nice solution for handling text input data. The component performs a number of transformations on the text to prepare it for model training:
159 |
160 | * Normalize the text (=remove punctuation, diacritics, switching to lowercase etc.)
161 | * Tokenize each word.
162 | * Remove all stopwords
163 | * Extract Ngrams and skip-grams
164 | * TF-IDF rescaling
165 | * Bag of words conversion
166 |
167 | The result is that each message is converted to a vector of numeric values that can easily be processed by the model.
168 |
169 | Before you start training, you're going to perform a quick check to see if the dataset has enough data to reliably train a binary classification model.
170 |
171 | We have 5574 messages which makes this a very small dataset. We'd prefer to have between 10k-100k records for reliable training. For small datasets like this one, we'll have to perform **K-Fold Cross Validation** to make sure we have enough data to work with.
172 |
173 | Let's set that up right now:
174 |
175 | ```fsharp
176 | // test the full data set by performing k-fold cross validation
177 | printfn "Performing cross validation:"
178 | let cvResults = context.BinaryClassification.CrossValidate(data = data, estimator = castToEstimator pipeline, numberOfFolds = 5)
179 |
180 | // report the results
181 | cvResults |> Seq.iter(fun f -> printfn " Fold: %i, AUC: %f" f.Fold f.Metrics.AreaUnderRocCurve)
182 |
183 | // the rest of the code goes here....
184 | ```
185 |
186 | This code calls the **CrossValidate** method to perform K-Fold Cross Validation on the training partition using 5 folds. Note how we call **castToEstimator** to cast the pipeline to an **IEstimator\** type.
187 |
188 | We need to do this because the **EstimatorChain** function we use every time to build the machine learning pipeline produces a type that cannot be read directly by **CrossValidate**. And the F# compiler is unable to perform the type cast for us automatically, so we need the helper function to perform the cast explicitly.
189 |
190 | Next, the code reports the individual AUC for each fold. For a well-balanced dataset we expect to see roughly identical AUC values for each fold. Any outliers are hints that the dataset may be unbalanced and too small to train on.
191 |
192 | Now let's train the model and get some validation metrics:
193 |
194 | ```fsharp
195 | // train the model on the training set
196 | let model = partitions.TrainSet |> pipeline.Fit
197 |
198 | // evaluate the model on the test set
199 | let metrics = partitions.TestSet |> model.Transform |> context.BinaryClassification.Evaluate
200 |
201 | // report the results
202 | printfn "Model metrics:"
203 | printfn " Accuracy: %f" metrics.Accuracy
204 | printfn " Auc: %f" metrics.AreaUnderRocCurve
205 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
206 | printfn " F1Score: %f" metrics.F1Score
207 | printfn " LogLoss: %f" metrics.LogLoss
208 | printfn " LogLossReduction: %f" metrics.LogLossReduction
209 | printfn " PositivePrecision: %f" metrics.PositivePrecision
210 | printfn " PositiveRecall: %f" metrics.PositiveRecall
211 | printfn " NegativePrecision: %f" metrics.NegativePrecision
212 | printfn " NegativeRecall: %f" metrics.NegativeRecall
213 |
214 | // the rest of the code goes here
215 | ```
216 |
217 | This code trains the model by piping the training data into the **Fit** function. Then it pipes the test data into the **Transform** function to make a prediction for every message in the validation partition.
218 |
219 | The code pipes these predictions into the **Evaluate** function to compare these predictions to the ground truth and calculate the following metrics:
220 |
221 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
222 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
223 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
224 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
225 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
226 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
227 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
228 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
229 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
230 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
231 |
232 | When filtering spam, you definitely want to avoid false positives because you don’t want to be sending important emails into the junk folder.
233 |
234 | You also want to avoid false negatives but they are not as bad as a false positive. Having some spam slipping through the filter is not the end of the world.
235 |
236 | To wrap up, You’re going to create a couple of messages and ask the model to make a prediction:
237 |
238 | ```fsharp
239 | // set up a prediction engine
240 | let engine = context.Model.CreatePredictionEngine model
241 |
242 | // create sample messages
243 | let messages = [
244 | { Message = "Hi, wanna grab lunch together today?"; Verdict = "" }
245 | { Message = "Win a Nokia, PSP, or €25 every week. Txt YEAHIWANNA now to join"; Verdict = "" }
246 | { Message = "Home in 30 mins. Need anything from store?"; Verdict = "" }
247 | { Message = "CONGRATS U WON LOTERY CLAIM UR 1 MILIONN DOLARS PRIZE"; Verdict = "" }
248 | ]
249 |
250 | // make the predictions
251 | printfn "Model predictions:"
252 | let predictions = messages |> List.iter(fun m ->
253 | let p = engine.Predict m
254 | printfn " %f %s" p.Probability m.Message)
255 | ```
256 |
257 | This code calls the **CreatePredictionEngine** function to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction.
258 |
259 | The code creates four new test messages and calls **List.iter** to make spam predictions for each message. What’s the result going to be?
260 |
261 | Time to find out. Go to your terminal and run your code:
262 |
263 | ```bash
264 | $ dotnet run
265 | ```
266 |
267 | What results do you get? What are your five AUC values from K-Fold Cross Validation and the average AUC over all folds? Are there any outliers? Are the five values grouped close together?
268 |
269 | What can you conclude from your cross-validation results? Do we have enough data to make reliable spam predictions?
270 |
271 | Based on the results of cross-validation, would you say this dataset is well-balanced? And what does this say about the metrics you should use to evaluate your model?
272 |
273 | Which metrics did you pick to evaluate the model? And what do the values say about the accuracy of your model?
274 |
275 | And what about the four test messages? Dit the model accurately predict which ones are spam?
276 |
277 | Think about the code in this assignment. How could you improve the accuracy of the model even more? What are your best AUC values after optimization?
278 |
279 | Share your results in our group!
280 |
--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/SpamDetection.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/BinaryClassification/SpamDetection/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/SpamDetection/assets/data.png
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open System.IO
3 | open Microsoft.ML
4 | open Microsoft.ML.Data
5 | open Microsoft.ML.Transforms
6 |
7 | /// The Passenger class represents one passenger on the Titanic.
8 | []
9 | type Passenger = {
10 | [] Label : bool
11 | [] Pclass : float32
12 | [] Sex : string
13 | [] RawAge : string // not a float!
14 | [] SibSp : float32
15 | [] Parch : float32
16 | [] Ticket : string
17 | [] Fare : float32
18 | [] Cabin : string
19 | [] Embarked : string
20 | }
21 |
22 | /// The PassengerPrediction class represents one model prediction.
23 | []
24 | type PassengerPrediction = {
25 | [] Prediction : bool
26 | Probability : float32
27 | Score : float32
28 | }
29 |
30 | /// The ToAge class is a helper class for a column transformation.
31 | []
32 | type ToAge = {
33 | mutable Age : string
34 | }
35 |
36 | /// file path to the train data file (assumes os = windows!)
37 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory
38 |
39 | /// file path to the test data file (assumes os = windows!)
40 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory
41 |
42 | []
43 | let main argv =
44 |
45 | // set up a machine learning context
46 | let context = new MLContext()
47 |
48 | // load the training and testing data in memory
49 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
50 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
51 |
52 | // set up a training pipeline
53 | let pipeline =
54 | EstimatorChain()
55 |
56 | // step 1: replace missing ages with '?'
57 | .Append(
58 | context.Transforms.CustomMapping(
59 | Action(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge),
60 | "AgeMapping"))
61 |
62 | // step 2: convert string ages to floats
63 | .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single))
64 |
65 | // step 3: replace missing age values with the mean age
66 | .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean))
67 |
68 | // step 4: replace string columns with one-hot encoded vectors
69 | .Append(context.Transforms.Categorical.OneHotEncoding("Sex"))
70 | .Append(context.Transforms.Categorical.OneHotEncoding("Ticket"))
71 | .Append(context.Transforms.Categorical.OneHotEncoding("Cabin"))
72 | .Append(context.Transforms.Categorical.OneHotEncoding("Embarked"))
73 |
74 | // step 5: concatenate everything into a single feature column
75 | .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked"))
76 |
77 | // step 6: use a fasttree trainer
78 | .Append(context.BinaryClassification.Trainers.FastTree())
79 |
80 | // train the model
81 | let model = trainData |> pipeline.Fit
82 |
83 | // make predictions and compare with ground truth
84 | let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate
85 |
86 | // report the results
87 | printfn "Model metrics:"
88 | printfn " Accuracy: %f" metrics.Accuracy
89 | printfn " Auc: %f" metrics.AreaUnderRocCurve
90 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
91 | printfn " F1Score: %f" metrics.F1Score
92 | printfn " LogLoss: %f" metrics.LogLoss
93 | printfn " LogLossReduction: %f" metrics.LogLossReduction
94 | printfn " PositivePrecision: %f" metrics.PositivePrecision
95 | printfn " PositiveRecall: %f" metrics.PositiveRecall
96 | printfn " NegativePrecision: %f" metrics.NegativePrecision
97 | printfn " NegativeRecall: %f" metrics.NegativeRecall
98 |
99 | // set up a prediction engine
100 | let engine = context.Model.CreatePredictionEngine model
101 |
102 | // create a sample record
103 | let passenger = {
104 | Pclass = 1.0f
105 | Sex = "male"
106 | RawAge = "48"
107 | SibSp = 0.0f
108 | Parch = 0.0f
109 | Ticket = "B"
110 | Fare = 70.0f
111 | Cabin = "123"
112 | Embarked = "S"
113 | Label = false // unused!
114 | }
115 |
116 | // make the prediction
117 | let prediction = engine.Predict passenger
118 |
119 | // report the results
120 | printfn "Model prediction:"
121 | printfn " Prediction: %s" (if prediction.Prediction then "survived" else "perished")
122 | printfn " Probability: %f" prediction.Probability
123 |
124 | 0 // return value
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Predict who survived the Titanic disaster
2 |
3 | The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
4 |
5 | 
6 |
7 | In this assignment you're going to build an app that can predict which Titanic passengers survived the disaster. You will use a decision tree classifier to make your predictions.
8 |
9 | The first thing you will need for your app is the passenger manifest of the Titanic's last voyage. You will use the famous [Kaggle Titanic Dataset](https://github.com/sbaidachni/MLNETTitanic/tree/master/MLNetTitanic) which has data for a subset of 891 passengers.
10 |
11 | Download the [test_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/test_data.csv) and [train_data](https://github.com/mdfarragher/DSC/blob/master/BinaryClassification/TitanicPrediction/train_data.csv) files and save them to your project folder.
12 |
13 | The training data file looks like this:
14 |
15 | 
16 |
17 | It’s a CSV file with 12 columns of information:
18 |
19 | * The passenger identifier
20 | * The label column containing ‘1’ if the passenger survived and ‘0’ if the passenger perished
21 | * The class of travel (1–3)
22 | * The name of the passenger
23 | * The gender of the passenger (‘male’ or ‘female’)
24 | * The age of the passenger, or ‘0’ if the age is unknown
25 | * The number of siblings and/or spouses aboard
26 | * The number of parents and/or children aboard
27 | * The ticket number
28 | * The fare paid
29 | * The cabin number
30 | * The port in which the passenger embarked
31 |
32 | The second column is the label: 0 means the passenger perished, and 1 means the passenger survived. All other columns are input features from the passenger manifest.
33 |
34 | You're gooing to build a binary classification model that reads in all columns and then predicts for each passenger if he or she survived.
35 |
36 | Let’s get started. Here’s how to set up a new console project in NET Core:
37 |
38 | ```bash
39 | $ dotnet new console --language F# --output TitanicPrediction
40 | $ cd TitanicPrediction
41 | ```
42 |
43 | Next, you need to install the correct NuGet packages:
44 |
45 | ```
46 | $ dotnet add package Microsoft.ML
47 | $ dotnet add package Microsoft.ML.FastTree
48 | ```
49 |
50 | Now you are ready to add some classes. You’ll need one to hold passenger data, and one to hold your model predictions.
51 |
52 | Replace the contents of the Program.fs file with this:
53 |
54 | ```fsharp
55 | open System
56 | open System.IO
57 | open Microsoft.ML
58 | open Microsoft.ML.Data
59 | open Microsoft.ML.Transforms
60 |
61 | /// The Passenger class represents one passenger on the Titanic.
62 | []
63 | type Passenger = {
64 | [] Label : bool
65 | [] Pclass : float32
66 | [] Sex : string
67 | [] RawAge : string // not a float!
68 | [] SibSp : float32
69 | [] Parch : float32
70 | [] Ticket : string
71 | [] Fare : float32
72 | [] Cabin : string
73 | [] Embarked : string
74 | }
75 |
76 | /// The PassengerPrediction class represents one model prediction.
77 | []
78 | type PassengerPrediction = {
79 | [] Prediction : bool
80 | Probability : float32
81 | Score : float32
82 | }
83 |
84 | // the rest of the code goes here...
85 | ```
86 |
87 | The **Passenger** type holds one single passenger record. There's also a **PassengerPrediction** type which will hold a single passenger prediction. There's a boolean **Prediction**, a **Probability** value, and the **Score** the model will assign to the prediction.
88 |
89 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
90 |
91 | Now look at the age column in the data file. It's a number, but for some passengers in the manifest the age is not known and the column is empty.
92 |
93 | ML.NET can automatically load and process missing numeric values, but only if they are present in the CSV file as a '?'.
94 |
95 | The Titanic datafile uses an empty string to denote missing values, so we'll have to perform a feature conversion
96 |
97 | Notice how the age is loaded as s string into a Passenger class field called **RawAge**.
98 |
99 | We will process the missing values later in our app. To prepare for this, we'll need an additional helper type:
100 |
101 | ```fsharp
102 | /// The ToAge class is a helper class for a column transformation.
103 | []
104 | type ToAge = {
105 | mutable Age : string
106 | }
107 |
108 | // the rest of the code goes here...
109 | ```
110 |
111 | The **ToAge** type will contain the converted age values. We will set up this conversion in a minute.
112 |
113 | Note the **mutable** keyword. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction.
114 |
115 | Now you're going to load the training data in memory:
116 |
117 | ```fsharp
118 | /// file path to the train data file (assumes os = windows!)
119 | let trainDataPath = sprintf "%s\\train_data.csv" Environment.CurrentDirectory
120 |
121 | /// file path to the test data file (assumes os = windows!)
122 | let testDataPath = sprintf "%s\\test_data.csv" Environment.CurrentDirectory
123 |
124 | []
125 | let main argv =
126 |
127 | // set up a machine learning context
128 | let context = new MLContext()
129 |
130 | // load the training and testing data in memory
131 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
132 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
133 |
134 | // the rest of the code goes here...
135 |
136 | 0 // return value
137 | ```
138 |
139 | This code calls the **LoadFromTextFile** function twice to load the training and testing datasets in memory.
140 |
141 | ML.NET expects missing data in CSV files to appear as a ‘?’, but unfortunately the Titanic file uses an empty string to indicate an unknown age. So the first thing you need to do is replace all empty age strings occurrences with ‘?’.
142 |
143 | Add the following code:
144 |
145 | ```fsharp
146 | // set up a training pipeline
147 | let pipeline =
148 | EstimatorChain()
149 |
150 | // step 1: replace missing ages with '?'
151 | .Append(
152 | context.Transforms.CustomMapping(
153 | Action(fun input output -> output.Age <- if String.IsNullOrEmpty(input.RawAge) then "?" else input.RawAge),
154 | "AgeMapping"))
155 |
156 | // the rest of the code goes here...
157 | ```
158 |
159 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
160 |
161 | The **CustomMapping** component converts empty age strings to ‘?’ values.
162 |
163 | Now ML.NET is happy with the age values. You will now convert the string ages to numeric values and instruct ML.NET to replace any missing values with the mean age over the entire dataset.
164 |
165 | Add the following code, and make sure you match the indentation level of the previous **Append** function exactly. Indentation is significant in F# and the wrong indentation level will lead to compiler errors:
166 |
167 | ```fsharp
168 | // step 2: convert string ages to floats
169 | .Append(context.Transforms.Conversion.ConvertType("Age", outputKind = DataKind.Single))
170 |
171 | // step 3: replace missing age values with the mean age
172 | .Append(context.Transforms.ReplaceMissingValues("Age", replacementMode = MissingValueReplacingEstimator.ReplacementMode.Mean))
173 |
174 | // the rest of the code goes here...
175 | ```
176 |
177 | The **ConvertType** component converts the Age column to a single-precision floating point value. And the **ReplaceMissingValues** component replaces any missing values with the mean value of all ages in the entire dataset.
178 |
179 | Now let's process the rest of the data columns. The Sex, Ticket, Cabin, and Embarked columns are enumerations of string values. As you've already learned, you'll need to one-hot encode them:
180 |
181 | ```fsharp
182 | // step 4: replace string columns with one-hot encoded vectors
183 | .Append(context.Transforms.Categorical.OneHotEncoding("Sex"))
184 | .Append(context.Transforms.Categorical.OneHotEncoding("Ticket"))
185 | .Append(context.Transforms.Categorical.OneHotEncoding("Cabin"))
186 | .Append(context.Transforms.Categorical.OneHotEncoding("Embarked"))
187 |
188 | // the rest of the code goes here...
189 | ```
190 |
191 | The **OneHotEncoding** components take an input column, one-hot encode all values, and produce a new column with the same name holding the one-hot vectors.
192 |
193 | Now let's wrap up the pipeline:
194 |
195 | ```fsharp
196 | // step 5: concatenate everything into a single feature column
197 | .Append(context.Transforms.Concatenate("Features", "Age", "Pclass", "SibSp", "Parch", "Sex", "Embarked"))
198 |
199 | // step 6: use a fasttree trainer
200 | .Append(context.BinaryClassification.Trainers.FastTree())
201 |
202 | // the rest of the code goes here (indented back 2 levels!)...
203 | ```
204 |
205 | The **Concatenate** component concatenates all remaining feature columns into a single column for training. This is required because ML.NET can only train on a single input column.
206 |
207 | And the **FastTreeBinaryClassificationTrainer** is the algorithm that's going to train the model. You're going to build a decision tree classifier that uses the Fast Tree algorithm to train on the data and configure the tree.
208 |
209 | Note the indentation level of the 'the rest of the code...' comment. Make sure that when you add the remaining code you indent this code back by two levels to match the indentation level of the **main** function.
210 |
211 | Now all you need to do now is train the model on the entire dataset, compare the predictions with the labels, and compute a bunch of metrics that describe how accurate the model is:
212 |
213 | ```fsharp
214 | // train the model
215 | let model = trainData |> pipeline.Fit
216 |
217 | // make predictions and compare with ground truth
218 | let metrics = testData |> model.Transform |> context.BinaryClassification.Evaluate
219 |
220 | // report the results
221 | printfn "Model metrics:"
222 | printfn " Accuracy: %f" metrics.Accuracy
223 | printfn " Auc: %f" metrics.AreaUnderRocCurve
224 | printfn " Auprc: %f" metrics.AreaUnderPrecisionRecallCurve
225 | printfn " F1Score: %f" metrics.F1Score
226 | printfn " LogLoss: %f" metrics.LogLoss
227 | printfn " LogLossReduction: %f" metrics.LogLossReduction
228 | printfn " PositivePrecision: %f" metrics.PositivePrecision
229 | printfn " PositiveRecall: %f" metrics.PositiveRecall
230 | printfn " NegativePrecision: %f" metrics.NegativePrecision
231 | printfn " NegativeRecall: %f" metrics.NegativeRecall
232 |
233 | // the rest of the code goes here...
234 | ```
235 |
236 | This code pipes the training data into the **Fit** function to train the model on the entire dataset.
237 |
238 | We then pipe the test data into the **Transform** function to set up a prediction for each passenger, and pipe these predictions into the **Evaluate** function to compare them to the label and automatically calculate evaluation metrics.
239 |
240 | We then display the following metrics:
241 |
242 | * **Accuracy**: this is the number of correct predictions divided by the total number of predictions.
243 | * **AreaUnderRocCurve**: a metric that indicates how accurate the model is: 0 = the model is wrong all the time, 0.5 = the model produces random output, 1 = the model is correct all the time. An AUC of 0.8 or higher is considered good.
244 | * **AreaUnderPrecisionRecallCurve**: an alternate AUC metric that performs better for heavily imbalanced datasets with many more negative results than positive.
245 | * **F1Score**: this is a metric that strikes a balance between Precision and Recall. It’s useful for imbalanced datasets with many more negative results than positive.
246 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
247 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
248 | * **PositivePrecision**: also called ‘Precision’, this is the fraction of positive predictions that are correct. This is a good metric to use when the cost of a false positive prediction is high.
249 | * **PositiveRecall**: also called ‘Recall’, this is the fraction of positive predictions out of all positive cases. This is a good metric to use when the cost of a false negative is high.
250 | * **NegativePrecision**: this is the fraction of negative predictions that are correct.
251 | * **NegativeRecall**: this is the fraction of negative predictions out of all negative cases.
252 |
253 | To wrap up, let's have some fun and pretend that I’m going to take a trip on the Titanic too. I will embark in Southampton and pay $70 for a first-class cabin. I travel on my own without parents, children, or my spouse.
254 |
255 | What are my odds of surviving?
256 |
257 | Add the following code:
258 |
259 | ```fsharp
260 | // set up a prediction engine
261 | let engine = context.Model.CreatePredictionEngine model
262 |
263 | // create a sample record
264 | let passenger = {
265 | Pclass = 1.0f
266 | Sex = "male"
267 | RawAge = "48"
268 | SibSp = 0.0f
269 | Parch = 0.0f
270 | Ticket = "B"
271 | Fare = 70.0f
272 | Cabin = "123"
273 | Embarked = "S"
274 | Label = false // unused!
275 | }
276 |
277 | // make the prediction
278 | let prediction = engine.Predict passenger
279 |
280 | // report the results
281 | printfn "Model prediction:"
282 | printfn " Prediction: %s" (if prediction.Prediction then "survived" else "perished")
283 | printfn " Probability: %f" prediction.Probability
284 | ```
285 |
286 | This code uses the **CreatePredictionEngine** method to create a prediction engine. With the prediction engine set up, you can simply call **Predict** to make a single prediction.
287 |
288 | The code sets up a new passenger record with my information and then calls **Predict** to make a prediction about my survival chances.
289 |
290 | So would I have survived the Titanic disaster?
291 |
292 | Time to find out. Go to your terminal and run your code:
293 |
294 | ```bash
295 | $ dotnet run
296 | ```
297 |
298 | What results do you get? What is your accuracy, precision, recall, AUC, AUCPRC, and F1 value?
299 |
300 | Is this dataset balanced? Which metrics should you use to evaluate your model? And what do the values say about the accuracy of your model?
301 |
302 | And what about me? Did I survive the disaster?
303 |
304 | Do you think a decision tree is a good choice to predict Titanic survivors? Which other algorithms could you use instead? Do they give a better result?
305 |
306 | Share your results in our group!
307 |
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/TitanicPrediction.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/assets/data.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/data.jpg
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/assets/titanic.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/BinaryClassification/TitanicPrediction/assets/titanic.jpeg
--------------------------------------------------------------------------------
/BinaryClassification/TitanicPrediction/test_data.csv:
--------------------------------------------------------------------------------
1 | "PassengerId","Survived","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"
2 | 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)","female","38",1,0,"PC 17599","71.2833","C85","C"
3 | 3,1,3,"Heikkinen, Miss. Laina","female","26",0,0,"STON/O2. 3101282","7.925","","S"
4 | 9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","female","27",0,2,"347742","11.1333","","S"
5 | 11,1,3,"Sandstrom, Miss. Marguerite Rut","female","4",1,1,"PP 9549","16.7","G6","S"
6 | 18,1,2,"Williams, Mr. Charles Eugene","male","",0,0,"244373","13","","S"
7 | 25,0,3,"Palsson, Miss. Torborg Danira","female","8",3,1,"349909","21.075","","S"
8 | 30,0,3,"Todoroff, Mr. Lalio","male","",0,0,"349216","7.8958","","S"
9 | 31,0,1,"Uruchurtu, Don. Manuel E","male","40",0,0,"PC 17601","27.7208","","C"
10 | 34,0,2,"Wheadon, Mr. Edward H","male","66",0,0,"C.A. 24579","10.5","","S"
11 | 38,0,3,"Cann, Mr. Ernest Charles","male","21",0,0,"A./5. 2152","8.05","","S"
12 | 43,0,3,"Kraeff, Mr. Theodor","male","",0,0,"349253","7.8958","","C"
13 | 49,0,3,"Samaan, Mr. Youssef","male","",2,0,"2662","21.6792","","C"
14 | 51,0,3,"Panula, Master. Juha Niilo","male","7",4,1,"3101295","39.6875","","S"
15 | 55,0,1,"Ostby, Mr. Engelhart Cornelius","male","65",0,1,"113509","61.9792","B30","C"
16 | 60,0,3,"Goodwin, Master. William Frederick","male","11",5,2,"CA 2144","46.9","","S"
17 | 64,0,3,"Skoog, Master. Harald","male","4",3,2,"347088","27.9","","S"
18 | 67,1,2,"Nye, Mrs. (Elizabeth Ramell)","female","29",0,0,"C.A. 29395","10.5","F33","S"
19 | 72,0,3,"Goodwin, Miss. Lillian Amy","female","16",5,2,"CA 2144","46.9","","S"
20 | 76,0,3,"Moen, Mr. Sigurd Hansen","male","25",0,0,"348123","7.65","F G73","S"
21 | 78,0,3,"Moutal, Mr. Rahamin Haim","male","",0,0,"374746","8.05","","S"
22 | 81,0,3,"Waelens, Mr. Achille","male","22",0,0,"345767","9","","S"
23 | 85,1,2,"Ilett, Miss. Bertha","female","17",0,0,"SO/C 14885","10.5","","S"
24 | 87,0,3,"Ford, Mr. William Neal","male","16",1,3,"W./C. 6608","34.375","","S"
25 | 93,0,1,"Chaffee, Mr. Herbert Fuller","male","46",1,0,"W.E.P. 5734","61.175","E31","S"
26 | 95,0,3,"Coxon, Mr. Daniel","male","59",0,0,"364500","7.25","","S"
27 | 99,1,2,"Doling, Mrs. John T (Ada Julia Bone)","female","34",0,1,"231919","23","","S"
28 | 113,0,3,"Barton, Mr. David John","male","22",0,0,"324669","8.05","","S"
29 | 121,0,2,"Hickman, Mr. Stanley George","male","21",2,0,"S.O.C. 14879","73.5","","S"
30 | 123,0,2,"Nasser, Mr. Nicholas","male","32.5",1,0,"237736","30.0708","","C"
31 | 136,0,2,"Richard, Mr. Emile","male","23",0,0,"SC/PARIS 2133","15.0458","","C"
32 | 140,0,1,"Giglio, Mr. Victor","male","24",0,0,"PC 17593","79.2","B86","C"
33 | 144,0,3,"Burke, Mr. Jeremiah","male","19",0,0,"365222","6.75","","Q"
34 | 146,0,2,"Nicholls, Mr. Joseph Charles","male","19",1,1,"C.A. 33112","36.75","","S"
35 | 148,0,3,"Ford, Miss. Robina Maggie ""Ruby""","female","9",2,2,"W./C. 6608","34.375","","S"
36 | 156,0,1,"Williams, Mr. Charles Duane","male","51",0,1,"PC 17597","61.3792","","C"
37 | 157,1,3,"Gilnagh, Miss. Katherine ""Katie""","female","16",0,0,"35851","7.7333","","Q"
38 | 158,0,3,"Corn, Mr. Harry","male","30",0,0,"SOTON/OQ 392090","8.05","","S"
39 | 166,1,3,"Goldsmith, Master. Frank John William ""Frankie""","male","9",0,2,"363291","20.525","","S"
40 | 167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)","female","",0,1,"113505","55","E33","S"
41 | 168,0,3,"Skoog, Mrs. William (Anna Bernhardina Karlsson)","female","45",1,4,"347088","27.9","","S"
42 | 195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)","female","44",0,0,"PC 17610","27.7208","B4","C"
43 | 201,0,3,"Vande Walle, Mr. Nestor Cyriel","male","28",0,0,"345770","9.5","","S"
44 | 206,0,3,"Strom, Miss. Telma Matilda","female","2",0,1,"347054","10.4625","G6","S"
45 | 210,1,1,"Blank, Mr. Henry","male","40",0,0,"112277","31","A31","C"
46 | 218,0,2,"Jacobsohn, Mr. Sidney Samuel","male","42",1,0,"243847","27","","S"
47 | 223,0,3,"Green, Mr. George Henry","male","51",0,0,"21440","8.05","","S"
48 | 241,0,3,"Zabour, Miss. Thamine","female","",1,0,"2665","14.4542","","C"
49 | 243,0,2,"Coleridge, Mr. Reginald Charles","male","29",0,0,"W./C. 14263","10.5","","S"
50 | 251,0,3,"Reed, Mr. James George","male","",0,0,"362316","7.25","","S"
51 | 255,0,3,"Rosblom, Mrs. Viktor (Helena Wilhelmina)","female","41",0,2,"370129","20.2125","","S"
52 | 265,0,3,"Henry, Miss. Delia","female","",0,0,"382649","7.75","","Q"
53 | 266,0,2,"Reeves, Mr. David","male","36",0,0,"C.A. 17248","10.5","","S"
54 | 271,0,1,"Cairns, Mr. Alexander","male","",0,0,"113798","31","","S"
55 | 279,0,3,"Rice, Master. Eric","male","7",4,1,"382652","29.125","","Q"
56 | 285,0,1,"Smith, Mr. Richard William","male","",0,0,"113056","26","A19","S"
57 | 296,0,1,"Lewy, Mr. Ervin G","male","",0,0,"PC 17612","27.7208","","C"
58 | 305,0,3,"Williams, Mr. Howard Hugh ""Harry""","male","",0,0,"A/5 2466","8.05","","S"
59 | 306,1,1,"Allison, Master. Hudson Trevor","male","0.92",1,2,"113781","151.55","C22 C26","S"
60 | 311,1,1,"Hays, Miss. Margaret Bechstein","female","24",0,0,"11767","83.1583","C54","C"
61 | 314,0,3,"Hendekovic, Mr. Ignjac","male","28",0,0,"349243","7.8958","","S"
62 | 315,0,2,"Hart, Mr. Benjamin","male","43",1,1,"F.C.C. 13529","26.25","","S"
63 | 333,0,1,"Graham, Mr. George Edward","male","38",0,1,"PC 17582","153.4625","C91","S"
64 | 335,1,1,"Frauenthal, Mrs. Henry William (Clara Heinsheimer)","female","",1,0,"PC 17611","133.65","","S"
65 | 337,0,1,"Pears, Mr. Thomas Clinton","male","29",1,0,"113776","66.6","C2","S"
66 | 341,1,2,"Navratil, Master. Edmond Roger","male","2",1,1,"230080","26","F2","S"
67 | 344,0,2,"Sedgwick, Mr. Charles Frederick Waddington","male","25",0,0,"244361","13","","S"
68 | 345,0,2,"Fox, Mr. Stanley Hubert","male","36",0,0,"229236","13","","S"
69 | 359,1,3,"McGovern, Miss. Mary","female","",0,0,"330931","7.8792","","Q"
70 | 365,0,3,"O'Brien, Mr. Thomas","male","",1,0,"370365","15.5","","Q"
71 | 366,0,3,"Adahl, Mr. Mauritz Nils Martin","male","30",0,0,"C 7076","7.25","","S"
72 | 367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)","female","60",1,0,"110813","75.25","D37","C"
73 | 374,0,1,"Ringhini, Mr. Sante","male","22",0,0,"PC 17760","135.6333","","C"
74 | 375,0,3,"Palsson, Miss. Stina Viola","female","3",3,1,"349909","21.075","","S"
75 | 376,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)","female","",1,0,"PC 17604","82.1708","","C"
76 | 383,0,3,"Tikkanen, Mr. Juho","male","32",0,0,"STON/O 2. 3101293","7.925","","S"
77 | 387,0,3,"Goodwin, Master. Sidney Leonard","male","1",5,2,"CA 2144","46.9","","S"
78 | 393,0,3,"Gustafsson, Mr. Johan Birger","male","28",2,0,"3101277","7.925","","S"
79 | 396,0,3,"Johansson, Mr. Erik","male","22",0,0,"350052","7.7958","","S"
80 | 401,1,3,"Niskanen, Mr. Juha","male","39",0,0,"STON/O 2. 3101289","7.925","","S"
81 | 407,0,3,"Widegren, Mr. Carl/Charles Peter","male","51",0,0,"347064","7.75","","S"
82 | 408,1,2,"Richards, Master. William Rowe","male","3",1,1,"29106","18.75","","S"
83 | 414,0,2,"Cunningham, Mr. Alfred Fleming","male","",0,0,"239853","0","","S"
84 | 419,0,2,"Matthews, Mr. William John","male","30",0,0,"28228","13","","S"
85 | 422,0,3,"Charters, Mr. David","male","21",0,0,"A/5. 13032","7.7333","","Q"
86 | 423,0,3,"Zimmerman, Mr. Leo","male","29",0,0,"315082","7.875","","S"
87 | 427,1,2,"Clarke, Mrs. Charles V (Ada Maria Winfield)","female","28",1,0,"2003","26","","S"
88 | 428,1,2,"Phillips, Miss. Kate Florence (""Mrs Kate Louise Phillips Marshall"")","female","19",0,0,"250655","26","","S"
89 | 434,0,3,"Kallio, Mr. Nikolai Erland","male","17",0,0,"STON/O 2. 3101274","7.125","","S"
90 | 437,0,3,"Ford, Miss. Doolina Margaret ""Daisy""","female","21",2,2,"W./C. 6608","34.375","","S"
91 | 438,1,2,"Richards, Mrs. Sidney (Emily Hocking)","female","24",2,3,"29106","18.75","","S"
92 | 441,1,2,"Hart, Mrs. Benjamin (Esther Ada Bloomfield)","female","45",1,1,"F.C.C. 13529","26.25","","S"
93 | 446,1,1,"Dodge, Master. Washington","male","4",0,2,"33638","81.8583","A34","S"
94 | 448,1,1,"Seward, Mr. Frederic Kimber","male","34",0,0,"113794","26.55","","S"
95 | 449,1,3,"Baclini, Miss. Marie Catherine","female","5",2,1,"2666","19.2583","","C"
96 | 462,0,3,"Morley, Mr. William","male","34",0,0,"364506","8.05","","S"
97 | 465,0,3,"Maisner, Mr. Simon","male","",0,0,"A/S 2816","8.05","","S"
98 | 483,0,3,"Rouse, Mr. Richard Henry","male","50",0,0,"A/5 3594","8.05","","S"
99 | 493,0,1,"Molson, Mr. Harry Markland","male","55",0,0,"113787","30.5","C30","S"
100 | 495,0,3,"Stanley, Mr. Edward Roland","male","21",0,0,"A/4 45380","8.05","","S"
101 | 497,1,1,"Eustis, Miss. Elizabeth Mussey","female","54",1,0,"36947","78.2667","D20","C"
102 | 507,1,2,"Quick, Mrs. Frederick Charles (Jane Richards)","female","33",0,2,"26360","26","","S"
103 | 508,1,1,"Bradley, Mr. George (""George Arthur Brayton"")","male","",0,0,"111427","26.55","","S"
104 | 512,0,3,"Webber, Mr. James","male","",0,0,"SOTON/OQ 3101316","8.05","","S"
105 | 518,0,3,"Ryan, Mr. Patrick","male","",0,0,"371110","24.15","","Q"
106 | 522,0,3,"Vovk, Mr. Janko","male","22",0,0,"349252","7.8958","","S"
107 | 530,0,2,"Hocking, Mr. Richard George","male","23",2,1,"29104","11.5","","S"
108 | 531,1,2,"Quick, Miss. Phyllis May","female","2",1,1,"26360","26","","S"
109 | 532,0,3,"Toufik, Mr. Nakli","male","",0,0,"2641","7.2292","","C"
110 | 538,1,1,"LeRoy, Miss. Bertha","female","30",0,0,"PC 17761","106.425","","C"
111 | 543,0,3,"Andersson, Miss. Sigrid Elisabeth","female","11",4,2,"347082","31.275","","S"
112 | 547,1,2,"Beane, Mrs. Edward (Ethel Clarke)","female","19",1,0,"2908","26","","S"
113 | 551,1,1,"Thayer, Mr. John Borland Jr","male","17",0,2,"17421","110.8833","C70","C"
114 | 558,0,1,"Robbins, Mr. Victor","male","",0,0,"PC 17757","227.525","","C"
115 | 561,0,3,"Morrow, Mr. Thomas Rowan","male","",0,0,"372622","7.75","","Q"
116 | 570,1,3,"Jonsson, Mr. Carl","male","32",0,0,"350417","7.8542","","S"
117 | 574,1,3,"Kelly, Miss. Mary","female","",0,0,"14312","7.75","","Q"
118 | 589,0,3,"Gilinski, Mr. Eliezer","male","22",0,0,"14973","8.05","","S"
119 | 591,0,3,"Rintamaki, Mr. Matti","male","35",0,0,"STON/O 2. 3101273","7.125","","S"
120 | 592,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)","female","52",1,0,"36947","78.2667","D20","C"
121 | 600,1,1,"Duff Gordon, Sir. Cosmo Edmund (""Mr Morgan"")","male","49",1,0,"PC 17485","56.9292","A20","C"
122 | 602,0,3,"Slabenoff, Mr. Petco","male","",0,0,"349214","7.8958","","S"
123 | 609,1,2,"Laroche, Mrs. Joseph (Juliette Marie Louise Lafargue)","female","22",1,2,"SC/Paris 2123","41.5792","","C"
124 | 616,1,2,"Herman, Miss. Alice","female","24",1,2,"220845","65","","S"
125 | 619,1,2,"Becker, Miss. Marion Louise","female","4",2,1,"230136","39","F4","S"
126 | 635,0,3,"Skoog, Miss. Mabel","female","9",3,2,"347088","27.9","","S"
127 | 641,0,3,"Jensen, Mr. Hans Peder","male","20",0,0,"350050","7.8542","","S"
128 | 647,0,3,"Cor, Mr. Liudevit","male","19",0,0,"349231","7.8958","","S"
129 | 648,1,1,"Simonius-Blumer, Col. Oberst Alfons","male","56",0,0,"13213","35.5","A26","C"
130 | 650,1,3,"Stanley, Miss. Amy Zillah Elsie","female","23",0,0,"CA. 2314","7.55","","S"
131 | 655,0,3,"Hegarty, Miss. Hanora ""Nora""","female","18",0,0,"365226","6.75","","Q"
132 | 657,0,3,"Radeff, Mr. Alexander","male","",0,0,"349223","7.8958","","S"
133 | 661,1,1,"Frauenthal, Dr. Henry William","male","50",2,0,"PC 17611","133.65","","S"
134 | 664,0,3,"Coleff, Mr. Peju","male","36",0,0,"349210","7.4958","","S"
135 | 673,0,2,"Mitchell, Mr. Henry Michael","male","70",0,0,"C.A. 24580","10.5","","S"
136 | 675,0,2,"Watson, Mr. Ennis Hastings","male","",0,0,"239856","0","","S"
137 | 679,0,3,"Goodwin, Mrs. Frederick (Augusta Tyler)","female","43",1,6,"CA 2144","46.9","","S"
138 | 688,0,3,"Dakic, Mr. Branko","male","19",0,0,"349228","10.1708","","S"
139 | 698,1,3,"Mullens, Miss. Katherine ""Katie""","female","",0,0,"35852","7.7333","","Q"
140 | 705,0,3,"Hansen, Mr. Henrik Juul","male","26",1,0,"350025","7.8542","","S"
141 | 713,1,1,"Taylor, Mr. Elmer Zebley","male","48",1,0,"19996","52","C126","S"
142 | 720,0,3,"Johnson, Mr. Malkolm Joackim","male","33",0,0,"347062","7.775","","S"
143 | 727,1,2,"Renouf, Mrs. Peter Henry (Lillian Jefferys)","female","30",3,0,"31027","21","","S"
144 | 732,0,3,"Hassan, Mr. Houssein G N","male","11",0,0,"2699","18.7875","","C"
145 | 740,0,3,"Nankoff, Mr. Minko","male","",0,0,"349218","7.8958","","S"
146 | 741,1,1,"Hawksford, Mr. Walter James","male","",0,0,"16988","30","D45","S"
147 | 742,0,1,"Cavendish, Mr. Tyrell William","male","36",1,0,"19877","78.85","C46","S"
148 | 744,0,3,"McNamee, Mr. Neal","male","24",1,0,"376566","16.1","","S"
149 | 748,1,2,"Sinkkonen, Miss. Anna","female","30",0,0,"250648","13","","S"
150 | 751,1,2,"Wells, Miss. Joan","female","4",1,1,"29103","23","","S"
151 | 752,1,3,"Moor, Master. Meier","male","6",0,1,"392096","12.475","E121","S"
152 | 762,0,3,"Nirva, Mr. Iisakki Antino Aijo","male","41",0,0,"SOTON/O2 3101272","7.125","","S"
153 | 763,1,3,"Barah, Mr. Hanna Assi","male","20",0,0,"2663","7.2292","","C"
154 | 769,0,3,"Moran, Mr. Daniel J","male","",1,0,"371110","24.15","","Q"
155 | 770,0,3,"Gronnestad, Mr. Daniel Danielsen","male","32",0,0,"8471","8.3625","","S"
156 | 783,0,1,"Long, Mr. Milton Clyde","male","29",0,0,"113501","30","D6","S"
157 | 786,0,3,"Harmer, Mr. Abraham (David Lishin)","male","25",0,0,"374887","7.25","","S"
158 | 792,0,2,"Gaskell, Mr. Alfred","male","16",0,0,"239865","26","","S"
159 | 795,0,3,"Dantcheff, Mr. Ristiu","male","25",0,0,"349203","7.8958","","S"
160 | 797,1,1,"Leader, Dr. Alice (Farnham)","female","49",0,0,"17465","25.9292","D17","S"
161 | 801,0,2,"Ponesell, Mr. Martin","male","34",0,0,"250647","13","","S"
162 | 810,1,1,"Chambers, Mrs. Norman Campbell (Bertha Griggs)","female","33",1,0,"113806","53.1","E8","S"
163 | 812,0,3,"Lester, Mr. James","male","39",0,0,"A/4 48871","24.15","","S"
164 | 815,0,3,"Tomlin, Mr. Ernest Portage","male","30.5",0,0,"364499","8.05","","S"
165 | 821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gregg)","female","52",1,1,"12749","93.5","B69","S"
166 | 829,1,3,"McCormack, Mr. Thomas Joseph","male","",0,0,"367228","7.75","","Q"
167 | 832,1,2,"Richards, Master. George Sibley","male","0.83",1,1,"29106","18.75","","S"
168 | 845,0,3,"Culumovic, Mr. Jeso","male","17",0,0,"315090","8.6625","","S"
169 | 850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)","female","",1,0,"17453","89.1042","C92","C"
170 | 851,0,3,"Andersson, Master. Sigvard Harald Elias","male","4",4,2,"347082","31.275","","S"
171 | 853,0,3,"Boulos, Miss. Nourelain","female","9",1,1,"2678","15.2458","","C"
172 | 857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)","female","45",1,1,"36928","164.8667","","S"
173 | 858,1,1,"Daly, Mr. Peter Denis ","male","51",0,0,"113055","26.55","E17","S"
174 | 860,0,3,"Razi, Mr. Raihed","male","",0,0,"2629","7.2292","","C"
175 | 865,0,2,"Gill, Mr. John William","male","24",0,0,"233866","13","","S"
176 | 867,1,2,"Duran y More, Miss. Asuncion","female","27",1,0,"SC/PARIS 2149","13.8583","","C"
177 | 874,0,3,"Vander Cruyssen, Mr. Victor","male","47",0,0,"345765","9","","S"
178 | 879,0,3,"Laleff, Mr. Kristo","male","",0,0,"349217","7.8958","","S"
179 | 882,0,3,"Markun, Mr. Johann","male","33",0,0,"349257","7.8958","","S"
180 | 886,0,3,"Rice, Mrs. William (Margaret Norton)","female","39",0,5,"382652","29.125","","Q"
181 |
--------------------------------------------------------------------------------
/Clustering/IrisFlower/IrisFlower.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/Clustering/IrisFlower/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open Microsoft.ML
3 | open Microsoft.ML.Data
4 |
5 | /// A type that holds a single iris flower.
6 | []
7 | type IrisData = {
8 | [] SepalLength : float32
9 | [] SepalWidth : float32
10 | [] PetalLength : float32
11 | [] PetalWidth : float32
12 | [] Label : string
13 | }
14 |
15 | /// A type that holds a single model prediction.
16 | []
17 | type IrisPrediction = {
18 | PredictedLabel : uint32
19 | Score : float32[]
20 | }
21 |
22 | /// file paths to data files (assumes os = windows!)
23 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory
24 |
25 | []
26 | let main argv =
27 |
28 | // get the machine learning context
29 | let context = new MLContext();
30 |
31 | // read the iris flower data from a text file
32 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',')
33 |
34 | // split the data into a training and testing partition
35 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
36 |
37 | // set up a learning pipeline
38 | let pipeline =
39 | EstimatorChain()
40 |
41 | // step 1: concatenate features into a single column
42 | .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
43 |
44 | // step 2: use k-means clustering to find the iris types
45 | .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3))
46 |
47 | // train the model on the training data
48 | let model = partitions.TrainSet |> pipeline.Fit
49 |
50 | // get predictions and compare to ground truth
51 | let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate
52 |
53 | // show results
54 | printfn "Nodel results"
55 | printfn " Average distance: %f" metrics.AverageDistance
56 | printfn " Davies Bouldin index: %f" metrics.DaviesBouldinIndex
57 |
58 | // set up a prediction engine
59 | let engine = context.Model.CreatePredictionEngine model
60 |
61 | // grab 3 flowers from the dataset
62 | let flowers = context.Data.CreateEnumerable(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq
63 | let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ]
64 |
65 | // show predictions for the three flowers
66 | printfn "Predictions for the 3 test flowers:"
67 | printfn " Label\t\t\tPredicted\tScores"
68 | testFlowers |> Seq.iter(fun f ->
69 | let p = engine.Predict f
70 | printf " %-15s\t%i\t\t" f.Label p.PredictedLabel
71 | p.Score |> Seq.iter(fun s -> printf "%f\t" s)
72 | printfn "")
73 |
74 | 0 // return value
--------------------------------------------------------------------------------
/Clustering/IrisFlower/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Cluster Iris flowers
2 |
3 | In this assignment you are going to build an unsupervised learning app that clusters Iris flowers into discrete groups.
4 |
5 | There are three types of Iris flowers: Versicolor, Setosa, and Virginica. Each flower has two sets of leaves: the inner Petals and the outer Sepals.
6 |
7 | Your goal is to build an app that can identify an Iris flower by its sepal and petal size.
8 |
9 | 
10 |
11 | Your challenge is that you're not going to use the dataset labels. Your app has to recognize patterns in the dataset and cluster the flowers into three groups without any help.
12 |
13 | Clustering is an example of **unsupervised learning** where the data science model has to figure out the labels on its own.
14 |
15 | The first thing you will need for your app is a data file with Iris flower petal and sepal sizes. You can use this [CSV file](https://github.com/mdfarragher/DSC/blob/master/Clustering/IrisFlower/iris-data.csv). Save it as **iris-data.csv** in your project folder.
16 |
17 | The file looks like this:
18 |
19 | 
20 |
21 | It’s a CSV file with 5 columns:
22 |
23 | * The length of the Sepal in centimeters
24 | * The width of the Sepal in centimeters
25 | * The length of the Petal in centimeters
26 | * The width of the Petal in centimeters
27 | * The type of Iris flower
28 |
29 | You are going to build a clustering data science model that reads the data and then guesses the label for each flower in the dataset.
30 |
31 | Of course the app won't know the real names of the flowers, so it's just going to number them: 1, 2, and 3.
32 |
33 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
34 |
35 | ```bash
36 | $ dotnet new console --language F# --output IrisFlowers
37 | $ cd IrisFlowers
38 | ```
39 |
40 | Now install the ML.NET package:
41 |
42 | ```bash
43 | $ dotnet add package Microsoft.ML
44 | ```
45 |
46 | Now you are ready to add some types. You’ll need one to hold a flower and one to hold your model prediction.
47 |
48 | Edit the Program.fs file and replace its contents with this:
49 |
50 | ```fsharp
51 | open System
52 | open Microsoft.ML
53 | open Microsoft.ML.Data
54 |
55 | /// A type that holds a single iris flower.
56 | []
57 | type IrisData = {
58 | [] SepalLength : float32
59 | [] SepalWidth : float32
60 | [] PetalLength : float32
61 | [] PetalWidth : float32
62 | [] Label : string
63 | }
64 |
65 | /// A type that holds a single model prediction.
66 | []
67 | type IrisPrediction = {
68 | PredictedLabel : uint32
69 | Score : float32[]
70 | }
71 |
72 | // the rest of the code goes here....
73 | ```
74 |
75 | The **IrisData** type holds one single flower. Note how the fields are tagged with the **LoadColumn** attribute that tells ML.NET how to load the data from the data file.
76 |
77 | We are loading the label in the 5th column, but we won't be using the label during training because we want the model to figure out the iris flower types on its own.
78 |
79 | There's also an **IrisPrediction** type which will hold a prediction for a single flower. The prediction consists of the ID of the cluster that the flower belongs to. Clusters are numbered from 1 upwards. And notice how the score field is an array? Each individual score value represents the distance of the flower to one specific cluster.
80 |
81 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
82 |
83 | Next you'll need to load the data in memory:
84 |
85 | ```fsharp
86 | /// file paths to data files (assumes os = windows!)
87 | let dataPath = sprintf "%s\\iris-data.csv" Environment.CurrentDirectory
88 |
89 | []
90 | let main argv =
91 |
92 | // get the machine learning context
93 | let context = new MLContext();
94 |
95 | // read the iris flower data from a text file
96 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = false, separatorChar = ',')
97 |
98 | // split the data into a training and testing partition
99 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
100 |
101 | // the rest of the code goes here....
102 |
103 | 0 // return value
104 | ```
105 |
106 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory, and then calls **TrainTestSplit** to split the dataset into an 80% training partition and a 20% test partition.
107 |
108 | Now let’s build the data science pipeline:
109 |
110 | ```fsharp
111 | // set up a learning pipeline
112 | let pipeline =
113 | EstimatorChain()
114 |
115 | // step 1: concatenate features into a single column
116 | .Append(context.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
117 |
118 | // step 2: use k-means clustering to find the iris types
119 | .Append(context.Clustering.Trainers.KMeans(numberOfClusters = 3))
120 |
121 | // train the model on the training data
122 | let model = partitions.TrainSet |> pipeline.Fit
123 |
124 | // the rest of the code goes here...
125 | ```
126 |
127 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
128 |
129 | This pipeline has two components:
130 |
131 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
132 | * A **KMeans** component which performs K-Means Clustering on the data and tries to find all Iris flower types.
133 |
134 | With the pipeline fully assembled, the code trains the model by piping the training set into the **Fit** function.
135 |
136 | You now have a fully- trained model. So now it's time to take the test set, predict the type of each flower, and calculate the accuracy metrics of the model:
137 |
138 | ```fsharp
139 | // get predictions and compare to ground truth
140 | let metrics = partitions.TestSet |> model.Transform |> context.Clustering.Evaluate
141 |
142 | // show results
143 | printfn "Nodel results"
144 | printfn " Average distance: %f" metrics.AverageDistance
145 | printfn " Davies Bouldin index: %f" metrics.DaviesBouldinIndex
146 |
147 | // the rest of the code goes here....
148 | ```
149 |
150 | This code pipes the test set into the **Transform** function to set up predictions for every flower in the test set. Then it pipes these predictions into the **Evaluate** function to compare each predictions with the label and automatically calculates two metrics:
151 |
152 | * **AverageDistance**: this is the average distance of a flower to the center point of its cluster, averaged over all clusters in the dataset. It is a measure for the 'tightness' of the clusters. Lower values are better and mean more concentrated clusters.
153 | * **DaviesBouldinIndex**: this metric is the average 'similarity' of each cluster with its most similar cluster. Similarity is defined as the ratio of within-cluster distances to between-cluster distances. So in other words, clusters which are farther apart and more concentrated will result in a better score. Low values indicate better clustering.
154 |
155 | So Average Distance measures how concentrated the clusters are in the dataset, and the Davies Bouldin Index measures both concentration and how far apart the clusters are spaced. Both metrics are negative-based with zero being the perfect score.
156 |
157 | To wrap up, let’s use the model to make predictions.
158 |
159 | You will pick three arbitrary flowers from the test set, run them through the model, and compare the predictions with the labels provided in the data file.
160 |
161 | Here’s how to do it:
162 |
163 | ```fsharp
164 | // set up a prediction engine
165 | let engine = context.Model.CreatePredictionEngine model
166 |
167 | // grab 3 flowers from the dataset
168 | let flowers = context.Data.CreateEnumerable(partitions.TestSet, reuseRowObject = false) |> Array.ofSeq
169 | let testFlowers = [ flowers.[0]; flowers.[10]; flowers.[20] ]
170 |
171 | // show predictions for the three flowers
172 | printfn "Predictions for the 3 test flowers:"
173 | printfn " Label\t\t\tPredicted\tScores"
174 | testFlowers |> Seq.iter(fun f ->
175 | let p = engine.Predict f
176 | printf " %-15s\t%i\t\t" f.Label p.PredictedLabel
177 | p.Score |> Seq.iter(fun s -> printf "%f\t" s)
178 | printfn "")
179 | ```
180 |
181 | This code calls **CreatePredictionEngine** to set up a prediction engine. This is a type that can generate individual predictions from sample data.
182 |
183 | Then we call the **CreateEnumerable** function to convert the test partition into an array of **IrisData** instances. Note the **Array.ofSeq** function at the end which converts the enumeration to an array.
184 |
185 | Next, we pick three test flowers and pipe them into **Seq.iter**. For each flower, we generate a prediction, print the predicted label (a cluster ID between 1 and 3) and then use a second **Seq.iter** to write the three scores to the console.
186 |
187 | That's it, you're done!
188 |
189 | Go to your terminal and run your code:
190 |
191 | ```bash
192 | $ dotnet run
193 | ```
194 |
195 | What results do you get? What is your average distance and your davies bouldin index?
196 |
197 | What do you think this says about the quality of the clusters?
198 |
199 | What did the 3 flower predictions look like? Does the cluster prediction match the label every time?
200 |
201 | Now change the code and check the predictions for every flower. How often does the model get it wrong? Which Iris types are the most confusing to the model?
202 |
203 | Share your results in our group.
--------------------------------------------------------------------------------
/Clustering/IrisFlower/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/data.png
--------------------------------------------------------------------------------
/Clustering/IrisFlower/assets/flowers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Clustering/IrisFlower/assets/flowers.png
--------------------------------------------------------------------------------
/Clustering/IrisFlower/iris-data.csv:
--------------------------------------------------------------------------------
1 | 5.1,3.5,1.4,0.2,Iris-setosa
2 | 4.9,3.0,1.4,0.2,Iris-setosa
3 | 4.7,3.2,1.3,0.2,Iris-setosa
4 | 4.6,3.1,1.5,0.2,Iris-setosa
5 | 5.0,3.6,1.4,0.2,Iris-setosa
6 | 5.4,3.9,1.7,0.4,Iris-setosa
7 | 4.6,3.4,1.4,0.3,Iris-setosa
8 | 5.0,3.4,1.5,0.2,Iris-setosa
9 | 4.4,2.9,1.4,0.2,Iris-setosa
10 | 4.9,3.1,1.5,0.1,Iris-setosa
11 | 5.4,3.7,1.5,0.2,Iris-setosa
12 | 4.8,3.4,1.6,0.2,Iris-setosa
13 | 4.8,3.0,1.4,0.1,Iris-setosa
14 | 4.3,3.0,1.1,0.1,Iris-setosa
15 | 5.8,4.0,1.2,0.2,Iris-setosa
16 | 5.7,4.4,1.5,0.4,Iris-setosa
17 | 5.4,3.9,1.3,0.4,Iris-setosa
18 | 5.1,3.5,1.4,0.3,Iris-setosa
19 | 5.7,3.8,1.7,0.3,Iris-setosa
20 | 5.1,3.8,1.5,0.3,Iris-setosa
21 | 5.4,3.4,1.7,0.2,Iris-setosa
22 | 5.1,3.7,1.5,0.4,Iris-setosa
23 | 4.6,3.6,1.0,0.2,Iris-setosa
24 | 5.1,3.3,1.7,0.5,Iris-setosa
25 | 4.8,3.4,1.9,0.2,Iris-setosa
26 | 5.0,3.0,1.6,0.2,Iris-setosa
27 | 5.0,3.4,1.6,0.4,Iris-setosa
28 | 5.2,3.5,1.5,0.2,Iris-setosa
29 | 5.2,3.4,1.4,0.2,Iris-setosa
30 | 4.7,3.2,1.6,0.2,Iris-setosa
31 | 4.8,3.1,1.6,0.2,Iris-setosa
32 | 5.4,3.4,1.5,0.4,Iris-setosa
33 | 5.2,4.1,1.5,0.1,Iris-setosa
34 | 5.5,4.2,1.4,0.2,Iris-setosa
35 | 4.9,3.1,1.5,0.1,Iris-setosa
36 | 5.0,3.2,1.2,0.2,Iris-setosa
37 | 5.5,3.5,1.3,0.2,Iris-setosa
38 | 4.9,3.1,1.5,0.1,Iris-setosa
39 | 4.4,3.0,1.3,0.2,Iris-setosa
40 | 5.1,3.4,1.5,0.2,Iris-setosa
41 | 5.0,3.5,1.3,0.3,Iris-setosa
42 | 4.5,2.3,1.3,0.3,Iris-setosa
43 | 4.4,3.2,1.3,0.2,Iris-setosa
44 | 5.0,3.5,1.6,0.6,Iris-setosa
45 | 5.1,3.8,1.9,0.4,Iris-setosa
46 | 4.8,3.0,1.4,0.3,Iris-setosa
47 | 5.1,3.8,1.6,0.2,Iris-setosa
48 | 4.6,3.2,1.4,0.2,Iris-setosa
49 | 5.3,3.7,1.5,0.2,Iris-setosa
50 | 5.0,3.3,1.4,0.2,Iris-setosa
51 | 7.0,3.2,4.7,1.4,Iris-versicolor
52 | 6.4,3.2,4.5,1.5,Iris-versicolor
53 | 6.9,3.1,4.9,1.5,Iris-versicolor
54 | 5.5,2.3,4.0,1.3,Iris-versicolor
55 | 6.5,2.8,4.6,1.5,Iris-versicolor
56 | 5.7,2.8,4.5,1.3,Iris-versicolor
57 | 6.3,3.3,4.7,1.6,Iris-versicolor
58 | 4.9,2.4,3.3,1.0,Iris-versicolor
59 | 6.6,2.9,4.6,1.3,Iris-versicolor
60 | 5.2,2.7,3.9,1.4,Iris-versicolor
61 | 5.0,2.0,3.5,1.0,Iris-versicolor
62 | 5.9,3.0,4.2,1.5,Iris-versicolor
63 | 6.0,2.2,4.0,1.0,Iris-versicolor
64 | 6.1,2.9,4.7,1.4,Iris-versicolor
65 | 5.6,2.9,3.6,1.3,Iris-versicolor
66 | 6.7,3.1,4.4,1.4,Iris-versicolor
67 | 5.6,3.0,4.5,1.5,Iris-versicolor
68 | 5.8,2.7,4.1,1.0,Iris-versicolor
69 | 6.2,2.2,4.5,1.5,Iris-versicolor
70 | 5.6,2.5,3.9,1.1,Iris-versicolor
71 | 5.9,3.2,4.8,1.8,Iris-versicolor
72 | 6.1,2.8,4.0,1.3,Iris-versicolor
73 | 6.3,2.5,4.9,1.5,Iris-versicolor
74 | 6.1,2.8,4.7,1.2,Iris-versicolor
75 | 6.4,2.9,4.3,1.3,Iris-versicolor
76 | 6.6,3.0,4.4,1.4,Iris-versicolor
77 | 6.8,2.8,4.8,1.4,Iris-versicolor
78 | 6.7,3.0,5.0,1.7,Iris-versicolor
79 | 6.0,2.9,4.5,1.5,Iris-versicolor
80 | 5.7,2.6,3.5,1.0,Iris-versicolor
81 | 5.5,2.4,3.8,1.1,Iris-versicolor
82 | 5.5,2.4,3.7,1.0,Iris-versicolor
83 | 5.8,2.7,3.9,1.2,Iris-versicolor
84 | 6.0,2.7,5.1,1.6,Iris-versicolor
85 | 5.4,3.0,4.5,1.5,Iris-versicolor
86 | 6.0,3.4,4.5,1.6,Iris-versicolor
87 | 6.7,3.1,4.7,1.5,Iris-versicolor
88 | 6.3,2.3,4.4,1.3,Iris-versicolor
89 | 5.6,3.0,4.1,1.3,Iris-versicolor
90 | 5.5,2.5,4.0,1.3,Iris-versicolor
91 | 5.5,2.6,4.4,1.2,Iris-versicolor
92 | 6.1,3.0,4.6,1.4,Iris-versicolor
93 | 5.8,2.6,4.0,1.2,Iris-versicolor
94 | 5.0,2.3,3.3,1.0,Iris-versicolor
95 | 5.6,2.7,4.2,1.3,Iris-versicolor
96 | 5.7,3.0,4.2,1.2,Iris-versicolor
97 | 5.7,2.9,4.2,1.3,Iris-versicolor
98 | 6.2,2.9,4.3,1.3,Iris-versicolor
99 | 5.1,2.5,3.0,1.1,Iris-versicolor
100 | 5.7,2.8,4.1,1.3,Iris-versicolor
101 | 6.3,3.3,6.0,2.5,Iris-virginica
102 | 5.8,2.7,5.1,1.9,Iris-virginica
103 | 7.1,3.0,5.9,2.1,Iris-virginica
104 | 6.3,2.9,5.6,1.8,Iris-virginica
105 | 6.5,3.0,5.8,2.2,Iris-virginica
106 | 7.6,3.0,6.6,2.1,Iris-virginica
107 | 4.9,2.5,4.5,1.7,Iris-virginica
108 | 7.3,2.9,6.3,1.8,Iris-virginica
109 | 6.7,2.5,5.8,1.8,Iris-virginica
110 | 7.2,3.6,6.1,2.5,Iris-virginica
111 | 6.5,3.2,5.1,2.0,Iris-virginica
112 | 6.4,2.7,5.3,1.9,Iris-virginica
113 | 6.8,3.0,5.5,2.1,Iris-virginica
114 | 5.7,2.5,5.0,2.0,Iris-virginica
115 | 5.8,2.8,5.1,2.4,Iris-virginica
116 | 6.4,3.2,5.3,2.3,Iris-virginica
117 | 6.5,3.0,5.5,1.8,Iris-virginica
118 | 7.7,3.8,6.7,2.2,Iris-virginica
119 | 7.7,2.6,6.9,2.3,Iris-virginica
120 | 6.0,2.2,5.0,1.5,Iris-virginica
121 | 6.9,3.2,5.7,2.3,Iris-virginica
122 | 5.6,2.8,4.9,2.0,Iris-virginica
123 | 7.7,2.8,6.7,2.0,Iris-virginica
124 | 6.3,2.7,4.9,1.8,Iris-virginica
125 | 6.7,3.3,5.7,2.1,Iris-virginica
126 | 7.2,3.2,6.0,1.8,Iris-virginica
127 | 6.2,2.8,4.8,1.8,Iris-virginica
128 | 6.1,3.0,4.9,1.8,Iris-virginica
129 | 6.4,2.8,5.6,2.1,Iris-virginica
130 | 7.2,3.0,5.8,1.6,Iris-virginica
131 | 7.4,2.8,6.1,1.9,Iris-virginica
132 | 7.9,3.8,6.4,2.0,Iris-virginica
133 | 6.4,2.8,5.6,2.2,Iris-virginica
134 | 6.3,2.8,5.1,1.5,Iris-virginica
135 | 6.1,2.6,5.6,1.4,Iris-virginica
136 | 7.7,3.0,6.1,2.3,Iris-virginica
137 | 6.3,3.4,5.6,2.4,Iris-virginica
138 | 6.4,3.1,5.5,1.8,Iris-virginica
139 | 6.0,3.0,4.8,1.8,Iris-virginica
140 | 6.9,3.1,5.4,2.1,Iris-virginica
141 | 6.7,3.1,5.6,2.4,Iris-virginica
142 | 6.9,3.1,5.1,2.3,Iris-virginica
143 | 5.8,2.7,5.1,1.9,Iris-virginica
144 | 6.8,3.2,5.9,2.3,Iris-virginica
145 | 6.7,3.3,5.7,2.5,Iris-virginica
146 | 6.7,3.0,5.2,2.3,Iris-virginica
147 | 6.3,2.5,5.0,1.9,Iris-virginica
148 | 6.5,3.0,5.2,2.0,Iris-virginica
149 | 6.2,3.4,5.4,2.3,Iris-virginica
150 | 5.9,3.0,5.1,1.8,Iris-virginica
151 |
--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/CaliforniaHousing.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open Microsoft.ML
3 | open Microsoft.ML.Data
4 | open FSharp.Plotly
5 |
6 | /// The HouseBlockData class holds one single housing block data record.
7 | []
8 | type HouseBlockData = {
9 | [] Longitude : float32
10 | [] Latitude : float32
11 | [] HousingMedianAge : float32
12 | [] TotalRooms : float32
13 | [] TotalBedrooms : float32
14 | [] Population : float32
15 | [] Households : float32
16 | [] MedianIncome : float32
17 | [] MedianHouseValue : float32
18 | }
19 |
20 | /// The ToMedianHouseValue class is used in a column data conversion.
21 | []
22 | type ToMedianHouseValue = {
23 | mutable NormalizedMedianHouseValue : float32
24 | }
25 |
26 | /// The ToRoomsPerPerson class is used in a column data conversion.
27 | []
28 | type ToRoomsPerPerson = {
29 | mutable RoomsPerPerson : float32
30 | }
31 |
32 | /// The ToLocation class is used in a column data conversion.
33 | []
34 | type FromLocation = {
35 | EncodedLongitude : float32[]
36 | EncodedLatitude : float32[]
37 | }
38 |
39 | /// The ToLocation class is used in a column data conversion.
40 | []
41 | type ToLocation = {
42 | mutable Location : float32[]
43 | }
44 |
45 | /// file paths to data files (assumes os = windows!)
46 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory
47 |
48 | []
49 | let main argv =
50 |
51 | // create the machine learning context
52 | let context = new MLContext()
53 |
54 | // load the dataset
55 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',')
56 |
57 | // keep only records with a median house value < 500,000
58 | let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0)
59 |
60 | // get an array of housing data
61 | let houses = context.Data.CreateEnumerable(data, reuseRowObject = false)
62 |
63 | // // plot median house value by median income
64 | // Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue)))
65 | // |> Chart.withX_AxisStyle "Median income"
66 | // |> Chart.withY_AxisStyle "Median house value"
67 | // |> Chart.Show
68 |
69 | // build a data loading pipeline
70 | let pipeline =
71 | EstimatorChain()
72 |
73 | // step 1: divide the median house value by 1000
74 | .Append(
75 | context.Transforms.CustomMapping(
76 | Action(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f),
77 | "MedianHouseValue"))
78 |
79 | // get a 10-record preview of the transformed data
80 | let model = data |> pipeline.Fit
81 | let preview = (data |> model.Transform).Preview(maxRows = 10)
82 |
83 | // // show the preview
84 | // preview.ColumnView |> Seq.iter(fun c ->
85 | // printf "%-30s|" c.Column.Name
86 | // preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
87 | // printfn "")
88 |
89 | // // plot median house value by longitude
90 | // Chart.Point(houses |> Seq.map(fun h -> (h.Longitude, h.MedianHouseValue)))
91 | // |> Chart.withX_AxisStyle "Longitude"
92 | // |> Chart.withY_AxisStyle "Median house value"
93 | // |> Chart.Show
94 |
95 | // step 2: bin the longitude
96 | let pipeline2 =
97 | pipeline
98 | .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10))
99 |
100 | // step 3: bin the latitude
101 | .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10))
102 |
103 | // step 4: one-hot encode the longitude
104 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude"))
105 |
106 | // step 5: one-hot encode the latitude
107 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude"))
108 |
109 | .Append(
110 | context.Transforms.CustomMapping(
111 | Action(fun input output ->
112 | output.Location <- [| for x in input.EncodedLongitude do
113 | for y in input.EncodedLatitude do
114 | x * y |] ),
115 | "Location"))
116 |
117 | // get a 10-record preview of the transformed data
118 | let model = data |> pipeline2.Fit
119 | let preview = (data |> model.Transform).Preview(maxRows = 10)
120 |
121 | // // show the preview
122 | // preview.ColumnView |> Seq.iter(fun c ->
123 | // printf "%-30s|" c.Column.Name
124 | // preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
125 | // printfn "")
126 |
127 | // show the dense vector
128 | preview.RowView |> Seq.iter(fun r ->
129 | let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer
130 | vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v))
131 | printfn "")
132 |
133 | 0 // return value
--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Load California housing data
2 |
3 | In this assignment you're going to build an app that can load a dataset with the prices of houses in California. The data is not ready for training yet and needs a bit of processing.
4 |
5 | The first thing you'll need is a data file with house prices. The data from the 1990 California cencus has exactly what we need.
6 |
7 | Download the [California 1990 housing census](https://github.com/mdfarragher/DSC/blob/master/LoadingData/CaliforniaHousing/california_housing.csv) and save it as **california_housing.csv**.
8 |
9 | This is a CSV file with 17,000 records that looks like this:
10 | 
11 | 
12 |
13 | The file contains information on 17k housing blocks all over the state of California:
14 |
15 | * Column 1: The longitude of the housing block
16 | * Column 2: The latitude of the housing block
17 | * Column 3: The median age of all the houses in the block
18 | * Column 4: The total number of rooms in all houses in the block
19 | * Column 5: The total number of bedrooms in all houses in the block
20 | * Column 6: The total number of people living in all houses in the block
21 | * Column 7: The total number of households in all houses in the block
22 | * Column 8: The median income of all people living in all houses in the block
23 | * Column 9: The median house value for all houses in the block
24 |
25 | We can use this data to train an app to predict the value of any house in and outside the state of California.
26 |
27 | Unfortunately we cannot train on this dataset directly. The data needs to be processed first to make it suitable for training. This is what you will do in this assignment.
28 |
29 | Let's get started.
30 |
31 | In these assignments you will not be using the code in Github. Instead, you'll be building all the applications 100% from scratch. So please make sure to create a new folder somewhere to hold all of your assignments.
32 |
33 | Now please open a console window. You are going to create a new subfolder for this assignment and set up a blank console application:
34 |
35 | ```bash
36 | $ dotnet new console --language F# --output LoadingData
37 | $ cd LoadingData
38 | ```
39 |
40 | Also make sure to copy the dataset file(s) into this folder because the code you're going to type next will expect them here.
41 |
42 | Now install the following packages
43 |
44 | ```bash
45 | $ dotnet add package Microsoft.ML
46 | $ dotnet add package FSharp.Plotly
47 | ```
48 |
49 | **Microsoft.ML** is the Microsoft machine learning package. We will use to build all our applications in this course. And **FSharp.Plotly** is an advanced scientific plotting library.
50 |
51 | Now you are ready to add types. You’ll need one type to hold all the information for a single housing block.
52 |
53 | Edit the Program.fs file with Visual Studio Code and add the following code:
54 |
55 | ```fsharp
56 | open System
57 | open Microsoft.ML
58 | open Microsoft.ML.Data
59 | open FSharp.Plotly
60 |
61 | /// The HouseBlockData class holds one single housing block data record.
62 | []
63 | type HouseBlockData = {
64 | [] Longitude : float32
65 | [] Latitude : float32
66 | [] HousingMedianAge : float32
67 | [] TotalRooms : float32
68 | [] TotalBedrooms : float32
69 | [] Population : float32
70 | [] Households : float32
71 | [] MedianIncome : float32
72 | [] MedianHouseValue : float32
73 | }
74 | ```
75 |
76 | The **HouseBlockData** class holds all the data for one single housing block. Note that we're loading each column as a 32-bit floating point number, and that every field is tagged with a **LoadColumn** attribute that will tell the CSV data loading code which column to import data from.
77 |
78 | We also need the **CLIMutable** attribute to tell F# that we want a 'C#-style' class implementation with a default constructor and setters functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
79 |
80 | Next you need to load the data in memory:
81 |
82 | ```fsharp
83 | /// file paths to data files (assumes os = windows!)
84 | let dataPath = sprintf "%s\\california_housing.csv" Environment.CurrentDirectory
85 |
86 | []
87 | let main argv =
88 |
89 | // create the machine learning context
90 | let context = new MLContext()
91 |
92 | // load the dataset
93 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',')
94 |
95 | // the rest of the code goes here...
96 |
97 | 0 // return value
98 | ```
99 |
100 | This code sets up the **main** function which is the main entry point of the application. The code calls the **LoadFromTextFile** method to load the CSV data in memory. Note the **HouseBlockData** type argument that tells the method which class to use to load the data.
101 |
102 | Also note that **dataPath** uses a Windows path separator to access the data file. Change this accordingly if you're using OS/X or Linux.
103 |
104 | So now we have the data in memory. Let's plot the median house value as a function of median income and see what happens.
105 |
106 | Add the following code:
107 |
108 | ```fsharp
109 | // get an array of housing data
110 | let houses = context.Data.CreateEnumerable(data, reuseRowObject = false)
111 |
112 | // plot median house value by median income
113 | Chart.Point(houses |> Seq.map(fun h -> (h.MedianIncome, h.MedianHouseValue)))
114 | |> Chart.withX_AxisStyle "Median income"
115 | |> Chart.withY_AxisStyle "Median house value"
116 | |> Chart.Show
117 |
118 | // the rest of the code goes here
119 | ```
120 |
121 | The housing data is stored in memory as a data view, but we want to work with the **HouseBlockData** records directly. So we call **CreateEnumerable** to convert the data view to an enumeration of **HouseDataBlock** instances.
122 |
123 | The **Chart.Point** method then sets up a scatterplot. We pipe the **houses** enumeration into the **Seq.map** function and project a tuple for every housing block. The tuples contain the median income and median house value for every block, and **Chart.Point** will use these as X- and Y coordinates.
124 |
125 | The **Chart.withX_AxisStyle** and **Chart.withY_AxisStyle** functions set the chart axis titles, and **Chart.Show** renders the chart on screen. Your app will open a web browser and display the chart there.
126 |
127 | This is a good moment to save your work ;)
128 |
129 | We're now ready to run the app. Open a Powershell terminal and make sure you're in the project folder. Then type the following:
130 |
131 | ```bash
132 | $ dotnet build
133 | ```
134 |
135 | This will build the project and populate the bin folder.
136 |
137 | Then type the following:
138 |
139 | ```bash
140 | $ dotnet run
141 | ```
142 |
143 | Your app will run and open the chart in a new browser window. It should look like this:
144 |
145 | 
146 |
147 | As the median income level increases, the median house value also increases. There's still a big spread in the house values, but a vague 'cigar' shape is visible which suggests a linear relationship between these two variables.
148 |
149 | But look at the horizontal line at 500,000. What's that all about?
150 |
151 | This is what **clipping** looks like. The creator of this dataset has clipped all housing blocks with a median house value above $500,000 back down to $500,000. We see this appear in the graph as a horizontal line that disrupts the linear 'cigar' shape.
152 |
153 | Let's start by using **data scrubbing** to get rid of these clipped records. Add the following code:
154 |
155 | ```fsharp
156 | // keep only records with a median house value < 500,000
157 | let data = context.Data.FilterRowsByColumn(data, "MedianHouseValue", upperBound = 499999.0)
158 |
159 | // the rest of the code goes here...
160 | ```
161 |
162 | The **FilterRowsByColumn** method will keep only those records with a median house value of 500,000 or less, and remove all other records from the dataset.
163 |
164 | Move your plotting code BELOW this code fragment and run your app again.
165 |
166 | Did this fix the problem? Is the clipping line gone?
167 |
168 | Now let's take a closer look at the CSV file. Notice how all the columns are numbers in the range of 0..3000, but the median house value is in a range of 0..500,000.
169 |
170 | Remember when we talked about training data science models that we discussed having all data in a similar range?
171 |
172 | So let's fix that now by using **data scaling**. We're going to divide the median house value by 1,000 to bring it down to a range more in line with the other data columns.
173 |
174 | Start by adding the following type:
175 |
176 | ```fsharp
177 | /// The ToMedianHouseValue class is used in a column data conversion.
178 | []
179 | type ToMedianHouseValue = {
180 | mutable NormalizedMedianHouseValue : float32
181 | }
182 | ```
183 |
184 | And then add the following code at the bottom of your **main** function:
185 |
186 | ```fsharp
187 | // build a data loading pipeline
188 | let pipeline =
189 | EstimatorChain()
190 |
191 | // step 1: divide the median house value by 1000
192 | .Append(
193 | context.Transforms.CustomMapping(
194 | Action(fun input output -> output.NormalizedMedianHouseValue <- input.MedianHouseValue / 1000.0f),
195 | "MedianHouseValue"))
196 |
197 | // the rest of the code goes here...
198 | ```
199 |
200 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
201 |
202 | This pipeline has only one component:
203 |
204 | * **CustomMapping** which takes the median house values, divides them by 1,000 and stores them in a new column called **NormalizedMedianHouseValue**. Note that we need the new **ToMedianHouseValue** type to access this new column in code.
205 |
206 | Also note the **mutable** keyword in the type definition for **ToMedianHouseValue**. By default F# types are immutable and the compiler will prevent us from assigning to any property after the type has been instantiated. The **mutable** keyword tells the compiler to create a mutable type instead and allow property assignments after construction.
207 |
208 | If we had left out the keyword, the **output.NormalizedMedianHouseValue = ...** line would fail.
209 |
210 | Now let's see if the conversion worked. Add the following code at the bottom of the **main** function:
211 |
212 | ```fsharp
213 | // get a 10-record preview of the transformed data
214 | let model = data |> pipeline.Fit
215 | let preview = (data |> model.Transform).Preview(maxRows = 10)
216 |
217 | // show the preview
218 | preview.ColumnView |> Seq.iter(fun c ->
219 | printf "%-30s|" c.Column.Name
220 | preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
221 | printfn "")
222 |
223 | // the rest of the code goes here...
224 | ```
225 |
226 | The **pipeline.Fit** method sets up the pipeline, creates a data science model and stores it in the **model** variable. The **model.Transform** method then runs the dataset through the pipeline and creates predictions for every housing block. And finally the **Preview** method extracts a 10-row preview from the collection of predictions.
227 |
228 | Next, we use **Seq.iter** to enumerate every column in the preview. We print the column name and then use a second **Seq.iter** to show all the preview values in this column.
229 |
230 | This will print a transposed view of the preview data with the columns stacked vertically and the rows stacked horizontally. Flipping the preview makes it easier to read, despite the very long column names.
231 |
232 | Now run your code.
233 |
234 | Find the MedianHouseValue and NormalizedMedianHouseValue columns in the output. Do they contain the correct values? Does the normalized column contain the oroginal house values divided by 1,000?
235 |
236 | Now let's fix the latitude and longitude. We're reading them in directly, but remember that we discussed how **Geo data should always be binned, one-hot encoded, and crossed?**
237 |
238 | Let's do that now. Add the following types at the top of the file:
239 |
240 | ```fsharp
241 | /// The ToLocation class is used in a column data conversion.
242 | []
243 | type FromLocation = {
244 | EncodedLongitude : float32[]
245 | EncodedLatitude : float32[]
246 | }
247 |
248 | /// The ToLocation class is used in a column data conversion.
249 | []
250 | type ToLocation = {
251 | mutable Location : float32[]
252 | }
253 | ```
254 |
255 | Note the **mutable** keyword again, which indicates that we're going to modify the **Location** property of the **ToLocation** type after construction.
256 |
257 | We will use these types in the next code snippet.
258 |
259 | Now scroll down to the bottom of the **main** function and add the following code just before the final line that retuns a zero return value:
260 |
261 | ```fsharp
262 | // step 2: bin, encode, and cross the longitude and latitude
263 | let pipeline2 =
264 | pipeline
265 | .Append(context.Transforms.NormalizeBinning("BinnedLongitude", "Longitude", maximumBinCount = 10))
266 |
267 | // step 3: bin the latitude
268 | .Append(context.Transforms.NormalizeBinning("BinnedLatitude", "Latitude", maximumBinCount = 10))
269 |
270 | // step 4: one-hot encode the longitude
271 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLongitude", "BinnedLongitude"))
272 |
273 | // step 5: one-hot encode the latitude
274 | .Append(context.Transforms.Categorical.OneHotEncoding("EncodedLatitude", "BinnedLatitude"))
275 |
276 | // step 6: cross the longitude and latitude vectors
277 | .Append(
278 | context.Transforms.CustomMapping(
279 | Action(fun input output ->
280 | output.Location <- [| for x in input.EncodedLongitude do
281 | for y in input.EncodedLatitude do
282 | x * y |] ),
283 | "Location"))
284 |
285 | // the rest of the code goes here...
286 | ```
287 |
288 | Note how we're extending the data loading pipeline with extra components. The new components are:
289 |
290 | * Two **NormalizeBinning** components that bin the longitude and latitude values into 10 bins
291 |
292 | * Two **OneHotEncoding** components that one-hot encode the longitude and latitude bins
293 |
294 | * One **CustomMapping** component that multiples (crosses) the longitude and latitude vectors to create a feature cross: a 100-element vector with all zeroes except for a single '1' value.
295 |
296 | Note how the custom mapping uses two nested for-loops inside the **[| ... |]** array brackets. This sets up an inline enumerator that multiples the two longitude and latitude vectors and produces a 1-dimensional array with 100 elements.
297 |
298 | Let's see if this worked. Add the following code to the bottom of the **main** function:
299 |
300 | ```fsharp
301 | // get a 10-record preview of the transformed data
302 | let model = data |> pipeline2.Fit
303 | let preview = (data |> model.Transform).Preview(maxRows = 10)
304 |
305 | // show the preview
306 | preview.ColumnView |> Seq.iter(fun c ->
307 | printf "%-30s|" c.Column.Name
308 | preview.RowView |> Seq.iter(fun r -> printf "%10O|" r.Values.[c.Column.Index].Value)
309 | printfn "")
310 |
311 | // the rest of the code goes here...
312 | ```
313 |
314 | This is the same code you used previously to create predictions, get a preview, and display the preview on the console. But now you're using **pipeline2** instead.
315 |
316 | Now run your app.
317 |
318 | What does the data look like now? Can you spot the new columns with the binned and one-hot encoded longitude and latitude values?
319 |
320 | And is the new **Location** column present?
321 |
322 | You should see the new **Location** column, but the code can't display its contents properly.
323 |
324 | So let's fix that. Add the following code to display all the individual values in the **Location** vector:
325 |
326 | ```fsharp
327 | // show the dense vector
328 | preview.RowView |> Seq.iter(fun r ->
329 | let vector = r.Values.[r.Values.Length-1].Value :?> VBuffer
330 | vector.DenseValues() |> Seq.iter(fun v -> printf "%i" (int v))
331 | printfn "")
332 | ```
333 |
334 | We use **Seq.iter** to enumerate every row in the preview. And note the **:?>** operator which casts the value to a **VBuffer** of floats. With this casted value we can access the **DenseValues** property which is a float array of all the elements in the vector. So we pipe that property into a second **Seq.iter** to print the values.
335 |
336 | Now run your app. What do you see? Did it work? Are there 100 digits in the **Location** column? And is there only a single '1' digit in each row?
337 |
338 | Post your results in our group.
--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/data.png
--------------------------------------------------------------------------------
/LoadingData/CaliforniaHousing/assets/plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/LoadingData/CaliforniaHousing/assets/plot.png
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/Mnist.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open System.IO
3 | open Microsoft.ML
4 | open Microsoft.ML.Data
5 | open Microsoft.ML.Transforms
6 |
7 | /// The Digit class represents one mnist digit.
8 | []
9 | type Digit = {
10 | [] Number : float32
11 | [] [] PixelValues : float32[]
12 | }
13 |
14 | /// The DigitPrediction class represents one digit prediction.
15 | []
16 | type DigitPrediction = {
17 | Score : float32[]
18 | }
19 |
20 | /// file paths to train and test data files (assumes os = windows!)
21 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory
22 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory
23 |
24 | []
25 | let main argv =
26 |
27 | // create a machine learning context
28 | let context = new MLContext()
29 |
30 | // load the datafiles
31 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',')
32 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',')
33 |
34 | // build a training pipeline
35 | let pipeline =
36 | EstimatorChain()
37 |
38 | // step 1: map the number column to a key value and store in the label column
39 | .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue))
40 |
41 | // step 2: concatenate all feature columns
42 | .Append(context.Transforms.Concatenate("Features", "PixelValues"))
43 |
44 | // step 3: cache data to speed up training
45 | .AppendCacheCheckpoint(context)
46 |
47 | // step 4: train the model with SDCA
48 | .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy())
49 |
50 | // step 5: map the label key value back to a number
51 | .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label"))
52 |
53 | // train the model
54 | let model = trainData |> pipeline.Fit
55 |
56 | // get predictions and compare them to the ground truth
57 | let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate
58 |
59 | // show evaluation metrics
60 | printfn "Evaluation metrics"
61 | printfn " MicroAccuracy: %f" metrics.MicroAccuracy
62 | printfn " MacroAccuracy: %f" metrics.MacroAccuracy
63 | printfn " LogLoss: %f" metrics.LogLoss
64 | printfn " LogLossReduction: %f" metrics.LogLossReduction
65 |
66 | // grab five digits from the test data
67 | let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq
68 | let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ]
69 |
70 | // create a prediction engine
71 | let engine = context.Model.CreatePredictionEngine model
72 |
73 | // show predictions
74 | printfn "Model predictions:"
75 | printf " #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn ""
76 | testDigits |> Seq.iter(
77 | fun digit ->
78 | printf " %i\t" (int digit.Number)
79 | let p = engine.Predict digit
80 | p.Score |> Seq.iter (fun s -> printf "%f\t" s)
81 | printfn "")
82 |
83 | 0 // return value
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Recognize handwritten digits
2 |
3 | In this article, You are going to build an app that recognizes handwritten digits from the famous MNIST machine learning dataset:
4 |
5 | 
6 |
7 | Your app must read these images of handwritten digits and correctly predict which digit is visible in each image.
8 |
9 | This may seem like an easy challenge, but look at this:
10 |
11 | 
12 |
13 | These are a couple of digits from the dataset. Are you able to identify each one? It probably won’t surprise you to hear that the human error rate on this exercise is around 2.5%.
14 |
15 | The first thing you will need for your app is a data file with images of handwritten digits. We will not use the original MNIST data because it's stored in a nonstandard binary format.
16 |
17 | Instead, we'll use these excellent [CSV files](https://www.kaggle.com/oddrationale/mnist-in-csv/) prepared by Daniel Dato on Kaggle.
18 |
19 | Create a Kaggle account if you don't have one yet, then download **mnist_train.csv** and **mnist_test.csv** and save them in your project folder.
20 |
21 | There are 60,000 images in the training file and 10,000 in the test file. Each image is monochrome and resized to 28x28 pixels.
22 |
23 | The training file looks like this:
24 |
25 | 
26 |
27 | It’s a CSV file with 785 columns:
28 |
29 | * The first column contains the label. It tells us which one of the 10 possible digits is visible in the image.
30 | * The next 784 columns are the pixel intensity values (0..255) for each pixel in the image, counting from left to right and top to bottom.
31 |
32 | You are going to build a multiclass classification machine learning model that reads in all 785 columns, and then makes a prediction for each digit in the dataset.
33 |
34 | Let’s get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
35 |
36 | ```bash
37 | $ dotnet new console --language F# --output Mnist
38 | $ cd Mnist
39 | ```
40 |
41 | Now install the ML.NET package:
42 |
43 | ```bash
44 | $ dotnet add package Microsoft.ML
45 | ```
46 |
47 | Now you are ready to add types. You’ll need one to hold a digit, and one to hold your model prediction.
48 |
49 | Replace the contents of the Program.fs file with this:
50 |
51 | ```fsharp
52 | open System
53 | open System.IO
54 | open Microsoft.ML
55 | open Microsoft.ML.Data
56 | open Microsoft.ML.Transforms
57 |
58 | /// The Digit class represents one mnist digit.
59 | []
60 | type Digit = {
61 | [] Number : float32
62 | [] [] PixelValues : float32[]
63 | }
64 |
65 | /// The DigitPrediction class represents one digit prediction.
66 | []
67 | type DigitPrediction = {
68 | Score : float32[]
69 | }
70 | ```
71 |
72 | The **Digit** type holds one single MNIST digit image. Note how the **PixelValues** field is tagged with a **VectorType** attribute. This tells ML.NET to combine the 784 individual pixel columns into a single vector value.
73 |
74 | There's also a **DigitPrediction** type which will hold a single prediction. And notice how the prediction score is actually an array? The model will generate 10 scores, one for every possible digit value.
75 |
76 | Also note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
77 |
78 | Next you'll need to load the data in memory:
79 |
80 | ```fsharp
81 | /// file paths to train and test data files (assumes os = windows!)
82 | let trainDataPath = sprintf "%s\\mnist_train.csv" Environment.CurrentDirectory
83 | let testDataPath = sprintf "%s\\mnist_test.csv" Environment.CurrentDirectory
84 |
85 | []
86 | let main argv =
87 |
88 | // create a machine learning context
89 | let context = new MLContext()
90 |
91 | // load the datafiles
92 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',')
93 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',')
94 |
95 | // the rest of the code goes here....
96 |
97 | 0 // return value
98 | ```
99 |
100 | This code uses the **LoadFromTextFile** function to load the CSV data directly into memory. We call this function twice to load the training and testing datasets separately.
101 |
102 | Now let’s build the machine learning pipeline:
103 |
104 | ```fsharp
105 | // build a training pipeline
106 | let pipeline =
107 | EstimatorChain()
108 |
109 | // step 1: map the number column to a key value and store in the label column
110 | .Append(context.Transforms.Conversion.MapValueToKey("Label", "Number", keyOrdinality = ValueToKeyMappingEstimator.KeyOrdinality.ByValue))
111 |
112 | // step 2: concatenate all feature columns
113 | .Append(context.Transforms.Concatenate("Features", "PixelValues"))
114 |
115 | // step 3: cache data to speed up training
116 | .AppendCacheCheckpoint(context)
117 |
118 | // step 4: train the model with SDCA
119 | .Append(context.MulticlassClassification.Trainers.SdcaMaximumEntropy())
120 |
121 | // step 5: map the label key value back to a number
122 | .Append(context.Transforms.Conversion.MapKeyToValue("Number", "Label"))
123 |
124 | // train the model
125 | let model = trainData |> pipeline.Fit
126 |
127 | // the rest of the code goes here....
128 | ```
129 |
130 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
131 |
132 | This pipeline has the following components:
133 |
134 | * **MapValueToKey** which reads the **Number** column and builds a dictionary of unique values. It then produces an output column called **Label** which contains the dictionary key for each number value. We need this step because we can only train a multiclass classifier on keys.
135 | * **Concatenate** which converts the PixelValue vector into a single column called Features. This is a required step because ML.NET can only train on a single input column.
136 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm which comes next.
137 | * A **SdcaMaximumEntropy** classification learner which will train the model to make accurate predictions.
138 | * A final **MapKeyToValue** step which converts the keys in the **Label** column back to the original number values. We need this step to show the numbers when making predictions.
139 |
140 | With the pipeline fully assembled, we can train the model by piping the training data into the **Fit** function.
141 |
142 | You now have a fully- trained model. So now it's time to take the test set, predict the number for each digit image, and calculate the accuracy metrics of the model:
143 |
144 | ```fsharp
145 | // get predictions and compare them to the ground truth
146 | let metrics = testData |> model.Transform |> context.MulticlassClassification.Evaluate
147 |
148 | // show evaluation metrics
149 | printfn "Evaluation metrics"
150 | printfn " MicroAccuracy: %f" metrics.MicroAccuracy
151 | printfn " MacroAccuracy: %f" metrics.MacroAccuracy
152 | printfn " LogLoss: %f" metrics.LogLoss
153 | printfn " LogLossReduction: %f" metrics.LogLossReduction
154 |
155 | // the rest of the code goes here....
156 | ```
157 |
158 | This code pipes the test data into the **Transform** function to set up predictions for every single image in the test set. Then it pipes these predictions into the **Evaluate** function to compare these predictions to the actual labels and automatically calculate four metrics:
159 |
160 | * **MicroAccuracy**: this is the average accuracy (=the number of correct predictions divided by the total number of predictions) for every digit in the dataset.
161 | * **MacroAccuracy**: this is calculated by first calculating the average accuracy for each unique prediction value, and then taking the averages of those averages.
162 | * **LogLoss**: this is a metric that expresses the size of the error in the predictions the model is making. A logloss of zero means every prediction is correct, and the loss value rises as the model makes more and more mistakes.
163 | * **LogLossReduction**: this metric is also called the Reduction in Information Gain (RIG). It expresses the probability that the model’s predictions are better than random chance.
164 |
165 | We can compare the micro- and macro accuracy to discover if the dataset is biased. In an unbiased set each unique label value will appear roughly the same number of times, and the micro- and macro accuracy values will be close together.
166 |
167 | If the values are far apart, this suggests that there is some kind of bias in the data that we need to deal with.
168 |
169 | To wrap up, let’s use the model to make a prediction.
170 |
171 | You will pick five arbitrary digits from the test set, run them through the model, and make a prediction for each one.
172 |
173 | Here’s how to do it:
174 |
175 | ```fsharp
176 | // grab five digits from the test data
177 | let digits = context.Data.CreateEnumerable(testData, reuseRowObject = false) |> Array.ofSeq
178 | let testDigits = [ digits.[5]; digits.[16]; digits.[28]; digits.[63]; digits.[129] ]
179 |
180 | // create a prediction engine
181 | let engine = context.Model.CreatePredictionEngine model
182 |
183 | // show predictions
184 | printfn "Model predictions:"
185 | printf " #\t\t"; [0..9] |> Seq.iter(fun i -> printf "%i\t\t" i); printfn ""
186 | testDigits |> Seq.iter(
187 | fun digit ->
188 | printf " %i\t" (int digit.Number)
189 | let p = engine.Predict digit
190 | p.Score |> Seq.iter (fun s -> printf "%f\t" s)
191 | printfn "")
192 | ```
193 |
194 | This code calls the **CreateEnumerable** function to convert the test dataview to an array of **Digit** instances. Then it picks five random digits for testing.
195 |
196 | We then call the **CreatePredictionEngine** function to set up a prediction engine.
197 |
198 | The code then calls **Seq.iter** to print column headings for each of the 10 possible digit values. We then pipe the 5 test digits into another **Seq.iter**, make a prediction for each test digit, and then use a third **Seq.iter** to display the 10 prediction scores.
199 |
200 | This will produce a table with 5 rows of test digits, and 10 columns of prediction scores. The column with the highest score represents the prediction for that particular test digit.
201 |
202 | That's it, you're done!
203 |
204 | Go to your terminal and run your code:
205 |
206 | ```bash
207 | $ dotnet run
208 | ```
209 |
210 | What results do you get? What are your micro- and macro accuracy values? Which logloss and logloss reduction did you get?
211 |
212 | Do you think the dataset is biased?
213 |
214 | What can you say about the accuracy? Is this a good model? How far away are you from the human accuracy rate? Is this a superhuman or subhuman AI?
215 |
216 | What did the 5 digit predictions look like? Do you understand why the model gets confused sometimes?
217 |
218 | Think about the code in this assignment. How could you improve the accuracy of the model even further?
219 |
220 | Share your results in our group!
221 |
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/datafile.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/datafile.png
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/mnist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist.png
--------------------------------------------------------------------------------
/MulticlassClassification/DigitRecognition/assets/mnist_hard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/DigitRecognition/assets/mnist_hard.png
--------------------------------------------------------------------------------
/MulticlassClassification/FlagToxicComments/README.md:
--------------------------------------------------------------------------------
1 | # The case
2 |
3 | Online discussions about things you care about can be difficult. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions. Many platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
4 |
5 | The Conversation AI team is a research initiative founded by Jigsaw and Google. It is working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments that are rude, disrespectful or likely to make someone leave a discussion.
6 |
7 | The team has built a range of public tools to detect toxicity. But the current apps still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding.
8 |
9 | In this case study, you’re going to build an app that is capable of detecting different types of of toxicity like threats, obscenity, insults, and hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits.
10 |
11 | How accurate will your app be? Do you think you will be able to flag every toxic comment?
12 |
13 | That's for you to find out!
14 |
15 | # The dataset
16 |
17 | 
18 |
19 | In this case study you'll be working with a dataset containing over 313,000 comments from Wikipedia talk pages.
20 |
21 | There are two files in the dataset:
22 | * [train.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/train.csv) which contains 160k records, 2 input features, and 6 output labels. You will use this file to train your model.
23 | * [test.csv](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/test.csv) which contains 153k records and 2 input features. You will use this file to test your model.
24 |
25 | You'll need to [download the dataset from Kaggle](https://www.kaggle.com/c/8076/download-all) to get started. [Create a Kaggle account](https://www.kaggle.com/account/login) if you don't have one yet.
26 |
27 | Here's a description of all columns in the training file:
28 | * **id**: the identifier of the comment
29 | * **comment_text**: the text of the comment
30 | * **toxic**: 1 if the comment is toxic, 0 if it is not
31 | * **severe_toxic**: 1 if the comment is severely toxic, 0 if it is not
32 | * **obscene**: 1 if the comment is obscene, 0 if it is not
33 | * **threat**: 1 if the comment is threatening, 0 if it is not
34 | * **insult**: 1 if the comment is insulting, 0 if it is not
35 | * **identity_hate**: 1 if the comment expresses identity hatred, 0 if it does not
36 |
37 | # Getting started
38 | Go to the console and set up a new console application:
39 |
40 | ```bash
41 | $ dotnet new console --language F# --output FlagToxicComments
42 | $ cd FlagToxicComments
43 | ```
44 |
45 | Then install the ML.NET NuGet package:
46 |
47 | ```bash
48 | $ dotnet add package Microsoft.ML
49 | $ dotnet add package Microsoft.ML.FastTree
50 | ```
51 |
52 | And launch the Visual Studio Code editor:
53 |
54 | ```bash
55 | $ code .
56 | ```
57 |
58 | The rest is up to you!
59 |
60 | # Hint
61 | To process text data, you'll need to add a **FeaturizeText** component to your machine learning pipeline.
62 |
63 | Your code should look something like this:
64 |
65 | ```fsharp
66 | // Assume we have a partial pipeline in the variable 'partialPipe'
67 | // This line adds a text featurizer to the pipeline. It reads the 'CommentText' column and
68 | // transforms it to a numeric vector and stores it in the 'Features' column
69 | let completePipe = partialPipe.Append(context.Transforms.Text.FeaturizeText("Features", "CommentText"))
70 | ```
71 |
72 | FeaturizeText is a handy all-in-one component that can read text columns, process them, and convert them to numeric vectors
73 | that are ready for model training.
74 |
75 | # Your assignment
76 | I want you to build an app that reads the training and testing files in memory and featurizes the comments to prepare them for analysis.
77 |
78 | Then train a multiclass classifier on the training data and generate predictions for the comments in the testing file.
79 |
80 | Measure the micro- and macro accuracy. Report your best values in our group.
81 |
82 | See if you can get the accuracies as close to 1 as possible. Share in our group how you did it. Which learning algorithm did you select, and how did you configure your model?
83 |
84 | Good luck!
--------------------------------------------------------------------------------
/MulticlassClassification/FlagToxicComments/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/MulticlassClassification/FlagToxicComments/assets/data.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Science with F# and ML.NET
2 |
3 | 
4 |
5 | This repository contains all course assignments of my **Data Science with F# and ML.NET** course and will get you up to speed with Microsoft's new ML.NET library.
6 |
7 | By working through the code examples, you will learn how to design, train, and evaluate complex AI models with simple F# code. I'll provide you with all the code, libraries, and data sets you need to get started.
8 |
9 | Please note that this repository only contains code examples with no additional support.
10 |
11 | If you prefer a full-featured e-learning experience with live coaching, please check out my online course here:
12 |
13 | https://www.machinelearningadvantage.com/datascience-with-fsharp
14 |
15 |
16 | # Table of contents
17 |
18 | Transforming data: [Processing California housing data](./LoadingData/CaliforniaHousing)
19 |
20 | Regression: [Predict taxi fares in New York](./Regression/TaxiFarePrediction)
21 |
22 | Case study: [Predict house prices in Iowa](./Regression/HousePricePrediction)
23 |
24 | Binary classification: [Predict heart disease in Ohio](./BinaryClassification/HeartDiseasePrediction)
25 |
26 | Case study: [Detect credit card fraud in Europe](./BinaryClassification/FraudDetection)
27 |
28 | Multiclass classification: [Recognize handwriting](./MulticlassClassification/DigitRecognition)
29 |
30 | Evaluating models: [Detect SMS spam messages](./BinaryClassification/SpamDetection)
31 |
32 | Case study: [Flag toxic comments on Wikipedia](./MulticlassClassification/FlagToxicComments)
33 |
34 | Decision trees: [Predict Titanic survivors](./BinaryClassification/TitanicPrediction)
35 |
36 | Case study: [Predict Diabetes in Pima indians](./BinaryClassification/DiabetesDetection)
37 |
38 | Ensembles: [Predict bike demand in Washington DC](./Regression/BikeDemandPrediction)
39 |
40 | Clustering: [Classify Iris flowers](./Clustering/IrisFlower)
41 |
42 | Recommendation: [Build a movie recommender](./Recommendation/MovieRecommender)
43 |
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/MovieRecommender.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open Microsoft.ML
3 | open Microsoft.ML.Trainers
4 | open Microsoft.ML.Data
5 |
6 | /// The MovieRating class holds a single movie rating.
7 | []
8 | type MovieRating = {
9 | [] UserID : float32
10 | [] MovieID : float32
11 | [] Label : float32
12 | }
13 |
14 | /// The MovieRatingPrediction class holds a single movie prediction.
15 | []
16 | type MovieRatingPrediction = {
17 | Label : float32
18 | Score : float32
19 | }
20 |
21 | /// The MovieTitle class holds a single movie title.
22 | []
23 | type MovieTitle = {
24 | [] MovieID : float32
25 | [] Title : string
26 | [] Genres: string
27 | }
28 |
29 | // file paths to data files (assumes os = windows!)
30 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory
31 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory
32 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory
33 |
34 | []
35 | let main argv =
36 |
37 | // set up a new machine learning context
38 | let context = new MLContext()
39 |
40 | // load training and test data
41 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',')
42 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',')
43 |
44 | // prepare matrix factorization options
45 | let options =
46 | MatrixFactorizationTrainer.Options(
47 | MatrixColumnIndexColumnName = "UserIDEncoded",
48 | MatrixRowIndexColumnName = "MovieIDEncoded",
49 | LabelColumnName = "Label",
50 | NumberOfIterations = 20,
51 | ApproximationRank = 100)
52 |
53 | // set up a training pipeline
54 | let pipeline =
55 | EstimatorChain()
56 |
57 | // step 1: map userId and movieId to keys
58 | .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID"))
59 | .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID"))
60 |
61 | // step 2: find recommendations using matrix factorization
62 | .Append(context.Recommendation().Trainers.MatrixFactorization(options))
63 |
64 | // train the model
65 | let model = trainData |> pipeline.Fit
66 |
67 | // calculate predictions and compare them to the ground truth
68 | let metrics = testData |> model.Transform |> context.Regression.Evaluate
69 |
70 | // show model metrics
71 | printfn "Model metrics:"
72 | printfn " RMSE: %f" metrics.RootMeanSquaredError
73 | printfn " MAE: %f" metrics.MeanAbsoluteError
74 | printfn " MSE: %f" metrics.MeanSquaredError
75 |
76 | // set up a prediction engine
77 | let engine = context.Model.CreatePredictionEngine model
78 |
79 | // check if Mark likes 'GoldenEye'
80 | printfn "Does Mark like GoldenEye?"
81 | let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f }
82 | printfn " Score: %f" p.Score
83 |
84 | // load all movie titles
85 | let movieData = context.Data.LoadFromTextFile(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
86 | let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false)
87 |
88 | // find Mark's top 5 movies
89 | let marksMovies =
90 | movies |> Seq.map(fun m ->
91 | let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f }
92 | (m.Title, p2.Score))
93 | |> Seq.sortByDescending(fun t -> snd t)
94 |
95 | // print the results
96 | printfn "What are Mark's top-5 movies?"
97 | marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn " %f %s" (snd t) (fst t))
98 |
99 | 0 // return value
100 |
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Recommend new movies to film fans
2 |
3 | In this assignment you're going to build a movie recommendation system that can recommend new movies to film fans.
4 |
5 | The first thing you'll need is a data file with thousands of movies rated by many different users. The [MovieLens Project](https://movielens.org) has exactly what you need.
6 |
7 | Download the [movie ratings for training](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-train.csv), [movie ratings for testing](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-ratings-test.csv), and the [movie dictionary](https://github.com/mdfarragher/DSC/blob/master/Recommendation/MovieRecommender/recommendation-movies.csv) and save these files in your project folder. You now have 100,000 movie ratings with 99,980 set aside for training and 20 for testing.
8 |
9 | The training and testing files are in CSV format and look like this:
10 | 
11 |
12 | 
13 |
14 | There are only four columns of data:
15 |
16 | * The ID of the user
17 | * The ID of the movie
18 | * The movie rating on a scale from 1–5
19 | * The timestamp of the rating
20 |
21 | There's also a movie dictionary in CSV format with all the movie IDs and titles:
22 |
23 |
24 | 
25 |
26 | You are going to build a data science model that reads in each user ID, movie ID, and rating, and then predicts the ratings each user would give for every movie in the dataset.
27 |
28 | Once you have a fully trained model, you can easily add a new user with a couple of favorite movies and then ask the model to generate predictions for any of the other movies in the dataset.
29 |
30 | And in fact this is exactly how the recommendation systems on Netflix and Amazon work.
31 |
32 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
33 |
34 | ```bash
35 | $ dotnet new console --language F# --output MovieRecommender
36 | $ cd MovieRecommender
37 | ```
38 |
39 | Now install the following packages
40 |
41 | ```bash
42 | $ dotnet add package Microsoft.ML
43 | $ dotnet add package Microsoft.ML.Recommender
44 | ```
45 |
46 | Now you're ready to add some types. You will need one type to hold a movie rating, and one to hold your model’s predictions.
47 |
48 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
49 |
50 | ```fsharp
51 | open System
52 | open Microsoft.ML
53 | open Microsoft.ML.Trainers
54 | open Microsoft.ML.Data
55 |
56 | /// The MovieRating class holds a single movie rating.
57 | []
58 | type MovieRating = {
59 | [] UserID : float32
60 | [] MovieID : float32
61 | [] Label : float32
62 | }
63 |
64 | /// The MovieRatingPrediction class holds a single movie prediction.
65 | []
66 | type MovieRatingPrediction = {
67 | Label : float32
68 | Score : float32
69 | }
70 |
71 | // the rest of the code goes here...
72 | ```
73 |
74 | The **MovieRating** type holds one single movie rating. Note how each field is tagged with a **LoadColumn** attribute that tell the CSV data loading code which column to import data from.
75 |
76 | You're also declaring a **MovieRatingPrediction** type which will hold a single movie rating prediction.
77 |
78 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
79 |
80 | Before we continue, we need to set up a third type that will hold our movie dictionary:
81 |
82 | ```fsharp
83 | /// The MovieTitle class holds a single movie title.
84 | []
85 | type MovieTitle = {
86 | [] MovieID : float32
87 | [] Title : string
88 | [] Genres: string
89 | }
90 |
91 | // the rest of the code goes here
92 | ```
93 |
94 | This **MovieTitle** type contains a movie ID value and its corresponding title and genres. We will use this type later in our code to map movie IDs to their corresponding titles.
95 |
96 | Now you need to load the dataset in memory:
97 |
98 | ```fsharp
99 | // file paths to data files (assumes os = windows!)
100 | let trainDataPath = sprintf "%s\\recommendation-ratings-train.csv" Environment.CurrentDirectory
101 | let testDataPath = sprintf "%s\\recommendation-ratings-test.csv" Environment.CurrentDirectory
102 | let titleDataPath = sprintf "%s\\recommendation-movies.csv" Environment.CurrentDirectory
103 |
104 | []
105 | let main argv =
106 |
107 | // set up a new machine learning context
108 | let context = new MLContext()
109 |
110 | // load training and test data
111 | let trainData = context.Data.LoadFromTextFile(trainDataPath, hasHeader = true, separatorChar = ',')
112 | let testData = context.Data.LoadFromTextFile(testDataPath, hasHeader = true, separatorChar = ',')
113 |
114 | // the rest of the code goes here...
115 |
116 | 0 // return value
117 | ```
118 |
119 | This code calls the **LoadFromTextFile** function twice to load the training and testing CSV data into memory. The field annotations we set up earlier tell the function how to store the loaded data in the **MovieRating** class.
120 |
121 | Now you're ready to start building the machine learning model:
122 |
123 | ```fsharp
124 | // prepare matrix factorization options
125 | let options =
126 | MatrixFactorizationTrainer.Options(
127 | MatrixColumnIndexColumnName = "UserIDEncoded",
128 | MatrixRowIndexColumnName = "MovieIDEncoded",
129 | LabelColumnName = "Label",
130 | NumberOfIterations = 20,
131 | ApproximationRank = 100)
132 |
133 | // set up a training pipeline
134 | let pipeline =
135 | EstimatorChain()
136 |
137 | // step 1: map userId and movieId to keys
138 | .Append(context.Transforms.Conversion.MapValueToKey("UserIDEncoded", "UserID"))
139 | .Append(context.Transforms.Conversion.MapValueToKey("MovieIDEncoded", "MovieID"))
140 |
141 | // step 2: find recommendations using matrix factorization
142 | .Append(context.Recommendation().Trainers.MatrixFactorization(options))
143 |
144 | // train the model
145 | let model = trainData |> pipeline.Fit
146 |
147 | // the rest of the code goes here...
148 | ```
149 |
150 | Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
151 |
152 | This pipeline has the following components:
153 |
154 | * **MapValueToKey** which reads the UserID column and builds a dictionary of unique ID values. It then produces an output column called UserIDEncoded containing an encoding for each ID. This step converts the IDs to numbers that the model can work with.
155 | * Another **MapValueToKey** which reads the MovieID column, encodes it, and stores the encodings in output column called MovieIDEncoded.
156 | * A **MatrixFactorization** component that performs matrix factorization on the encoded ID columns and the ratings. This step calculates the movie rating predictions for every user and movie.
157 |
158 | With the pipeline fully assembled, you train the model by piping the training data into the **Fit** function.
159 |
160 | You now have a fully- trained model. So now you need to load the validation data, predict the rating for each user and movie, and calculate the accuracy metrics of the model:
161 |
162 | ```fsharp
163 | // calculate predictions and compare them to the ground truth
164 | let metrics = testData |> model.Transform |> context.Regression.Evaluate
165 |
166 | // show model metrics
167 | printfn "Model metrics:"
168 | printfn " RMSE: %f" metrics.RootMeanSquaredError
169 | printfn " MAE: %f" metrics.MeanAbsoluteError
170 | printfn " MSE: %f" metrics.MeanSquaredError
171 |
172 | // the rest of the code goes here...
173 | ```
174 |
175 | This code pipes the test data into the **Transform** function to make predictions for every user and movie in the test dataset. It then pipes these predictions into the **Evaluate** function to compare them to the actual ratings.
176 |
177 | The **Evaluate** function calculates the following three metrics:
178 |
179 | * **RootMeanSquaredError**: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
180 | * **MeanAbsoluteError**: this is the mean absolute prediction error, expressed as a rating.
181 | * **MeanSquaredError**: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.
182 |
183 | To wrap up, let’s use the model to make a prediction about me. Here are 6 movies I like:
184 |
185 | * Blade Runner
186 | * True Lies
187 | * Speed
188 | * Twelve Monkeys
189 | * Things to do in Denver when you're dead
190 | * Cloud Atlas
191 |
192 | And 6 more movies I really didn't like at all:
193 |
194 | * Ace Ventura: when nature calls
195 | * Naked Gun 33 1/3
196 | * Highlander II
197 | * Throw momma from the train
198 | * Jingle all the way
199 | * Dude, where's my car?
200 |
201 | You'll find my ratings at the very end of the training file. I added myself as user 999.
202 |
203 | So based on this list, do you think I would enjoy the James Bond movie ‘GoldenEye’?
204 |
205 | Let's write some code to find out:
206 |
207 | ```fsharp
208 | // set up a prediction engine
209 | let engine = context.Model.CreatePredictionEngine model
210 |
211 | // check if Mark likes 'GoldenEye'
212 | printfn "Does Mark like GoldenEye?"
213 | let p = engine.Predict { UserID = 999.0f; MovieID = 10.0f; Label = 0.0f }
214 | printfn " Score: %f" p.Score
215 |
216 | // the rest of the code goes here...
217 | ```
218 |
219 | This code uses the **CreatePredictionEngine** method to set up a prediction engine, and then calls **Predict** to create a prediction for user 999 (me) and movie 10 (GoldenEye).
220 |
221 | Let’s do one more thing and ask the model to predict my top-5 favorite movies.
222 |
223 | We can ask the model to predict my favorite movies, but it will just produce movie ID values. So now's the time to load that movie dictionary that will help us convert movie IDs to their corresponding titles:
224 |
225 | ```fsharp
226 | // load all movie titles
227 | let movieData = context.Data.LoadFromTextFile(titleDataPath, hasHeader = true, separatorChar = ',', allowQuoting = true)
228 | let movies = context.Data.CreateEnumerable(movieData, reuseRowObject = false)
229 |
230 | // the rest of the code goes here...
231 | ```
232 |
233 | This code calls **LoadFromTextFile** to load the movie dictionary in memory, and then calls **CreateEnumerable** to create an enumeration of **MovieTitle** instances.
234 |
235 | We can now find my favorite movies like this:
236 |
237 | ```fsharp
238 | // find Mark's top 5 movies
239 | let marksMovies =
240 | movies |> Seq.map(fun m ->
241 | let p2 = engine.Predict { UserID = 999.0f; MovieID = m.MovieID; Label = 0.0f }
242 | (m.Title, p2.Score))
243 | |> Seq.sortByDescending(fun t -> snd t)
244 |
245 | // print the results
246 | printfn "What are Mark's top-5 movies?"
247 | marksMovies |> Seq.take(5) |> Seq.iter(fun t -> printfn " %f %s" (snd t) (fst t))
248 | ```
249 |
250 | The code pipes the movie dictionary into **Seq.map** to create an enumeration of tuples. The first tuple element is the movie title and the second element is the rating the model thinks I would give to that movie.
251 |
252 | The code then pipes the enumeration of tuples into **Seq.sortByDescending** to sort the list by rating. This will put my favorite movies at the top of the list.
253 |
254 | Finally, the code pipes the rated movie list into **Seq.take** to grab the top-5, and then prints out the title and correspnding rating.
255 |
256 | That's it, your code is done. Go to your terminal and run the app:
257 |
258 | ```bash
259 | $ dotnet run
260 | ```
261 |
262 | Which training and validation metrics did you get? What are your RMSE and MAE values? Now look at how the data has been partitioned into training and validaton sets. Do you think this a good result? What could you improve?
263 |
264 | What rating did the model predict I would give to the movie GoldenEye? And what are my 5 favorite movies according to the model?
265 |
266 | Share your results in our group and then ask me if the predictions are correct ;)
267 |
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/data.png
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/assets/movies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/assets/movies.png
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/recommendation-movies.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Recommendation/MovieRecommender/recommendation-movies.csv
--------------------------------------------------------------------------------
/Recommendation/MovieRecommender/recommendation-ratings-test.csv:
--------------------------------------------------------------------------------
1 | userId,movieId,rating,timestamp
2 | 1,1097,5,964981680
3 | 1,1127,4,964982513
4 | 1,1136,5,964981327
5 | 1,1196,5,964981827
6 | 1,1197,5,964981872
7 | 1,1198,5,964981827
8 | 1,1206,5,964983737
9 | 1,1208,4,964983250
10 | 1,1210,5,964980499
11 | 1,1213,5,964982951
12 | 1,1214,4,964981855
13 | 2,114060,2,1445715276
14 | 2,115713,3.5,1445714854
15 | 2,122882,5,1445715272
16 | 2,131724,5,1445714851
17 | 3,2105,2,1306463559
18 | 3,2288,4,1306463631
19 | 3,2851,5,1306463925
20 | 3,2424,0.5,1306464293
21 |
--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/BikeDemand.fsproj:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Exe
5 | netcoreapp3.1
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open System.IO
3 | open Microsoft.ML
4 | open Microsoft.ML.Data
5 |
6 | /// The DemandObservation class holds one single bike demand observation record.
7 | []
8 | type DemandObservation = {
9 | [] Season : float32
10 | [] Year : float32
11 | [] Month : float32
12 | [] Hour : float32
13 | [] Holiday : float32
14 | [] Weekday : float32
15 | [] WorkingDay : float32
16 | [] Weather : float32
17 | [] Temperature : float32
18 | [] NormalizedTemperature : float32
19 | [] Humidity : float32
20 | [] Windspeed : float32
21 | [] [] Count : float32
22 | }
23 |
24 | /// The DemandPrediction class holds one single bike demand prediction.
25 | []
26 | type DemandPrediction = {
27 | [] PredictedCount : float32;
28 | }
29 |
30 | // file paths to data files (assumes os = windows!)
31 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory
32 |
33 | /// The main application entry point.
34 | []
35 | let main argv =
36 |
37 | // create the machine learning context
38 | let context = new MLContext();
39 |
40 | // load the dataset
41 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',')
42 |
43 | // split the dataset into 80% training and 20% testing
44 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
45 |
46 | // build a training pipeline
47 | let pipeline =
48 | EstimatorChain()
49 |
50 | // step 1: concatenate all feature columns
51 | .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed"))
52 |
53 | // step 2: cache the data to speed up training
54 | .AppendCacheCheckpoint(context)
55 |
56 | // step 3: use a fast forest learner
57 | .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10))
58 |
59 | // train the model
60 | let model = partitions.TrainSet |> pipeline.Fit
61 |
62 | // evaluate the model
63 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
64 |
65 | // show evaluation metrics
66 | printfn "Model metrics:"
67 | printfn " RMSE:%f" metrics.RootMeanSquaredError
68 | printfn " MSE: %f" metrics.MeanSquaredError
69 | printfn " MAE: %f" metrics.MeanAbsoluteError
70 |
71 | // set up a sample observation
72 | let sample ={
73 | Season = 3.0f
74 | Year = 1.0f
75 | Month = 8.0f
76 | Hour = 10.0f
77 | Holiday = 0.0f
78 | Weekday = 4.0f
79 | WorkingDay = 1.0f
80 | Weather = 1.0f
81 | Temperature = 0.8f
82 | NormalizedTemperature = 0.7576f
83 | Humidity = 0.55f
84 | Windspeed = 0.2239f
85 | Count = 0.0f // the field to predict
86 | }
87 |
88 | // create a prediction engine
89 | let engine = context.Model.CreatePredictionEngine model
90 |
91 | // make the prediction
92 | let prediction = sample |> engine.Predict
93 |
94 | // show the prediction
95 | printfn "\r"
96 | printfn "Single prediction:"
97 | printfn " Predicted bike count: %f" prediction.PredictedCount
98 |
99 | 0 // return value
--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Predict bike sharing demand in Washington DC
2 |
3 | In this assignment you're going to build an app that can predict bike sharing demand in Washington DC.
4 |
5 | A bike-sharing system is a service in which bicycles are made available to individuals on a short term. Users borrow a bike from a dock and return it at another dock belonging to the same system. Docks are bike racks that lock the bike, and only release it by computer control.
6 |
7 | You’ve probably seen docks around town, they look like this:
8 |
9 | 
10 |
11 | Bike sharing companies try to even out supply by manually distributing bikes across town, but they need to know how many bikes will be in demand at any given time in the city.
12 |
13 | So let’s give them a hand with a machine learning model!
14 |
15 | You are going to train a forest of regression decision trees on a dataset of bike sharing demand. Then you’ll use the fully-trained model to make a prediction for a given date and time.
16 |
17 | The first thing you will need is a data file with lots of bike sharing demand numbers. We are going to use the [UCI Bike Sharing Dataset](http://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset) from [Capital Bikeshare](https://www.capitalbikeshare.com/) in Metro DC. This dataset has 17,380 bike sharing records that span a 2-year period.
18 |
19 | [Download the dataset](https://github.com/mdfarragher/DSC/blob/master/Regression/BikeDemandPrediction/bikedemand.csv) and save it in your project folder as **bikedmand.csv**.
20 |
21 | The file looks like this:
22 |
23 | 
24 |
25 | It’s a comma-separated file with 17 columns:
26 |
27 | * Instant: the record index
28 | * Date: the date of the observation
29 | * Season: the season (1 = springer, 2 = summer, 3 = fall, 4 = winter)
30 | * Year: the year of the observation (0 = 2011, 1 = 2012)
31 | * Month: the month of the observation ( 1 to 12)
32 | * Hour: the hour of the observation (0 to 23)
33 | * Holiday: if the date is a holiday or not
34 | * Weekday: the day of the week of the observation
35 | * WorkingDay: if the date is a working day
36 | * Weather: the weather during the observation (1 = clear, 2 = mist, 3 = light snow/rain, 4 = heavy rain)
37 | * Temperature : the normalized temperature in Celsius
38 | * ATemperature: the normalized feeling temperature in Celsius
39 | * Humidity: the normalized humidity
40 | * Windspeed: the normalized wind speed
41 | * Casual: the number of casual bike users at the time
42 | * Registered: the number of registered bike users at the time
43 | * Count: the total number of rental bikes in operation at the time
44 |
45 | You can ignore the record index, the date, and the number of casual and registered bikes, and use everything else as input features. The final column **Count** is the label you're trying to predict.
46 |
47 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
48 |
49 | ```bash
50 | $ dotnet new console --language F# --output BikeDemand
51 | $ cd BikeDemand
52 | ```
53 |
54 | Now install the following packages
55 |
56 | ```bash
57 | $ dotnet add package Microsoft.ML
58 | $ dotnet add package Microsoft.ML.FastTree
59 | ```
60 |
61 | Now you are ready to add some types. You’ll need one to hold a bike demand record, and one to hold your model predictions.
62 |
63 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
64 |
65 | ```fsharp
66 | open System
67 | open System.IO
68 | open Microsoft.ML
69 | open Microsoft.ML.Data
70 |
71 | /// The DemandObservation class holds one single bike demand observation record.
72 | []
73 | type DemandObservation = {
74 | [] Season : float32
75 | [] Year : float32
76 | [] Month : float32
77 | [] Hour : float32
78 | [] Holiday : float32
79 | [] Weekday : float32
80 | [] WorkingDay : float32
81 | [] Weather : float32
82 | [] Temperature : float32
83 | [] NormalizedTemperature : float32
84 | [] Humidity : float32
85 | [] Windspeed : float32
86 | [] [] Count : float32
87 | }
88 |
89 | /// The DemandPrediction class holds one single bike demand prediction.
90 | []
91 | type DemandPrediction = {
92 | [] PredictedCount : float32;
93 | }
94 |
95 | // the rest of the code goes here...
96 | ```
97 |
98 | The **DemandObservation** type holds one single bike trip. Note how each field is tagged with a **LoadColumn** attribute that tells the CSV data loading code which column to import data from.
99 |
100 | You're also declaring a **DemandPrediction** type which will hold a single bike demand prediction.
101 |
102 | Note the **CLIMutable** attribute that tells F# that we want a 'C#-style' class implementation with a default constructor and setter functions for every property. Without this attribute the compiler would generate an F#-style immutable class with read-only properties and no default constructor. The ML.NET library cannot handle immutable classes.
103 |
104 | Now you need to load the training data in memory:
105 |
106 | ```fsharp
107 | // file paths to data files (assumes os = windows!)
108 | let dataPath = sprintf "%s\\bikedemand.csv" Environment.CurrentDirectory
109 |
110 | /// The main application entry point.
111 | []
112 | let main argv =
113 |
114 | // create the machine learning context
115 | let context = new MLContext();
116 |
117 | // load the dataset
118 | let data = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',')
119 |
120 | // split the dataset into 80% training and 20% testing
121 | let partitions = context.Data.TrainTestSplit(data, testFraction = 0.2)
122 |
123 | // the rest of the code goes here...
124 |
125 | 0 // return value
126 | ```
127 |
128 | This code uses the method **LoadFromTextFile** to load the data directly into memory. The field annotations we set up earlier tell the method how to store the loaded data in the **DemandObservation** class.
129 |
130 | The code then calls **TrainTestSplit** to reserve 80% of the data for training and 20% for testing.
131 |
132 | Now let’s build the machine learning pipeline:
133 |
134 | ```fsharp
135 | // build a training pipeline
136 | let pipeline =
137 | EstimatorChain()
138 |
139 | // step 1: concatenate all feature columns
140 | .Append(context.Transforms.Concatenate("Features", "Season", "Year", "Month", "Hour", "Holiday", "Weekday", "WorkingDay", "Weather", "Temperature", "NormalizedTemperature", "Humidity", "Windspeed"))
141 |
142 | // step 2: cache the data to speed up training
143 | .AppendCacheCheckpoint(context)
144 |
145 | // step 3: use a fast forest learner
146 | .Append(context.Regression.Trainers.FastForest(numberOfLeaves = 20, numberOfTrees = 100, minimumExampleCountPerLeaf = 10))
147 |
148 | // train the model
149 | let model = partitions.TrainSet |> pipeline.Fit
150 |
151 | // the rest of the code goes here...
152 | ```
153 |
154 | Machine learning models in ML.NET are built with pipelines which are sequences of data-loading, transformation, and learning components.
155 |
156 | This pipeline has the following components:
157 |
158 | * **Concatenate** which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
159 | * **AppendCacheCheckpoint** which caches all training data at this point. This is an optimization step that speeds up the learning algorithm.
160 | * A final **FastForest** regression learner which will train the model to make accurate predictions using a forest of decision trees.
161 |
162 | The **FastForest** learner is a very nice training algorithm that uses gradient boosting to build a forest of weak decision trees.
163 |
164 | Gradient boosting builds a stack of weak decision trees. It starts with a single weak tree that tries to predict the bike demand. Then it adds a second tree on top of the first one to correct the error in the first tree. And then it adds a third tree on top of the second one to correct the output of the second tree. And so on.
165 |
166 | The result is a fairly strong prediction model that is made up of a stack of weak decision trees that incrementally correct each other's output.
167 |
168 | Note the use of hyperparameters to configure the learner:
169 |
170 | * **NumberOfLeaves** is the maximum number of leaf nodes each weak decision tree will have. In this forest each tree will have at most 10 leaf nodes.
171 | * **NumberOfTrees** is the total number of weak decision trees to create in the forest. This forest will hold 100 trees.
172 | * **MinimumExampleCountPerLeaf** is the minimum number of data points at which a leaf node is split. In this model each leaf is split when it has 10 or more qualifying data points.
173 |
174 | These hyperparameters are the default for the **FastForest** learner, but you can tweak them if you want.
175 |
176 | With the pipeline fully assembled, you can pipe the trainig data into the **Fit** function to train the model.
177 |
178 | You now have a fully- trained model. So next, you'll have to load the test data, predict the bike demand, and calculate the accuracy of your model:
179 |
180 | ```fsharp
181 | // evaluate the model
182 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
183 |
184 | // show evaluation metrics
185 | printfn "Model metrics:"
186 | printfn " RMSE:%f" metrics.RootMeanSquaredError
187 | printfn " MSE: %f" metrics.MeanSquaredError
188 | printfn " MAE: %f" metrics.MeanAbsoluteError
189 |
190 | // the rest of the code goes here...
191 | ```
192 |
193 | This code pipes the test data into the **Transform** function to set up predictions for every single bike demand record in the test partition. The code then pipes these predictions into the **Evaluate** function to compares them to the actual bike demand and automatically calculate these metrics:
194 |
195 | * **RootMeanSquaredError**: this is the root mean squared error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
196 | * **MeanSquaredError**: this is the mean squared error, or MSE value. Note that RMSE and MSE are related: RMSE is the square root of MSE.
197 | * **MeanAbsoluteError**: this is the mean absolute prediction error or MAE value, expressed in number of bikes.
198 |
199 | To wrap up, let’s use the model to make a prediction.
200 |
201 | I want to rent a bike in the fall of 2012, on a Thursday in August at 10am in the morning in clear weather. What will the bike demand be on that day?
202 |
203 | Here’s how to make that prediction:
204 |
205 | ```fsharp
206 | // set up a sample observation
207 | let sample ={
208 | Season = 3.0f
209 | Year = 1.0f
210 | Month = 8.0f
211 | Hour = 10.0f
212 | Holiday = 0.0f
213 | Weekday = 4.0f
214 | WorkingDay = 1.0f
215 | Weather = 1.0f
216 | Temperature = 0.8f
217 | NormalizedTemperature = 0.7576f
218 | Humidity = 0.55f
219 | Windspeed = 0.2239f
220 | Count = 0.0f // the field to predict
221 | }
222 |
223 | // create a prediction engine
224 | let engine = context.Model.CreatePredictionEngine model
225 |
226 | // make the prediction
227 | let prediction = sample |> engine.Predict
228 |
229 | // show the prediction
230 | printfn "\r"
231 | printfn "Single prediction:"
232 | printfn " Predicted bike count: %f" prediction.PredictedCount
233 | ```
234 |
235 | This code sets up a new bike demand observation, and then uses the **CreatePredictionEngine** function to set up a prediction engine and call **Predict** to make a demand prediction.
236 |
237 | What will the model prediction be?
238 |
239 | Time to find out. Go to your terminal and run your code:
240 |
241 | ```bash
242 | $ dotnet run
243 | ```
244 |
245 | What results do you get? What are your RMSE and MAE values? Is this a good result?
246 |
247 | And what bike demand does your model predict on the day I wanted to take my bike ride?
248 |
249 | Now take a look at the hyperparameters. Try to change the behavior of the fast forest learner and see what happens to the accuracy of your model. Did your model improve or get worse?
250 |
251 | Share your results in our group!
252 |
--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/assets/bikesharing.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/bikesharing.jpeg
--------------------------------------------------------------------------------
/Regression/BikeDemandPrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/BikeDemandPrediction/assets/data.png
--------------------------------------------------------------------------------
/Regression/HousePricePrediction/README.md:
--------------------------------------------------------------------------------
1 | # The case
2 |
3 | Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But a detailed analysis of houses and sales prices actually proves that these metrics have a much greater influence on price negotiations than the number of bedrooms or a white-picket fence.
4 |
5 | In this case study, you're going to answer the age-old question: what exactly determines the sales price of a house?
6 |
7 | And once you have your fully-trained app up and running, you can use it to predict the sales price of any house. Just plug in the relevant numbers and your app will generate a sales price prediction.
8 |
9 | But how accurate will these predictions be? Can you actually use this app in a realtor business?
10 |
11 | That's for you to find out!
12 |
13 | # The dataset
14 |
15 | 
16 |
17 | In this case study you'll be working with the Iowa House Price dataset. This data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010.
18 |
19 | The data set contains 1460 records and a large number of feature columns involved in assessing home values. You can use any combination of features you like to generate your house price predictions.
20 |
21 | There is 1 file in the dataset:
22 | * [data.csv](https://github.com/mdfarragher/DSC/blob/master/Regression/HousePricePrediction/data.csv) which contains 1460 records, 80 input features, and one output label. You will use this file to train and evaluate your model.
23 |
24 | Download the file and save it in your project folder.
25 |
26 | Here's a description of all 81 columns in the training file:
27 | * SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
28 | * MSSubClass: The building class
29 | * MSZoning: The general zoning classification
30 | * LotFrontage: Linear feet of street connected to property
31 | * LotArea: Lot size in square feet
32 | * Street: Type of road access
33 | * Alley: Type of alley access
34 | * LotShape: General shape of property
35 | * LandContour: Flatness of the property
36 | * Utilities: Type of utilities available
37 | * LotConfig: Lot configuration
38 | * LandSlope: Slope of property
39 | * Neighborhood: Physical locations within Ames city limits
40 | * Condition1: Proximity to main road or railroad
41 | * Condition2: Proximity to main road or railroad (if a second is present)
42 | * BldgType: Type of dwelling
43 | * HouseStyle: Style of dwelling
44 | * OverallQual: Overall material and finish quality
45 | * OverallCond: Overall condition rating
46 | * YearBuilt: Original construction date
47 | * YearRemodAdd: Remodel date
48 | * RoofStyle: Type of roof
49 | * RoofMatl: Roof material
50 | * Exterior1st: Exterior covering on house
51 | * Exterior2nd: Exterior covering on house (if more than one material)
52 | * MasVnrType: Masonry veneer type
53 | * MasVnrArea: Masonry veneer area in square feet
54 | * ExterQual: Exterior material quality
55 | * ExterCond: Present condition of the material on the exterior
56 | * Foundation: Type of foundation
57 | * BsmtQual: Height of the basement
58 | * BsmtCond: General condition of the basement
59 | * BsmtExposure: Walkout or garden level basement walls
60 | * BsmtFinType1: Quality of basement finished area
61 | * BsmtFinSF1: Type 1 finished square feet
62 | * BsmtFinType2: Quality of second finished area (if present)
63 | * BsmtFinSF2: Type 2 finished square feet
64 | * BsmtUnfSF: Unfinished square feet of basement area
65 | * TotalBsmtSF: Total square feet of basement area
66 | * Heating: Type of heating
67 | * HeatingQC: Heating quality and condition
68 | * CentralAir: Central air conditioning
69 | * Electrical: Electrical system
70 | * 1stFlrSF: First Floor square feet
71 | * 2ndFlrSF: Second floor square feet
72 | * LowQualFinSF: Low quality finished square feet (all floors)
73 | * GrLivArea: Above grade (ground) living area square feet
74 | * BsmtFullBath: Basement full bathrooms
75 | * BsmtHalfBath: Basement half bathrooms
76 | * FullBath: Full bathrooms above grade
77 | * HalfBath: Half baths above grade
78 | * Bedroom: Number of bedrooms above basement level
79 | * Kitchen: Number of kitchens
80 | * KitchenQual: Kitchen quality
81 | * TotRmsAbvGrd: Total rooms above grade (does not include * bathrooms)
82 | * Functional: Home functionality rating
83 | * Fireplaces: Number of fireplaces
84 | * FireplaceQu: Fireplace quality
85 | * GarageType: Garage location
86 | * GarageYrBlt: Year garage was built
87 | * GarageFinish: Interior finish of the garage
88 | * GarageCars: Size of garage in car capacity
89 | * GarageArea: Size of garage in square feet
90 | * GarageQual: Garage quality
91 | * GarageCond: Garage condition
92 | * PavedDrive: Paved driveway
93 | * WoodDeckSF: Wood deck area in square feet
94 | * OpenPorchSF: Open porch area in square feet
95 | * EnclosedPorch: Enclosed porch area in square feet
96 | * 3SsnPorch: Three season porch area in square feet
97 | * ScreenPorch: Screen porch area in square feet
98 | * PoolArea: Pool area in square feet
99 | * PoolQC: Pool quality
100 | * Fence: Fence quality
101 | * MiscFeature: Miscellaneous feature not covered in other categories
102 | * MiscVal: $Value of miscellaneous feature
103 | * MoSold: Month Sold
104 | * YrSold: Year Sold
105 | * SaleType: Type of sale
106 | * SaleCondition: Condition of sale
107 |
108 | # Getting started
109 | Go to the console and set up a new console application:
110 |
111 | ```bash
112 | $ dotnet new console --language F# --output HousePricePrediction
113 | $ cd HousePricePrediction
114 | ```
115 |
116 | Then install the ML.NET NuGet package:
117 |
118 | ```bash
119 | $ dotnet add package Microsoft.ML
120 | $ dotnet add package Microsoft.ML.FastTree
121 | ```
122 |
123 | And launch the Visual Studio Code editor:
124 |
125 | ```bash
126 | $ code .
127 | ```
128 |
129 | The rest is up to you!
130 |
131 | # Your assignment
132 | I want you to build an app that reads the data file, processes it, and then trains a linear regression model on the data.
133 |
134 | You can select any combination of input features you like, and you can perform any kind of data processing you like on the columns.
135 |
136 | Partition the data and use the trained model to make house price predictions on all the houses in the test partition. Calculate the best possible **RMSE** and **MAE** and share it in our group.
137 |
138 | See if you can get the RMSE as low as possible. Share in our group how you did it. Which features did you select, how did you process them, and how did you configure your model?
139 |
140 | Good luck!
--------------------------------------------------------------------------------
/Regression/HousePricePrediction/assets/data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mdfarragher/DSC-FS/61917efdaa745b0c4488cd9e2f6796b3ea952a62/Regression/HousePricePrediction/assets/data.png
--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/Program.fs:
--------------------------------------------------------------------------------
1 | open System
2 | open Microsoft.ML
3 | open Microsoft.ML.Data
4 |
5 | /// The TaxiTrip class represents a single taxi trip.
6 | []
7 | type TaxiTrip = {
8 | [] VendorId : string
9 | [] RateCode : string
10 | [] PassengerCount : float32
11 | [] TripDistance : float32
12 | [] PaymentType : string
13 | [] [] FareAmount : float32
14 | }
15 |
16 | /// The TaxiTripFarePrediction class represents a single far prediction.
17 | []
18 | type TaxiTripFarePrediction = {
19 | [] FareAmount : float32
20 | }
21 |
22 | // file paths to data files (assumes os = windows!)
23 | let dataPath = sprintf "%s\\yellow_tripdata_2018-12.csv" Environment.CurrentDirectory
24 |
25 | /// The main application entry point.
26 | []
27 | let main argv =
28 |
29 | // create the machine learning context
30 | let context = new MLContext()
31 |
32 | // load the data
33 | let dataView = context.Data.LoadFromTextFile(dataPath, hasHeader = true, separatorChar = ',')
34 |
35 | // split into a training and test partition
36 | let partitions = context.Data.TrainTestSplit(dataView, testFraction = 0.2)
37 |
38 | // set up a learning pipeline
39 | let pipeline =
40 | EstimatorChain()
41 |
42 | // one-hot encode all text features
43 | .Append(context.Transforms.Categorical.OneHotEncoding("VendorId"))
44 | .Append(context.Transforms.Categorical.OneHotEncoding("RateCode"))
45 | .Append(context.Transforms.Categorical.OneHotEncoding("PaymentType"))
46 |
47 | // combine all input features into a single column
48 | .Append(context.Transforms.Concatenate("Features", "VendorId", "RateCode", "PaymentType", "PassengerCount", "TripDistance"))
49 |
50 | // cache the data to speed up training
51 | .AppendCacheCheckpoint(context)
52 |
53 | // use the fast tree learner
54 | .Append(context.Regression.Trainers.FastTree())
55 |
56 | // train the model
57 | let model = partitions.TrainSet |> pipeline.Fit
58 |
59 | // get regression metrics to score the model
60 | let metrics = partitions.TestSet |> model.Transform |> context.Regression.Evaluate
61 |
62 | // show the metrics
63 | printfn "Model metrics:"
64 | printfn " RMSE:%f" metrics.RootMeanSquaredError
65 | printfn " MSE: %f" metrics.MeanSquaredError
66 | printfn " MAE: %f" metrics.MeanAbsoluteError
67 |
68 | // create a prediction engine for one single prediction
69 | let engine = context.Model.CreatePredictionEngine model
70 |
71 | let taxiTripSample = {
72 | VendorId = "VTS"
73 | RateCode = "1"
74 | PassengerCount = 1.0f
75 | TripDistance = 3.75f
76 | PaymentType = "CRD"
77 | FareAmount = 0.0f // To predict. Actual/Observed = 15.5
78 | }
79 |
80 | // make the prediction
81 | let prediction = taxiTripSample |> engine.Predict
82 |
83 | // show the prediction
84 | printfn "\r"
85 | printfn "Single prediction:"
86 | printfn " Predicted fare: %f" prediction.FareAmount
87 |
88 | 0 // return value
--------------------------------------------------------------------------------
/Regression/TaxiFarePrediction/README.md:
--------------------------------------------------------------------------------
1 | # Assignment: Predict taxi fares in New York
2 |
3 | In this assignment you're going to build an app that can predict taxi fares in New York.
4 |
5 | The first thing you'll need is a data file with transcripts of New York taxi rides. The [NYC Taxi & Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) provides yearly TLC Trip Record Data files which have exactly what you need.
6 |
7 | Download the [Yellow Taxi Trip Records from December 2018](https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-12.csv) and save it as **yellow_tripdata_2018-12.csv**.
8 |
9 | This is a CSV file with 8,173,233 records that looks like this:
10 | 
11 |
12 | 
13 |
14 |
15 | There are a lot of columns with interesting information in this data file, but you will only train on the following:
16 |
17 | * Column 0: The data provider vendor ID
18 | * Column 3: Number of passengers
19 | * Column 4: Trip distance
20 | * Column 5: The rate code (standard, JFK, Newark, …)
21 | * Column 9: Payment type (credit card, cash, …)
22 | * Column 10: Fare amount
23 |
24 | You are going to build a machine learning model in F# that will use columns 0, 3, 4, 5, and 9 as input, and use them to predict the taxi fare for every trip. Then you’ll compare the predicted fares with the actual taxi fares in column 10, and evaluate the accuracy of your model.
25 |
26 | Let's get started. You need to build a new application from scratch by opening a terminal and creating a new NET Core console project:
27 |
28 | ```bash
29 | $ dotnet new console --language F# --output PricePrediction
30 | $ cd PricePrediction
31 | ```
32 |
33 | Now install the following packages
34 |
35 | ```bash
36 | $ dotnet add package Microsoft.ML
37 | $ dotnet add package Microsoft.ML.FastTree
38 | ```
39 |
40 | Now you are ready to add some classes. You’ll need one to hold a taxi trip, and one to hold your model predictions.
41 |
42 | Edit the Program.fs file with Visual Studio Code and replace its contents with the following code:
43 |
44 | ```fsharp
45 | /// The TaxiTrip class represents a single taxi trip.
46 | []
47 | type TaxiTrip = {
48 | [