├── R files ├── .Rhistory ├── bikeUtilities.zip ├── outScale.R ├── rfModel.R ├── firstExecuteRScript.R ├── firstScore.R ├── addNA.R ├── predict.R ├── transform3.R ├── firstModel.R ├── rf_Example.R ├── utilities.R ├── evaluate.R ├── transform.R ├── transform2.R ├── visualize-prelim.R ├── visualize.R └── LICENSE ├── Python files ├── utilities.pyc ├── BikeUtilities-Py.zip ├── utilities.py ├── linearpredict.py ├── linearmodel.py ├── filterdata.py ├── transform.py ├── visualizeresids.py └── visualize.py └── README.md /R files/.Rhistory: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Python files/utilities.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Quantia-Analytics/AzureML-Regression-Example/HEAD/Python files/utilities.pyc -------------------------------------------------------------------------------- /R files/bikeUtilities.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Quantia-Analytics/AzureML-Regression-Example/HEAD/R files/bikeUtilities.zip -------------------------------------------------------------------------------- /Python files/BikeUtilities-Py.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Quantia-Analytics/AzureML-Regression-Example/HEAD/Python files/BikeUtilities-Py.zip -------------------------------------------------------------------------------- /Python files/utilities.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Sep 07 18:34:24 2015 4 | 5 | @author: Steve 6 | """ 7 | 8 | def mnth_cnt(df): 9 | ''' 10 | Compute the count of months from the start of the time series. 11 | ''' 12 | import itertools 13 | yr = df['yr'].tolist() 14 | mnth = df['mnth'].tolist() 15 | out = [0] * df.shape[0] 16 | indx = 0 17 | for x, y in itertools.izip(mnth, yr): 18 | out[indx] = x + 12 * y 19 | indx += 1 20 | return out 21 | 22 | -------------------------------------------------------------------------------- /Python files/linearpredict.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Oct 08 14:57:25 2015 4 | 5 | @author: Steve Elston 6 | 7 | This code computes scores (perdictions) for a scikit-learn 8 | linear model using a DataFrame contianing the intercept and 9 | coeficients. 10 | """ 11 | 12 | def azureml_main(BikeShare, coefs): 13 | import numpy as np 14 | 15 | arr1 = BikeShare[coefs.iloc[1:, 0]].as_matrix() 16 | arr2 = coefs.iloc[1:, 1].as_matrix() 17 | 18 | BikeShare['Scored Label Mean'] = np.dot(arr1, arr2) + coefs.iloc[0, 1] 19 | 20 | return BikeShare 21 | 22 | -------------------------------------------------------------------------------- /R files/outScale.R: -------------------------------------------------------------------------------- 1 | ## This scales the output of the model prediction to actual values 2 | ## from the log scale used in the models. 3 | ## This code is s intended to run in an Azure ML Execute R 4 | ## Script module. 5 | 6 | ## Read in the dataset 7 | inFrame <- maml.mapInputPort(1) 8 | 9 | ## Since the model was computed using the log of bike demand 10 | ## transform the results to actual counts. 11 | inFrame[, 9] <- exp(inFrame[, 9]) 12 | 13 | ## Select the columns and apply names for output. 14 | outFrame <- inFrame[, c(1, 2, 3, 9)] 15 | colnames(outFrame) <- c('Date', "Month", "Hour", "BikeDemand") 16 | 17 | ## Output the transformed data frame. 18 | maml.mapOutputPort('outFrame') -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Azure Machine Learning-Regression-Example 2 | ========================== 3 | ## Data Science in the Cloud with Microsoft Azure Machine Learning: 2015 Update 4 | 5 | This repository contains all the code and data necessary to explore non-linear regression using Microsoft's Azure Machine Learning cloud service. There are two branches in this repository; one for R and one for Python which is currently under construction. 6 | 7 | The code in the R branch is described in the O'Reilly Media report Data Science in the Cloud with Microsoft Azure Machine Learning and R:2015 Update by Stephen F Elston. The final experiment discussed in this report can be found in the [Microsoft Azure Machine Learning Gallery](https://gallery.azureml.net/Experiment/57ea80de15004256849bdf74afa94f1a). 8 | 9 | A companion O'Reilly Media report and experiment using Python is coming soon. 10 | -------------------------------------------------------------------------------- /R files/rfModel.R: -------------------------------------------------------------------------------- 1 | ## This code computes a random forest model. 2 | ## This code is s intended to run in an Azure ML 3 | ## Execute R Script module. By setting the Azure 4 | ## variable to FALSE this code can be run in R 5 | ## or RStudio. 6 | Azure <- FALSE 7 | 8 | if(Azure){ 9 | ## Source the zipped utility file 10 | source("src/utilities.R") 11 | ## Read in the dataset. 12 | BikeShare2 <- maml.mapInputPort(1) 13 | BikeShare2$dteday <- set.asPOSIXct2(BikeShare2) 14 | } 15 | 16 | require(randomForest) 17 | rf.bike <- randomForest(cnt ~ xformWorkHr + dteday + 18 | temp + hum, 19 | data = BikeShare, ntree = 40, 20 | importance = TRUE, nodesize = 5) 21 | 22 | 23 | outFrame <- serList(list(bike.model = rf.bike)) 24 | 25 | ## Output the serialized model data frame. 26 | if(Azure) maml.mapOutputPort('outFrame') 27 | -------------------------------------------------------------------------------- /Python files/linearmodel.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Oct 08 13:56:44 2015 4 | 5 | @author: Steve Elston 6 | 7 | Code to create a simple linear model for testing purposes. 8 | """ 9 | 10 | def azureml_main(BikeShare): 11 | from sklearn import linear_model 12 | import pandas as pd 13 | 14 | cols = ['temp', 'hum', 'xformWorkHr', 'dayCount', 'mnth'] 15 | X = BikeShare[cols].as_matrix() 16 | Y = BikeShare['cnt'].as_matrix() 17 | ## Compute the linear model. 18 | clf = linear_model.LinearRegression() 19 | bike_lm = clf.fit(X, Y) 20 | 21 | coef_names = ['intercept'] + cols 22 | 23 | ## Build a DataFrame to output the coeficients 24 | lm_co = [] 25 | lm_co.append(bike_lm.intercept_) 26 | for val in list(bike_lm.coef_): lm_co.append(val) 27 | 28 | coefs = pd.DataFrame({'coef_names' : coef_names, 29 | 'coefs' : lm_co} 30 | ) 31 | 32 | return coefs 33 | -------------------------------------------------------------------------------- /R files/firstExecuteRScript.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code for the simple filtering 2 | ## and ploting of the raw bike rental data. 3 | ## This code is intended to run in an 4 | ## Azure ML Execute R Script module. By changing 5 | ## the following vaiable to false the code will run 6 | ## in R or RStudio. 7 | Azure <- FALSE 8 | 9 | ## If we are in Azure, source the utilities from the zip 10 | ## file. The next lines of code read in the dataset, either 11 | ## in Azure ML or from a csv file for testing purposes. 12 | if(Azure){ 13 | source("src/utilities.R") 14 | BikeShare <- maml.mapInputPort(1) 15 | BikeShare$dteday <- set.asPOSIXct(BikeShare) 16 | }else{ 17 | BikeShare <- read.csv("BikeSharing.csv", sep = ",", 18 | header = T, stringsAsFactors = F ) 19 | BikeShare$dteday <- char.toPOSIXct(BikeShare) 20 | } 21 | 22 | require(dplyr) 23 | print("Before the subset operation the dimensions are:") 24 | print(dim(BikeShare)) 25 | BikeShare <- BikeShare %>% filter(cnt > 100) 26 | print("After the subset operation the dimensions are:") 27 | print(dim(BikeShare)) 28 | 29 | require(ggplot2) 30 | qplot(dteday, cnt, data=subset(BikeShare, hr == 9), geom="line") 31 | 32 | if(Azure) maml.mapOutputPort("BikeShare") -------------------------------------------------------------------------------- /R files/firstScore.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code to score a randomForest 2 | ## model in an Azure ML Execute R Script module. 3 | 4 | ## Some utility functions 5 | set.asPOSIXct <- function(inFrame) { 6 | dteday <- as.POSIXct( 7 | as.integer(inFrame$dteday), 8 | origin = "1970-01-01") 9 | 10 | as.POSIXct(strptime( 11 | paste(as.character(dteday), 12 | " ", 13 | as.character(inFrame$hr), 14 | ":00:00", 15 | sep = ""), 16 | "%Y-%m-%d %H:%M:%S")) 17 | } 18 | 19 | char.toPOSIXct <- function(inFrame) { 20 | as.POSIXct(strptime( 21 | paste(inFrame$dteday, " ", 22 | as.character(inFrame$hr), 23 | ":00:00", 24 | sep = ""), 25 | "%Y-%m-%d %H:%M:%S")) } 26 | 27 | ## This code is intended to run in an 28 | ## Azure ML Execute R Script module. By changing 29 | ## the following variable to false the code will run 30 | ## in R or RStudio. 31 | Azure <- FALSE 32 | 33 | ## Set the dteday column to a POSIXct type if in Azure ML 34 | ## or bind the data to the dataset name. 35 | if(Azure){ 36 | BikeShare <- dataset 37 | BikeShare$dteday <- set.asPOSIXct(BikeShare) 38 | } 39 | 40 | require(randomForest) 41 | scores <- data.frame(prediction = predict(model, newdata = BikeShare)) 42 | 43 | -------------------------------------------------------------------------------- /R files/addNA.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code to create a table 2 | ## with some artificial missing data. 3 | 4 | ## Some utility functions 5 | set.asPOSIXct <- function(inFrame) { 6 | dteday <- as.POSIXct( 7 | as.integer(inFrame$dteday), 8 | origin = "1970-01-01") 9 | 10 | as.POSIXct(strptime( 11 | paste(as.character(dteday), 12 | " ", 13 | as.character(inFrame$hr), 14 | ":00:00", 15 | sep = ""), 16 | "%Y-%m-%d %H:%M:%S")) 17 | } 18 | 19 | char.toPOSIXct <- function(inFrame) { 20 | as.POSIXct(strptime( 21 | paste(inFrame$dteday, " ", 22 | as.character(inFrame$hr), 23 | ":00:00", 24 | sep = ""), 25 | "%Y-%m-%d %H:%M:%S")) } 26 | 27 | ## This code is intended to run in an 28 | ## Azure ML Execute R Script module. By changing 29 | ## the following vaiable to false the code will run 30 | ## in R or RStudio. 31 | Azure <- FALSE 32 | 33 | ## Set the dteday column to a POSIXct type if in Azure ML 34 | ## or bind the data to the dataset name. 35 | if(Azure){ 36 | BikeShare <- maml.mapInputPort(1) 37 | BikeShare$dteday <- set.asPOSIXct(BikeShare) 38 | }else{ 39 | BikeShare <- read.csv("BikeSharing.csv", sep = ",", 40 | header = T, stringsAsFactors = F ) 41 | BikeShare$dteday <- char.toPOSIXct(BikeShare) 42 | } 43 | 44 | BikeShare$cnt <- ifelse(BikeShare$cnt < 20, NA, BikeShare$cnt) 45 | 46 | if(Azure) maml.mapOutputPort("BikeShare") -------------------------------------------------------------------------------- /R files/predict.R: -------------------------------------------------------------------------------- 1 | ## This code will compute predictions from test data 2 | ## for R models of various types. This code is 3 | ## intended to run in an Azure ML Execute R 4 | ## Script module. By changing the following variable 5 | ## you can run the code in R or RStudio for testing. 6 | Azure <- FALSE 7 | 8 | if(Azure){ 9 | ## Sourcethe zipped utility file 10 | source("src/utilities.R") 11 | ## Read the data frame containing the serialized 12 | ## model object. 13 | modelFrame <- maml.mapInputPort(1) 14 | ## Read in the dataset. 15 | BikeShare <- maml.mapInputPort(2) 16 | BikeShare$dteday <- set.asPOSIXct2(BikeShare) 17 | } else { 18 | ## comment out the following line if running in Azure ML. 19 | modelFrame <- outFrame 20 | } 21 | 22 | 23 | ## Extract the model from the serialized input and assign 24 | ## to a convenient name. 25 | modelList <- unserList(modelFrame) 26 | bike.model <- modelList$bike.model 27 | 28 | ## Output a data frame with actual and values predicted 29 | ## by the model. 30 | require(gam) 31 | require(randomForest) 32 | require(kernlab) 33 | require(nnet) 34 | outFrame <- data.frame( actual = BikeShare$cnt, 35 | predicted = 36 | predict(bike.model, 37 | newdata = BikeShare)) 38 | 39 | ## The following line should be executed only when running in 40 | ## Azure ML Studio to output the serialized model. 41 | if(Azure) maml.mapOutputPort('outFrame') -------------------------------------------------------------------------------- /R files/transform3.R: -------------------------------------------------------------------------------- 1 | ## This code removes downside outliers from the 2 | ## training sample of the bike rental data. 3 | ## The value of Quantile variable can be changed 4 | ## to change the trim level. 5 | ## This code is intended to run in an Azure ML 6 | ## Execute R Script module. By changing the Azure 7 | ## variable to FALSE you can run the code in R 8 | ## and RStudio. 9 | Azure <- FALSE 10 | 11 | if(Azure){ 12 | ## Read in the dataset. 13 | BikeShare <- maml.mapInputPort(1) 14 | BikeShare$dteday <- as.POSIXct(as.integer(BikeShare$dteday), 15 | origin = "1970-01-01") 16 | } 17 | 18 | ## Build a dataframe with the quantile by month and 19 | ## hour. Parameter Quantile determines the trim point. 20 | Quantile <- 0.10 21 | require(dplyr) 22 | quantByPer <- ( 23 | BikeShare %>% 24 | group_by(workTime, monthCount) %>% 25 | summarise(Quant = quantile(cnt, 26 | probs = Quantile, 27 | na.rm = TRUE)) 28 | ) 29 | 30 | ## Join the quantile informaiton with the 31 | ## matching rows of the data frame. This is 32 | ## join uses the names with common columns 33 | ## as the keys. 34 | BikeShare2 <- inner_join(BikeShare, quantByPer) 35 | 36 | ## Filter for the rows we want and remove the 37 | ## no longer needed column. 38 | BikeShare2 <- BikeShare2 %>% 39 | filter(cnt > Quant) 40 | BikeShare2[, "Quant"] <- NULL 41 | 42 | ## Output the transformed data frame. 43 | if(Azure) maml.mapOutputPort('BikeShare2') -------------------------------------------------------------------------------- /R files/firstModel.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code to create a basic 2 | ## randomForest model in an Azure ML Create Model module. 3 | 4 | ## Some utility functions 5 | set.asPOSIXct <- function(inFrame) { 6 | dteday <- as.POSIXct( 7 | as.integer(inFrame$dteday), 8 | origin = "1970-01-01") 9 | 10 | as.POSIXct(strptime( 11 | paste(as.character(dteday), 12 | " ", 13 | as.character(inFrame$hr), 14 | ":00:00", 15 | sep = ""), 16 | "%Y-%m-%d %H:%M:%S")) 17 | } 18 | 19 | char.toPOSIXct <- function(inFrame) { 20 | as.POSIXct(strptime( 21 | paste(inFrame$dteday, " ", 22 | as.character(inFrame$hr), 23 | ":00:00", 24 | sep = ""), 25 | "%Y-%m-%d %H:%M:%S")) } 26 | 27 | ## This code is intended to run in an 28 | ## Azure ML Execute R Script module. By changing 29 | ## the following variable to false the code will run 30 | ## in R or RStudio. 31 | Azure <- FALSE 32 | 33 | ## Set the dteday column to a POSIXct type if in Azure ML 34 | ## or bind the data to the dataset name. 35 | if(Azure){ 36 | dataset$dteday <- set.asPOSIXct(dataset) 37 | }else{ 38 | dataset <- read.csv("BikeSharing.csv", sep = ",", 39 | header = T, stringsAsFactors = F ) 40 | dataset$dteday <- char.toPOSIXct(dataset) 41 | } 42 | 43 | require(randomForest) 44 | model <- randomForest(cnt ~ xformWorkHr + dteday + 45 | temp + hum, 46 | data = dataset, ntree = 40, 47 | nodesize = 5) -------------------------------------------------------------------------------- /R files/rf_Example.R: -------------------------------------------------------------------------------- 1 | ## This code is demonstrates the use of the importance 2 | ## from a randomForest model to select features. 3 | 4 | ## This code is intended to run in an Azure ML Execute R 5 | ## Script module. By changing the following variable 6 | ## you can run the code in R or RStudio for testing. 7 | Azure <- FALSE 8 | 9 | if(Azure){ 10 | source("src/utilities.R") 11 | BikeShare <- maml.mapInputPort(1) 12 | BikeShare$dteday <- set.asPOSIXct(BikeShare) 13 | }else{ 14 | BikeShare <- read.csv("BikeSharing.csv", sep = ",", 15 | header = T, stringsAsFactors = F ) 16 | BikeShare$dteday <- char.toPOSIXct(BikeShare) 17 | require(dplyr) 18 | BikeShare <- mutate(BikeShare, casual = NULL, 19 | registered = NULL, instant = NULL, 20 | atemp = NULL) 21 | } 22 | 23 | require(randomForest) 24 | rf.mod <- randomForest(cnt ~ . - count 25 | - mnth 26 | - hr 27 | - workingday 28 | - isWorking 29 | - dayWeek 30 | - xformHr 31 | - workTime 32 | - holiday 33 | - windspeed 34 | - monthCount 35 | - weathersit, 36 | data = BikeShare2, 37 | ntree = 100, nodesize = 10, 38 | importance = TRUE) 39 | 40 | varImpPlot(rf.mod) 41 | 42 | out.frame <- BikeShare[, c("cnt", rownames(rf.mod$importance))] 43 | 44 | if(Azure) maml.mapOutputPort("out.frame") 45 | -------------------------------------------------------------------------------- /Python files/filterdata.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Oct 07 09:51:37 2015 4 | 5 | @author: Steve Elston 6 | 7 | This file contains a funciton for filtering outliers 8 | from the bike rental data. Lower quantiles are computed 9 | based on the month of the year and the working hour 10 | (0-47 hour) for the day. Values less than this quantile 11 | are filtered from the dataset. 12 | """ 13 | 14 | def azureml_main(BikeShare): 15 | import pandas as pd 16 | 17 | ## Save the original names of the DataFrame. 18 | in_names = list(BikeShare) 19 | 20 | ## Compute the lower quantile of the number of biked grouped by 21 | ## Date and time values. 22 | quantiles = BikeShare.groupby(['yr', 'mnth', 'xformWorkHr']).cnt.quantile(q = 0.2) 23 | 24 | ## Join (merge) quantiles as a DataFrame to BikeShare 25 | quantiles = pd.DataFrame(quantiles) 26 | BikeShare = pd.merge(BikeShare, quantiles, 27 | left_on = ['yr', 'mnth', 'xformWorkHr'], 28 | right_index = True, 29 | how = 'inner') 30 | 31 | ## Filter rows where the count of bikes is less than the lower quantile. 32 | BikeShare = BikeShare.ix[BikeShare.cnt_x > BikeShare.cnt_y] 33 | 34 | ## Remove the unneeded column and restore the original column names. 35 | BikeShare.drop('cnt_y', axis = 1, inplace = True) 36 | BikeShare.columns = in_names 37 | 38 | ## Sort the data frame based on the dayCount 39 | BikeShare.sort('dayCount', axis = 0, inplace = True) 40 | 41 | return BikeShare 42 | -------------------------------------------------------------------------------- /R files/utilities.R: -------------------------------------------------------------------------------- 1 | ## This code contains some utility functions used in 2 | ## several Execute R Script modules. This file should 3 | ## be zipped and uploaded as a dataset into Azure ML 4 | ## Studio. Each Execute R Script model which uses 5 | ## these utilities imports them using the R source() 6 | ## function. 7 | 8 | set.asPOSIXct <- function(inFrame) { 9 | dteday <- as.POSIXct( 10 | as.integer(inFrame$dteday), 11 | origin = "1970-01-01") 12 | 13 | as.POSIXct(strptime( 14 | paste(as.character(dteday), 15 | " ", 16 | as.character(inFrame$hr), 17 | ":00:00", 18 | sep = ""), 19 | "%Y-%m-%d %H:%M:%S")) 20 | } 21 | 22 | char.toPOSIXct <- function(inFrame) { 23 | as.POSIXct(strptime( 24 | paste(inFrame$dteday, " ", 25 | as.character(inFrame$hr), 26 | ":00:00", 27 | sep = ""), 28 | "%Y-%m-%d %H:%M:%S")) } 29 | 30 | 31 | set.asPOSIXct2 <- function(inFrame) { 32 | dteday <- as.POSIXct( 33 | as.integer(inFrame$dteday), 34 | origin = "1970-01-01") 35 | } 36 | 37 | 38 | fact.conv <- function(inVec){ 39 | ## Function gives the day variable meaningful 40 | ## level names. 41 | outVec <- as.factor(inVec) 42 | levels(outVec) <- c("Monday", "Tuesday", "Wednesday", 43 | "Thursday", "Friday", "Saturday", 44 | "Sunday") 45 | outVec 46 | } 47 | 48 | get.date <- function(Date){ 49 | ## Funciton returns the data as a character 50 | ## string from a POSIXct datatime object. 51 | temp <- strftime(Date, format = "%Y-%m-%d %H:%M:%S") 52 | substr(unlist(temp), 1, 10) 53 | } 54 | 55 | 56 | POSIX.date <- function(Date,Hour){ 57 | ## Function returns POSIXct time series object 58 | ## from date and hour arguments. 59 | as.POSIXct(strptime(paste(Date, " ", as.character(Hour), 60 | ":00:00", sep = ""), 61 | "%Y-%m-%d %H:%M:%S")) 62 | } 63 | 64 | var.log <- function(inFrame, col){ 65 | outVec <- ifelse(inFrame[, col] < 0.1, 1, inFrame[, col]) 66 | log(outVec) 67 | } 68 | 69 | month.count <- function(inFrame){ 70 | Dteday <- strftime(inFrame$dteday, 71 | format = "%Y-%m-%dT%H:%M:%S") 72 | yearCount <- as.numeric(unlist(lapply(strsplit( 73 | Dteday, "-"), 74 | function(x){x[1]}))) - 2011 75 | inFrame$monthCount <- 12 * yearCount + inFrame$mnth 76 | inFrame 77 | } 78 | 79 | -------------------------------------------------------------------------------- /Python files/transform.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jul 21 12:49:06 2015 4 | 5 | @author: Steve Elston 6 | """ 7 | 8 | 9 | ## The main function with a single argument, a Pandas data frame 10 | ## from the first input port of the Execute Python Script module. 11 | def azureml_main(BikeShare): 12 | import pandas as pd 13 | from sklearn import preprocessing 14 | import utilities as ut 15 | import numpy as np 16 | import os 17 | 18 | ## If not in the Azure environment, read the data from a csv 19 | ## file for testing purposes. 20 | Azure = False 21 | if(Azure == False): 22 | pathName = "C:/Users/Steve/GIT/Quantia-Analytics/AzureML-Regression-Example/Python files" 23 | fileName = "BikeSharing.csv" 24 | filePath = os.path.join(pathName, fileName) 25 | BikeShare = pd.read_csv(filePath) 26 | 27 | ## Drop the columns we do not need 28 | BikeShare = BikeShare.drop(['instant', 29 | 'instant', 30 | 'atemp', 31 | 'casual', 32 | 'registered'], 1) 33 | 34 | ## Normalize the numeric columns 35 | scale_cols = ['temp', 'hum', 'windspeed'] 36 | arry = BikeShare[scale_cols].as_matrix() 37 | BikeShare[scale_cols] = preprocessing.scale(arry) 38 | 39 | ## Create a new column to indicate if the day is a working day or not. 40 | work_day = BikeShare['workingday'].as_matrix() 41 | holiday = BikeShare['holiday'].as_matrix() 42 | BikeShare['isWorking'] = np.where(np.logical_and(work_day == 1, holiday == 0), 1, 0) 43 | 44 | ## Compute a new column with the count of months from 45 | ## the start of the series which can be used to model 46 | ## trend 47 | BikeShare['monthCount'] = ut.mnth_cnt(BikeShare) 48 | 49 | ## Shift the order of the hour variable so that it is smoothly 50 | ## "humped over 24 hours.## Add a column of the count of months which could 51 | hr = BikeShare.hr.as_matrix() 52 | BikeShare['xformHr'] = np.where(hr > 4, hr - 5, hr + 19) 53 | 54 | ## Add a variable with unique values for time of day for working 55 | ## and non-working days. 56 | isWorking = BikeShare['isWorking'].as_matrix() 57 | BikeShare['xformWorkHr'] = np.where(isWorking, 58 | BikeShare.xformHr, 59 | BikeShare.xformHr + 24.0) 60 | 61 | BikeShare['dayCount'] = pd.Series(range(BikeShare.shape[0]))/24 62 | 63 | 64 | return BikeShare -------------------------------------------------------------------------------- /R files/evaluate.R: -------------------------------------------------------------------------------- 1 | ## This code will produce various measures of model 2 | ## performance using the actual and predicted values 3 | ## from the Bike rental data. 4 | ## This code is intended to run in an Azure ML 5 | ## Execute R Script module. By changing the Azure 6 | ## variable to false you can run in R or 7 | ## RStudio. 8 | Azure <- FALSE 9 | 10 | if(Azure){ 11 | ## Source the zipped utility file 12 | source("src/utilities.R") 13 | ## Read in the dataset if in Azure ML. 14 | ## The second and third line are for test in RStudio 15 | ## and should be commented out if running in Azure ML. 16 | inFrame <- maml.mapInputPort(1) 17 | refFrame <- maml.mapInputPort(2) 18 | refFrame$dteday <- set.asPOSIXct2(refFrame) 19 | }else{ 20 | inFrame <- scores 21 | refFrame <- BikeShare 22 | } 23 | 24 | ## Another data frame is created from the data produced 25 | ## by the Azure Split module. The columns we need are 26 | ## added to inFrame 27 | inFrame[, c("cnt", "dteday", "monthCount", "hr", "workTime")] <- 28 | refFrame[, c("cnt", "dteday", "monthCount", "hr", "workTime")] 29 | 30 | ## Assign names to the data frame for reference 31 | names(inFrame) <- c("predicted", "cnt", "dteday", 32 | "monthCount", "hr", "workTime") 33 | 34 | ## Since the sampling process randomized the order of 35 | ## the rows sort the data by the Time object. 36 | inFrame <- inFrame[order(inFrame$dteday),] 37 | 38 | ## Time series plots showing actual and predicted values; columns 3 and 4. 39 | library(ggplot2) 40 | times <- c(7, 9, 12, 15, 18, 20, 22) 41 | 42 | lapply(times, function(times){ 43 | ggplot() + 44 | geom_line(data = inFrame[inFrame$hr == times, ], 45 | aes(x = dteday, y = cnt)) + 46 | geom_line(data = inFrame[inFrame$hr == times, ], 47 | aes(x = dteday, y = predicted), color = "red") + 48 | ylab("Log number of bikes") + 49 | labs(title = paste("Bike demand at ", 50 | as.character(times), ":00", spe ="")) + 51 | theme(text = element_text(size=20)) 52 | }) 53 | 54 | ## Compute the residuals 55 | library(dplyr) 56 | inFrame <- mutate(inFrame, resids = predicted - cnt) 57 | 58 | ## Plot the residuals. First a histogram and 59 | ## a qq plot of the residuals. 60 | ggplot(inFrame, aes(x = resids)) + 61 | geom_histogram(binwidth = 1, fill = "white", color = "black") 62 | 63 | qqnorm(inFrame$resids) 64 | qqline(inFrame$resids) 65 | 66 | ## Plot the residuals by hour and transformed work hour. 67 | inFrame <- mutate(inFrame, fact.hr = as.factor(hr), 68 | fact.workTime = as.factor(workTime)) 69 | facts <- c("fact.hr", "fact.workTime") 70 | lapply(facts, function(x){ 71 | ggplot(inFrame, aes_string(x = x, y = "resids")) + 72 | geom_boxplot( ) + 73 | ggtitle("Residual of actual versus predicted bike demand by hour")}) 74 | 75 | 76 | 77 | 78 | -------------------------------------------------------------------------------- /R files/transform.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code for the transformation 2 | ## of the raw bike rental data. it is intended to run in an 3 | ## Azure ML Execute R Script module. By changing 4 | ## the following vaiable to false the code will run 5 | ## in R or RStudio. 6 | Azure <- FALSE 7 | 8 | ## If we are in Azure, source the utilities from the zip 9 | ## file. The next lines of code read in the dataset, either 10 | ## in Azure ML or from a csv file for testing purposes. 11 | if(Azure){ 12 | source("src/utilities.R") 13 | BikeShare <- maml.mapInputPort(1) 14 | BikeShare$dteday <- set.asPOSIXct(BikeShare) 15 | }else{ 16 | BikeShare <- read.csv("C:\\Users\\Steve\\GIT\\Quantia-Analytics\\AzureML-Regression-Example\\R files\\BikeSharing.csv", 17 | sep = ",", 18 | header = T, stringsAsFactors = F ) 19 | 20 | ## Select the columns we need 21 | cols <- c("dteday", "mnth", "hr", "holiday", 22 | "workingday", "weathersit", "temp", 23 | "hum", "windspeed", "cnt") 24 | BikeShare <- BikeShare[, cols] 25 | 26 | ## Transform the date-time object 27 | BikeShare$dteday <- char.toPOSIXct(BikeShare) 28 | 29 | ## Normalize the numeric perdictors 30 | cols <- c("temp", "hum", "windspeed") 31 | BikeShare[, cols] <- scale(BikeShare[, cols]) 32 | } 33 | 34 | ## Create a new variable to indicate workday 35 | BikeShare$isWorking <- ifelse(BikeShare$workingday & 36 | !BikeShare$holiday, 1, 0) 37 | 38 | ## Add a column of the count of months which could 39 | ## help model trend. 40 | BikeShare <- month.count(BikeShare) 41 | 42 | ## Create an ordered factor for the day of the week 43 | ## starting with Monday. Note this factor is then 44 | ## converted to an "ordered" numerical value to be 45 | ## compatible with Azure ML table data types. 46 | BikeShare$dayWeek <- as.factor(weekdays(BikeShare$dteday)) 47 | BikeShare$dayWeek <- as.numeric(ordered(BikeShare$dayWeek, 48 | levels = c("Monday", 49 | "Tuesday", 50 | "Wednesday", 51 | "Thursday", 52 | "Friday", 53 | "Saturday", 54 | "Sunday"))) 55 | 56 | ## Add a variable with unique values for time of day for working and non-working days. 57 | BikeShare$workTime <- ifelse(BikeShare$isWorking, 58 | BikeShare$hr, 59 | BikeShare$hr + 24) 60 | 61 | ## Shift the order of the hour variable so that it is smoothly 62 | ## "humped over 24 hours. 63 | BikeShare$xformHr <- ifelse(BikeShare$hr > 4, 64 | BikeShare$hr - 5, 65 | BikeShare$hr + 19) 66 | 67 | ## Add a variable with unique values for time of day for working and non-working days. 68 | BikeShare$xformWorkHr <- ifelse(BikeShare$isWorking, 69 | BikeShare$xformHr, 70 | BikeShare$xformHr + 24) 71 | 72 | ## Output the transformed data frame if in Azure ML. 73 | if(Azure) maml.mapOutputPort('BikeShare') 74 | -------------------------------------------------------------------------------- /R files/transform2.R: -------------------------------------------------------------------------------- 1 | ## This file contains the code for the transformation 2 | ## of the raw bike rental data. it is intended to run in an 3 | ## Azure ML Execute R Script module. By changing 4 | ## the following vaiable to false the code will run 5 | ## in R or RStudio. 6 | Azure <- FALSE 7 | 8 | ## The next lines of code read in the dataset, either 9 | ## in Azure ML or from a csv file for testing purposes. 10 | if(Azure){ 11 | BikeShare <- maml.mapInputPort(1) 12 | BikeShare$dteday <- as.POSIXct(as.integer(BikeShare$dteday), 13 | origin = "1970-01-01") 14 | }else{ 15 | BikeShare <- read.csv("BikeSharing.csv", sep = ",", 16 | header = T, stringsAsFactors = F ) 17 | BikeShare$dteday <- as.POSIXct(strptime( 18 | paste(BikeShare$dteday, " ", 19 | "00:00:00", 20 | sep = ""), 21 | "%Y-%m-%d %H:%M:%S")) 22 | } 23 | 24 | ## Select the columns we need 25 | cols <- c("dteday", "mnth", "hr", "holiday", 26 | "workingday", "weathersit", "temp", 27 | "hum", "windspeed", "casual", 28 | "registered", "cnt") 29 | BikeShare <- BikeShare[, cols] 30 | 31 | ## Normalize the numeric perdictors 32 | cols <- c("temp", "hum", "windspeed") 33 | BikeShare[, cols] <- scale(BikeShare[, cols]) 34 | 35 | ## Take the log of response variables. First we 36 | ## must ensure there are no zero values. The difference 37 | ## between 0 and 1 is inconsequential. 38 | cols <- c("casual", "registered", "cnt") 39 | BikeShare <- var.log(BikeShare, cols) 40 | 41 | ## Create a new variable to indicate workday 42 | BikeShare$isWorking <- ifelse(BikeShare$workingday & 43 | !BikeShare$holiday, 1, 0) 44 | 45 | ## Add a column of the count of months which could 46 | ## help model trend. 47 | BikeShare <- month.count(BikeShare) 48 | 49 | ## Create an ordered factor for the day of the week 50 | ## starting with Monday. Note this factor is then 51 | ## converted to an "ordered" numerical value to be 52 | ## compatible with Azure ML table data types. 53 | BikeShare$dayWeek <- as.factor(weekdays(BikeShare$dteday)) 54 | BikeShare$dayWeek <- as.numeric(ordered(BikeShare$dayWeek, 55 | levels = c("Monday", 56 | "Tuesday", 57 | "Wednesday", 58 | "Thursday", 59 | "Friday", 60 | "Saturday", 61 | "Sunday"))) 62 | 63 | ## Add a variable with unique values for time of day for working and non-working days. 64 | BikeShare$workTime <- ifelse(BikeShare$isWorking, 65 | BikeShare$hr, 66 | BikeShare$hr + 24) 67 | 68 | ## Shift the order of the hour variable so that it is smoothly 69 | ## "humped over 24 hours. 70 | BikeShare$xformHr <- ifelse(BikeShare$hr > 4, 71 | BikeShare$hr - 5, 72 | BikeShare$hr + 19) 73 | 74 | ## Output the transformed data frame if in Azure ML. 75 | if(Azure) maml.mapOutputPort('BikeShare') -------------------------------------------------------------------------------- /Python files/visualizeresids.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Oct 10 17:28:01 2015 4 | 5 | @author: Steve Elston 6 | 7 | Code for visualization of the residuals (errors) of the 8 | regression model. 9 | """ 10 | 11 | def azureml_main(BikeShare): 12 | import matplotlib 13 | matplotlib.use('agg') # Set backend 14 | matplotlib.rcParams.update({'font.size': 20}) 15 | 16 | import matplotlib.pyplot as plt 17 | import statsmodels.api as sm 18 | 19 | Azure = False 20 | 21 | ## Sort the data frame based on the dayCount 22 | BikeShare.sort('dayCount', axis = 0, inplace = True) 23 | 24 | ## Compute the residuals. 25 | BikeShare['Resids'] = BikeShare['Scored Label Mean'] - BikeShare['cnt'] 26 | 27 | ## Plot the residuals vs the label, the count of rented bikes. 28 | fig = plt.figure(figsize=(8, 6)) 29 | fig.clf() 30 | ax = fig.gca() 31 | ## PLot the residuals. 32 | BikeShare.plot(kind = 'scatter', x = 'cnt', y = 'Resids', 33 | alpha = 0.05, color = 'red', ax = ax) 34 | plt.xlabel("Bike demand") 35 | plt.ylabel("Residual") 36 | plt.title("Residuals vs demand") 37 | plt.show() 38 | if(Azure == True): fig.savefig('scatter1.png') 39 | 40 | 41 | ## Make time series plots of actual bike demand and 42 | ## predicted demand by times of the day. 43 | times = [7, 9, 12, 15, 18, 20, 22] 44 | for tm in times: 45 | fig = plt.figure(figsize=(8, 6)) 46 | fig.clf() 47 | ax = fig.gca() 48 | BikeShare[BikeShare.hr == tm].plot(kind = 'line', 49 | x = 'dayCount', y = 'cnt', 50 | ax = ax) 51 | BikeShare[BikeShare.hr == tm].plot(kind = 'line', 52 | x = 'dayCount', y = 'Scored Label Mean', 53 | color = 'red', ax = ax) 54 | plt.xlabel("Days from start of plot") 55 | plt.ylabel("Count of bikes rented") 56 | plt.title("Bikes rented by days for hour = " + str(tm)) 57 | plt.show() 58 | if(Azure == True): fig.savefig('tsplot' + str(tm) + '.png') 59 | 60 | ## Boxplots to for the residuals by hour and transformed hour. 61 | labels = ["Box plots of residuals by hour of the day \n\n", 62 | "Box plots of residuals by transformed hour of the day \n\n"] 63 | xAxes = ["hr", "xformWorkHr"] 64 | for lab, xaxs in zip(labels, xAxes): 65 | fig = plt.figure(figsize=(12, 6)) 66 | fig.clf() 67 | ax = fig.gca() 68 | BikeShare.boxplot(column = ['Resids'], by = [xaxs], ax = ax) 69 | plt.xlabel('') 70 | plt.ylabel('Residuals') 71 | plt.show() 72 | if(Azure == True): fig.savefig('boxplot' + xaxs + '.png') 73 | 74 | ## QQ Normal plot of residuals 75 | fig = plt.figure(figsize = (6,6)) 76 | fig.clf() 77 | ax = fig.gca() 78 | sm.qqplot(BikeShare['Resids'], ax = ax) 79 | ax.set_title('QQ Normal plot of residuals') 80 | if(Azure == True): fig.savefig('QQ.png') 81 | if(Azure == True): fig.savefig('QQ1.png') 82 | 83 | ## Histograms of the residuals 84 | fig = plt.figure(figsize = (8,6)) 85 | fig.clf() 86 | fig.clf() 87 | ax = fig.gca() 88 | ax.hist(BikeShare['Resids'].as_matrix(), bins = 40) 89 | ax.set_xlabel("Residuals") 90 | ax.set_ylabel("Density") 91 | ax.set_title("Histogram of residuals") 92 | if(Azure == True): fig.savefig('hist.png') 93 | 94 | return BikeShare 95 | 96 | 97 | -------------------------------------------------------------------------------- /R files/visualize-prelim.R: -------------------------------------------------------------------------------- 1 | ## This code will create a series of data visualizations 2 | ## to explore the bike rental dataset. This code is 3 | ## intended to run in an Azure ML Execute R 4 | ## Script module. By changing the following variable 5 | ## you can run the code in R or RStudio for testing. 6 | Azure <- FALSE 7 | 8 | if(Azure){ 9 | ## Sourcethe zipped utility file 10 | source("src/utilities.R") 11 | ## Read in the dataset. 12 | BikeShare <- maml.mapInputPort(1) 13 | BikeShare$dteday <- set.asPOSIXct2(BikeShare) 14 | } 15 | 16 | 17 | ## Look at the correlation between the predictors and 18 | ## between predictors and quality. Use a linear 19 | ## time series regression to detrend the demand. 20 | BikeShare$count <- BikeShare$cnt - predict( 21 | lm(cnt ~ dteday, data = BikeShare), newdata = BikeShare) 22 | 23 | cols <- c("mnth", "hr", "holiday", "workingday", 24 | "weathersit", "temp", "hum", "windspeed", 25 | "isWorking", "monthCount", "dayWeek", 26 | "workTime", "xformHr", "count") 27 | methods <- c("pearson", "spearman") #, "kendal") 28 | 29 | cors <- lapply( methods, function(method) 30 | (cor(BikeShare[, cols], method = method))) 31 | 32 | require(lattice) 33 | plot.cors <- function(x, labs){ 34 | diag(x) <- 0.0 35 | plot( levelplot(x, 36 | main = paste("Correlation plot for", labs, "method"), 37 | scales=list(x=list(rot=90), cex=1.0)) ) 38 | } 39 | 40 | Map(plot.cors, cors, methods) 41 | 42 | ## Make time series plots for certain hours of 43 | ## working and non-working days 44 | times <- c(7, 7+24,9, 9+24, 12, 12+24, 15, 15+24, 18, 18+24, 20, 20+24, 22, 22+24) 45 | 46 | tms.plot <- function(times){ 47 | ggplot(BikeShare[BikeShare$workTime == times, ], 48 | aes(x = dteday, y = cnt)) + 49 | geom_line() + 50 | ylab("Number of bikes") + 51 | labs(title = paste("Bike demand at ", 52 | as.character(times), ":00", sep ="")) + 53 | theme(text = element_text(size=20)) 54 | } 55 | require(ggplot2) 56 | lapply(times, tms.plot) 57 | 58 | ## Convert dayWeek back to an ordered factor so the plot is in 59 | ## time order. 60 | BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek) 61 | 62 | ## This code gives a first look at the predictor values vs the demand for bikes. 63 | labels <- list("Box plots of hourly bike demand", 64 | "Box plots of transformed hourly bike demand", 65 | "Box plots of demand by workTime", 66 | "Box plots of demand by xformWorkHr", 67 | "Box plots of monthly bike demand", 68 | "Box plots of bike demand by weather factor", 69 | "Box plots of bike demand by working day", 70 | "Box plots of bike demand by day of the week") 71 | xAxis <- list("hr", "xformHr", "workTime", "xformWorkHr", 72 | "mnth", "weathersit", "isWorking", "dayWeek") 73 | 74 | plot.boxes <- function(X, label){ 75 | ggplot(BikeShare, aes_string(x = X, 76 | y = "cnt", 77 | group = X)) + 78 | geom_boxplot( ) + ggtitle(label) + 79 | theme(text = element_text(size=18)) } 80 | 81 | Map(plot.boxes, xAxis, labels) 82 | 83 | ## Look at the relationship between predictors and bike demand 84 | labels <- c("Bike demand vs temperature", 85 | "Bike demand vs humidity", 86 | "Bike demand vs windspeed", 87 | "Bike demand vs hr", 88 | "Bike demand vs xformHr", 89 | "Bike demand vs xformWorkHr") 90 | xAxis <- c("temp", "hum", "windspeed", "hr", "xformHr", "xformWorkHr") 91 | 92 | plot.scatter <- function(X, label){ 93 | ggplot(BikeShare, aes_string(x = X, y = "cnt")) + 94 | geom_point(aes_string(colour = "cnt"), alpha = 0.1) + 95 | scale_colour_gradient(low = "green", high = "blue") + 96 | geom_smooth(method = "loess") + 97 | ggtitle(label) + 98 | theme(text = element_text(size=20)) } 99 | 100 | Map(plot.scatter, xAxis, labels) 101 | 102 | 103 | ## Explore the interaction between time of day 104 | ## and working or non-working days. 105 | labels <- list("Box plots of bike demand at 0900 for \n working and non-working days", 106 | "Box plots of bike demand at 1800 for \n working and non-working days") 107 | Times <- list(8, 17) 108 | 109 | plot.box2 <- function(time, label){ 110 | ggplot(BikeShare[BikeShare$hr == time, ], 111 | aes(x = isWorking, y = cnt, group = isWorking)) + 112 | geom_boxplot( ) + ggtitle(label) + 113 | theme(text = element_text(size=18)) } 114 | 115 | Map(plot.box2, Times, labels) 116 | 117 | -------------------------------------------------------------------------------- /R files/visualize.R: -------------------------------------------------------------------------------- 1 | ## This code will create a series of data visualizations 2 | ## to explore the bike rental dataset. This code is 3 | ## intended to run in an Azure ML Execute R 4 | ## Script module. By changing the following variable 5 | ## you can run the code in R or RStudio for testing. 6 | Azure <- TRUE 7 | 8 | if(Azure){ 9 | ## Source the zipped utility file 10 | source("src/utilities.R") 11 | ## Read in the dataset. 12 | BikeShare <- maml.mapInputPort(1) 13 | BikeShare$dteday <- set.asPOSIXct2(BikeShare) 14 | } 15 | 16 | 17 | ## Look at the correlation between the predictors and 18 | ## between predictors and quality. Use a linear 19 | ## time series regression to detrend the demand. 20 | Time <- BikeShare$dteday 21 | BikeShare$count <- BikeShare$cnt - fitted( 22 | lm(BikeShare$cnt ~ Time, data = BikeShare)) 23 | cor.BikeShare.all <- cor(BikeShare[, c("mnth", 24 | "hr", 25 | "weathersit", 26 | "temp", 27 | "hum", 28 | "windspeed", 29 | "isWorking", 30 | "monthCount", 31 | "dayWeek", 32 | "count")]) 33 | 34 | diag(cor.BikeShare.all) <- 0.0 35 | cor.BikeShare.all 36 | require(lattice) 37 | plot( levelplot(cor.BikeShare.all, 38 | main ="Correlation matrix for all bike users", 39 | scales=list(x=list(rot=90), cex=1.0)) ) 40 | 41 | ## Make time series plots for certain hours of the day 42 | require(ggplot2) 43 | times <- c(7, 9, 12, 15, 18, 20, 22) 44 | # BikeShare$Time <- Time 45 | lapply(times, function(times){ 46 | ggplot(BikeShare[BikeShare$hr == times, ], 47 | aes(x = dteday, y = cnt)) + 48 | geom_line() + 49 | ylab("Log number of bikes") + 50 | labs(title = paste("Bike demand at ", 51 | as.character(times), ":00", spe ="")) + 52 | theme(text = element_text(size=20)) 53 | }) 54 | 55 | ## Convert dayWeek back to an ordered factor so the plot is in 56 | ## time order. 57 | BikeShare$dayWeek <- fact.conv(BikeShare$dayWeek) 58 | 59 | ## This code gives a first look at the predictor values vs the demand for bikes. 60 | labels <- list("Box plots of hourly bike demand", 61 | "Box plots of working hour bike demand", 62 | "Box plots of monthly bike demand", 63 | "Box plots of bike demand by weather factor", 64 | "Box plots of bike demand by workday vs. holiday", 65 | "Box plots of bike demand by day of the week") 66 | xAxis <- list("hr", "xformWorkHr","mnth", "weathersit", 67 | "isWorking", "dayWeek") 68 | Map(function(X, label){ 69 | ggplot(BikeShare, aes_string(x = X, 70 | y = "cnt", 71 | group = X)) + 72 | geom_boxplot( ) + ggtitle(label) + 73 | theme(text = 74 | element_text(size=18)) }, 75 | xAxis, labels) 76 | 77 | ## Look at the relationship between predictors and bike demand 78 | labels <- c("Bike demand vs temperature", 79 | "Bike demand vs humidity", 80 | "Bike demand vs windspeed", 81 | "Bike demand vs hr", 82 | "Bike demand vs xformHr", 83 | "Bike demand vs xformWorkHr") 84 | xAxis <- c("temp", "hum", "windspeed", "hr", 85 | "xformHr", "xformWorkHr") 86 | Map(function(X, label){ 87 | ggplot(BikeShare, aes_string(x = X, y = "cnt")) + 88 | geom_point(aes_string(colour = "cnt"), alpha = 0.1) + 89 | scale_colour_gradient(low = "green", high = "blue") + 90 | geom_smooth(method = "loess") + 91 | ggtitle(label) + 92 | theme(text = element_text(size=20)) }, 93 | xAxis, labels) 94 | 95 | 96 | ## Explore the interaction between time of day 97 | ## and working or non-working days. 98 | labels <- list("Box plots of bike demand at 0900 for \n working and non-working days", 99 | "Box plots of bike demand at 1800 for \n working and non-working days") 100 | Times <- list(8, 17) 101 | Map(function(time, label){ 102 | ggplot(BikeShare[BikeShare$hr == time, ], 103 | aes(x = isWorking, y = cnt, group = isWorking)) + 104 | geom_boxplot( ) + ggtitle(label) + 105 | theme(text = element_text(size=18)) }, 106 | Times, labels) 107 | 108 | ## Explore the interaction between time of day 109 | ## and working or non-working days. 110 | labels <- list("Box plots of bike demand at 0900 for \n working and non-working days", 111 | "Box plots of bike demand at 1800 for \n working and non-working days") 112 | Times <- list(8, 17) 113 | 114 | plot.box2 <- function(time, label){ 115 | ggplot(BikeShare[BikeShare$hr == time, ], 116 | aes(x = isWorking, y = cnt, group = isWorking)) + 117 | geom_boxplot( ) + ggtitle(label) + 118 | theme(text = element_text(size=18)) } 119 | 120 | Map(plot.box2, Times, labels) 121 | -------------------------------------------------------------------------------- /Python files/visualize.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Sep 11 18:49:43 2015 4 | 5 | @author: Steve 6 | """ 7 | 8 | def set_day(df): 9 | ''' 10 | This function assigns day names to each of the 11 | rows in the data set. The function needs to account 12 | for the fact that some days are missing and there 13 | may be some missing hours as well. 14 | ''' 15 | ## Assumes the first day of the data set is Saturday 16 | days = ["Sat", "Sun", "Mon", "Tue", "Wed", 17 | "Thr", "Fri"] 18 | temp = ['d']*df.shape[0] 19 | i = 0 20 | indx = 0 21 | cur_day = df.dteday[0] 22 | for day in df.dteday: 23 | if(cur_day != day): 24 | cur_day = day 25 | if(i == 6): i = 0 26 | else: i += 1 27 | temp[indx] = days[i] 28 | indx += 1 29 | df['dayWeek'] = temp 30 | return df 31 | 32 | 33 | def azureml_main(BikeShare): 34 | import matplotlib 35 | matplotlib.use('agg') # Set backend 36 | matplotlib.rcParams.update({'font.size': 20}) 37 | 38 | from sklearn import preprocessing 39 | from sklearn import linear_model 40 | import numpy as np 41 | import matplotlib.pyplot as plt 42 | import statsmodels.graphics.correlation as pltcor 43 | import statsmodels.nonparametric.smoothers_lowess as lw 44 | 45 | Azure = False 46 | 47 | ## Sort the data frame based on the dayCount 48 | BikeShare.sort('dayCount', axis = 0, inplace = True) 49 | 50 | ## De-trend the bike demand with time. 51 | nrow = BikeShare.shape[0] 52 | X = BikeShare.dayCount.as_matrix().reshape((nrow,1)) 53 | Y = BikeShare.cnt.as_matrix() 54 | ## Compute the linear model. 55 | clf = linear_model.LinearRegression() 56 | bike_lm = clf.fit(X, Y) 57 | ## Remove the trend 58 | BikeShare.cnt = BikeShare.cnt - bike_lm.predict(X) 59 | 60 | ## Compute the correlation matrix and set the diagonal 61 | ## elements to 0. 62 | arry = BikeShare.drop('dteday', axis = 1).as_matrix() 63 | arry = preprocessing.scale(arry, axis = 1) 64 | corrs = np.corrcoef(arry, rowvar = 0) 65 | np.fill_diagonal(corrs, 0) 66 | 67 | col_nms = list(BikeShare)[1:] 68 | fig = plt.figure(figsize = (9,9)) 69 | ax = fig.gca() 70 | pltcor.plot_corr(corrs, xnames = col_nms, ax = ax) 71 | plt.show() 72 | if(Azure == True): fig.savefig('cor1.png') 73 | 74 | ## Compute and plot the correlation matrix with 75 | ## a smaller subset of columns. 76 | cols = ['yr', 'mnth', 'isWorking', 'xformWorkHr', 'dayCount', 77 | 'temp', 'hum', 'windspeed', 'cnt'] 78 | arry = BikeShare[cols].as_matrix() 79 | arry = preprocessing.scale(arry, axis = 1) 80 | corrs = np.corrcoef(arry, rowvar = 0) 81 | np.fill_diagonal(corrs, 0) 82 | 83 | fig = plt.figure(figsize = (9,9)) 84 | ax = fig.gca() 85 | pltcor.plot_corr(corrs, xnames = cols, ax = ax) 86 | plt.show() 87 | if(Azure == True): fig.savefig('cor2.png') 88 | 89 | 90 | ## Make time series plots of bike demand by times of the day. 91 | times = [7, 9, 12, 15, 18, 20, 22] 92 | for tm in times: 93 | fig = plt.figure(figsize=(8, 6)) 94 | fig.clf() 95 | ax = fig.gca() 96 | BikeShare[BikeShare.hr == tm].plot(kind = 'line', 97 | x = 'dayCount', y = 'cnt', 98 | ax = ax) 99 | plt.xlabel("Days from start of plot") 100 | plt.ylabel("Count of bikes rented") 101 | plt.title("Bikes rented by days for hour = " + str(tm)) 102 | plt.show() 103 | if(Azure == True): fig.savefig('tsplot' + str(tm) + '.png') 104 | 105 | ## Boxplots to for the predictor values vs the demand for bikes. 106 | BikeShare = set_day(BikeShare) 107 | labels = ["Box plots of hourly bike demand", 108 | "Box plots of monthly bike demand", 109 | "Box plots of bike demand by weather factor", 110 | "Box plots of bike demand by workday vs. holiday", 111 | "Box plots of bike demand by day of the week", 112 | "Box plots by transformed work hour of the day"] 113 | xAxes = ["hr", "mnth", "weathersit", 114 | "isWorking", "dayWeek", "xformWorkHr"] 115 | for lab, xaxs in zip(labels, xAxes): 116 | fig = plt.figure(figsize=(10, 6)) 117 | fig.clf() 118 | ax = fig.gca() 119 | BikeShare.boxplot(column = ['cnt'], by = [xaxs], ax = ax) 120 | plt.xlabel('') 121 | plt.ylabel('Number of bikes') 122 | plt.show() 123 | if(Azure == True): fig.savefig('boxplot' + xaxs + '.png') 124 | 125 | ## Make scater plot of bike demand vs. various features. 126 | 127 | labels = ["Bike demand vs temperature", 128 | "Bike demand vs humidity", 129 | "Bike demand vs windspeed", 130 | "Bike demand vs hr", 131 | "Bike demand vs xformHr", 132 | "Bike demand vs xformWorkHr"] 133 | xAxes = ["temp", "hum", "windspeed", "hr", 134 | "xformHr", "xformWorkHr"] 135 | for lab, xaxs in zip(labels, xAxes): 136 | ## first compute a lowess fit to the data 137 | los = lw.lowess(BikeShare['cnt'], BikeShare[xaxs], frac = 0.2) 138 | 139 | ## Now make the plots 140 | fig = plt.figure(figsize=(8, 6)) 141 | fig.clf() 142 | ax = fig.gca() 143 | BikeShare.plot(kind = 'scatter', x = xaxs, y = 'cnt', ax = ax, alpha = 0.05) 144 | plt.plot(los[:, 0], los[:, 1], axes = ax, color = 'red') 145 | plt.show() 146 | if(Azure == True): fig.savefig('scatterplot' + xaxs + '.png') 147 | 148 | ## Explore bike demand for certain times on working and nonworking days 149 | labels = ["Boxplots of bike demand at 0900 \n\n", 150 | "Boxplots of bike demand at 1800 \n\n"] 151 | times = [8, 17] 152 | for lab, tms in zip(labels, times): 153 | temp = BikeShare[BikeShare.hr == tms] 154 | fig = plt.figure(figsize=(8, 6)) 155 | fig.clf() 156 | ax = fig.gca() 157 | temp.boxplot(column = ['cnt'], by = ['isWorking'], ax = ax) 158 | plt.xlabel('') 159 | plt.ylabel('Number of bikes') 160 | plt.title(lab) 161 | plt.show() 162 | if(Azure == True): fig.savefig('timeplot' + str(tms) + '.png') 163 | 164 | return BikeShare 165 | 166 | 167 | -------------------------------------------------------------------------------- /R files/LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | 341 | --------------------------------------------------------------------------------