├── .gitignore ├── LICENSE ├── README.md ├── data └── data_wc2014.xlsx ├── output ├── 2014-06-13_162847_pred.pdf ├── 2014-06-14_113816_pred.pdf ├── 2014-06-15_083259_pred.pdf ├── 2014-06-16_021624_pred.pdf ├── 2014-06-17_160505_pred.pdf ├── 2014-06-18_040416_pred.pdf ├── 2014-06-19_141238_pred.pdf ├── 2014-06-20_063723_pred.pdf ├── 2014-06-21_080007_pred.pdf ├── 2014-06-22_052609_pred.pdf ├── 2014-06-23_to_26_pred.pdf ├── 2014-06-24_065030_pred.pdf ├── 2014-06-25_073743_pred.pdf ├── 2014-06-26_011614_pred.pdf ├── 2014-06-28_111144_pred.pdf ├── 2014-06-29_045202_pred.pdf ├── 2014-06-30_003721_pred.pdf ├── 2014-07-03_232113_pred.pdf ├── 2014-07-05_015503_pred.pdf ├── 2014-07-08_030446_pred.pdf ├── 2014-07-09_020453_pred.pdf ├── 2014-07-12_041544_pred.pdf └── 2014-07-13_023520_pred.pdf ├── predict.R └── wc2014.Rproj /.gitignore: -------------------------------------------------------------------------------- 1 | # History files 2 | .Rhistory 3 | 4 | # Example code in package build process 5 | *-Ex.R 6 | 7 | # R data files from past sessions 8 | .Rdata 9 | .Rproj.user 10 | 11 | .sync.ffs_db 12 | 13 | README.html -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Jo-fai Chow 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ### Introduction 2 | 3 | This repository contains all the data, code and outputs for my World Cup 2014 analysis. It is an experiment to see whether data mining can outperform my friends. So far the data mining approach has the upper hand. It predicted 35/64 (or 54.7%) correct match results (Win/Draw/Lose) and 10/64 (or 15.6%) correct scores. 4 | 5 |
6 | 7 | ### Results Comparison 8 | 9 | Despite the simplicity of the model and lack of in-depth data (e.g. individual players' performance, weather ...), 10 | I think that my model still has a clear competitve edge over my friends. 11 | Especially on predicting correct scores - which is one of the most difficult tasks in sports analytics. 12 | Yet, this comparison is only based on a small sample size of 64 WC matches. 13 | In the future, I am going to improve the model and to further investigate its performance for major football leagues in Europe. 14 | 15 |
16 | ![compare](http://i.imgur.com/CwmQi5E.png) 17 |
18 | 19 | Tasks | Me | Friend A | Friend B | Friend C | Friend D | Friend E 20 | |---|----|----|----|----|----|----| 21 | Correct Results (out of 64)| 35 (54.7%) | 29 (45.3%) | 35 (54.7%) | 25 (39.1%) | 28 (43.8%) | 29 (45.3%) 22 | Correct Scores (out of 64)| 10 (15.6%) | 4 (6.3%) | 6 (9.4%) | 4 (6.3%) | 4 (6.3%) | 2 (3.1%) 23 | 24 | **Source**: A copy of the Google Spreadsheet we used. 25 | 26 |
27 | 28 | ### Data Sources 29 | 30 | 1. http://fivethirtyeight.com/interactives/world-cup/ 31 | 2. http://www.bloomberg.com/visual-data/world-cup/ 32 | 3. https://www.betfair.com/sport 33 | 34 |
35 | 36 | ### Methods 37 | 38 | - The predictors include various information from Bloomberg and FiveThirtyEight such as team performance indicators, probability of Win/Draw/Lose, world ranking as well as latest odds from betfair.com. An Excel spreadsheet in the **data** folder is used for storing the data. 39 | - The predicted results are median values from multiple model predictions. Each model consists of four common regression mini models (Random Forest, SVMs, Cubist and KNN) which are trained with bootstrapped samples and blended for better genearlisation. 40 | - Early on in the tourament, the future match results predicted by Bloomberg had been used as training data as there were not enough actual results. 41 | 42 |
43 | 44 | ### Results 45 | 46 | **Notes**: more detailed results including the distribution of predictions results can be found in the **output** folder. 47 | 48 | Match | Date | Team 1 | Team 2 | Predictions | Results (90 mins) | Correct WDL | Correct Score 49 | ------|------|--------|--------|-------------|---------|--------------|--------------- 50 | 1 | 12/06 | Brazil | Croatia | 4:0 | 3:1 | **Yes** | No 51 | 2 | 13/06 | Mexico | Cameroon | 1:0 | 1:0 | **Yes** | **Yes** 52 | 3 | 13/06 | Spain | Netherlands | 1:0 | 1:5 | No | No 53 | 4 | 13/06 | Chile | Australia | 1:0 | 3:1 | **Yes** | No 54 | 5 | 14/06 | Colombia | Greece | 1:0 | 3:0 | **Yes** | No 55 | 6 | 14/06 | Uruguay | Costa Rica | 2:0 | 1:3 | No | No 56 | 7 | 14/06 | England | Italy | 1:1 | 1:2 | No | No 57 | 8 | 15/06 | Ivory Coast | Japan | 1:1 | 2:1 | No | No 58 | 9 | 15/06 | Switzerland | Ecuador | 1:1 | 2:1 | No | No 59 | 10 | 15/06 | France | Honduras | 2:0 | 3:0 | **Yes** | No 60 | 11 | 15/06 | Argentina | Bosnia | 2:1 | 2:1 | **Yes** | **Yes** 61 | 12 | 16/06 | Germany | Portugal | 1:1 | 4:0 | No | No 62 | 13 | 16/06 | Iran | Nigeria | 1:1 | 0:0 | **Yes** | No 63 | 14 | 16/06 | Ghana | USA | 1:1 | 1:2 | No | No 64 | 15 | 17/06 | Belgium | Algeria | 2:1 | 2:1 | **Yes** | **Yes** 65 | 16 | 17/06 | Brazil | Mexico | 3:1 | 0:0 | No | No 66 | 17 | 17/06 | Russia | S Korea | 1:1 | 1:1 | **Yes** | **Yes** 67 | 18 | 18/06 | Australia | Netherlands | 1:3 | 2:3 | **Yes** | No 68 | 19 | 18/06 | Spain | Chile | 2:3 | 0:2 | **Yes** | No 69 | 20 | 18/06 | Cameroon | Croatia | 1:1 | 0:4 | No | No 70 | 21 | 19/06 | Colombia | Ivory Coast | 2:1 | 2:1 | **Yes** | **Yes** 71 | 22 | 19/06 | Uruguay | England | 2:2 | 2:1 | No | No 72 | 23 | 19/06 | Japan | Greece | 0:0 | 1:1 | **Yes** | No 73 | 24 | 20/06 | Italy | Costa Rica | 2:1 | 0:1 | No | No 74 | 25 | 20/06 | Switzerland | France | 1:2 | 2:5 | **Yes** | No 75 | 26 | 20/06 | Honduras | Ecuador | 1:2 | 1:2 | **Yes** | **Yes** 76 | 27 | 21/06 | Argentina | Iran | 2:1 | 1:0 | **Yes** | No 77 | 28 | 21/06 | Germany | Ghana | 3:1 | 2:2 | No | No 78 | 29 | 21/06 | Nigeria | Bosnia | 1:2 | 1:0 | No | No 79 | 30 | 22/06 | Belgium | Russia | 2:1 | 1:0 | **Yes** | No 80 | 31 | 22/06 | S Korea | Algeria | 1:1 | 2:4 | No | No 81 | 32 | 22/06 | USA | Portugal | 2:2 | 2:2 | **Yes** | **Yes** 82 | 33 | 23/06 | Netherlands | Chile | 2:2 | 2:0 | No | No 83 | 34 | 23/06 | Australia | Spain | 1:2 | 0:3 | **Yes** | No 84 | 35 | 23/06 | Cameroon | Brazil | 1:2 | 1:4 | **Yes** | No 85 | 36 | 23/06 | Croatia | Mexico | 1:1 | 1:3 | No | No 86 | 37 | 24/06 | Costa Rica | England | 2:2 | 0:0 | **Yes** | No 87 | 38 | 24/06 | Italy | Uruguay | 1:1 | 0:1 | No | No 88 | 39 | 24/06 | Japan | Colombia | 1:2 | 1:4 | **Yes** | No 89 | 40 | 24/06 | Greece | Ivory Coast | 1:1 | 2:1 | No | No 90 | 41 | 25/06 | Nigeria | Argentina | 1:2 | 2:3 | **Yes** | No 91 | 42 | 25/06 | Bosnia | Iran | 1:1 | 3:1 | No | No 92 | 43 | 25/06 | Ecuador | France | 1:2 | 0:0 | No | No 93 | 44 | 25/06 | Honduras | Switzerland | 1:2 | 0:3 | **Yes** | No 94 | 45 | 26/06 | Portugal | Ghana | 2:1 | 2:1 | **Yes** | **Yes** 95 | 46 | 26/06 | USA | Germany | 1:2 | 0:1 | **Yes** | No 96 | 47 | 26/06 | Algeria | Russia | 1:2 | 1:1 | No | No 97 | 48 | 26/06 | S Korea | Belgium | 1:2 | 0:1 | **Yes** | No 98 | 49 | 28/06 | Brazil | Chile | 2:2 | 1:1 | **Yes** | No 99 | 50 | 28/06 | Colombia | Uruguay | 1:1 | 2:0 | No | No 100 | 51 | 29/06 | Netherlands | Mexico | 2:1 | 2:1 | **Yes** | **Yes** 101 | 52 | 29/06 | Costa Rica | Greece | 1:1 | 1:1 | **Yes** | **Yes** 102 | 53 | 30/06 | France | Nigeria | 2:1 | 2:0 | **Yes** | No 103 | 54 | 30/06 | Germany | Algeria | 2:1 | 0:0 | No | No 104 | 55 | 01/07 | Argentina | Switzerland | 3:1 | 0:0 | No | No 105 | 56 | 01/07 | Belgium | USA | 2:1 | 0:0 | No | No 106 | 57 | 04/07 | France | Germany | 1:2 | 0:1 | **Yes** | No 107 | 58 | 04/07 | Brazil | Colombia | 1:1 | 2:1 | No | No 108 | 59 | 05/07 | Argentina | Belgium | 2:1 | 1:0 | **Yes** | No 109 | 60 | 05/07 | Netherlands | Costa Rica | 1:1 | 0:0 | **Yes** | No 110 | 61 | 08/07 | Brazil | Germany | 1:1 | 1:7 | No | No 111 | 62 | 09/07 | Netherlands | Argentina | 1:1 | 0:0 | **Yes** | No 112 | 63 | 12/07 | Brazil | Netherlands | 1:1 | 0:3 | No | No 113 | 64 | 13/07 | Germany | Argentina | 2:1 | 0:0 | No | No 114 | **Summary** | - | - | - | - | - | **35/64** | **10/64** 115 | **Accuracy** | - | - | - | - | - | **54.7%** | **15.6%** 116 | 117 |
118 | 119 | ### Comments 120 | 121 | Match(es) | Comments 122 | ----------|------------- 123 | 1 | Pure guess only. 124 | 2-11 | Bloomberg future predictions used as training data. 125 | 12 | Dropped Bloomberg future predictions. Started using actual results only. 126 | 33 - 48 | Predictions all made on 23/6 as family holiday began. 127 | 49 - 52 | Predictions all made on 28/6 as I travelled to LA for useR! conference. 128 | 129 |
130 | -------------------------------------------------------------------------------- /data/data_wc2014.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/data/data_wc2014.xlsx -------------------------------------------------------------------------------- /output/2014-06-13_162847_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-13_162847_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-14_113816_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-14_113816_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-15_083259_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-15_083259_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-16_021624_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-16_021624_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-17_160505_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-17_160505_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-18_040416_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-18_040416_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-19_141238_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-19_141238_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-20_063723_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-20_063723_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-21_080007_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-21_080007_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-22_052609_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-22_052609_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-23_to_26_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-23_to_26_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-24_065030_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-24_065030_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-25_073743_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-25_073743_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-26_011614_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-26_011614_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-28_111144_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-28_111144_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-29_045202_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-29_045202_pred.pdf -------------------------------------------------------------------------------- /output/2014-06-30_003721_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-30_003721_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-03_232113_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-03_232113_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-05_015503_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-05_015503_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-08_030446_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-08_030446_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-09_020453_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-09_020453_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-12_041544_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-12_041544_pred.pdf -------------------------------------------------------------------------------- /output/2014-07-13_023520_pred.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-13_023520_pred.pdf -------------------------------------------------------------------------------- /predict.R: -------------------------------------------------------------------------------- 1 | ## ============================================================================= 2 | ## WC 2014 Analysis and Predictions 3 | ## ============================================================================= 4 | 5 | ## Initiate 6 | rm(list=ls()) 7 | 8 | 9 | ## Core Paramemters 10 | n_core <- 5 11 | n_total <- 500 12 | p_train <- 0.667 13 | 14 | ## Set seed 15 | set.seed(1234) 16 | 17 | 18 | ## ============================================================================= 19 | ## Load Packages 20 | ## ============================================================================= 21 | 22 | suppressMessages(library(xlsx)) 23 | suppressMessages(library(caret)) 24 | suppressMessages(library(reshape2)) 25 | suppressMessages(library(ggplot2)) 26 | suppressMessages(library(grid)) 27 | suppressMessages(library(gridExtra)) 28 | suppressMessages(library(extrafont)) 29 | suppressMessages(library(e1071)) 30 | suppressMessages(library(randomForest)) 31 | suppressMessages(library(MASS)) 32 | suppressMessages(library(Cubist)) 33 | suppressMessages(library(kknn)) 34 | suppressMessages(library(GA)) 35 | suppressMessages(library(bib)) 36 | 37 | ## ============================================================================= 38 | ## Load Data 39 | ## ============================================================================= 40 | 41 | dat <- read.xlsx(file = './data/data_wc2014.xlsx', sheetName = 'Data') 42 | 43 | 44 | ## ============================================================================= 45 | ## Double Up Data (Swap Position) 46 | ## ============================================================================= 47 | 48 | ## Keep a copy 49 | dat_raw <- dat 50 | 51 | ## Extract Train, Predict and Future 52 | dat_train <- dat[which(dat$Type == 'train'),] 53 | dat_predict <- dat[which(dat$Type == 'predict'),] 54 | dat_future <- dat[which(dat$Type == 'future'),] 55 | 56 | ## Swap Home and Away position 57 | # col_swap <- c('Date', 'TEAM_A', 'TEAM_H', 58 | # 'FTE_L', 'FTE_W', 'FTE_D', 59 | # "BLM_W", "BLM_T", "BLM_A" 60 | # 'BF_2', 'BF_X', 'BF_1', 61 | # 'SPI_A', 'OFF_A', 'DEF_A', 62 | # 'SPI_H', 'OFF_H', 'DEF_H', 63 | # 'RES_A', 'RES_H', 'DIFF', 64 | # 'PRED_A', 'PRED_H', 'PRED_DIFF', 65 | # 'Type') 66 | 67 | col_swap <- c("Date", "TEAM_A", "TEAM_H", 68 | "FTE_A", "FTE_H", 69 | "BLM_A", "BLM_H", 70 | "SPI_A", "OFF_A", "DEF_A", 71 | "SPI_H", "OFF_H", "DEF_H", 72 | "BLM_OFF_A", "BLM_DEF_A", "BLM_OVR_A", 73 | "BLM_OFF_H", "BLM_DEF_H", "BLM_OVR_H", 74 | "RNK_A", "APP_A", "BEST_A", 75 | "RNK_H", "APP_H", "BEST_H", 76 | "BF_A", "BF_D", "BF_H", 77 | "RES_A", "RES_H", "RES_DIFF", 78 | "PRED_A", "PRED_H", "PRED_DIFF", 79 | "Type") 80 | 81 | dat_train_swap <- dat_train[, col_swap] 82 | dat_train_swap$RES_DIFF <- dat_train_swap$RES_DIFF * -1 83 | dat_train_swap$PRED_DIFF <- dat_train_swap$PRED_DIFF * -1 84 | colnames(dat_train_swap) <- colnames(dat) 85 | 86 | dat_predict_swap <- dat_predict[, col_swap] 87 | dat_predict_swap$RES_DIFF <- dat_predict_swap$RES_DIFF * -1 88 | dat_predict_swap$PRED_DIFF <- dat_predict_swap$PRED_DIFF * -1 89 | colnames(dat_predict_swap) <- colnames(dat) 90 | 91 | dat_future_swap <- dat_future[, col_swap] 92 | dat_future_swap$RES_DIFF <- dat_future_swap$RES_DIFF * -1 93 | dat_future_swap$PRED_DIFF <- dat_future_swap$PRED_DIFF * -1 94 | colnames(dat_future_swap) <- colnames(dat) 95 | 96 | ## Combine 97 | dat_combine <- rbind(dat_train, 98 | dat_predict, 99 | dat_future, 100 | dat_train_swap, 101 | dat_predict_swap, 102 | dat_future_swap) 103 | 104 | ## ============================================================================= 105 | ## Normalise Data 106 | ## ============================================================================= 107 | 108 | ## Pre-process predictors 109 | pp <- preProcess(dat_combine[,4:28], method = c("center", "scale", "BoxCox")) 110 | dat_combine[, 4:28] <- predict(pp, dat_combine[, 4:28]) 111 | 112 | 113 | ## ============================================================================= 114 | ## Split Train/Test 115 | ## ============================================================================= 116 | 117 | ## Define Training Set Here! 118 | row_train <- which(dat$Type == 'train') 119 | row_predict <- which(dat$Type == 'predict') 120 | row_future <- which(dat$Type == 'future') 121 | 122 | ## Pred 123 | dat_pred <- data.frame(dat_raw[row_predict, 2:3], Match = NA) 124 | for (n in 1:nrow(dat_pred)) { 125 | dat_pred[n, 3] <- paste0(dat_pred[n, 1], "_v_", dat_pred[n, 2]) 126 | } 127 | 128 | 129 | 130 | ## ============================================================================= 131 | ## Prepare for training 132 | ## ============================================================================= 133 | 134 | ## Activate 135 | activate_core(n_core) 136 | 137 | ## Train four models for each random split 138 | train_four <- function(dat_combine, pred_type, p_train) { 139 | 140 | ## Extract 141 | x_train <- dat_combine[which(dat_combine$Type == 'train'), 4:28] 142 | x_test <- dat_combine[which(dat_combine$Type == 'predict'), 4:28] 143 | 144 | if (pred_type == 'Goal') { 145 | y_train <- dat_combine[which(dat_combine$Type == 'train'), 'RES_H'] 146 | } else { 147 | y_train <- dat_combine[which(dat_combine$Type == 'train'), 'RES_DIFF'] 148 | } 149 | 150 | ## Sub-sample 151 | row_use <- createDataPartition(y_train, p = p_train, list = FALSE) 152 | 153 | ## Global variables 154 | ctrl <- trainControl(method = "adaptive_cv", 155 | repeats = 1, 156 | number = 10, 157 | allowParallel = FALSE) 158 | 159 | ## Empty Shell 160 | yy_train <- matrix(NA, ncol = 4, nrow = nrow(x_train)) 161 | yy_test <- matrix(NA, ncol = 4, nrow = nrow(x_test)) 162 | 163 | ## New Stuff Here 164 | model_svm <- svm(x_train[row_use,], y_train[row_use], cost = 10) 165 | yy_train[, 1] <- predict(model_svm, x_train) 166 | yy_test[, 1] <- predict(model_svm, x_test) 167 | 168 | model_rf <- randomForest(x_train[row_use,], y_train[row_use]) 169 | yy_train[, 2] <- predict(model_rf, x_train) 170 | yy_test[, 2] <- predict(model_rf, x_test) 171 | 172 | model_cb <- train(x_train[row_use,], y_train[row_use], method = 'cubist', trControl = ctrl) 173 | yy_train[, 3] <- predict(model_cb, x_train) 174 | yy_test[, 3] <- predict(model_cb, x_test) 175 | 176 | model_knn <- train(x_train[row_use,], y_train[row_use], method = 'kknn', trControl = ctrl) 177 | yy_train[, 4] <- predict(model_knn, x_train) 178 | yy_test[, 4] <- predict(model_knn, x_test) 179 | 180 | ## Evaluate function 181 | eval_yy <- function(wgts) { 182 | 183 | ## Normalise weights 184 | wgts <- wgts / sum(wgts) 185 | 186 | ## Apply weights 187 | yy_wgts <- rowSums(yy_train * wgts) 188 | 189 | ## Return evaluation 190 | return(nse(yy_wgts, y_train)) 191 | 192 | } 193 | 194 | ## Optimise 195 | model_ga <- ga(type = 'real-valued', 196 | fitness = eval_yy, 197 | min = c(0,0,0,0), 198 | max = c(1,1,1,1)) 199 | 200 | ## Get best weights 201 | wgts <- summary(model_ga)$solution / sum(summary(model_ga)$solution) 202 | 203 | ## Apply weights to yy_test 204 | yy_test_final <- as.matrix(rowSums(yy_test * as.numeric(wgts))) 205 | 206 | ## Return 207 | return(yy_test_final) 208 | 209 | } 210 | 211 | 212 | 213 | ## ============================================================================= 214 | ## Train Models for Goals 215 | ## ============================================================================= 216 | 217 | ## Timer 218 | tt <- start_timer() 219 | 220 | ## Display 221 | cat("Now training models for Goals prediction ...") 222 | 223 | ## Parallelised Train 224 | tmp_yy <- foreach(n_model = 1:n_total, 225 | .combine = cbind, 226 | .multicombine = TRUE, 227 | .errorhandling = 'remove', 228 | .packages = c('caret', 'e1071', 'randomForest','bib','GA')) %dopar% 229 | train_four(dat_combine, pred_type = 'Goal', p_train) 230 | 231 | ## Split into Home/Away 232 | tmp_yy <- data.frame(Match = dat_pred$Match, Team = rep(c('Team1','Team2'), each = nrow(dat_predict)), tmp_yy) 233 | yy_HG <- tmp_yy[1:nrow(dat_predict), ] 234 | yy_AG <- tmp_yy[-1:-nrow(dat_predict), ] 235 | 236 | ## Timer 237 | tt <- stop_timer(tt) 238 | 239 | ## Disp 240 | cat(" Done! ... Duration:", round(tt), "seconds.\n") 241 | 242 | 243 | ## ============================================================================= 244 | ## Train Models for Goal Difference 245 | ## ============================================================================= 246 | 247 | ## Timer 248 | tt <- start_timer() 249 | 250 | ## Display 251 | cat("Now training models for Goals Difference Prediction ...") 252 | 253 | ## Parallelised Train 254 | tmp_yy <- foreach(n_model = 1:n_total, 255 | .combine = cbind, 256 | .multicombine = TRUE, 257 | .errorhandling = 'remove', 258 | .packages = c('caret', 'e1071', 'randomForest','bib','GA')) %dopar% 259 | train_four(dat_combine, pred_type = 'Diff', p_train) 260 | 261 | ## Reverse Away Prediction 262 | yy_DF <- data.frame(Match = dat_pred$Match, Team = 'Pred_Diff', tmp_yy) 263 | n_start <- nrow(yy_DF)/2 + 1 264 | n_end <- nrow(yy_DF) 265 | yy_DF[n_start:n_end, -1:-2] <- yy_DF[n_start:n_end, -1:-2] * -1 266 | 267 | 268 | 269 | ## Timer 270 | tt <- stop_timer(tt) 271 | 272 | ## Disp 273 | cat(" Done! ... Duration:", round(tt), "seconds.\n") 274 | 275 | 276 | ## ============================================================================= 277 | ## Create a ggplot object 278 | ## ============================================================================= 279 | 280 | yy_all <- rbind(melt(yy_HG), 281 | melt(yy_AG)) 282 | #melt(yy_DF)) 283 | 284 | colnames(yy_all) <- c("Match", "Team", "Variable", "Goals") 285 | yy_all$Goals <- round(yy_all$Goals, 3) 286 | 287 | axis_max <- 3 #round(max(yy_all$Goals)) 288 | axis_min <- 0 #round(min(yy_all$Goals)) 289 | 290 | g_density <- ggplot(yy_all, aes(x = Goals, colour = Team, fill = Team)) + 291 | geom_density() + 292 | facet_grid(Team ~ Match) + 293 | scale_colour_manual(name = "Team", values = c("dodgerblue4", "darkorange4")) + 294 | scale_fill_manual(name = "Team", values = c("dodgerblue", "darkorange")) + 295 | theme(title = element_text(size = 18, vjust = 2), 296 | strip.text = element_text(size = 16), 297 | axis.text = element_text(size = 12), 298 | axis.title.y = element_text(vjust = 0.75), 299 | axis.title.x = element_text(vjust = -0.5), 300 | legend.text = element_text(size = 12)) + 301 | ggtitle("Distribution of Predicted Outcomes (Goals) for Each Team") + 302 | xlim(axis_min, axis_max) + 303 | geom_vline(xintercept = 1, linetype = "dotted", size = 0.5) + 304 | geom_vline(xintercept = 2, linetype = "dotted", size = 0.5) 305 | #geom_vline(xintercept = 3, linetype = "dotted", size = 0.5) 306 | 307 | 308 | g_boxplot <- ggplot(yy_all, aes(x = Team, y = Goals, colour = Team, fill = Team)) + 309 | geom_boxplot() + 310 | scale_colour_manual(name = "Team", values = c("dodgerblue4", "darkorange4")) + 311 | scale_fill_manual(name = "Team", values = c("dodgerblue", "darkorange")) + 312 | facet_grid(~ Match) + 313 | theme(title = element_text(size = 18, vjust = 2), 314 | strip.text = element_text(size = 16), 315 | axis.text = element_text(size = 12), 316 | axis.title.y = element_text(vjust = 0.75), 317 | axis.title.x = element_text(vjust = -0.5), 318 | legend.text = element_text(size = 12)) + 319 | ggtitle("Boxplots of Predicted Outcomes (Goals) for Each Team") + 320 | ylim(axis_min, axis_max) + 321 | geom_hline(yintercept = 1, linetype = "dotted", size = 0.5) + 322 | geom_hline(yintercept = 2, linetype = "dotted", size = 0.5) 323 | #geom_hline(yintercept = 3, linetype = "dotted", size = 0.5) 324 | 325 | 326 | 327 | ## ============================================================================= 328 | ## Create Output Data Frame 329 | ## ============================================================================= 330 | 331 | ## Train/Test, Date and Teams 332 | output <- data.frame(matrix(NA, nrow = nrow(dat_raw), ncol = 10)) 333 | output[row_train, 1] <- "Training Set" 334 | output[row_predict, 1] <- "Predictions" 335 | output[row_future, 1] <- "Future" 336 | output[, 2:4] <- dat_raw[, 1:3] 337 | 338 | ## Real Data 339 | output[row_train, 5:7] <- dat_raw[row_train, c("RES_H", "RES_A", "RES_DIFF")] 340 | 341 | ## Predictions (Previous) 342 | output[row_train, 8:10] <- dat_raw[row_train, c("PRED_H", "PRED_A", "PRED_DIFF")] 343 | 344 | ## Predictions (Current) 345 | output[row_predict, 8] <- round(apply(yy_HG[,-1:-2], 1, median), 2) 346 | output[row_predict, 9] <- round(apply(yy_AG[,-1:-2], 1, median), 2) 347 | 348 | tmp_DF <- matrix(round(apply(yy_DF[,-1:-2], 1, median), 2), nrow = 2, byrow = T) 349 | tmp_DF <- round((tmp_DF[1, ] + tmp_DF[2, ]) / 2, 2) 350 | output[row_predict, 10] <- tmp_DF 351 | 352 | ## Rename columns 353 | colnames(output) <- c("Data", "Date", "Home", "Away", 354 | "Real_H", "Real_A", "Real_Df", 355 | "Pred_H", "Pred_A", "Pred_Df") 356 | 357 | 358 | ## ============================================================================= 359 | ## Create Output Data Frame 360 | ## ============================================================================= 361 | 362 | ## Generate File Name 363 | now <- Sys.time() 364 | now <- gsub(":", "",now) 365 | now <- gsub(" ", "_",now) 366 | name_box <- paste0("./output/",now, "_boxplot.png") 367 | name_dis <- paste0("./output/",now, "_dist.png") 368 | name_tab <- paste0("./output/",now, "_summary.png") 369 | name_pdf <- paste0("./output/",now, "_pred.pdf") 370 | 371 | ## Load Extra Fonts 372 | suppressMessages(loadfonts()) 373 | 374 | ## Define output size 375 | row_max <- max(which(output$Data == "Predictions")) 376 | pdf_w <- 10 377 | pdf_h <- 14 378 | 379 | ## Print PDF 380 | pdf(file = name_pdf, height = pdf_h, width = pdf_w, 381 | family = "Ubuntu", title = "WC2014 Predictions by Jo-fai Chow") 382 | 383 | ## Print Summary Table 384 | grid.newpage() 385 | grid.table(output[-1:-32,], show.rownames = F) 386 | 387 | ## Print boxplot and density 388 | grid.newpage() 389 | pushViewport(viewport(layout = grid.layout(1000, 1000))) 390 | vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y) 391 | print(g_boxplot, vp = vplayout(1:500, 1:1000)) 392 | print(g_density, vp = vplayout(501:1000, 1:1000)) 393 | 394 | 395 | ## Close and save 396 | dev.off() 397 | 398 | if (Sys.info()[1] == "Linux") embed_fonts(name_pdf) 399 | -------------------------------------------------------------------------------- /wc2014.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | --------------------------------------------------------------------------------