├── .gitignore
├── LICENSE
├── README.md
├── data
└── data_wc2014.xlsx
├── output
├── 2014-06-13_162847_pred.pdf
├── 2014-06-14_113816_pred.pdf
├── 2014-06-15_083259_pred.pdf
├── 2014-06-16_021624_pred.pdf
├── 2014-06-17_160505_pred.pdf
├── 2014-06-18_040416_pred.pdf
├── 2014-06-19_141238_pred.pdf
├── 2014-06-20_063723_pred.pdf
├── 2014-06-21_080007_pred.pdf
├── 2014-06-22_052609_pred.pdf
├── 2014-06-23_to_26_pred.pdf
├── 2014-06-24_065030_pred.pdf
├── 2014-06-25_073743_pred.pdf
├── 2014-06-26_011614_pred.pdf
├── 2014-06-28_111144_pred.pdf
├── 2014-06-29_045202_pred.pdf
├── 2014-06-30_003721_pred.pdf
├── 2014-07-03_232113_pred.pdf
├── 2014-07-05_015503_pred.pdf
├── 2014-07-08_030446_pred.pdf
├── 2014-07-09_020453_pred.pdf
├── 2014-07-12_041544_pred.pdf
└── 2014-07-13_023520_pred.pdf
├── predict.R
└── wc2014.Rproj
/.gitignore:
--------------------------------------------------------------------------------
1 | # History files
2 | .Rhistory
3 |
4 | # Example code in package build process
5 | *-Ex.R
6 |
7 | # R data files from past sessions
8 | .Rdata
9 | .Rproj.user
10 |
11 | .sync.ffs_db
12 |
13 | README.html
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2014 Jo-fai Chow
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ### Introduction
2 |
3 | This repository contains all the data, code and outputs for my World Cup 2014 analysis. It is an experiment to see whether data mining can outperform my friends. So far the data mining approach has the upper hand. It predicted 35/64 (or 54.7%) correct match results (Win/Draw/Lose) and 10/64 (or 15.6%) correct scores.
4 |
5 |
6 |
7 | ### Results Comparison
8 |
9 | Despite the simplicity of the model and lack of in-depth data (e.g. individual players' performance, weather ...),
10 | I think that my model still has a clear competitve edge over my friends.
11 | Especially on predicting correct scores - which is one of the most difficult tasks in sports analytics.
12 | Yet, this comparison is only based on a small sample size of 64 WC matches.
13 | In the future, I am going to improve the model and to further investigate its performance for major football leagues in Europe.
14 |
15 |
16 | 
17 |
18 |
19 | Tasks | Me | Friend A | Friend B | Friend C | Friend D | Friend E
20 | |---|----|----|----|----|----|----|
21 | Correct Results (out of 64)| 35 (54.7%) | 29 (45.3%) | 35 (54.7%) | 25 (39.1%) | 28 (43.8%) | 29 (45.3%)
22 | Correct Scores (out of 64)| 10 (15.6%) | 4 (6.3%) | 6 (9.4%) | 4 (6.3%) | 4 (6.3%) | 2 (3.1%)
23 |
24 | **Source**: A copy of the Google Spreadsheet we used.
25 |
26 |
27 |
28 | ### Data Sources
29 |
30 | 1. http://fivethirtyeight.com/interactives/world-cup/
31 | 2. http://www.bloomberg.com/visual-data/world-cup/
32 | 3. https://www.betfair.com/sport
33 |
34 |
35 |
36 | ### Methods
37 |
38 | - The predictors include various information from Bloomberg and FiveThirtyEight such as team performance indicators, probability of Win/Draw/Lose, world ranking as well as latest odds from betfair.com. An Excel spreadsheet in the **data** folder is used for storing the data.
39 | - The predicted results are median values from multiple model predictions. Each model consists of four common regression mini models (Random Forest, SVMs, Cubist and KNN) which are trained with bootstrapped samples and blended for better genearlisation.
40 | - Early on in the tourament, the future match results predicted by Bloomberg had been used as training data as there were not enough actual results.
41 |
42 |
43 |
44 | ### Results
45 |
46 | **Notes**: more detailed results including the distribution of predictions results can be found in the **output** folder.
47 |
48 | Match | Date | Team 1 | Team 2 | Predictions | Results (90 mins) | Correct WDL | Correct Score
49 | ------|------|--------|--------|-------------|---------|--------------|---------------
50 | 1 | 12/06 | Brazil | Croatia | 4:0 | 3:1 | **Yes** | No
51 | 2 | 13/06 | Mexico | Cameroon | 1:0 | 1:0 | **Yes** | **Yes**
52 | 3 | 13/06 | Spain | Netherlands | 1:0 | 1:5 | No | No
53 | 4 | 13/06 | Chile | Australia | 1:0 | 3:1 | **Yes** | No
54 | 5 | 14/06 | Colombia | Greece | 1:0 | 3:0 | **Yes** | No
55 | 6 | 14/06 | Uruguay | Costa Rica | 2:0 | 1:3 | No | No
56 | 7 | 14/06 | England | Italy | 1:1 | 1:2 | No | No
57 | 8 | 15/06 | Ivory Coast | Japan | 1:1 | 2:1 | No | No
58 | 9 | 15/06 | Switzerland | Ecuador | 1:1 | 2:1 | No | No
59 | 10 | 15/06 | France | Honduras | 2:0 | 3:0 | **Yes** | No
60 | 11 | 15/06 | Argentina | Bosnia | 2:1 | 2:1 | **Yes** | **Yes**
61 | 12 | 16/06 | Germany | Portugal | 1:1 | 4:0 | No | No
62 | 13 | 16/06 | Iran | Nigeria | 1:1 | 0:0 | **Yes** | No
63 | 14 | 16/06 | Ghana | USA | 1:1 | 1:2 | No | No
64 | 15 | 17/06 | Belgium | Algeria | 2:1 | 2:1 | **Yes** | **Yes**
65 | 16 | 17/06 | Brazil | Mexico | 3:1 | 0:0 | No | No
66 | 17 | 17/06 | Russia | S Korea | 1:1 | 1:1 | **Yes** | **Yes**
67 | 18 | 18/06 | Australia | Netherlands | 1:3 | 2:3 | **Yes** | No
68 | 19 | 18/06 | Spain | Chile | 2:3 | 0:2 | **Yes** | No
69 | 20 | 18/06 | Cameroon | Croatia | 1:1 | 0:4 | No | No
70 | 21 | 19/06 | Colombia | Ivory Coast | 2:1 | 2:1 | **Yes** | **Yes**
71 | 22 | 19/06 | Uruguay | England | 2:2 | 2:1 | No | No
72 | 23 | 19/06 | Japan | Greece | 0:0 | 1:1 | **Yes** | No
73 | 24 | 20/06 | Italy | Costa Rica | 2:1 | 0:1 | No | No
74 | 25 | 20/06 | Switzerland | France | 1:2 | 2:5 | **Yes** | No
75 | 26 | 20/06 | Honduras | Ecuador | 1:2 | 1:2 | **Yes** | **Yes**
76 | 27 | 21/06 | Argentina | Iran | 2:1 | 1:0 | **Yes** | No
77 | 28 | 21/06 | Germany | Ghana | 3:1 | 2:2 | No | No
78 | 29 | 21/06 | Nigeria | Bosnia | 1:2 | 1:0 | No | No
79 | 30 | 22/06 | Belgium | Russia | 2:1 | 1:0 | **Yes** | No
80 | 31 | 22/06 | S Korea | Algeria | 1:1 | 2:4 | No | No
81 | 32 | 22/06 | USA | Portugal | 2:2 | 2:2 | **Yes** | **Yes**
82 | 33 | 23/06 | Netherlands | Chile | 2:2 | 2:0 | No | No
83 | 34 | 23/06 | Australia | Spain | 1:2 | 0:3 | **Yes** | No
84 | 35 | 23/06 | Cameroon | Brazil | 1:2 | 1:4 | **Yes** | No
85 | 36 | 23/06 | Croatia | Mexico | 1:1 | 1:3 | No | No
86 | 37 | 24/06 | Costa Rica | England | 2:2 | 0:0 | **Yes** | No
87 | 38 | 24/06 | Italy | Uruguay | 1:1 | 0:1 | No | No
88 | 39 | 24/06 | Japan | Colombia | 1:2 | 1:4 | **Yes** | No
89 | 40 | 24/06 | Greece | Ivory Coast | 1:1 | 2:1 | No | No
90 | 41 | 25/06 | Nigeria | Argentina | 1:2 | 2:3 | **Yes** | No
91 | 42 | 25/06 | Bosnia | Iran | 1:1 | 3:1 | No | No
92 | 43 | 25/06 | Ecuador | France | 1:2 | 0:0 | No | No
93 | 44 | 25/06 | Honduras | Switzerland | 1:2 | 0:3 | **Yes** | No
94 | 45 | 26/06 | Portugal | Ghana | 2:1 | 2:1 | **Yes** | **Yes**
95 | 46 | 26/06 | USA | Germany | 1:2 | 0:1 | **Yes** | No
96 | 47 | 26/06 | Algeria | Russia | 1:2 | 1:1 | No | No
97 | 48 | 26/06 | S Korea | Belgium | 1:2 | 0:1 | **Yes** | No
98 | 49 | 28/06 | Brazil | Chile | 2:2 | 1:1 | **Yes** | No
99 | 50 | 28/06 | Colombia | Uruguay | 1:1 | 2:0 | No | No
100 | 51 | 29/06 | Netherlands | Mexico | 2:1 | 2:1 | **Yes** | **Yes**
101 | 52 | 29/06 | Costa Rica | Greece | 1:1 | 1:1 | **Yes** | **Yes**
102 | 53 | 30/06 | France | Nigeria | 2:1 | 2:0 | **Yes** | No
103 | 54 | 30/06 | Germany | Algeria | 2:1 | 0:0 | No | No
104 | 55 | 01/07 | Argentina | Switzerland | 3:1 | 0:0 | No | No
105 | 56 | 01/07 | Belgium | USA | 2:1 | 0:0 | No | No
106 | 57 | 04/07 | France | Germany | 1:2 | 0:1 | **Yes** | No
107 | 58 | 04/07 | Brazil | Colombia | 1:1 | 2:1 | No | No
108 | 59 | 05/07 | Argentina | Belgium | 2:1 | 1:0 | **Yes** | No
109 | 60 | 05/07 | Netherlands | Costa Rica | 1:1 | 0:0 | **Yes** | No
110 | 61 | 08/07 | Brazil | Germany | 1:1 | 1:7 | No | No
111 | 62 | 09/07 | Netherlands | Argentina | 1:1 | 0:0 | **Yes** | No
112 | 63 | 12/07 | Brazil | Netherlands | 1:1 | 0:3 | No | No
113 | 64 | 13/07 | Germany | Argentina | 2:1 | 0:0 | No | No
114 | **Summary** | - | - | - | - | - | **35/64** | **10/64**
115 | **Accuracy** | - | - | - | - | - | **54.7%** | **15.6%**
116 |
117 |
118 |
119 | ### Comments
120 |
121 | Match(es) | Comments
122 | ----------|-------------
123 | 1 | Pure guess only.
124 | 2-11 | Bloomberg future predictions used as training data.
125 | 12 | Dropped Bloomberg future predictions. Started using actual results only.
126 | 33 - 48 | Predictions all made on 23/6 as family holiday began.
127 | 49 - 52 | Predictions all made on 28/6 as I travelled to LA for useR! conference.
128 |
129 |
130 |
--------------------------------------------------------------------------------
/data/data_wc2014.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/data/data_wc2014.xlsx
--------------------------------------------------------------------------------
/output/2014-06-13_162847_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-13_162847_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-14_113816_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-14_113816_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-15_083259_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-15_083259_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-16_021624_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-16_021624_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-17_160505_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-17_160505_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-18_040416_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-18_040416_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-19_141238_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-19_141238_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-20_063723_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-20_063723_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-21_080007_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-21_080007_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-22_052609_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-22_052609_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-23_to_26_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-23_to_26_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-24_065030_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-24_065030_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-25_073743_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-25_073743_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-26_011614_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-26_011614_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-28_111144_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-28_111144_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-29_045202_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-29_045202_pred.pdf
--------------------------------------------------------------------------------
/output/2014-06-30_003721_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-06-30_003721_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-03_232113_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-03_232113_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-05_015503_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-05_015503_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-08_030446_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-08_030446_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-09_020453_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-09_020453_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-12_041544_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-12_041544_pred.pdf
--------------------------------------------------------------------------------
/output/2014-07-13_023520_pred.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/woobe/wc2014/568aadbda278548d7d2009af736f87a0c34bdfaa/output/2014-07-13_023520_pred.pdf
--------------------------------------------------------------------------------
/predict.R:
--------------------------------------------------------------------------------
1 | ## =============================================================================
2 | ## WC 2014 Analysis and Predictions
3 | ## =============================================================================
4 |
5 | ## Initiate
6 | rm(list=ls())
7 |
8 |
9 | ## Core Paramemters
10 | n_core <- 5
11 | n_total <- 500
12 | p_train <- 0.667
13 |
14 | ## Set seed
15 | set.seed(1234)
16 |
17 |
18 | ## =============================================================================
19 | ## Load Packages
20 | ## =============================================================================
21 |
22 | suppressMessages(library(xlsx))
23 | suppressMessages(library(caret))
24 | suppressMessages(library(reshape2))
25 | suppressMessages(library(ggplot2))
26 | suppressMessages(library(grid))
27 | suppressMessages(library(gridExtra))
28 | suppressMessages(library(extrafont))
29 | suppressMessages(library(e1071))
30 | suppressMessages(library(randomForest))
31 | suppressMessages(library(MASS))
32 | suppressMessages(library(Cubist))
33 | suppressMessages(library(kknn))
34 | suppressMessages(library(GA))
35 | suppressMessages(library(bib))
36 |
37 | ## =============================================================================
38 | ## Load Data
39 | ## =============================================================================
40 |
41 | dat <- read.xlsx(file = './data/data_wc2014.xlsx', sheetName = 'Data')
42 |
43 |
44 | ## =============================================================================
45 | ## Double Up Data (Swap Position)
46 | ## =============================================================================
47 |
48 | ## Keep a copy
49 | dat_raw <- dat
50 |
51 | ## Extract Train, Predict and Future
52 | dat_train <- dat[which(dat$Type == 'train'),]
53 | dat_predict <- dat[which(dat$Type == 'predict'),]
54 | dat_future <- dat[which(dat$Type == 'future'),]
55 |
56 | ## Swap Home and Away position
57 | # col_swap <- c('Date', 'TEAM_A', 'TEAM_H',
58 | # 'FTE_L', 'FTE_W', 'FTE_D',
59 | # "BLM_W", "BLM_T", "BLM_A"
60 | # 'BF_2', 'BF_X', 'BF_1',
61 | # 'SPI_A', 'OFF_A', 'DEF_A',
62 | # 'SPI_H', 'OFF_H', 'DEF_H',
63 | # 'RES_A', 'RES_H', 'DIFF',
64 | # 'PRED_A', 'PRED_H', 'PRED_DIFF',
65 | # 'Type')
66 |
67 | col_swap <- c("Date", "TEAM_A", "TEAM_H",
68 | "FTE_A", "FTE_H",
69 | "BLM_A", "BLM_H",
70 | "SPI_A", "OFF_A", "DEF_A",
71 | "SPI_H", "OFF_H", "DEF_H",
72 | "BLM_OFF_A", "BLM_DEF_A", "BLM_OVR_A",
73 | "BLM_OFF_H", "BLM_DEF_H", "BLM_OVR_H",
74 | "RNK_A", "APP_A", "BEST_A",
75 | "RNK_H", "APP_H", "BEST_H",
76 | "BF_A", "BF_D", "BF_H",
77 | "RES_A", "RES_H", "RES_DIFF",
78 | "PRED_A", "PRED_H", "PRED_DIFF",
79 | "Type")
80 |
81 | dat_train_swap <- dat_train[, col_swap]
82 | dat_train_swap$RES_DIFF <- dat_train_swap$RES_DIFF * -1
83 | dat_train_swap$PRED_DIFF <- dat_train_swap$PRED_DIFF * -1
84 | colnames(dat_train_swap) <- colnames(dat)
85 |
86 | dat_predict_swap <- dat_predict[, col_swap]
87 | dat_predict_swap$RES_DIFF <- dat_predict_swap$RES_DIFF * -1
88 | dat_predict_swap$PRED_DIFF <- dat_predict_swap$PRED_DIFF * -1
89 | colnames(dat_predict_swap) <- colnames(dat)
90 |
91 | dat_future_swap <- dat_future[, col_swap]
92 | dat_future_swap$RES_DIFF <- dat_future_swap$RES_DIFF * -1
93 | dat_future_swap$PRED_DIFF <- dat_future_swap$PRED_DIFF * -1
94 | colnames(dat_future_swap) <- colnames(dat)
95 |
96 | ## Combine
97 | dat_combine <- rbind(dat_train,
98 | dat_predict,
99 | dat_future,
100 | dat_train_swap,
101 | dat_predict_swap,
102 | dat_future_swap)
103 |
104 | ## =============================================================================
105 | ## Normalise Data
106 | ## =============================================================================
107 |
108 | ## Pre-process predictors
109 | pp <- preProcess(dat_combine[,4:28], method = c("center", "scale", "BoxCox"))
110 | dat_combine[, 4:28] <- predict(pp, dat_combine[, 4:28])
111 |
112 |
113 | ## =============================================================================
114 | ## Split Train/Test
115 | ## =============================================================================
116 |
117 | ## Define Training Set Here!
118 | row_train <- which(dat$Type == 'train')
119 | row_predict <- which(dat$Type == 'predict')
120 | row_future <- which(dat$Type == 'future')
121 |
122 | ## Pred
123 | dat_pred <- data.frame(dat_raw[row_predict, 2:3], Match = NA)
124 | for (n in 1:nrow(dat_pred)) {
125 | dat_pred[n, 3] <- paste0(dat_pred[n, 1], "_v_", dat_pred[n, 2])
126 | }
127 |
128 |
129 |
130 | ## =============================================================================
131 | ## Prepare for training
132 | ## =============================================================================
133 |
134 | ## Activate
135 | activate_core(n_core)
136 |
137 | ## Train four models for each random split
138 | train_four <- function(dat_combine, pred_type, p_train) {
139 |
140 | ## Extract
141 | x_train <- dat_combine[which(dat_combine$Type == 'train'), 4:28]
142 | x_test <- dat_combine[which(dat_combine$Type == 'predict'), 4:28]
143 |
144 | if (pred_type == 'Goal') {
145 | y_train <- dat_combine[which(dat_combine$Type == 'train'), 'RES_H']
146 | } else {
147 | y_train <- dat_combine[which(dat_combine$Type == 'train'), 'RES_DIFF']
148 | }
149 |
150 | ## Sub-sample
151 | row_use <- createDataPartition(y_train, p = p_train, list = FALSE)
152 |
153 | ## Global variables
154 | ctrl <- trainControl(method = "adaptive_cv",
155 | repeats = 1,
156 | number = 10,
157 | allowParallel = FALSE)
158 |
159 | ## Empty Shell
160 | yy_train <- matrix(NA, ncol = 4, nrow = nrow(x_train))
161 | yy_test <- matrix(NA, ncol = 4, nrow = nrow(x_test))
162 |
163 | ## New Stuff Here
164 | model_svm <- svm(x_train[row_use,], y_train[row_use], cost = 10)
165 | yy_train[, 1] <- predict(model_svm, x_train)
166 | yy_test[, 1] <- predict(model_svm, x_test)
167 |
168 | model_rf <- randomForest(x_train[row_use,], y_train[row_use])
169 | yy_train[, 2] <- predict(model_rf, x_train)
170 | yy_test[, 2] <- predict(model_rf, x_test)
171 |
172 | model_cb <- train(x_train[row_use,], y_train[row_use], method = 'cubist', trControl = ctrl)
173 | yy_train[, 3] <- predict(model_cb, x_train)
174 | yy_test[, 3] <- predict(model_cb, x_test)
175 |
176 | model_knn <- train(x_train[row_use,], y_train[row_use], method = 'kknn', trControl = ctrl)
177 | yy_train[, 4] <- predict(model_knn, x_train)
178 | yy_test[, 4] <- predict(model_knn, x_test)
179 |
180 | ## Evaluate function
181 | eval_yy <- function(wgts) {
182 |
183 | ## Normalise weights
184 | wgts <- wgts / sum(wgts)
185 |
186 | ## Apply weights
187 | yy_wgts <- rowSums(yy_train * wgts)
188 |
189 | ## Return evaluation
190 | return(nse(yy_wgts, y_train))
191 |
192 | }
193 |
194 | ## Optimise
195 | model_ga <- ga(type = 'real-valued',
196 | fitness = eval_yy,
197 | min = c(0,0,0,0),
198 | max = c(1,1,1,1))
199 |
200 | ## Get best weights
201 | wgts <- summary(model_ga)$solution / sum(summary(model_ga)$solution)
202 |
203 | ## Apply weights to yy_test
204 | yy_test_final <- as.matrix(rowSums(yy_test * as.numeric(wgts)))
205 |
206 | ## Return
207 | return(yy_test_final)
208 |
209 | }
210 |
211 |
212 |
213 | ## =============================================================================
214 | ## Train Models for Goals
215 | ## =============================================================================
216 |
217 | ## Timer
218 | tt <- start_timer()
219 |
220 | ## Display
221 | cat("Now training models for Goals prediction ...")
222 |
223 | ## Parallelised Train
224 | tmp_yy <- foreach(n_model = 1:n_total,
225 | .combine = cbind,
226 | .multicombine = TRUE,
227 | .errorhandling = 'remove',
228 | .packages = c('caret', 'e1071', 'randomForest','bib','GA')) %dopar%
229 | train_four(dat_combine, pred_type = 'Goal', p_train)
230 |
231 | ## Split into Home/Away
232 | tmp_yy <- data.frame(Match = dat_pred$Match, Team = rep(c('Team1','Team2'), each = nrow(dat_predict)), tmp_yy)
233 | yy_HG <- tmp_yy[1:nrow(dat_predict), ]
234 | yy_AG <- tmp_yy[-1:-nrow(dat_predict), ]
235 |
236 | ## Timer
237 | tt <- stop_timer(tt)
238 |
239 | ## Disp
240 | cat(" Done! ... Duration:", round(tt), "seconds.\n")
241 |
242 |
243 | ## =============================================================================
244 | ## Train Models for Goal Difference
245 | ## =============================================================================
246 |
247 | ## Timer
248 | tt <- start_timer()
249 |
250 | ## Display
251 | cat("Now training models for Goals Difference Prediction ...")
252 |
253 | ## Parallelised Train
254 | tmp_yy <- foreach(n_model = 1:n_total,
255 | .combine = cbind,
256 | .multicombine = TRUE,
257 | .errorhandling = 'remove',
258 | .packages = c('caret', 'e1071', 'randomForest','bib','GA')) %dopar%
259 | train_four(dat_combine, pred_type = 'Diff', p_train)
260 |
261 | ## Reverse Away Prediction
262 | yy_DF <- data.frame(Match = dat_pred$Match, Team = 'Pred_Diff', tmp_yy)
263 | n_start <- nrow(yy_DF)/2 + 1
264 | n_end <- nrow(yy_DF)
265 | yy_DF[n_start:n_end, -1:-2] <- yy_DF[n_start:n_end, -1:-2] * -1
266 |
267 |
268 |
269 | ## Timer
270 | tt <- stop_timer(tt)
271 |
272 | ## Disp
273 | cat(" Done! ... Duration:", round(tt), "seconds.\n")
274 |
275 |
276 | ## =============================================================================
277 | ## Create a ggplot object
278 | ## =============================================================================
279 |
280 | yy_all <- rbind(melt(yy_HG),
281 | melt(yy_AG))
282 | #melt(yy_DF))
283 |
284 | colnames(yy_all) <- c("Match", "Team", "Variable", "Goals")
285 | yy_all$Goals <- round(yy_all$Goals, 3)
286 |
287 | axis_max <- 3 #round(max(yy_all$Goals))
288 | axis_min <- 0 #round(min(yy_all$Goals))
289 |
290 | g_density <- ggplot(yy_all, aes(x = Goals, colour = Team, fill = Team)) +
291 | geom_density() +
292 | facet_grid(Team ~ Match) +
293 | scale_colour_manual(name = "Team", values = c("dodgerblue4", "darkorange4")) +
294 | scale_fill_manual(name = "Team", values = c("dodgerblue", "darkorange")) +
295 | theme(title = element_text(size = 18, vjust = 2),
296 | strip.text = element_text(size = 16),
297 | axis.text = element_text(size = 12),
298 | axis.title.y = element_text(vjust = 0.75),
299 | axis.title.x = element_text(vjust = -0.5),
300 | legend.text = element_text(size = 12)) +
301 | ggtitle("Distribution of Predicted Outcomes (Goals) for Each Team") +
302 | xlim(axis_min, axis_max) +
303 | geom_vline(xintercept = 1, linetype = "dotted", size = 0.5) +
304 | geom_vline(xintercept = 2, linetype = "dotted", size = 0.5)
305 | #geom_vline(xintercept = 3, linetype = "dotted", size = 0.5)
306 |
307 |
308 | g_boxplot <- ggplot(yy_all, aes(x = Team, y = Goals, colour = Team, fill = Team)) +
309 | geom_boxplot() +
310 | scale_colour_manual(name = "Team", values = c("dodgerblue4", "darkorange4")) +
311 | scale_fill_manual(name = "Team", values = c("dodgerblue", "darkorange")) +
312 | facet_grid(~ Match) +
313 | theme(title = element_text(size = 18, vjust = 2),
314 | strip.text = element_text(size = 16),
315 | axis.text = element_text(size = 12),
316 | axis.title.y = element_text(vjust = 0.75),
317 | axis.title.x = element_text(vjust = -0.5),
318 | legend.text = element_text(size = 12)) +
319 | ggtitle("Boxplots of Predicted Outcomes (Goals) for Each Team") +
320 | ylim(axis_min, axis_max) +
321 | geom_hline(yintercept = 1, linetype = "dotted", size = 0.5) +
322 | geom_hline(yintercept = 2, linetype = "dotted", size = 0.5)
323 | #geom_hline(yintercept = 3, linetype = "dotted", size = 0.5)
324 |
325 |
326 |
327 | ## =============================================================================
328 | ## Create Output Data Frame
329 | ## =============================================================================
330 |
331 | ## Train/Test, Date and Teams
332 | output <- data.frame(matrix(NA, nrow = nrow(dat_raw), ncol = 10))
333 | output[row_train, 1] <- "Training Set"
334 | output[row_predict, 1] <- "Predictions"
335 | output[row_future, 1] <- "Future"
336 | output[, 2:4] <- dat_raw[, 1:3]
337 |
338 | ## Real Data
339 | output[row_train, 5:7] <- dat_raw[row_train, c("RES_H", "RES_A", "RES_DIFF")]
340 |
341 | ## Predictions (Previous)
342 | output[row_train, 8:10] <- dat_raw[row_train, c("PRED_H", "PRED_A", "PRED_DIFF")]
343 |
344 | ## Predictions (Current)
345 | output[row_predict, 8] <- round(apply(yy_HG[,-1:-2], 1, median), 2)
346 | output[row_predict, 9] <- round(apply(yy_AG[,-1:-2], 1, median), 2)
347 |
348 | tmp_DF <- matrix(round(apply(yy_DF[,-1:-2], 1, median), 2), nrow = 2, byrow = T)
349 | tmp_DF <- round((tmp_DF[1, ] + tmp_DF[2, ]) / 2, 2)
350 | output[row_predict, 10] <- tmp_DF
351 |
352 | ## Rename columns
353 | colnames(output) <- c("Data", "Date", "Home", "Away",
354 | "Real_H", "Real_A", "Real_Df",
355 | "Pred_H", "Pred_A", "Pred_Df")
356 |
357 |
358 | ## =============================================================================
359 | ## Create Output Data Frame
360 | ## =============================================================================
361 |
362 | ## Generate File Name
363 | now <- Sys.time()
364 | now <- gsub(":", "",now)
365 | now <- gsub(" ", "_",now)
366 | name_box <- paste0("./output/",now, "_boxplot.png")
367 | name_dis <- paste0("./output/",now, "_dist.png")
368 | name_tab <- paste0("./output/",now, "_summary.png")
369 | name_pdf <- paste0("./output/",now, "_pred.pdf")
370 |
371 | ## Load Extra Fonts
372 | suppressMessages(loadfonts())
373 |
374 | ## Define output size
375 | row_max <- max(which(output$Data == "Predictions"))
376 | pdf_w <- 10
377 | pdf_h <- 14
378 |
379 | ## Print PDF
380 | pdf(file = name_pdf, height = pdf_h, width = pdf_w,
381 | family = "Ubuntu", title = "WC2014 Predictions by Jo-fai Chow")
382 |
383 | ## Print Summary Table
384 | grid.newpage()
385 | grid.table(output[-1:-32,], show.rownames = F)
386 |
387 | ## Print boxplot and density
388 | grid.newpage()
389 | pushViewport(viewport(layout = grid.layout(1000, 1000)))
390 | vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
391 | print(g_boxplot, vp = vplayout(1:500, 1:1000))
392 | print(g_density, vp = vplayout(501:1000, 1:1000))
393 |
394 |
395 | ## Close and save
396 | dev.off()
397 |
398 | if (Sys.info()[1] == "Linux") embed_fonts(name_pdf)
399 |
--------------------------------------------------------------------------------
/wc2014.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: knitr
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------