├── Courses └── Datacamp │ └── Manipulating Time Series Data in Python │ ├── README.md │ ├── chapter1.pdf │ └── chapter2.pdf ├── LICENSE ├── Metrics ├── README.md └── calculating_errors_of_the_forecasting.ipynb ├── Missing Data └── README.md ├── Model Selection └── Cross-validation │ └── README.md ├── Orbit Models ├── README.md └── stable version │ ├── backtest_orbit_model.ipynb │ ├── eval_decompose_pred.ipynb │ └── evaluating_models.ipynb ├── Outliers └── README.md ├── README.md └── header.png /Courses/Datacamp/Manipulating Time Series Data in Python/README.md: -------------------------------------------------------------------------------- 1 | # Manipulating Time Series Data in Python 2 | -------------------------------------------------------------------------------- /Courses/Datacamp/Manipulating Time Series Data in Python/chapter1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ElizaLo/Time-Series/42483d39a09e8945b0650227504fd064f1898219/Courses/Datacamp/Manipulating Time Series Data in Python/chapter1.pdf -------------------------------------------------------------------------------- /Courses/Datacamp/Manipulating Time Series Data in Python/chapter2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ElizaLo/Time-Series/42483d39a09e8945b0650227504fd064f1898219/Courses/Datacamp/Manipulating Time Series Data in Python/chapter2.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Yelyzaveta Losieva 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Metrics/README.md: -------------------------------------------------------------------------------- 1 | # Time Series Forecast Error Metrics 2 | 3 |

4 | 5 | Keep in mind that **there is no silver bullet, no single best error metric**. The fundamental challenge is, that every statistical measure condenses a large number of data into a single value, so it only provides one projection of the model errors emphasizing a certain aspect of the error characteristics of the model performance _(Chai and Draxler 2014)_. 6 | 7 | Therefore it is better to have a more practical and pragmatic view and work with a selection of metrics that fit for your use case or project. 8 | 9 | # 💠 Scale Dependent Metrics 10 | 11 | Many popular metrics are referred to as **scale-dependent** _(Hyndman, 2006)_. Scale-dependent means the error metrics are **expressed in the units** _(i.e. Dollars, Inches, etc.)_ of the underlying data. 12 | 13 | The main advantage of scale dependent metrics is that they are usually **easy to calculate** and **interpret**. However, they can **not be used to compare different series**, because of their **scale dependency** _(Hyndman, 2006)_. 14 | 15 | > Please note here that _Hyndman (2006)_ includes Mean Squared Error into a scale-dependent group (claiming that the error is “on the same scale as the data”). However, Mean Squared Error has a dimension of the squared scale/unit. To bring MSE to the data’s unit we need to take the square root which leads to another metric, the RMSE. _(Shcherbakov et al., 2013)_ 16 | 17 | ## 🔹 Mean Error 18 | 19 | The mean error is an informal term that usually refers to the average of all the errors in a set. An _“error”_ in this context is an [uncertainty](https://www.statisticshowto.com/uncertainty-in-statistics/) in a measurement, or the difference between the measured value and true/correct value. The more formal term for error is [measurement error](https://www.statisticshowto.com/measurement-error/), also called [observational error](https://www.statisticshowto.com/measurement-error/). 20 | 21 | **The mean error usually results in a number that isn’t helpful** because positives and negatives cancel each other out. For example, two errors of +100 and -100 would give a mean error of zero 22 | 23 | ## 🔹 Mean Absolute Error (MAE) 24 | 25 |

26 | 27 | The **Mean Absolute Error (MAE)** is calculated by taking the mean of the **absolute differences between the actual values** (also called _y_) **and the predicted values** (_y_hat_). 28 | 29 | It is **easy to understand** (even for business users) and **to compute**. It is recommended for **assessing accuracy on a single series** _(Hyndman, 2006)_. 30 | 31 | However if you want to compare different series (with different units) it is not suitable. Also you should **not use** it if you want to **penalize outliers**. 32 | 33 | ```python 34 | import numpy as np 35 | 36 | def mae(y, y_hat): 37 | return np.mean(np.abs(y - y_hat)) 38 | ``` 39 | 40 | ## 🔹 Mean Squared Error (MSE) 41 | 42 | If you want to put **more attention on outliers (huge errors)** you can consider the **Mean Squared Error (MSE)**. Like it’s name implies it takes the mean of the squared errors (differences between _y_ and _y_hat_). 43 | 44 | Due to its squaring, it **heavily weights large errors more than small ones**, which can be in some situations a **disadvantage**. Therefore the MSE is suitable for situations where you **really want to focus on large errors**. 45 | 46 | Also keep in mind that due to its squaring the metric **loses its unit**. 47 | 48 | ```python 49 | import numpy as np 50 | 51 | def mse(y, y_hat): 52 | return np.mean(np.square(y - y_hat)) 53 | ``` 54 | 55 | ## 🔹 Root Mean Squared Error (RMSE) 56 | 57 |

58 | 59 | To **avoid the MSE’s loss of its unit** we can take the square root of it. The outcome is then a new error metric called the Root Mean Squared Error (RMSE). 60 | 61 | It comes with the same **advantages** as its siblings MAE and MSE. However, like MSE, it is also **sensitive to outliers**. 62 | 63 | Some authors like _Willmott and Matsuura (2005)_ argue that the RMSE is an inappropriate and misinterpreted measure of an average error and recommend MAE over RMSE. 64 | 65 | However, Chai and Drexler (2014) partially refuted their arguments and **recommend RMSE over MAE for your model optimization** as well as for **evaluating different models where the error distribution is expected to be Gaussian**. 66 | 67 | # 💠 Percentage Error Metrics 68 | 69 | As we know from the previous chapter, **scale dependent metrics are not suitable for comparing different time series**. 70 | 71 | Percentage Error Metrics solve this problem. They are **scale independent** and **used to compare forecast performance between different time series**. However, their **weak spots are zero values in a time series**. Then they become **infinite or undefined** which makes them not interpretable _(Hyndman 2006)_. 72 | 73 | ## 🔹 Mean Absolute Percentage Error (MAPE) 74 | 75 |

76 | 77 | The mean absolute percentage error (MAPE) is one of the **most popular used error metrics** in time series forecasting. It is calculated by taking the average (mean) of the absolute difference between actuals and predicted values divided by the actuals. 78 | 79 | > _Please note, some MAPE formulas do not multiply the result(s) with 100. However, the MAPE is presented as a percentage unit so I added the multiplication._ 80 | 81 | MAPE’s **advantages** are it’s **scale-independency** and **easy interpretability**. As said at the beginning, percentage error metrics can be used to compare the outcome of multiple time series models with different scales. 82 | 83 | However, MAPE also comes with some **disadvantages**. First, **it generates infinite or undefined values for zero or close-to-zero actual values** _(Kim and Kim 2016)_. 84 | 85 | Second, it also puts a **heavier penalty on negative than on positive errors** which leads to an **asymmetry** _(Hyndman 2014)_. 86 | 87 | And last but not least, MAPE **can not be used when using percentages make no sense**. This is for example the case when measuring temperatures. The units Fahrenheit or Celsius scales have relatively arbitrary zero points, and it makes no sense to talk about percentages _(Hyndman and Koehler, 2006)_. 88 | 89 | ```python 90 | import numpy as np 91 | 92 | def mape(y, y_hat): 93 | return np.mean(np.abs((y - y_hat)/y)*100) 94 | ``` 95 | 96 | ## 🔹 Symmetric Mean Absolute Percentage Error (sMAPE) 97 | 98 | To **avoid the asymmetry** of the MAPE a new error metric was proposed. The **Symmetric Mean Absolute Percentage Error (sMAPE)**. The sMAPE is probably one of the **most controversial** error metrics, since not only different definitions or formulas exist but also critics claim that this metric **is not symmetric** as the name suggests _(Goodwin and Lawton, 1999)_. 99 | 100 | The original idea of an **“adjusted MAPE”** was proposed by _Armstrong (1985)_. However by his definition the **error metric can be negative or infinite** since the values in the denominator **are not set absolute** (which is then correctly mentioned as a disadvantage in some articles that follow his definition). 101 | 102 |

103 | 104 | _Makridakis (1993)_ proposed a similar metric and called it SMAPE. His formula which can be seen below **avoids the problems Armstrong’s formula** had by setting the values in the denominator to absolute _(Hyndman, 2014)_. 105 | 106 |

107 | 108 | > _**Note:** Makridakis (1993) proposed the formula above in his paper “Accuracy measures: theoretical and practical concerns’’. Later in his publication (Makridakis and Hibbon, 2000) “The M3-Competition: results, conclusions and implications’’ he used Armstrong’s formula (Hyndman, 2014). This fact has probably also contributed to the confusion about SMAPE’s different definitions._ 109 | 110 | The sAMPE is the average across all forecasts made for a given horizon. It’s **advantages** are that it **avoids MAPE’s problem of large errors** when y-values are close to zero and the large difference between the absolute percentage errors when y is greater than y-hat and vice versa. Unlike MAPE which has no limits, **it fluctuates between 0% and 200%** _(Makridakis and Hibon, 2000)_. 111 | 112 | For the **sake of interpretation** there is also a **slightly modified version of SMAPE** that ensures that the metric’s results **will be always between 0% and 100%**: 113 | 114 |

115 | 116 | The following code snippet contains the sMAPE metric proposed by _Makridakis (1993)_ and the modified version. 117 | 118 | ```python 119 | import numpy as np 120 | 121 | # SMAPE proposed by Makridakis (1993): 0%-200% 122 | def smape_original(a, f): 123 | return 1/len(a) * np.sum(2 * np.abs(f-a) / (np.abs(a) + np.abs(f))*100) 124 | 125 | 126 | # adjusted SMAPE version to scale metric from 0%-100% 127 | def smape_adjusted(a, f): 128 | return (1/a.size * np.sum(np.abs(f-a) / (np.abs(a) + np.abs(f))*100)) 129 | ``` 130 | 131 | As mentioned at the beginning, there are **controversies around the sMAPE**. And they are true. _Goodwin and Lawton (1999)_ pointed out that sMAPE **gives more penalties to under-estimates more than to over-estimates** _(Chen et al., 2017)_. _Cánovas (2009)_ proofs this fact with an easy example. 132 | 133 | - **_Table 1:_** Example with a **symmetric sMAPE**: 134 | 135 |

136 | 137 | - **_Table 2:_** Example with an **asymmetric sMAPE**: 138 | 139 |

140 | 141 | Starting with **_Table 1_** we have two cases. In **case 1** our actual value **_y_** is **100** and the prediction **_y_hat_** **150**. This leads to a sMAPE value of 20 %. **Case 2** is the opposite. Here we have an actual value _**y**_ of **150** and a prediction **_y_hat_** of **100**. This also leads to a sMAPE of 20 %. 142 | 143 | Let us now have a look at **_Table 2_**. We also have here two cases and as you can already see the sMAPE values **are not the same anymore**. The second case **leads to a different SMAPE value** of 33 %. 144 | 145 | **Modifying the forecast while holding fixed actual values and absolute deviation do not produce the same sMAPE’s value. Simply biasing the model without improving its accuracy should never produce different error values _(Cánovas, 2009)_.** 146 | 147 | # 💭 Conclusions 148 | 149 | - As you have seen there is **no silver bullet, no single best error metric**. Each category or metric has its **advantages** and **weaknesses**. So it **always depends on your individual use case** or **purpose and your underlying data**. It is important **not to just look at one single error metric** when evaluating your model’s performance. It is necessary to measure several of the main metrics described above in order to analyze several parameters such as deviation, symmetrical deviation and largest outliers. 150 | - If all series **are on the same scale**, the **data preprocessing procedures** were performed (data cleaning, anomaly detection) and the task is **to evaluate the forecast performance** then the **MAE can be preferred** because it is simpler to explain _(Hyndman and Koehler, 2006; Shcherbakov et al., 2013)_. 151 | - _Chai and Draxler (2014)_ recommend to **prefer RMSE over MAE** when the error distribution is **expected to be Gaussian**. 152 | - In case the data **contain outliers** it is advisable to apply scaled measures like **MASE**. In this situation the **horizon should be large enough, no identical values should be**, the normalized factor **should be not equal to zero** _(Shcherbakov et al., 2013)_. 153 | 154 | # 📰 Articles 155 | 156 | - [Time Series Forecast Error Metrics You Should Know](https://towardsdatascience.com/time-series-forecast-error-metrics-you-should-know-cc88b8c67f27) 157 | - [Choosing the correct error metric: MAPE vs. sMAPE](https://towardsdatascience.com/choosing-the-correct-error-metric-mape-vs-smape-5328dec53fac) 158 | - [Types of error](https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/types-error#:~:text=Error%20(statistical%20error)%20describes%20the,data%20are%20of%20the%20population.) 159 | 160 | --- 161 | 162 | # 💠 Margin of Error 163 | 164 | The **margin of error** is defined a the [range](https://www.statisticshowto.com/types-of-functions/domain-and-range-of-a-function/) of values below and above the [sample statistic](https://www.statisticshowto.com/sample-statistic-definition-examples/) in a [confidence interval](https://www.statisticshowto.com/probability-and-statistics/confidence-interval/). The confidence interval is a way to show what the [**uncertainty**](https://www.statisticshowto.com/uncertainty-in-statistics/) is with a certain [statistic](https://www.statisticshowto.com/statistic/) _(i.e. from a poll or survey)_. 165 | 166 | A statistical value that determines, with a certain degree of probability, the maximum value by which the results of the sample differ from the results of the general population. It is half the length of the confidence interval. 167 | 168 | > **Предельная ошибка выборки** (также _предельная погрешность выборки_) — статистическая величина, определяющая, с определенной степенью вероятности, максимальное значение, на которое результаты выборки отличаются от результатов генеральной совокупности. Составляет половину длины доверительного интервала. 169 | 170 | **_Examples:_** 171 | 172 | - _For example,_ a survey indicates that **72%** of respondents favor Brand A over Brand B with a 3% margin of error. In this case, the actual population percentage that prefers Brand A likely falls within the range of **72% ± 3%**, or **69 – 75%**. 173 | - A margin of error tells you how many percentage points your results will differ from the real population value. _For example,_ a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real population value 95% of the time. 174 | 175 | A **smaller margin of error** suggests that **the survey’s results will tend to be close to the correct values**. Conversely, **larger MOEs** indicate that **the survey’s estimates can be further away from the population values**. 176 | 177 | The margin of error is influenced by several factors, including the sample size, variability in the data, and the desired level of confidence. A larger sample size generally results in a smaller margin of error, indicating a more precise estimate. Similarly, a higher level of confidence requires a larger margin of error to account for the increased certainty. 178 | 179 | The margin of error provides a measure of the precision and reliability of a sample-based estimate. It helps researchers and analysts interpret and communicate the level of confidence and uncertainty associated with the estimated values. 180 | 181 | The code calculates the error (residual) between the actual and predicted values and adds it as a new column 'Error' in the DataFrame. Then, it calculates the mean and standard deviation of the error. 182 | 183 | Next, you specify the desired confidence level _(e.g., 95% confidence level)_ and use the `stats.norm.ppf` function from the `scipy.stats` module to calculate the critical value based on the confidence level. Finally, the margin of error is computed by multiplying the critical value by the standard error, which is the standard deviation divided by the square root of the number of observations. 184 | 185 | ```python 186 | import numpy as np 187 | import pandas as pd 188 | import scipy.stats as stats 189 | 190 | # Example DataFrame with actual and predicted values 191 | df = pd.DataFrame({'Actual': [10, 15, 20, 25, 30], 192 | 'Predicted': [12, 18, 22, 28, 32]}) 193 | 194 | # Calculate the error (residual) between actual and predicted values 195 | df['Error'] = df['Actual'] - df['Predicted'] 196 | 197 | # Calculate the mean and standard deviation of the error 198 | error_mean = df['Error'].mean() 199 | error_std = df['Error'].std() 200 | 201 | # Define the desired confidence level (e.g., 95%) 202 | confidence_level = 0.95 203 | 204 | # Calculate the critical value based on the confidence level 205 | z_score = stats.norm.ppf((1 + confidence_level) / 2) 206 | 207 | # Calculate the margin of error 208 | margin_of_error = z_score * (error_std / np.sqrt(len(df))) 209 | 210 | print('Margin of Error:', margin_of_error) 211 | ``` 212 | The formula to calculate the margin of error is: **Margin of Error = Critical Value * Standard Error** 213 | 214 | Here's an example code that calculates the margin of error given a sample size, standard deviation, and confidence level: 215 | 216 | ```python 217 | import scipy.stats as stats 218 | import math 219 | 220 | # Example variables 221 | sample_size = 500 222 | standard_deviation = 0.05 223 | confidence_level = 0.95 224 | 225 | # Calculate critical value 226 | z_score = stats.norm.ppf((1 + confidence_level) / 2) # For a two-tailed test 227 | critical_value = z_score * standard_deviation / math.sqrt(sample_size) 228 | 229 | # Calculate margin of error 230 | margin_of_error = critical_value * standard_deviation 231 | 232 | print('Margin of Error:', margin_of_error) 233 | ``` 234 | 235 | In this example, `sample_size` represents the size of the sample, `standard_deviation` represents the standard deviation of the population (or an estimate if it is unknown), and `confidence_level` represents the desired level of confidence _(e.g., 0.95 for 95% confidence)_. The code uses the `stats.norm.ppf` function from the `scipy.stats` module to calculate the critical value based on the confidence level. It then multiplies the critical value by the standard deviation divided by the square root of the sample size to calculate the margin of error. 236 | 237 | ## 📰 Articles 238 | 239 | - [Margin of error](https://en.wikipedia.org/wiki/Margin_of_error), Wikipedia 240 | - [Margin of Error: Formula and Interpreting](https://statisticsbyjim.com/hypothesis-testing/margin-of-error/) 241 | - [Margin of Error: Definition, Calculate in Easy Steps](https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/margin-of-error/) 242 | - [Margin of Error: Definition + Easy Calculation with Examples](https://www.questionpro.com/blog/margin-of-error/) 243 | 244 | --- 245 | 246 | # 💠 Measures 247 | 248 | ## 🔹 Mean directional accuracy (MDA) 249 | 250 | **Mean directional accuracy (MDA)**, also known as mean direction accuracy, is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. 251 | 252 | It is a popular metric for forecasting performance in economics and finance. MDA is used where we are often interested only in the directional movement of variables of interest. 253 | 254 | **The formula for calculating the MDA** 255 | 256 | MDA measures how often the predicted direction of a time series matches the actual direction of the time series. To calculate MDA, you look at the signs of the differences between consecutive actual values and the signs of the differences between consecutive predicted values. If the signs are the same (i.e., both positive or both negative), that means the predicted direction matches the actual direction. You count how many times this happens, and divide by the total number of possible comparisons (which is one less than the length of the time series, because you can’t compare the first value to anything). This gives you the MDA value, which ranges from 0 to 1, with 1 indicating perfect directional accuracy. 257 | 258 | MDA = Number of times the signs of the differences between consecutive actual values are the same as the signs of the differences between consecutive predicted values / (N – 1) 259 | 260 | where N is the length of the time series. 261 | 262 | In mathematical notation, this can be expressed as: 263 | 264 | `MDA = sum(i=2 to N) sign(actual[i] – actual[i-1]) * sign(predicted[i] – predicted[i-1]) / (N – 1)` 265 | 266 | where sign(x) returns the sign of x (i.e., -1 if x < 0, 0 if x == 0, and 1 if x > 0). 267 | 268 | ```python 269 | import numpy as np 270 | 271 | def mda(actual, predicted): 272 | """ 273 | Calculates the Mean Directional Accuracy (MDA) for two time series. 274 | 275 | Parameters: 276 | actual (array-like): The actual values for the time series. 277 | predicted (array-like): The predicted values for the time series. 278 | 279 | Returns: 280 | float: The MDA value. 281 | """ 282 | actual = np.array(actual) 283 | predicted = np.array(predicted) 284 | 285 | # calculate the signs of the differences between consecutive values 286 | actual_diff = np.diff(actual) 287 | actual_signs = np.sign(actual_diff) 288 | predicted_diff = np.diff(predicted) 289 | predicted_signs = np.sign(predicted_diff) 290 | 291 | # count the number of times the signs are the same 292 | num_correct = np.sum(actual_signs == predicted_signs) 293 | 294 | # calculate the MDA value 295 | mda = num_correct / (len(actual) - 1) 296 | 297 | return mda 298 | ``` 299 | 300 | ### 📰 Articles 301 | 302 | - [Mean directional accuracy](https://en.wikipedia.org/wiki/Mean_directional_accuracy#:~:text=Mean%20directional%20accuracy%20(MDA)%2C,to%20the%20actual%20realized%20direction.), Wikipedia 303 | - [Mean directional accuracy of time series forecast](https://datasciencestunt.com/mean-directional-accuracy-of-time-series-forecast/) 304 | 305 | ## 🔹 Percentage of Correct Direction (PCD) 306 | 307 | **PCD measures the proportion of time steps where the predicted direction _(e.g., increase or decrease)_ matches the actual direction.** It is useful when the direction of the predicted values is more important than the magnitude of the errors. 308 | 309 | The resulting boolean values are then converted to `1` for **True (correct direction)** and `0` for **False (incorrect direction)**. 310 | 311 | ```python 312 | # Calculate Percentage of Correct Direction (PCD) for each row 313 | df['PCD'] = (df['Actual'].diff() > 0) == (df['Predicted'].diff() > 0) 314 | 315 | # Convert boolean values to 1 for True and 0 for False 316 | df['PCD'] = df['PCD'].astype(int) 317 | 318 | # Calculate PCD for the whole dataset 319 | overall_pcd = df['PCD'].sum() / len(df) * 100 320 | 321 | # Display the updated DataFrame with PCD column 322 | print(df) 323 | 324 | # Display the overall PCD for the whole dataset 325 | print('Overall PCD:', overall_pcd) 326 | ``` 327 | 328 | ## 🔹 Uncertainty coefficient, Theil's U 329 | 330 | The term **“U statistic”** can have several meanings: 331 | 332 | - Unbiased (U) statistics 333 | - Mann Whitney U Statistic 334 | - U statistic in L-estimators 335 | - Theil’s U 336 | 337 | Theil's U statistic compares the root mean squared error of the model predictions to the root mean squared error of a simple benchmark, such as the naive forecast _(e.g., using the previous observation as the prediction)_. It provides a measure of the model's improvement over a naive approach. 338 | 339 | Theil proposed two U statistics, used in finance. The first _**U1**_ is a measure of forecast accuracy _(Theil, 1958, pp 31-42)_; The second _**U2**_ is a measure of forecast quality _(Theil, 1966, chapter 2)_. 340 | 341 | ```python 342 | # Calculate Theil's U statistic for each row 343 | df['U'] = np.sqrt(((df['Predicted'] - df['Actual']) ** 2).mean()) / np.sqrt(((df['Actual'].diff()) ** 2).mean()) 344 | ``` 345 | 346 | ### 📰 Articles 347 | 348 | - [Uncertainty coefficient](https://en.wikipedia.org/wiki/Uncertainty_coefficient) - Wikipedia 349 | - [U Statistic: Definition, Different Types; Theil’s U](https://www.statisticshowto.com/u-statistic-theils/) 350 | - [Theil’s U](https://docs.oracle.com/cd/E40248_01/epm.1112/cb_statistical/frameset.htm?ch07s02s03s04.html) - Oracle 351 | 352 | ## 🔹 Forecast Error Variance Decomposition (FEVD) 353 | 354 | FEVD decomposes the variance of the forecast errors into components attributed to different factors, such as model bias, model variance, and random fluctuations. It provides insights into the sources of forecast errors and can help identify areas for improvement in the model. 355 | 356 | ### 📰 Articles 357 | 358 | - [Variance decomposition of forecast errors](https://en.wikipedia.org/wiki/Variance_decomposition_of_forecast_errors) 359 | - [The Intuition Behind Impulse Response Functions and Forecast Error Variance Decomposition](https://www.aptech.com/blog/the-intuition-behind-impulse-response-functions-and-forecast-error-variance-decomposition/) 360 | 361 | ## 🔹 Forecast Coverage 362 | 363 | Forecast coverage measures the proportion of actual values that fall within a certain prediction interval or range. It can be explained as the percentage of the actual values that are captured within the model's prediction bounds, providing an assessment of the model's reliability in capturing the range of possible outcomes. 364 | 365 | ```python 366 | # Calculate Forecast Coverage for each row 367 | df['Coverage'] = ((df['Actual'] >= df['Predicted - Lower Bound']) & 368 | (df['Actual'] <= df['Predicted - Upper Bound'])).astype(int) 369 | 370 | # Calculate Forecast Coverage for the whole dataset 371 | overall_coverage = df['Coverage'].sum() / len(df) * 100 372 | 373 | # Display the updated DataFrame with Coverage column 374 | print(df) 375 | 376 | # Display Forecast Coverage for the whole dataset 377 | print('Overall Forecast Coverage:', overall_coverage) 378 | ``` 379 | 380 | # :octocat: GitHub 381 | 382 | - [forecasting_metrics.py](https://gist.github.com/bshishov/5dc237f59f019b26145648e2124ca1c9) 383 | -------------------------------------------------------------------------------- /Metrics/calculating_errors_of_the_forecasting.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "# Calculating errors of the Forecasting - Comparing Actual and Predicted Value\n", 11 | "\n", 12 | "**Errors:**\n", 13 | "\n", 14 | "- Error and Mean Error\n", 15 | "- Absolute Error and Mean Absolute Error (MAE)\n", 16 | "- Squared Error and Mean Squared Error (MSE)\n", 17 | "- Squared Error and Root Mean Squared Error (RMSE)\n", 18 | "- Percentage Error\n", 19 | "- Mean Absolute Percentage Error (MAPE)\n", 20 | "- Symmetric Mean Absolute Percentage Error (SMAPE) \n", 21 | "\n", 22 | "**Measures:**\n", 23 | "\n", 24 | "- Percentage of Correct Direction (PCD)\n", 25 | "- Theil's U statistic\n", 26 | "\n", 27 | "\n", 28 | "Scale error metrics to a 100% range\n", 29 | "\n", 30 | "- Mean Error (%)\n", 31 | "- Mean Absolute Error (MAE) (%)\n", 32 | "- Mean Squared Error (MSE) (%)\n", 33 | "- Root Mean Squared Error (RMSE) (%)\n", 34 | "- Mean Absolute Percentage Error (MAPE) (%)" 35 | ] 36 | }, 37 | { 38 | "attachments": {}, 39 | "cell_type": "markdown", 40 | "metadata": { 41 | "tags": [] 42 | }, 43 | "source": [ 44 | "### Install packages" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "tags": [] 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "!pip install awswrangler" 56 | ] 57 | }, 58 | { 59 | "attachments": {}, 60 | "cell_type": "markdown", 61 | "metadata": { 62 | "tags": [] 63 | }, 64 | "source": [ 65 | "### Import packages" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 12, 71 | "metadata": { 72 | "tags": [] 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "import boto3\n", 77 | "import awswrangler as wr\n", 78 | "import pandas as pd\n", 79 | "import numpy as np\n", 80 | "import matplotlib.pyplot as plt\n", 81 | "from sklearn.metrics import mean_absolute_error, mean_squared_error\n", 82 | "from math import sqrt\n", 83 | "\n", 84 | "import warnings\n", 85 | "warnings.filterwarnings(\"ignore\")" 86 | ] 87 | }, 88 | { 89 | "attachments": {}, 90 | "cell_type": "markdown", 91 | "metadata": { 92 | "jp-MarkdownHeadingCollapsed": true, 93 | "tags": [] 94 | }, 95 | "source": [ 96 | "## Load data and pre-processing" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "tags": [] 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "predictions = 'predictions.xlsx'\n", 108 | "predictions_df = pd.DataFrame(pd.read_excel(predictions))\n", 109 | "\n", 110 | "print(\"Shape:\", predictions_df.shape)\n", 111 | "predictions_df.head(3)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": { 118 | "tags": [] 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "predictions_subset = predictions_df[['ID', 'Predicted']]\n", 123 | "\n", 124 | "print(\"Shape:\", predictions_subset.shape)\n", 125 | "predictions_subset.head(3)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "tags": [] 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "actual_value = 'actual_values.xlsx'\n", 137 | "actual_value_df = pd.DataFrame(pd.read_excel(actual_value))\n", 138 | "\n", 139 | "print(\"Shape:\", actual_value_df.shape)\n", 140 | "actual_value_df.head(3)" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "tags": [] 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "actual_value_subset = actual_value_df[['Name', 'id']]\n", 152 | "\n", 153 | "actual_value_subset = actual_value_subset.rename(columns={\"id\": \"ID\"})\n", 154 | "columns_titles = ['ID', 'Name']\n", 155 | "actual_value_subset = actual_value_subset.reindex(columns=columns_titles)\n", 156 | "print(\"Shape:\", actual_value_subset.shape)\n", 157 | "actual_value_subset.head(3)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": { 164 | "tags": [] 165 | }, 166 | "outputs": [], 167 | "source": [ 168 | "predicted_actual_df = pd.merge(actual_value_subset, predictions_subset, on=\"ID\")\n", 169 | "\n", 170 | "print(\"Shape:\", predicted_actual_df.shape)\n", 171 | "predicted_actual_df.head(3)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": null, 177 | "metadata": { 178 | "tags": [] 179 | }, 180 | "outputs": [], 181 | "source": [ 182 | "wr.s3.to_csv(\n", 183 | " df=predicted_actual_df,\n", 184 | " path='s3:/datasets/time_series/predicted_actual.csv'\n", 185 | ")" 186 | ] 187 | }, 188 | { 189 | "attachments": {}, 190 | "cell_type": "markdown", 191 | "metadata": { 192 | "tags": [] 193 | }, 194 | "source": [ 195 | "## Calculating Errors" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": { 202 | "tags": [] 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "predicted_actual_df = wr.s3.read_csv('s3:/datasets/time_series/predicted_actual.csv')\n", 207 | "\n", 208 | "predicted_actual_df = predicted_actual_df.drop(columns=['Unnamed: 0'])\n", 209 | "print(\"Shape:\", predicted_actual_df.shape)\n", 210 | "predicted_actual_df.head(5)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": { 217 | "tags": [] 218 | }, 219 | "outputs": [], 220 | "source": [ 221 | "# Drop rows with missing values\n", 222 | "predicted_actual_df = predicted_actual_df.dropna()\n", 223 | "\n", 224 | "print(\"Shape:\", predicted_actual_df.shape)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": { 231 | "tags": [] 232 | }, 233 | "outputs": [], 234 | "source": [ 235 | "errors_df = predicted_actual_df.copy()\n", 236 | "\n", 237 | "print(\"Shape:\", errors_df.shape)\n", 238 | "errors_df.head(3)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "tags": [] 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "import numpy as np\n", 250 | "import pandas as pd\n", 251 | "import matplotlib.pyplot as plt\n", 252 | "\n", 253 | "\n", 254 | "# Calculate error metrics for each row\n", 255 | "\n", 256 | "# Calculate the error (residual) between actual and predicted values\n", 257 | "errors_df['Error'] = errors_df['Actual'] - errors_df['Predicted']\n", 258 | "errors_df['Absolute Error'] = np.abs(errors_df['Error'])\n", 259 | "errors_df['MAE'] = mean_absolute_error(errors_df['Actual'], errors_df['Predicted'])\n", 260 | "errors_df['Squared Error'] = errors_df['Error'] ** 2\n", 261 | "errors_df['Percentage Error'] = (errors_df['Error'] / errors_df['Actual']) * 100\n", 262 | "#errors_df['Error (%)'] = ((errors_df['Predicted'] - errors_df['Actual']) / errors_df['Actual']) * 100\n", 263 | "errors_df['MAPE'] = np.abs(errors_df['Percentage Error'])\n", 264 | "# Calculate symmetric mean absolute percentage error (SMAPE) for each row\n", 265 | "errors_df['SMAPE'] = (np.abs(errors_df['Actual'] - errors_df['Predicted']) / ((np.abs(errors_df['Actual']) + np.abs(errors_df['Predicted'])) / 2)) * 100\n", 266 | "\n", 267 | "# Calculate Percentage of Correct Direction (PCD) for each row\n", 268 | "errors_df['PCD'] = (errors_df['Actual'].diff() > 0) == (errors_df['Predicted'].diff() > 0)\n", 269 | "# Convert boolean values to 1 for True and 0 for False\n", 270 | "errors_df['PCD'] = errors_df['PCD'].astype(int)\n", 271 | "\n", 272 | "# Calculate Theil's U statistic for each row\n", 273 | "errors_df['U'] = np.sqrt(((errors_df['Predicted'] - errors_df['Actual']) ** 2).mean()) / np.sqrt(((errors_df['Actual'].diff()) ** 2).mean())\n", 274 | "\n", 275 | "\n", 276 | "# Calculate error metrics for the whole dataset\n", 277 | "mean_error = errors_df['Error'].mean()\n", 278 | "mae = errors_df['Absolute Error'].mean()\n", 279 | "mse = errors_df['Squared Error'].mean()\n", 280 | "rmse = np.sqrt(mse)\n", 281 | "mape = np.mean(np.abs(errors_df['Percentage Error']))\n", 282 | "overall_error = ((errors_df['Predicted'] - errors_df['Actual']).sum() / errors_df['Actual'].sum()) * 100\n", 283 | "# Calculate overall SMAPE for the whole dataset\n", 284 | "overall_smape = (np.abs(errors_df['Actual'] - errors_df['Predicted']).sum() / ((np.abs(errors_df['Actual']) + np.abs(errors_df['Predicted'])).sum() / 2)) * 100\n", 285 | "# Calculate PCD for the whole dataset\n", 286 | "overall_pcd = errors_df['PCD'].sum() / len(errors_df) * 100\n", 287 | "# Calculate Theil's U statistic for the whole dataset\n", 288 | "overall_u = np.sqrt(((errors_df['Predicted'] - errors_df['Actual']) ** 2).mean()) / np.sqrt(((errors_df['Actual'].diff()) ** 2).mean())\n", 289 | "\n", 290 | "\n", 291 | "print('Mean Error:', mean_error)\n", 292 | "print('Mean Absolute Error (MAE):', mae)\n", 293 | "print('Mean Squared Error (MSE):', mse)\n", 294 | "print('Root Mean Squared Error (RMSE):', rmse)\n", 295 | "print('Mean Absolute Percentage Error (MAPE):', mape)\n", 296 | "print('Overall SMAPE:', overall_smape)\n", 297 | "print('Overall Error (%):', overall_error)\n", 298 | "print('Overall PCD:', overall_pcd)\n", 299 | "print('Overall U:', overall_u) # Display Theil's U statistic for the whole dataset\n", 300 | "\n", 301 | "print('\\n\\n')\n", 302 | "\n", 303 | "\n", 304 | "# Scale error metrics to a 100% range\n", 305 | "mean_error_percent = (mean_error / errors_df['Actual'].mean()) * 100\n", 306 | "mae_percent = (mae / errors_df['Actual'].mean()) * 100\n", 307 | "mse_percent = (mse / errors_df['Actual'].mean() ** 2) * 100\n", 308 | "rmse_percent = (rmse / errors_df['Actual'].mean()) * 100\n", 309 | "mape_percent = mape\n", 310 | "\n", 311 | "# Print error metrics for the whole dataset\n", 312 | "print('Mean Error (%):', mean_error_percent)\n", 313 | "print('Mean Absolute Error (MAE) (%):', mae_percent)\n", 314 | "print('Mean Squared Error (MSE) (%):', mse_percent)\n", 315 | "print('Root Mean Squared Error (RMSE) (%):', rmse_percent)\n", 316 | "print('Mean Absolute Percentage Error (MAPE) (%):', mape_percent)" 317 | ] 318 | }, 319 | { 320 | "attachments": {}, 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "Calculate error metrics for the whole dataset\n", 325 | "\n", 326 | "- **Mean Error:** ``\n", 327 | "- **Mean Absolute Error (MAE):** ``\n", 328 | "- **Mean Squared Error (MSE):** ``\n", 329 | "- **Root Mean Squared Error (RMSE):** ``\n", 330 | "- **Mean Absolute Percentage Error (MAPE):** `inf`\n", 331 | "- **Overall SMAPE:** ``\n", 332 | "- **Overall Error (%):** ``\n", 333 | "- **Overall PCD:** ``\n", 334 | "- **Overall U:** ``\n", 335 | "\n", 336 | "\n", 337 | "Scale error metrics to a 100% range\n", 338 | "\n", 339 | "- **Mean Error (%):** ``\n", 340 | "- **Mean Absolute Error (MAE) (%):** ``\n", 341 | "- **Mean Squared Error (MSE) (%):** ``\n", 342 | "- **Root Mean Squared Error (RMSE) (%):** ``\n", 343 | "- **Mean Absolute Percentage Error (MAPE) (%):** `inf`" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": null, 349 | "metadata": { 350 | "tags": [] 351 | }, 352 | "outputs": [], 353 | "source": [ 354 | "print(\"Shape:\", errors_df.shape)\n", 355 | "errors_df.head(5)" 356 | ] 357 | }, 358 | { 359 | "attachments": {}, 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "---" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 73, 369 | "metadata": { 370 | "tags": [] 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "errors_df.to_excel(\"errors.xlsx\", sheet_name='Errors Metrics') " 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 63, 380 | "metadata": { 381 | "tags": [] 382 | }, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "{'paths': ['s3://ds-dataset/Penetration_Prediction/Penetration_2022_Q-4/2022-Q4_errors.csv'],\n", 388 | " 'partitions_values': {}}" 389 | ] 390 | }, 391 | "execution_count": 63, 392 | "metadata": {}, 393 | "output_type": "execute_result" 394 | } 395 | ], 396 | "source": [ 397 | "wr.s3.to_csv(\n", 398 | " df=errors_df,\n", 399 | " index = False\n", 400 | " path='s3:/datasets/time_series/errors_metrics.csv'\n", 401 | ")" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": { 408 | "tags": [] 409 | }, 410 | "outputs": [], 411 | "source": [ 412 | "errors_df = wr.s3.read_csv('s3:/datasets/time_series/errors_metrics.csv')\n", 413 | "\n", 414 | "print(\"Shape:\", errors_df.shape)\n", 415 | "errors_df.head(5)" 416 | ] 417 | }, 418 | { 419 | "attachments": {}, 420 | "cell_type": "markdown", 421 | "metadata": { 422 | "tags": [] 423 | }, 424 | "source": [ 425 | "## Calculate Margin of Error" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": null, 431 | "metadata": { 432 | "tags": [] 433 | }, 434 | "outputs": [], 435 | "source": [ 436 | "import numpy as np\n", 437 | "import pandas as pd\n", 438 | "import scipy.stats as stats\n", 439 | "\n", 440 | "\n", 441 | "# Calculate the mean and standard deviation of the error\n", 442 | "error_mean = errors_df['Error'].mean()\n", 443 | "error_std = errors_df['Error'].std()\n", 444 | "\n", 445 | "# Define the desired confidence level (e.g., 95%)\n", 446 | "confidence_level = 0.95\n", 447 | "\n", 448 | "# Calculate the critical value based on the confidence level\n", 449 | "z_score = stats.norm.ppf((1 + confidence_level) / 2)\n", 450 | "\n", 451 | "# Calculate the margin of error\n", 452 | "margin_of_error = z_score * (error_std / np.sqrt(len(df)))\n", 453 | "\n", 454 | "print('Margin of Error:', margin_of_error)" 455 | ] 456 | }, 457 | { 458 | "attachments": {}, 459 | "cell_type": "markdown", 460 | "metadata": { 461 | "tags": [] 462 | }, 463 | "source": [ 464 | "## Charts" 465 | ] 466 | }, 467 | { 468 | "attachments": {}, 469 | "cell_type": "markdown", 470 | "metadata": { 471 | "tags": [] 472 | }, 473 | "source": [ 474 | "### Distribution of Mean Absolute Error (MAE)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": { 481 | "tags": [] 482 | }, 483 | "outputs": [], 484 | "source": [ 485 | "# Removing anything below actual of 1\n", 486 | "df = errors_df[errors_df['Actual'] >= 1.0]\n", 487 | "\n", 488 | "# Define the bins\n", 489 | "bins = np.arange(0, 101, 10)\n", 490 | "\n", 491 | "# Create a new column 'bins' in the dataframe to hold the bin each 'actual' value falls into\n", 492 | "df['bins'] = pd.cut(df['Actual'], bins, right=False)\n", 493 | "\n", 494 | "# Group by bins\n", 495 | "grouped = df.groupby('bins')\n", 496 | "\n", 497 | "# Calculate the Mean Absolute Error for each bin\n", 498 | "mae_values = grouped.apply(lambda g: mean_absolute_error(g['Actual'], \n", 499 | " g['Predicted']))\n", 500 | "\n", 501 | "# Plot histogram\n", 502 | "plt.bar(range(len(mae_values)), mae_values, color='lightblue')\n", 503 | "\n", 504 | "# Add MAE values on top of each bar\n", 505 | "for i, v in enumerate(mae_values):\n", 506 | " plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')\n", 507 | "\n", 508 | "# Calculate and add total MAE to title\n", 509 | "total_mae = mean_absolute_error(df['Actual'], df['Predicted'])\n", 510 | "plt.title(f'Total MAE: {total_mae:.2f}')\n", 511 | "\n", 512 | "# Set x-axis labels to bin ranges\n", 513 | "plt.xticks(range(len(mae_values)), [str(i) for i in grouped.groups.keys()], rotation=45)\n", 514 | "\n", 515 | "# Set x and y labels\n", 516 | "plt.xlabel('Actual Value Bins')\n", 517 | "plt.ylabel('Mean Absolute Error')\n", 518 | "\n", 519 | "# Show the plot\n", 520 | "plt.tight_layout()\n", 521 | "plt.show()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": { 528 | "tags": [] 529 | }, 530 | "outputs": [], 531 | "source": [ 532 | "# Count of data points in each bin\n", 533 | "counts = df['bins'].value_counts().sort_index()\n", 534 | "\n", 535 | "# Plot histogram\n", 536 | "plt.bar(range(len(counts)), counts, color='lightgreen')\n", 537 | "\n", 538 | "# Add count values on top of each bar\n", 539 | "for i, v in enumerate(counts):\n", 540 | " plt.text(i, v, f'{v}', ha='center', va='bottom')\n", 541 | "\n", 542 | "# Set x-axis labels to bin ranges\n", 543 | "plt.xticks(range(len(counts)), [str(i) for i in counts.index], rotation=45)\n", 544 | "\n", 545 | "# Set x and y labels\n", 546 | "plt.xlabel('Actual Value Bins')\n", 547 | "plt.ylabel('Count of Data Points')\n", 548 | "\n", 549 | "# Show the plot\n", 550 | "plt.tight_layout()\n", 551 | "plt.show()\n" 552 | ] 553 | }, 554 | { 555 | "attachments": {}, 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "### Distribution of Mean Squared Error (MSE)" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [ 568 | "# Removing anything below actual of 1\n", 569 | "df = errors_df[errors_df['Actual'] >= 1.0] # 0\n", 570 | "\n", 571 | "# Define the bins\n", 572 | "bins = np.arange(0, 101, 10)\n", 573 | "\n", 574 | "# Create a new column 'bins' in the dataframe to hold the bin each 'actual' value falls into\n", 575 | "df['bins'] = pd.cut(df['Actual'], bins, right=False)\n", 576 | "\n", 577 | "# Group by bins\n", 578 | "grouped = df.groupby('bins')\n", 579 | "\n", 580 | "# Calculate the Mean Squared Error (MSE) for each bin\n", 581 | "mse_values = grouped.apply(lambda g: mean_squared_error(g['Actual'], \n", 582 | " g['Predicted']))\n", 583 | "\n", 584 | "# Plot histogram\n", 585 | "plt.bar(range(len(mse_values)), mse_values, color='lightblue')\n", 586 | "\n", 587 | "# Add MSE values on top of each bar\n", 588 | "for i, v in enumerate(mse_values):\n", 589 | " plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')\n", 590 | "\n", 591 | "# Calculate and add total MSE to title\n", 592 | "total_mse = mean_squared_error(df['Actual'], df['Predicted'])\n", 593 | "plt.title(f'Total MSE: {total_mse:.2f}')\n", 594 | "\n", 595 | "# Set x-axis labels to bin ranges\n", 596 | "plt.xticks(range(len(mse_values)), [str(i) for i in grouped.groups.keys()], rotation=45)\n", 597 | "\n", 598 | "# Set x and y labels\n", 599 | "plt.xlabel('Actual Value Bins')\n", 600 | "plt.ylabel('Mean Squared Error (MSE)')\n", 601 | "\n", 602 | "# Show the plot\n", 603 | "plt.tight_layout()\n", 604 | "plt.show()" 605 | ] 606 | }, 607 | { 608 | "attachments": {}, 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "### Distribution of Root Mean Squared Error (RMSE)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": null, 618 | "metadata": { 619 | "tags": [] 620 | }, 621 | "outputs": [], 622 | "source": [ 623 | "# Removing anything below actual of 1\n", 624 | "df = errors_df[errors_df['Actual'] >= 0] # 0\n", 625 | "\n", 626 | "# Define the bins\n", 627 | "bins = np.arange(0, 101, 10)\n", 628 | "\n", 629 | "# Create a new column 'bins' in the dataframe to hold the bin each 'actual' value falls into\n", 630 | "df['bins'] = pd.cut(df['Actual'], bins, right=False)\n", 631 | "\n", 632 | "# Group by bins\n", 633 | "grouped = df.groupby('bins')\n", 634 | "\n", 635 | "# Calculate the Root Mean Squared Error (RMSE) for each bin\n", 636 | "rmse_values = grouped.apply(lambda g: sqrt(mean_squared_error(g['Actual'], \n", 637 | " g['Predicted'])))\n", 638 | "\n", 639 | "# Plot histogram\n", 640 | "plt.bar(range(len(rmse_values)), rmse_values, color='lightblue')\n", 641 | "\n", 642 | "# Add RMSE values on top of each bar\n", 643 | "for i, v in enumerate(rmse_values):\n", 644 | " plt.text(i, v, f'{v:.2f}', ha='center', va='bottom')\n", 645 | "\n", 646 | "# Calculate and add total MSE to title\n", 647 | "total_rmse = sqrt(mean_squared_error(df['Actual'], df['Predicted']))\n", 648 | "plt.title(f'Total RMSE: {total_rmse:.2f}')\n", 649 | "\n", 650 | "# Set x-axis labels to bin ranges\n", 651 | "plt.xticks(range(len(rmse_values)), [str(i) for i in grouped.groups.keys()], rotation=45)\n", 652 | "\n", 653 | "# Set x and y labels\n", 654 | "plt.xlabel('Actual Value Bins')\n", 655 | "plt.ylabel('Root Mean Squared Error (RMSE)')\n", 656 | "\n", 657 | "# Show the plot\n", 658 | "plt.tight_layout()\n", 659 | "plt.show()" 660 | ] 661 | } 662 | ], 663 | "metadata": { 664 | "availableInstances": [ 665 | { 666 | "_defaultOrder": 0, 667 | "_isFastLaunch": true, 668 | "category": "General purpose", 669 | "gpuNum": 0, 670 | "hideHardwareSpecs": false, 671 | "memoryGiB": 4, 672 | "name": "ml.t3.medium", 673 | "vcpuNum": 2 674 | }, 675 | { 676 | "_defaultOrder": 1, 677 | "_isFastLaunch": false, 678 | "category": "General purpose", 679 | "gpuNum": 0, 680 | "hideHardwareSpecs": false, 681 | "memoryGiB": 8, 682 | "name": "ml.t3.large", 683 | "vcpuNum": 2 684 | }, 685 | { 686 | "_defaultOrder": 2, 687 | "_isFastLaunch": false, 688 | "category": "General purpose", 689 | "gpuNum": 0, 690 | "hideHardwareSpecs": false, 691 | "memoryGiB": 16, 692 | "name": "ml.t3.xlarge", 693 | "vcpuNum": 4 694 | }, 695 | { 696 | "_defaultOrder": 3, 697 | "_isFastLaunch": false, 698 | "category": "General purpose", 699 | "gpuNum": 0, 700 | "hideHardwareSpecs": false, 701 | "memoryGiB": 32, 702 | "name": "ml.t3.2xlarge", 703 | "vcpuNum": 8 704 | }, 705 | { 706 | "_defaultOrder": 4, 707 | "_isFastLaunch": true, 708 | "category": "General purpose", 709 | "gpuNum": 0, 710 | "hideHardwareSpecs": false, 711 | "memoryGiB": 8, 712 | "name": "ml.m5.large", 713 | "vcpuNum": 2 714 | }, 715 | { 716 | "_defaultOrder": 5, 717 | "_isFastLaunch": false, 718 | "category": "General purpose", 719 | "gpuNum": 0, 720 | "hideHardwareSpecs": false, 721 | "memoryGiB": 16, 722 | "name": "ml.m5.xlarge", 723 | "vcpuNum": 4 724 | }, 725 | { 726 | "_defaultOrder": 6, 727 | "_isFastLaunch": false, 728 | "category": "General purpose", 729 | "gpuNum": 0, 730 | "hideHardwareSpecs": false, 731 | "memoryGiB": 32, 732 | "name": "ml.m5.2xlarge", 733 | "vcpuNum": 8 734 | }, 735 | { 736 | "_defaultOrder": 7, 737 | "_isFastLaunch": false, 738 | "category": "General purpose", 739 | "gpuNum": 0, 740 | "hideHardwareSpecs": false, 741 | "memoryGiB": 64, 742 | "name": "ml.m5.4xlarge", 743 | "vcpuNum": 16 744 | }, 745 | { 746 | "_defaultOrder": 8, 747 | "_isFastLaunch": false, 748 | "category": "General purpose", 749 | "gpuNum": 0, 750 | "hideHardwareSpecs": false, 751 | "memoryGiB": 128, 752 | "name": "ml.m5.8xlarge", 753 | "vcpuNum": 32 754 | }, 755 | { 756 | "_defaultOrder": 9, 757 | "_isFastLaunch": false, 758 | "category": "General purpose", 759 | "gpuNum": 0, 760 | "hideHardwareSpecs": false, 761 | "memoryGiB": 192, 762 | "name": "ml.m5.12xlarge", 763 | "vcpuNum": 48 764 | }, 765 | { 766 | "_defaultOrder": 10, 767 | "_isFastLaunch": false, 768 | "category": "General purpose", 769 | "gpuNum": 0, 770 | "hideHardwareSpecs": false, 771 | "memoryGiB": 256, 772 | "name": "ml.m5.16xlarge", 773 | "vcpuNum": 64 774 | }, 775 | { 776 | "_defaultOrder": 11, 777 | "_isFastLaunch": false, 778 | "category": "General purpose", 779 | "gpuNum": 0, 780 | "hideHardwareSpecs": false, 781 | "memoryGiB": 384, 782 | "name": "ml.m5.24xlarge", 783 | "vcpuNum": 96 784 | }, 785 | { 786 | "_defaultOrder": 12, 787 | "_isFastLaunch": false, 788 | "category": "General purpose", 789 | "gpuNum": 0, 790 | "hideHardwareSpecs": false, 791 | "memoryGiB": 8, 792 | "name": "ml.m5d.large", 793 | "vcpuNum": 2 794 | }, 795 | { 796 | "_defaultOrder": 13, 797 | "_isFastLaunch": false, 798 | "category": "General purpose", 799 | "gpuNum": 0, 800 | "hideHardwareSpecs": false, 801 | "memoryGiB": 16, 802 | "name": "ml.m5d.xlarge", 803 | "vcpuNum": 4 804 | }, 805 | { 806 | "_defaultOrder": 14, 807 | "_isFastLaunch": false, 808 | "category": "General purpose", 809 | "gpuNum": 0, 810 | "hideHardwareSpecs": false, 811 | "memoryGiB": 32, 812 | "name": "ml.m5d.2xlarge", 813 | "vcpuNum": 8 814 | }, 815 | { 816 | "_defaultOrder": 15, 817 | "_isFastLaunch": false, 818 | "category": "General purpose", 819 | "gpuNum": 0, 820 | "hideHardwareSpecs": false, 821 | "memoryGiB": 64, 822 | "name": "ml.m5d.4xlarge", 823 | "vcpuNum": 16 824 | }, 825 | { 826 | "_defaultOrder": 16, 827 | "_isFastLaunch": false, 828 | "category": "General purpose", 829 | "gpuNum": 0, 830 | "hideHardwareSpecs": false, 831 | "memoryGiB": 128, 832 | "name": "ml.m5d.8xlarge", 833 | "vcpuNum": 32 834 | }, 835 | { 836 | "_defaultOrder": 17, 837 | "_isFastLaunch": false, 838 | "category": "General purpose", 839 | "gpuNum": 0, 840 | "hideHardwareSpecs": false, 841 | "memoryGiB": 192, 842 | "name": "ml.m5d.12xlarge", 843 | "vcpuNum": 48 844 | }, 845 | { 846 | "_defaultOrder": 18, 847 | "_isFastLaunch": false, 848 | "category": "General purpose", 849 | "gpuNum": 0, 850 | "hideHardwareSpecs": false, 851 | "memoryGiB": 256, 852 | "name": "ml.m5d.16xlarge", 853 | "vcpuNum": 64 854 | }, 855 | { 856 | "_defaultOrder": 19, 857 | "_isFastLaunch": false, 858 | "category": "General purpose", 859 | "gpuNum": 0, 860 | "hideHardwareSpecs": false, 861 | "memoryGiB": 384, 862 | "name": "ml.m5d.24xlarge", 863 | "vcpuNum": 96 864 | }, 865 | { 866 | "_defaultOrder": 20, 867 | "_isFastLaunch": false, 868 | "category": "General purpose", 869 | "gpuNum": 0, 870 | "hideHardwareSpecs": true, 871 | "memoryGiB": 0, 872 | "name": "ml.geospatial.interactive", 873 | "supportedImageNames": [ 874 | "sagemaker-geospatial-v1-0" 875 | ], 876 | "vcpuNum": 0 877 | }, 878 | { 879 | "_defaultOrder": 21, 880 | "_isFastLaunch": true, 881 | "category": "Compute optimized", 882 | "gpuNum": 0, 883 | "hideHardwareSpecs": false, 884 | "memoryGiB": 4, 885 | "name": "ml.c5.large", 886 | "vcpuNum": 2 887 | }, 888 | { 889 | "_defaultOrder": 22, 890 | "_isFastLaunch": false, 891 | "category": "Compute optimized", 892 | "gpuNum": 0, 893 | "hideHardwareSpecs": false, 894 | "memoryGiB": 8, 895 | "name": "ml.c5.xlarge", 896 | "vcpuNum": 4 897 | }, 898 | { 899 | "_defaultOrder": 23, 900 | "_isFastLaunch": false, 901 | "category": "Compute optimized", 902 | "gpuNum": 0, 903 | "hideHardwareSpecs": false, 904 | "memoryGiB": 16, 905 | "name": "ml.c5.2xlarge", 906 | "vcpuNum": 8 907 | }, 908 | { 909 | "_defaultOrder": 24, 910 | "_isFastLaunch": false, 911 | "category": "Compute optimized", 912 | "gpuNum": 0, 913 | "hideHardwareSpecs": false, 914 | "memoryGiB": 32, 915 | "name": "ml.c5.4xlarge", 916 | "vcpuNum": 16 917 | }, 918 | { 919 | "_defaultOrder": 25, 920 | "_isFastLaunch": false, 921 | "category": "Compute optimized", 922 | "gpuNum": 0, 923 | "hideHardwareSpecs": false, 924 | "memoryGiB": 72, 925 | "name": "ml.c5.9xlarge", 926 | "vcpuNum": 36 927 | }, 928 | { 929 | "_defaultOrder": 26, 930 | "_isFastLaunch": false, 931 | "category": "Compute optimized", 932 | "gpuNum": 0, 933 | "hideHardwareSpecs": false, 934 | "memoryGiB": 96, 935 | "name": "ml.c5.12xlarge", 936 | "vcpuNum": 48 937 | }, 938 | { 939 | "_defaultOrder": 27, 940 | "_isFastLaunch": false, 941 | "category": "Compute optimized", 942 | "gpuNum": 0, 943 | "hideHardwareSpecs": false, 944 | "memoryGiB": 144, 945 | "name": "ml.c5.18xlarge", 946 | "vcpuNum": 72 947 | }, 948 | { 949 | "_defaultOrder": 28, 950 | "_isFastLaunch": false, 951 | "category": "Compute optimized", 952 | "gpuNum": 0, 953 | "hideHardwareSpecs": false, 954 | "memoryGiB": 192, 955 | "name": "ml.c5.24xlarge", 956 | "vcpuNum": 96 957 | }, 958 | { 959 | "_defaultOrder": 29, 960 | "_isFastLaunch": true, 961 | "category": "Accelerated computing", 962 | "gpuNum": 1, 963 | "hideHardwareSpecs": false, 964 | "memoryGiB": 16, 965 | "name": "ml.g4dn.xlarge", 966 | "vcpuNum": 4 967 | }, 968 | { 969 | "_defaultOrder": 30, 970 | "_isFastLaunch": false, 971 | "category": "Accelerated computing", 972 | "gpuNum": 1, 973 | "hideHardwareSpecs": false, 974 | "memoryGiB": 32, 975 | "name": "ml.g4dn.2xlarge", 976 | "vcpuNum": 8 977 | }, 978 | { 979 | "_defaultOrder": 31, 980 | "_isFastLaunch": false, 981 | "category": "Accelerated computing", 982 | "gpuNum": 1, 983 | "hideHardwareSpecs": false, 984 | "memoryGiB": 64, 985 | "name": "ml.g4dn.4xlarge", 986 | "vcpuNum": 16 987 | }, 988 | { 989 | "_defaultOrder": 32, 990 | "_isFastLaunch": false, 991 | "category": "Accelerated computing", 992 | "gpuNum": 1, 993 | "hideHardwareSpecs": false, 994 | "memoryGiB": 128, 995 | "name": "ml.g4dn.8xlarge", 996 | "vcpuNum": 32 997 | }, 998 | { 999 | "_defaultOrder": 33, 1000 | "_isFastLaunch": false, 1001 | "category": "Accelerated computing", 1002 | "gpuNum": 4, 1003 | "hideHardwareSpecs": false, 1004 | "memoryGiB": 192, 1005 | "name": "ml.g4dn.12xlarge", 1006 | "vcpuNum": 48 1007 | }, 1008 | { 1009 | "_defaultOrder": 34, 1010 | "_isFastLaunch": false, 1011 | "category": "Accelerated computing", 1012 | "gpuNum": 1, 1013 | "hideHardwareSpecs": false, 1014 | "memoryGiB": 256, 1015 | "name": "ml.g4dn.16xlarge", 1016 | "vcpuNum": 64 1017 | }, 1018 | { 1019 | "_defaultOrder": 35, 1020 | "_isFastLaunch": false, 1021 | "category": "Accelerated computing", 1022 | "gpuNum": 1, 1023 | "hideHardwareSpecs": false, 1024 | "memoryGiB": 61, 1025 | "name": "ml.p3.2xlarge", 1026 | "vcpuNum": 8 1027 | }, 1028 | { 1029 | "_defaultOrder": 36, 1030 | "_isFastLaunch": false, 1031 | "category": "Accelerated computing", 1032 | "gpuNum": 4, 1033 | "hideHardwareSpecs": false, 1034 | "memoryGiB": 244, 1035 | "name": "ml.p3.8xlarge", 1036 | "vcpuNum": 32 1037 | }, 1038 | { 1039 | "_defaultOrder": 37, 1040 | "_isFastLaunch": false, 1041 | "category": "Accelerated computing", 1042 | "gpuNum": 8, 1043 | "hideHardwareSpecs": false, 1044 | "memoryGiB": 488, 1045 | "name": "ml.p3.16xlarge", 1046 | "vcpuNum": 64 1047 | }, 1048 | { 1049 | "_defaultOrder": 38, 1050 | "_isFastLaunch": false, 1051 | "category": "Accelerated computing", 1052 | "gpuNum": 8, 1053 | "hideHardwareSpecs": false, 1054 | "memoryGiB": 768, 1055 | "name": "ml.p3dn.24xlarge", 1056 | "vcpuNum": 96 1057 | }, 1058 | { 1059 | "_defaultOrder": 39, 1060 | "_isFastLaunch": false, 1061 | "category": "Memory Optimized", 1062 | "gpuNum": 0, 1063 | "hideHardwareSpecs": false, 1064 | "memoryGiB": 16, 1065 | "name": "ml.r5.large", 1066 | "vcpuNum": 2 1067 | }, 1068 | { 1069 | "_defaultOrder": 40, 1070 | "_isFastLaunch": false, 1071 | "category": "Memory Optimized", 1072 | "gpuNum": 0, 1073 | "hideHardwareSpecs": false, 1074 | "memoryGiB": 32, 1075 | "name": "ml.r5.xlarge", 1076 | "vcpuNum": 4 1077 | }, 1078 | { 1079 | "_defaultOrder": 41, 1080 | "_isFastLaunch": false, 1081 | "category": "Memory Optimized", 1082 | "gpuNum": 0, 1083 | "hideHardwareSpecs": false, 1084 | "memoryGiB": 64, 1085 | "name": "ml.r5.2xlarge", 1086 | "vcpuNum": 8 1087 | }, 1088 | { 1089 | "_defaultOrder": 42, 1090 | "_isFastLaunch": false, 1091 | "category": "Memory Optimized", 1092 | "gpuNum": 0, 1093 | "hideHardwareSpecs": false, 1094 | "memoryGiB": 128, 1095 | "name": "ml.r5.4xlarge", 1096 | "vcpuNum": 16 1097 | }, 1098 | { 1099 | "_defaultOrder": 43, 1100 | "_isFastLaunch": false, 1101 | "category": "Memory Optimized", 1102 | "gpuNum": 0, 1103 | "hideHardwareSpecs": false, 1104 | "memoryGiB": 256, 1105 | "name": "ml.r5.8xlarge", 1106 | "vcpuNum": 32 1107 | }, 1108 | { 1109 | "_defaultOrder": 44, 1110 | "_isFastLaunch": false, 1111 | "category": "Memory Optimized", 1112 | "gpuNum": 0, 1113 | "hideHardwareSpecs": false, 1114 | "memoryGiB": 384, 1115 | "name": "ml.r5.12xlarge", 1116 | "vcpuNum": 48 1117 | }, 1118 | { 1119 | "_defaultOrder": 45, 1120 | "_isFastLaunch": false, 1121 | "category": "Memory Optimized", 1122 | "gpuNum": 0, 1123 | "hideHardwareSpecs": false, 1124 | "memoryGiB": 512, 1125 | "name": "ml.r5.16xlarge", 1126 | "vcpuNum": 64 1127 | }, 1128 | { 1129 | "_defaultOrder": 46, 1130 | "_isFastLaunch": false, 1131 | "category": "Memory Optimized", 1132 | "gpuNum": 0, 1133 | "hideHardwareSpecs": false, 1134 | "memoryGiB": 768, 1135 | "name": "ml.r5.24xlarge", 1136 | "vcpuNum": 96 1137 | }, 1138 | { 1139 | "_defaultOrder": 47, 1140 | "_isFastLaunch": false, 1141 | "category": "Accelerated computing", 1142 | "gpuNum": 1, 1143 | "hideHardwareSpecs": false, 1144 | "memoryGiB": 16, 1145 | "name": "ml.g5.xlarge", 1146 | "vcpuNum": 4 1147 | }, 1148 | { 1149 | "_defaultOrder": 48, 1150 | "_isFastLaunch": false, 1151 | "category": "Accelerated computing", 1152 | "gpuNum": 1, 1153 | "hideHardwareSpecs": false, 1154 | "memoryGiB": 32, 1155 | "name": "ml.g5.2xlarge", 1156 | "vcpuNum": 8 1157 | }, 1158 | { 1159 | "_defaultOrder": 49, 1160 | "_isFastLaunch": false, 1161 | "category": "Accelerated computing", 1162 | "gpuNum": 1, 1163 | "hideHardwareSpecs": false, 1164 | "memoryGiB": 64, 1165 | "name": "ml.g5.4xlarge", 1166 | "vcpuNum": 16 1167 | }, 1168 | { 1169 | "_defaultOrder": 50, 1170 | "_isFastLaunch": false, 1171 | "category": "Accelerated computing", 1172 | "gpuNum": 1, 1173 | "hideHardwareSpecs": false, 1174 | "memoryGiB": 128, 1175 | "name": "ml.g5.8xlarge", 1176 | "vcpuNum": 32 1177 | }, 1178 | { 1179 | "_defaultOrder": 51, 1180 | "_isFastLaunch": false, 1181 | "category": "Accelerated computing", 1182 | "gpuNum": 1, 1183 | "hideHardwareSpecs": false, 1184 | "memoryGiB": 256, 1185 | "name": "ml.g5.16xlarge", 1186 | "vcpuNum": 64 1187 | }, 1188 | { 1189 | "_defaultOrder": 52, 1190 | "_isFastLaunch": false, 1191 | "category": "Accelerated computing", 1192 | "gpuNum": 4, 1193 | "hideHardwareSpecs": false, 1194 | "memoryGiB": 192, 1195 | "name": "ml.g5.12xlarge", 1196 | "vcpuNum": 48 1197 | }, 1198 | { 1199 | "_defaultOrder": 53, 1200 | "_isFastLaunch": false, 1201 | "category": "Accelerated computing", 1202 | "gpuNum": 4, 1203 | "hideHardwareSpecs": false, 1204 | "memoryGiB": 384, 1205 | "name": "ml.g5.24xlarge", 1206 | "vcpuNum": 96 1207 | }, 1208 | { 1209 | "_defaultOrder": 54, 1210 | "_isFastLaunch": false, 1211 | "category": "Accelerated computing", 1212 | "gpuNum": 8, 1213 | "hideHardwareSpecs": false, 1214 | "memoryGiB": 768, 1215 | "name": "ml.g5.48xlarge", 1216 | "vcpuNum": 192 1217 | }, 1218 | { 1219 | "_defaultOrder": 55, 1220 | "_isFastLaunch": false, 1221 | "category": "Accelerated computing", 1222 | "gpuNum": 8, 1223 | "hideHardwareSpecs": false, 1224 | "memoryGiB": 1152, 1225 | "name": "ml.p4d.24xlarge", 1226 | "vcpuNum": 96 1227 | }, 1228 | { 1229 | "_defaultOrder": 56, 1230 | "_isFastLaunch": false, 1231 | "category": "Accelerated computing", 1232 | "gpuNum": 8, 1233 | "hideHardwareSpecs": false, 1234 | "memoryGiB": 1152, 1235 | "name": "ml.p4de.24xlarge", 1236 | "vcpuNum": 96 1237 | } 1238 | ], 1239 | "instance_type": "ml.t3.medium", 1240 | "kernelspec": { 1241 | "display_name": "Python 3 (Data Science 3.0)", 1242 | "language": "python", 1243 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" 1244 | }, 1245 | "language_info": { 1246 | "codemirror_mode": { 1247 | "name": "ipython", 1248 | "version": 3 1249 | }, 1250 | "file_extension": ".py", 1251 | "mimetype": "text/x-python", 1252 | "name": "python", 1253 | "nbconvert_exporter": "python", 1254 | "pygments_lexer": "ipython3", 1255 | "version": "3.10.6" 1256 | } 1257 | }, 1258 | "nbformat": 4, 1259 | "nbformat_minor": 4 1260 | } 1261 | -------------------------------------------------------------------------------- /Missing Data/README.md: -------------------------------------------------------------------------------- 1 | # Handling Missing Data in Time Series 2 | 3 | - Interpolation 4 | - [Handling Missing Response](https://orbit-ml.readthedocs.io/en/stable/tutorials/ets_lgt_dlt_missing_response.html?highlight=OrbitPalette) in Orbit models 5 | - :octocat: [examples/ets_lgt_dlt_missing_response.ipynb](https://github.com/uber/orbit/blob/master/examples/ets_lgt_dlt_missing_response.ipynb) — Handling Missing Response in Exponential Smoothing Models 6 | - 7 | -------------------------------------------------------------------------------- /Model Selection/Cross-validation/README.md: -------------------------------------------------------------------------------- 1 | # Cross validation of Time Series data 2 | 3 | Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model. To achieve this, one solution is provided by TimeSeriesSplit. 4 | 5 | | Title | Description, Information | 6 | | :---: | :--- | 7 | |**K-Folds cross-validator**|

KFold divides all the samples in _k_ groups of samples, called folds (if _k = n_, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using _k−1_ folds, and the fold left out is used for test.

[K-fold](https://scikit-learn.org/stable/modules/cross_validation.html#k-fold)
[sklearn.model_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

| 8 | |**Time Series Split**|

TimeSeriesSplit is a variation of k-fold which returns first _k_ folds as train set and the _(k + 1)_- th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.

This class can be used to cross-validate time series data samples that are observed at fixed time intervals.

[Time Series Split](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split)
[sklearn.model_selection.TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html)

| 9 | |**Backtest Orbit Model**|

The way to gauge the performance of a time-series model is through re-training models with different historic periods and check their forecast within certain steps. This is similar to a time-based style cross-validation. More often, we called it `backtest` in time-series modeling.

There are two schemes for the back-testing engine: **expanding window** and **rolling window.**

**expanding window:** for each back-testing model training, the train start date is fixed, while the train end date is extended forward. The expanding window form, on the other hand, is used more often in _**weekly, monthly, or quarterly time series**_ where the number of historical points are limited.
**rolling window:** for each back-testing model training, the training window length is fixed but the window is moving forward. The sliding window form achieves a favorable balance between model accuracy and training time, especially when it comes to testing high frequency data such as _**daily and hourly time series**_.

[A Gentle Introduction to Backtesting for Evaluating the Prophet Forecasting Models](https://blog.exploratory.io/a-gentle-introduction-to-backtesting-for-evaluating-the-prophet-forecasting-models-66c132adc37c)
[How To Backtest Machine Learning Models for Time Series Forecasting](https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/)
[The basics of backtesting](https://www.datapred.com/blog/the-basics-of-backtesting)
[Omphalos, Uber’s Parallel and Language-Extensible Time Series Backtesting Tool](https://eng.uber.com/omphalos/)

| 10 | |**Decompose Prediction**|

Decomposition is a technique that can be used to separate a series into components and predict each one individually. Each part can be treated in the most appropriate way and thereby improve the total prediction.

First, we do the prediction on the training data before the year N. Next, we do the predictions on the test data after the year N. This plot is useful to help check the overall model performance on the out-of-sample period.

[Decompose Prediction in Orbit](https://orbit-ml.readthedocs.io/en/latest/tutorials/decompose_prediction.html)
[Forecasting with decomposition](https://otexts.com/fpp2/forecasting-decomposition.html) in "Forecasting: Principles and Practice" (2nd ed) book by Rob J Hyndman and George Athanasopoulos
[Using decomposition to improve time series prediction](https://quantdare.com/decomposition-to-improve-time-series-prediction/)

| 11 | 12 | 13 | - [Building a Backtesting Service to Measure Model Performance at Uber-scale](https://eng.uber.com/backtesting-at-scale/) 14 | -------------------------------------------------------------------------------- /Orbit Models/README.md: -------------------------------------------------------------------------------- 1 | # Orbit 2 | > by UBER 3 | 4 | - 📰 **Paper:** [Orbit: Probabilistic Forecast with Exponential Smoothing](https://arxiv.org/abs/2004.08492) 5 | - 📄 **Articles:** 6 | - [The New Version of Orbit (v1.1) is Released: The Improvements, Design Changes, and Exciting Collaborations](https://www.uber.com/en-UA/blog/the-new-version-of-orbit-v1-1-is-released/) 7 | - **Implementation:** 8 | - [SVI Part I: An Introduction to Stochastic Variational Inference in Pyro](https://pyro.ai/examples/svi_part_i.html) 9 | 10 | Orbit (**O**bject-**OR**iented **B**ayes**I**an **T**ime Series) is a general interface for **Bayesian exponential smoothing model**. The goal of Orbit development team is to create a tool that is easy to use, flexible, interitible, and high performing (fast computation). Under the hood, Orbit uses the probabilistic programming languages (PPL) including but not limited to Stan and Pyro for posterior approximation (i.e, MCMC sampling, SVI). Below is a quadrant chart to position a few time series related packages in our assessment in terms of flexibility and completeness. Orbit is the only tool that allows for easy model specification and analysis while not limiting itself to a small subset of models. For example Prophet has a complete end to end solution but only has one model type and Pyro has total specification model flexibility but does not give an end to end solution. Thus Orbit bridges the gap between business problems and statistical solutions. 11 | 12 | Orbit is also computationally efficient. Proposed models outperform the baseline time series models consistently in terms of SMAPE metrics. 13 | 14 | ## Models 15 | 16 | ### Local Global Trend (LGT) 17 | 18 | Local Global Trend (LGT) was created to be able to work with outliers, anomalies data. This model is devised to forecast non-seasonal time series. This model is particularly useful for short time series. The LGT model is constructed based on Holt’s linear trend method. The model is designed to allow for a more general term of error by allowing for heteroscedasticity and the addition of a constant “global” trend in the model. 19 | 20 | In this model taking a two-sided 68% confidence interval (or mean). 21 | 22 | > On a normal distribution about 68% of data will be within one standard deviation of the mean 23 | 24 | ### Pyro 25 | 26 | - **Imlementation:** 27 | - :octocat: [Pyro](https://github.com/pyro-ppl/pyro) 28 | 29 | 30 | ## Stable version 31 | 32 | - [Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit/tree/master) 33 | - [Orbit’s Documentation](https://orbit-ml.readthedocs.io/en/stable/) 34 | - [Quick Start](https://orbit-ml.readthedocs.io/en/stable/tutorials/quick_start.html#) 35 | 36 | 37 | Orbit is a general interface for **Bayesian time series modeling**. 38 | Currently, it supports concrete implementations for the following models: 39 | 1. Exponential Smoothing (ETS) 40 | 2. Damped Local Trend (DLT) 41 | 3. Local Global Trend (LGT) 42 | 4. Kernel Time-based Regression (KTR-Lite) 43 | 44 | It also supports the following sampling methods for model estimation: 45 | 1. Markov-Chain Monte Carlo (MCMC) as a full sampling method 46 | 2. Maximum a Posteriori (MAP) as a point estimate method 47 | 3. Variational Inference (VI) as a hybrid-sampling method on approximate distribution 48 | 49 | Orbit enables the easy decomposition of a KPI time series into **trend**, **seasonality**, and **marketing channels effects**. This decomposition enables unbiased forecasting and dynamic insights, including cost curves and ROAS of marketing channels. 50 | 51 | ## Conclusions 52 | 53 | ### How can we improve model? 54 | 55 | Most time series models (classic and the one we use Bayesian) do real-time calculations and forecasting based on the received data. The only way to improve the accuracy of such models with our data is to increase the amount of input data (series/sequence).If we save and feed in more data each year, the model will iteratively learn from real-time data and make more accurate predictions. Also, we can do interpolation because it works well and increases the frequency of data. 56 | 57 | ### Installation 58 | 59 | **Install from PyPi:** 60 | 61 | pip install orbit-ml 62 | 63 | **Install from source:** 64 | 65 | ```@bash 66 | $ git clone https://github.com/uber/orbit.git 67 | $ cd orbit 68 | $ pip install -r requirements.txt 69 | $ pip install . 70 | ``` 71 | 72 | ## Latest dev version 73 | 74 | - [Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit/tree/dev) 75 | - [Orbit’s Documentation](https://orbit-ml.readthedocs.io/en/latest/) latest dev version 76 | - [Quick Start](https://orbit-ml.readthedocs.io/en/latest/tutorials/quick_start.html) 77 | 78 | ### Installation 79 | 80 | **Installing from Dev Branch:** 81 | 82 | pip install git+https://github.com/uber/orbit.git@dev 83 | -------------------------------------------------------------------------------- /Orbit Models/stable version/backtest_orbit_model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Backtest Orbit Model\n", 8 | "\n", 9 | "- [Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit)\n", 10 | "- [Orbit’s Documentation](https://orbit-ml.readthedocs.io/en/stable/index.html)\n", 11 | "- [Quick Start](https://orbit-ml.readthedocs.io/en/stable/tutorials/quick_start.html)\n", 12 | "- [Orbit: Probabilistic Forecast with Exponential Smoothing](https://arxiv.org/abs/2004.08492) Paper\n", 13 | "- [Backtest Orbit Model Documentation](https://orbit-ml.readthedocs.io/en/stable/tutorials/backtest.html)\n", 14 | "\n", 15 | "\n", 16 | "## Backtest\n", 17 | "\n", 18 | "The way to gauge the performance of a time-series model is through re-training models with different historic periods and check their forecast within certain steps. This is similar to a time-based style cross-validation. More often, we called it `backtest` in time-series modeling.\n", 19 | "\n", 20 | "Two schemes supported for the back-testing engine: **expanding window** and **rolling window**.\n", 21 | "\n", 22 | "### Implemented Models\n", 23 | "\n", 24 | "- ETS (which stands for Error, Trend, and Seasonality) Model\n", 25 | "- Methods of Estimations\n", 26 | " - Maximum a Posteriori (MAP)\n", 27 | " - Full Bayesian Estimation\n", 28 | " - Aggregated Posteriors\n", 29 | "- Damped Local Trend (DLT)\n", 30 | " - Global Trend Configurations:\n", 31 | " - Linear Global Trend\n", 32 | " - Log-Linear Global Trend\n", 33 | " - Flat Global Trend\n", 34 | " - Logistic Global Trend\n", 35 | " - Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 36 | "- Local Global Trend (LGT)\n", 37 | " - Local Global Trend Maximum a Posteriori (LGTMAP)\n", 38 | " - Local Global Trend for full Bayesian prediction (LGTFull)\n", 39 | " - Local Global Trend for aggregated posterior prediction (LGTAggregated)\n", 40 | "- Using Pyro for Estimation\n", 41 | " - MAP Fit and Predict\n", 42 | " - VI Fit and Predict\n", 43 | "- Kernel-based Time-varying Regression (KTR)\n", 44 | " - Kernel-based Time-varying Regression Lite (KTRLite)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "!pip install orbit-ml==1.0.17 --upgrade --no-input" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "import pandas as pd\n", 63 | "import numpy as np\n", 64 | "import boto3\n", 65 | "from sagemaker import get_execution_role\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "%matplotlib inline\n", 68 | "plt.style.use('ggplot')\n", 69 | "\n", 70 | "import orbit\n", 71 | "from orbit.models.dlt import ETSFull, ETSMAP, ETSAggregated, DLTMAP, DLTFull, DLTMAP, DLTAggregated\n", 72 | "from orbit.models.lgt import LGTMAP, LGTAggregated, LGTFull\n", 73 | "from orbit.models.ktrlite import KTRLiteMAP\n", 74 | "from orbit.estimators.pyro_estimator import PyroEstimatorVI, PyroEstimatorMAP\n", 75 | "from orbit.diagnostics.backtest import BackTester, TimeSeriesSplitter\n", 76 | "\n", 77 | "import warnings\n", 78 | "warnings.filterwarnings('ignore')" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [ 87 | "print(orbit.__version__)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "role = get_execution_role()\n", 97 | "bucket='...'\n", 98 | "data_key = '...'\n", 99 | "data_location = 's3://{}/{}'.format(bucket, data_key)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "df = pd.DataFrame(pd.read_csv(data_location))\n", 109 | "df = df.drop(df.index[11:]) #drop to forecast for N years only\n", 110 | "df['Date'] = pd.to_datetime(df['Date'].astype(str))\n", 111 | "df['Value'] = df['Value'].astype(float)\n", 112 | "\n", 113 | "df.head(5)" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "date_col = 'Date'\n", 123 | "response_col = 'Value'" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## Create a TimeSeriesSplitter\n", 131 | "\n", 132 | "### Expanding window\n", 133 | "\n", 134 | "- **expanding window:** for each back-testing model training, the train start date is fixed, while the train end date is extended forward." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# configs\n", 144 | "min_train_len = 4 # minimal length of window length\n", 145 | "forecast_len = 2 # length forecast window\n", 146 | "incremental_len = 1 # step length for moving forward" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# configs\n", 156 | "min_train_len = 5 # in case of rolling window, this specify the length of window length\n", 157 | "forecast_len = 1 # length forecast window\n", 158 | "incremental_len = 1 # step length for moving forward\n", 159 | "window_type = 'expanding' # 'rolling'" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "ex_splitter = TimeSeriesSplitter(df,\n", 169 | " min_train_len=min_train_len,\n", 170 | " incremental_len=incremental_len,\n", 171 | " forecast_len=forecast_len,\n", 172 | " window_type='expanding', \n", 173 | " date_col='Date')" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "print(ex_splitter)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "_ = ex_splitter.plot()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "## Create a BackTester" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "# instantiate a model\n", 208 | "dlt = DLTMAP(\n", 209 | " date_col='Date',\n", 210 | " response_col='Value'\n", 211 | ")\n", 212 | "\n", 213 | "lgt_vi = LGTFull(\n", 214 | " response_col='Value',\n", 215 | " date_col='Date',\n", 216 | " seasonality=52,\n", 217 | " seed=8888,\n", 218 | " num_steps=101,\n", 219 | " num_sample=100,\n", 220 | " learning_rate=0.1,\n", 221 | " n_bootstrap_draws=-1,\n", 222 | " estimator_type=PyroEstimatorVI,\n", 223 | ")\n", 224 | "\n", 225 | "etsMAP_model = ETSMAP(\n", 226 | " response_col='Value',\n", 227 | " date_col='Date',\n", 228 | " seasonality=52,\n", 229 | " seed=8888,\n", 230 | ")" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "# configs\n", 240 | "min_train_len = 3\n", 241 | "forecast_len = 1\n", 242 | "incremental_len = 1\n", 243 | "window_type = 'expanding'\n", 244 | "\n", 245 | "bt = BackTester(\n", 246 | " model=etsMAP_model,\n", 247 | " df=df,\n", 248 | " min_train_len=min_train_len,\n", 249 | " incremental_len=incremental_len,\n", 250 | " forecast_len=forecast_len,\n", 251 | " window_type=window_type,\n", 252 | ")" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## Backtest fit and predict" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "bt.fit_predict()" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [ 277 | "predicted_df = bt.get_predicted_df()\n", 278 | "predicted_df.head(50)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": {}, 285 | "outputs": [], 286 | "source": [ 287 | "predicted_df.shape" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "predicted_df.loc[predicted_df['training_data'] == False]" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "help(bt.score)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "score = bt.score()" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "def mse_naive(test_actual):\n", 324 | " actual = test_actual[1:]\n", 325 | " predicted = test_actual[:-1]\n", 326 | " return np.mean(np.square(actual - predicted))\n", 327 | "\n", 328 | "def naive_error(train_actual, test_predicted):\n", 329 | " train_mean = np.mean(train_actual)\n", 330 | " return np.mean(np.abs(test_predicted - train_mean))\n", 331 | "\n", 332 | "def rmse(train_actual, test_predicted):\n", 333 | " print(train_actual[-7:])\n", 334 | " print(test_predicted)\n", 335 | " return np.sqrt(np.square(np.subtract(train_actual[-7:],test_predicted[:-1])).mean())" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [ 344 | "bt.score(metrics=[rmse])" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "# \n", 352 | "\n", 353 | "-------------------------\n", 354 | "\n", 355 | "## Backtesting\n", 356 | "\n", 357 | "### Expanding Window\n", 358 | "\n", 359 | "> for each back-testing model training, the train start date is fixed, while the train end date is extended forward.\n", 360 | "\n", 361 | "### Rolling Window\n", 362 | "\n", 363 | "> for each back-testing model training, the training window length is fixed but the window is moving forward." 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": null, 369 | "metadata": {}, 370 | "outputs": [], 371 | "source": [ 372 | "role = get_execution_role()\n", 373 | "bucket='...'\n", 374 | "data_key = '...' \n", 375 | "data_location = 's3://{}/{}'.format(bucket, data_key)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [ 384 | "df = pd.DataFrame(pd.read_csv(data_location))" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": {}, 391 | "outputs": [], 392 | "source": [ 393 | "df = df.rename({'Unnamed: 0': 'Date'}, axis = 1)\n", 394 | "df.index = df['Date']" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": null, 400 | "metadata": {}, 401 | "outputs": [], 402 | "source": [ 403 | "df.shape" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "df" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "#### Orbit Models" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "# ETS (which stands for Error, Trend, and Seasonality)\n", 429 | "\n", 430 | "# Methods of Estimations\n", 431 | "\n", 432 | "# Maximum a Posteriori (MAP)\n", 433 | "\n", 434 | "# The advantage of MAP estimation is a faster computational speed.\n", 435 | "\n", 436 | "def ETSMAP_model(date_col, response_col, tmp_df, \n", 437 | " min_train_len, forecast_len, incremental_len, window_type):\n", 438 | " \n", 439 | " ets = ETSMAP(\n", 440 | " response_col=response_col,\n", 441 | " date_col=date_col,\n", 442 | " seasonality=52,\n", 443 | " seed=8888\n", 444 | " )\n", 445 | " \n", 446 | " bt = BackTester(\n", 447 | " model=ets,\n", 448 | " df=tmp_df,\n", 449 | " min_train_len=min_train_len,\n", 450 | " incremental_len=incremental_len,\n", 451 | " forecast_len=forecast_len,\n", 452 | " window_type=window_type,\n", 453 | " )\n", 454 | " \n", 455 | " bt.fit_predict()\n", 456 | " \n", 457 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 458 | "\n", 459 | "\n", 460 | "# Full Bayesian Estimation\n", 461 | "\n", 462 | "\n", 463 | "def ETSFull_model(date_col, response_col, tmp_df, \n", 464 | " min_train_len, forecast_len, incremental_len, window_type):\n", 465 | " \n", 466 | " ets = ETSFull(\n", 467 | " response_col=response_col,\n", 468 | " date_col=date_col,\n", 469 | " seasonality=52,\n", 470 | " seed=8888,\n", 471 | " num_warmup=400,\n", 472 | " num_sample=400,\n", 473 | " )\n", 474 | " \n", 475 | " bt = BackTester(\n", 476 | " model=ets,\n", 477 | " df=tmp_df,\n", 478 | " min_train_len=min_train_len,\n", 479 | " incremental_len=incremental_len,\n", 480 | " forecast_len=forecast_len,\n", 481 | " window_type=window_type,\n", 482 | " )\n", 483 | " \n", 484 | " bt.fit_predict()\n", 485 | " \n", 486 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 487 | "\n", 488 | "# Aggregated Posteriors\n", 489 | "\n", 490 | "\n", 491 | "def ETSAggregated_model(date_col, response_col, tmp_df, \n", 492 | " min_train_len, forecast_len, incremental_len, window_type):\n", 493 | " \n", 494 | " ets = ETSAggregated(\n", 495 | " response_col=response_col,\n", 496 | " date_col=date_col,\n", 497 | " seasonality=52,\n", 498 | " seed=8888,\n", 499 | " )\n", 500 | " \n", 501 | " bt = BackTester(\n", 502 | " model=ets,\n", 503 | " df=tmp_df,\n", 504 | " min_train_len=min_train_len,\n", 505 | " incremental_len=incremental_len,\n", 506 | " forecast_len=forecast_len,\n", 507 | " window_type=window_type,\n", 508 | " )\n", 509 | " \n", 510 | " bt.fit_predict()\n", 511 | " \n", 512 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 513 | "\n", 514 | "\n", 515 | "# Damped Local Trend (DLT)\n", 516 | "\n", 517 | "# Global Trend Configurations\n", 518 | "\n", 519 | "# Linear Global Trend\n", 520 | "\n", 521 | "# linear global trend\n", 522 | "def DLTMAP_lin(date_col, response_col, tmp_df, \n", 523 | " min_train_len, forecast_len, incremental_len, window_type):\n", 524 | " \n", 525 | " dlt = DLTMAP(\n", 526 | " response_col=response_col,\n", 527 | " date_col=date_col,\n", 528 | " seasonality=52,\n", 529 | " seed=8888,\n", 530 | " )\n", 531 | "\n", 532 | " bt = BackTester(\n", 533 | " model=dlt,\n", 534 | " df=tmp_df,\n", 535 | " min_train_len=min_train_len,\n", 536 | " incremental_len=incremental_len,\n", 537 | " forecast_len=forecast_len,\n", 538 | " window_type=window_type,\n", 539 | " )\n", 540 | " \n", 541 | " bt.fit_predict()\n", 542 | " \n", 543 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 544 | "\n", 545 | "\n", 546 | "# log-linear global trend\n", 547 | "def DLTMAP_log_lin(date_col, response_col, tmp_df, \n", 548 | " min_train_len, forecast_len, incremental_len, window_type):\n", 549 | " \n", 550 | " dlt = DLTMAP(\n", 551 | " response_col=response_col,\n", 552 | " date_col=date_col,\n", 553 | " seasonality=52,\n", 554 | " seed=8888,\n", 555 | " global_trend_option='loglinear'\n", 556 | " )\n", 557 | "\n", 558 | " bt = BackTester(\n", 559 | " model=dlt,\n", 560 | " df=tmp_df,\n", 561 | " min_train_len=min_train_len,\n", 562 | " incremental_len=incremental_len,\n", 563 | " forecast_len=forecast_len,\n", 564 | " window_type=window_type,\n", 565 | " )\n", 566 | " \n", 567 | " bt.fit_predict()\n", 568 | " \n", 569 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 570 | "\n", 571 | "\n", 572 | "# log-linear global trend\n", 573 | "def DLTMAP_flat(date_col, response_col, tmp_df, \n", 574 | " min_train_len, forecast_len, incremental_len, window_type):\n", 575 | " \n", 576 | " dlt = DLTMAP(\n", 577 | " response_col=response_col,\n", 578 | " date_col=date_col,\n", 579 | " seasonality=52,\n", 580 | " seed=8888,\n", 581 | " global_trend_option='flat'\n", 582 | " )\n", 583 | "\n", 584 | " bt = BackTester(\n", 585 | " model=dlt,\n", 586 | " df=tmp_df,\n", 587 | " min_train_len=min_train_len,\n", 588 | " incremental_len=incremental_len,\n", 589 | " forecast_len=forecast_len,\n", 590 | " window_type=window_type,\n", 591 | " )\n", 592 | " \n", 593 | " bt.fit_predict()\n", 594 | " \n", 595 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 596 | "\n", 597 | "\n", 598 | "# logistic global trend\n", 599 | "def DLTMAP_logistic(date_col, response_col, tmp_df, \n", 600 | " min_train_len, forecast_len, incremental_len, window_type):\n", 601 | " \n", 602 | " dlt = DLTMAP(\n", 603 | " response_col=response_col,\n", 604 | " date_col=date_col,\n", 605 | " seasonality=52,\n", 606 | " seed=8888,\n", 607 | " global_trend_option='logistic'\n", 608 | " )\n", 609 | "\n", 610 | " bt = BackTester(\n", 611 | " model=dlt,\n", 612 | " df=tmp_df,\n", 613 | " min_train_len=min_train_len,\n", 614 | " incremental_len=incremental_len,\n", 615 | " forecast_len=forecast_len,\n", 616 | " window_type=window_type,\n", 617 | " )\n", 618 | " \n", 619 | " bt.fit_predict()\n", 620 | " \n", 621 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 622 | "\n", 623 | "\n", 624 | "# Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 625 | "\n", 626 | "def DLTFull_model(date_col, response_col, tmp_df, \n", 627 | " min_train_len, forecast_len, incremental_len, window_type):\n", 628 | " \n", 629 | " dlt = DLTFull(\n", 630 | " response_col=response_col,\n", 631 | " date_col=date_col,\n", 632 | " num_warmup=400,\n", 633 | " num_sample=400,\n", 634 | " seasonality=52,\n", 635 | " seed=8888\n", 636 | " )\n", 637 | " \n", 638 | " bt = BackTester(\n", 639 | " model=dlt,\n", 640 | " df=tmp_df,\n", 641 | " min_train_len=min_train_len,\n", 642 | " incremental_len=incremental_len,\n", 643 | " forecast_len=forecast_len,\n", 644 | " window_type=window_type,\n", 645 | " )\n", 646 | " \n", 647 | " bt.fit_predict()\n", 648 | "\n", 649 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 650 | "\n", 651 | "\n", 652 | "# Damped Local Trend Full (DLTAggregated)\n", 653 | "\n", 654 | "def DLTAggregated_model(date_col, response_col, tmp_df, \n", 655 | " min_train_len, forecast_len, incremental_len, window_type):\n", 656 | " \n", 657 | " ets = DLTAggregated(\n", 658 | " response_col=response_col,\n", 659 | " date_col=date_col,\n", 660 | " seasonality=52,\n", 661 | " seed=8888,\n", 662 | " )\n", 663 | " \n", 664 | " bt = BackTester(\n", 665 | " model=ets,\n", 666 | " df=tmp_df,\n", 667 | " min_train_len=min_train_len,\n", 668 | " incremental_len=incremental_len,\n", 669 | " forecast_len=forecast_len,\n", 670 | " window_type=window_type,\n", 671 | " )\n", 672 | " \n", 673 | " bt.fit_predict()\n", 674 | "\n", 675 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 676 | "\n", 677 | "\n", 678 | "# Local Global Trend (LGT) Model\n", 679 | "\n", 680 | "# Local Global Trend Maximum a Posteriori (LGTMAP)\n", 681 | "\n", 682 | "def LGTMAP_model(date_col, response_col, tmp_df, \n", 683 | " min_train_len, forecast_len, incremental_len, window_type):\n", 684 | " \n", 685 | " lgt = LGTMAP(\n", 686 | " response_col=response_col,\n", 687 | " date_col=date_col,\n", 688 | " seasonality=52,\n", 689 | " seed=8888,\n", 690 | " )\n", 691 | "\n", 692 | " bt = BackTester(\n", 693 | " model=lgt,\n", 694 | " df=tmp_df,\n", 695 | " min_train_len=min_train_len,\n", 696 | " incremental_len=incremental_len,\n", 697 | " forecast_len=forecast_len,\n", 698 | " window_type=window_type,\n", 699 | " )\n", 700 | " \n", 701 | " bt.fit_predict()\n", 702 | "\n", 703 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 704 | "\n", 705 | "# LGTFull\n", 706 | "\n", 707 | "def LGTFull_model(date_col, response_col, tmp_df, \n", 708 | " min_train_len, forecast_len, incremental_len, window_type):\n", 709 | " \n", 710 | " lgt = LGTFull(\n", 711 | " response_col=response_col,\n", 712 | " date_col=date_col,\n", 713 | " seasonality=52,\n", 714 | " seed=8888,\n", 715 | " )\n", 716 | "\n", 717 | " bt = BackTester(\n", 718 | " model=lgt,\n", 719 | " df=tmp_df,\n", 720 | " min_train_len=min_train_len,\n", 721 | " incremental_len=incremental_len,\n", 722 | " forecast_len=forecast_len,\n", 723 | " window_type=window_type,\n", 724 | " )\n", 725 | " \n", 726 | " bt.fit_predict()\n", 727 | "\n", 728 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 729 | "\n", 730 | "# LGTAggregated\n", 731 | "\n", 732 | "def LGTAggregated_model(date_col, response_col, tmp_df, \n", 733 | " min_train_len, forecast_len, incremental_len, window_type):\n", 734 | " \n", 735 | " lgt = LGTAggregated(\n", 736 | " response_col=response_col,\n", 737 | " date_col=date_col,\n", 738 | " seasonality=52,\n", 739 | " seed=8888,\n", 740 | " )\n", 741 | "\n", 742 | " bt = BackTester(\n", 743 | " model=lgt,\n", 744 | " df=tmp_df,\n", 745 | " min_train_len=min_train_len,\n", 746 | " incremental_len=incremental_len,\n", 747 | " forecast_len=forecast_len,\n", 748 | " window_type=window_type,\n", 749 | " )\n", 750 | " \n", 751 | " bt.fit_predict()\n", 752 | "\n", 753 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 754 | "\n", 755 | "# Using Pyro for Estimation\n", 756 | "\n", 757 | "# MAP Fit and Predict\n", 758 | "\n", 759 | "def LGTMAP_PyroEstimatorMAP(date_col, response_col, tmp_df, \n", 760 | " min_train_len, forecast_len, incremental_len, window_type):\n", 761 | " \n", 762 | " lgt_map = LGTMAP(\n", 763 | " response_col=response_col,\n", 764 | " date_col=date_col,\n", 765 | " seasonality=52,\n", 766 | " seed=8888,\n", 767 | " estimator_type=PyroEstimatorMAP,\n", 768 | " )\n", 769 | "\n", 770 | " bt = BackTester(\n", 771 | " model=lgt_map,\n", 772 | " df=tmp_df,\n", 773 | " min_train_len=min_train_len,\n", 774 | " incremental_len=incremental_len,\n", 775 | " forecast_len=forecast_len,\n", 776 | " window_type=window_type,\n", 777 | " )\n", 778 | " \n", 779 | " bt.fit_predict()\n", 780 | "\n", 781 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 782 | "\n", 783 | "# VI Fit and Predict\n", 784 | "\n", 785 | "def LGTFull_pyro(date_col, response_col, tmp_df, \n", 786 | " min_train_len, forecast_len, incremental_len, window_type):\n", 787 | " \n", 788 | " lgt_vi = LGTFull(\n", 789 | " response_col=response_col,\n", 790 | " date_col=date_col,\n", 791 | " seasonality=52,\n", 792 | " seed=8888,\n", 793 | " num_steps=101,\n", 794 | " num_sample=100,\n", 795 | " learning_rate=0.1,\n", 796 | " n_bootstrap_draws=-1,\n", 797 | " estimator_type=PyroEstimatorVI,\n", 798 | " )\n", 799 | "\n", 800 | " bt = BackTester(\n", 801 | " model=lgt_vi,\n", 802 | " df=tmp_df,\n", 803 | " min_train_len=min_train_len,\n", 804 | " incremental_len=incremental_len,\n", 805 | " forecast_len=forecast_len,\n", 806 | " window_type=window_type,\n", 807 | " )\n", 808 | " \n", 809 | " bt.fit_predict()\n", 810 | "\n", 811 | " return bt.score().iloc[5]['metric_values'] # rmsse\n", 812 | "\n", 813 | "\n", 814 | "# Kernel-based Time-varying Regression (KTR)\n", 815 | "\n", 816 | "# KTRLite\n", 817 | "\n", 818 | "def ktrlite_MAP(date_col, response_col, tmp_df, \n", 819 | " min_train_len, forecast_len, incremental_len, window_type):\n", 820 | " \n", 821 | " ktrlite = KTRLiteMAP(\n", 822 | " response_col=response_col,\n", 823 | " #response_col=np.log(df[response_col]),\n", 824 | " date_col=date_col,\n", 825 | " level_knot_scale=.1,\n", 826 | " span_level=.05,\n", 827 | " )\n", 828 | " \n", 829 | " bt = BackTester(\n", 830 | " model=ktrlite,\n", 831 | " df=tmp_df,\n", 832 | " min_train_len=min_train_len,\n", 833 | " incremental_len=incremental_len,\n", 834 | " forecast_len=forecast_len,\n", 835 | " window_type=window_type,\n", 836 | " )\n", 837 | " \n", 838 | " bt.fit_predict()\n", 839 | "\n", 840 | " return bt.score().iloc[5]['metric_values'] # rmsse" 841 | ] 842 | }, 843 | { 844 | "cell_type": "markdown", 845 | "metadata": {}, 846 | "source": [ 847 | "#### Backtest all Orbit models for each item" 848 | ] 849 | }, 850 | { 851 | "cell_type": "code", 852 | "execution_count": null, 853 | "metadata": {}, 854 | "outputs": [], 855 | "source": [ 856 | "def backtesing_models(index, column):\n", 857 | " \n", 858 | " '''\n", 859 | " Parameters:\n", 860 | " index: column index\n", 861 | " column: column name\n", 862 | " \n", 863 | " Returns:\n", 864 | " models_df: new dataframe with \n", 865 | " '''\n", 866 | " \n", 867 | " tmp_df['Date'] = pd.to_datetime(df['Date'].astype(str))\n", 868 | " tmp_df['Penetration'] = df[column].astype(float)\n", 869 | " \n", 870 | " date_col = 'Date'\n", 871 | " response_col = 'Value'\n", 872 | "\n", 873 | " models_df.at[index ,'Name'] = column\n", 874 | "\n", 875 | " # configs\n", 876 | " min_train_len = 5 # in case of rolling window, this specify the length of window length\n", 877 | " forecast_len = 1 # length forecast window\n", 878 | " incremental_len = 1 # step length for moving forward\n", 879 | " window_type = 'expanding' # 'rolling' 'expanding'\n", 880 | " \n", 881 | " \n", 882 | " # Making backtesting with each model\n", 883 | " try:\n", 884 | " models_df.at[index , 'ETSMAP'] = ETSMAP_model(date_col, response_col, tmp_df, \n", 885 | " min_train_len, forecast_len, incremental_len, \n", 886 | " window_type)\n", 887 | " except:\n", 888 | " models_df.at[index , 'ETSMAP'] = 100\n", 889 | " try: \n", 890 | " models_df.at[index , 'ETSFull'] = ETSFull_model(date_col, response_col, tmp_df, \n", 891 | " min_train_len, forecast_len, incremental_len, \n", 892 | " window_type)\n", 893 | " except:\n", 894 | " models_df.at[index , 'ETSFull'] = 100\n", 895 | " try:\n", 896 | " models_df.at[index , 'ETSAggregated'] = ETSAggregated_model(date_col, response_col, tmp_df, \n", 897 | " min_train_len, forecast_len, incremental_len, \n", 898 | " window_type)\n", 899 | " except:\n", 900 | " models_df.at[index , 'ETSAggregated'] = 100\n", 901 | "\n", 902 | " \n", 903 | " try:\n", 904 | " models_df.at[index , 'DLTMAP_lin'] = DLTMAP_lin(date_col, response_col, tmp_df, \n", 905 | " min_train_len, forecast_len, incremental_len, \n", 906 | " window_type)\n", 907 | " except:\n", 908 | " models_df.at[index , 'DLTMAP_lin'] = 100\n", 909 | " try:\n", 910 | " models_df.at[index , 'DLTMAP_log_lin'] = DLTMAP_log_lin(date_col, response_col, tmp_df, \n", 911 | " min_train_len, forecast_len, incremental_len, \n", 912 | " window_type)\n", 913 | " except:\n", 914 | " models_df.at[index , 'DLTMAP_log_lin'] = 100\n", 915 | " try:\n", 916 | " models_df.at[index , 'DLTMAP_flat'] = DLTMAP_flat(date_col, response_col, tmp_df, \n", 917 | " min_train_len, forecast_len, incremental_len, \n", 918 | " window_type)\n", 919 | " except:\n", 920 | " models_df.at[index , 'DLTMAP_flat'] = 100\n", 921 | " try:\n", 922 | " models_df.at[index , 'DLTMAP_logistic'] = DLTMAP_logistic(date_col, response_col, tmp_df, \n", 923 | " min_train_len, forecast_len, incremental_len, \n", 924 | " window_type)\n", 925 | " except:\n", 926 | " models_df.at[index , 'DLTMAP_logistic'] = 100\n", 927 | " try: \n", 928 | " models_df.at[index , 'DLTFull'] = DLTFull_model(date_col, response_col, tmp_df, \n", 929 | " min_train_len, forecast_len, incremental_len, \n", 930 | " window_type)\n", 931 | " except:\n", 932 | " models_df.at[index , 'DLTFull'] = 100\n", 933 | " try:\n", 934 | " models_df.at[index , 'DLTAggregated'] = DLTAggregated_model(date_col, response_col, tmp_df, \n", 935 | " min_train_len, forecast_len, incremental_len, \n", 936 | " window_type)\n", 937 | " except: \n", 938 | " models_df.at[index , 'DLTAggregated'] = 100\n", 939 | " \n", 940 | " \n", 941 | " try:\n", 942 | " models_df.at[index , 'LGTMAP'] = LGTMAP_model(date_col, response_col, tmp_df, \n", 943 | " min_train_len, forecast_len, incremental_len, \n", 944 | " window_type)\n", 945 | " except:\n", 946 | " models_df.at[index , 'LGTMAP'] = 100\n", 947 | " try:\n", 948 | " models_df.at[index , 'LGTFull'] = LGTFull_model(date_col, response_col, tmp_df, \n", 949 | " min_train_len, forecast_len, incremental_len, \n", 950 | " window_type)\n", 951 | " except: \n", 952 | " models_df.at[index , 'LGTFull'] = 100\n", 953 | " try: \n", 954 | " models_df.at[index , 'LGTAggregated'] = LGTAggregated_model(date_col, response_col, tmp_df, \n", 955 | " min_train_len, forecast_len, incremental_len, \n", 956 | " window_type)\n", 957 | " except:\n", 958 | " models_df.at[index , 'LGTAggregated'] = 100\n", 959 | "\n", 960 | " \n", 961 | " # Using Pyro for Estimation\n", 962 | " try:\n", 963 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = LGTMAP_PyroEstimatorMAP(\n", 964 | " date_col, response_col, tmp_df, min_train_len, forecast_len, incremental_len, window_type)\n", 965 | " except:\n", 966 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = 100\n", 967 | " try:\n", 968 | " models_df.at[index , 'LGTFull_pyro4'] = LGTFull_pyro(date_col, response_col, tmp_df, \n", 969 | " min_train_len, forecast_len, incremental_len, \n", 970 | " window_type)\n", 971 | " except:\n", 972 | " models_df.at[index , 'LGTFull_pyro4'] = 100\n", 973 | " \n", 974 | " # Kernel-based Time-varying Regression (KTR)\n", 975 | " try:\n", 976 | " models_df.at[index , 'KTR_Lite_MAP'] = ktrlite_MAP(date_col, response_col, tmp_df, \n", 977 | " min_train_len, forecast_len, incremental_len, \n", 978 | " window_type)\n", 979 | " except:\n", 980 | " models_df.at[index , 'KTR_Lite_MAP'] = 100\n", 981 | " \n", 982 | " \n", 983 | " models_df.at[index, 'Type'] = df[column].iloc[-1]\n", 984 | " \n", 985 | " \n", 986 | " return models_df" 987 | ] 988 | }, 989 | { 990 | "cell_type": "markdown", 991 | "metadata": {}, 992 | "source": [ 993 | "#### Calculating minimal RMSSE value for each item" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": null, 999 | "metadata": {}, 1000 | "outputs": [], 1001 | "source": [ 1002 | "def min_value(df):\n", 1003 | " \n", 1004 | " '''\n", 1005 | " Parameters:\n", 1006 | " df: input dataframe with multiple columns and values in a row\n", 1007 | " \n", 1008 | " Returns:\n", 1009 | " models_df: existing dataframe with added the new 'Model' column filled with \n", 1010 | " the name of the best-fitted model for each item\n", 1011 | " '''\n", 1012 | " \n", 1013 | " df.iloc[:, 1:-1].apply(pd.to_numeric)\n", 1014 | " df['Model'] = df.iloc[:, 1:-1].idxmin(axis=1)\n", 1015 | " \n", 1016 | " return models_df" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "markdown", 1021 | "metadata": {}, 1022 | "source": [ 1023 | "#### Backtest Orbit models for each item" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": {}, 1030 | "outputs": [], 1031 | "source": [ 1032 | "import time\n", 1033 | "\n", 1034 | "\n", 1035 | "tmp_df = pd.DataFrame()\n", 1036 | "models_df = pd.DataFrame()\n", 1037 | "\n", 1038 | "#start = time.time()\n", 1039 | "\n", 1040 | "for index, column in enumerate(df.columns[1:2]):\n", 1041 | " evaluating_models(index, column)\n", 1042 | " \n", 1043 | "#end = time.time()\n", 1044 | "#print(end - start)" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "models_df" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "code", 1058 | "execution_count": null, 1059 | "metadata": {}, 1060 | "outputs": [], 1061 | "source": [ 1062 | "min_value(models_df)" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": null, 1068 | "metadata": {}, 1069 | "outputs": [], 1070 | "source": [] 1071 | } 1072 | ], 1073 | "metadata": { 1074 | "instance_type": "ml.g4dn.xlarge", 1075 | "kernelspec": { 1076 | "display_name": "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)", 1077 | "language": "python", 1078 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/pytorch-1.6-gpu-py36-cu110-ubuntu18.04-v3" 1079 | }, 1080 | "language_info": { 1081 | "codemirror_mode": { 1082 | "name": "ipython", 1083 | "version": 3 1084 | }, 1085 | "file_extension": ".py", 1086 | "mimetype": "text/x-python", 1087 | "name": "python", 1088 | "nbconvert_exporter": "python", 1089 | "pygments_lexer": "ipython3", 1090 | "version": "3.6.13" 1091 | } 1092 | }, 1093 | "nbformat": 4, 1094 | "nbformat_minor": 4 1095 | } 1096 | -------------------------------------------------------------------------------- /Orbit Models/stable version/eval_decompose_pred.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Evaluating decomposed predictions by Orbit (**O**bject-**OR**iented **B**ayes**I**an **T**ime Series)\n", 7 | "\n", 8 | "- [Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit/tree/master)\n", 9 | "- [Orbit’s Documentation](https://orbit-ml.readthedocs.io/en/stable/)\n", 10 | "- [Quick Start](https://orbit-ml.readthedocs.io/en/stable/tutorials/quick_start.html#)\n", 11 | "- [Orbit: Probabilistic Forecast with Exponential Smoothing](https://arxiv.org/abs/2004.08492) Paper\n", 12 | "\n", 13 | "\n", 14 | "### Implemented Models\n", 15 | "\n", 16 | "- ETS (which stands for Error, Trend, and Seasonality) Model\n", 17 | "- Methods of Estimations\n", 18 | " - Maximum a Posteriori (MAP)\n", 19 | " - Full Bayesian Estimation\n", 20 | " - Aggregated Posteriors\n", 21 | "- Damped Local Trend (DLT)\n", 22 | " - Global Trend Configurations:\n", 23 | " - Linear Global Trend\n", 24 | " - Log-Linear Global Trend\n", 25 | " - Flat Global Trend\n", 26 | " - Logistic Global Trend\n", 27 | " - Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 28 | "- Local Global Trend (LGT)\n", 29 | " - Local Global Trend Maximum a Posteriori (LGTMAP)\n", 30 | " - Local Global Trend for full Bayesian prediction (LGTFull)\n", 31 | " - Local Global Trend for aggregated posterior prediction (LGTAggregated)\n", 32 | "- Using Pyro for Estimation\n", 33 | " - MAP Fit and Predict\n", 34 | " - VI Fit and Predict\n", 35 | "- Kernel-based Time-varying Regression (KTR)\n", 36 | " - Kernel-based Time-varying Regression Lite (KTRLite)" 37 | ], 38 | "metadata": {} 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "source": [ 44 | "!pip install awswrangler" 45 | ], 46 | "outputs": [], 47 | "metadata": {} 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "source": [ 53 | "!pip install orbit-ml --no-input" 54 | ], 55 | "outputs": [], 56 | "metadata": {} 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "source": [ 62 | "import awswrangler as wr\n", 63 | "import boto3\n", 64 | "from sagemaker import get_execution_role\n", 65 | "import pandas as pd\n", 66 | "import numpy as np\n", 67 | "\n", 68 | "import orbit\n", 69 | "from orbit import *\n", 70 | "from orbit.models.dlt import ETSFull, ETSMAP, ETSAggregated, DLTMAP, DLTFull, DLTMAP, DLTAggregated\n", 71 | "from orbit.models.lgt import LGTMAP, LGTAggregated, LGTFull\n", 72 | "from orbit.models.ktrlite import KTRLiteMAP\n", 73 | "\n", 74 | "from orbit.estimators.pyro_estimator import PyroEstimatorVI, PyroEstimatorMAP" 75 | ], 76 | "outputs": [], 77 | "metadata": {} 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": null, 82 | "source": [ 83 | "import warnings\n", 84 | "\n", 85 | "warnings.simplefilter('ignore')\n", 86 | "warnings.filterwarnings('ignore')" 87 | ], 88 | "outputs": [], 89 | "metadata": {} 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "source": [ 94 | "## Uploading data\n", 95 | "\n", 96 | "- uploading data for **models**" 97 | ], 98 | "metadata": {} 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "source": [ 104 | "role = get_execution_role()\n", 105 | "bucket='...'\n", 106 | "data_key = '...csv' \n", 107 | "data_location = 's3://{}/{}'.format(bucket, data_key)" 108 | ], 109 | "outputs": [], 110 | "metadata": {} 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "source": [ 116 | "df = pd.DataFrame(pd.read_csv(data_location))" 117 | ], 118 | "outputs": [], 119 | "metadata": {} 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "source": [ 125 | "df = df.rename({'Unnamed: 0': 'Date'}, axis = 1)\n", 126 | "df.index = df['Date']" 127 | ], 128 | "outputs": [], 129 | "metadata": {} 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "source": [ 135 | "df.shape" 136 | ], 137 | "outputs": [], 138 | "metadata": {} 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "source": [ 144 | "df" 145 | ], 146 | "outputs": [], 147 | "metadata": {} 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "source": [ 153 | "curve_df = df.drop(['curve'], axis = 0)" 154 | ], 155 | "outputs": [], 156 | "metadata": {} 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "source": [ 161 | "## Orbit Models" 162 | ], 163 | "metadata": {} 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "source": [ 169 | "# ETS (which stands for Error, Trend, and Seasonality)\n", 170 | "\n", 171 | "# Methods of Estimations\n", 172 | "\n", 173 | "# Maximum a Posteriori (MAP)\n", 174 | "\n", 175 | "# The advantage of MAP estimation is a faster computational speed.\n", 176 | "\n", 177 | "def ETSMAP_model(date_col, response_col, train_df, test_df):\n", 178 | " ets = ETSMAP(\n", 179 | " response_col=response_col,\n", 180 | " date_col=date_col,\n", 181 | " seasonality=52,\n", 182 | " seed=8888,\n", 183 | " )\n", 184 | " \n", 185 | " ets.fit(df=train_df)\n", 186 | " predicted_df_MAP = ets.predict(df=test_df)\n", 187 | " \n", 188 | " return predicted_df_MAP['prediction'][:11]\n", 189 | "\n", 190 | "# Full Bayesian Estimation\n", 191 | "\n", 192 | "\n", 193 | "def ETSFull_model(date_col, response_col, train_df, test_df):\n", 194 | " ets = ETSFull(\n", 195 | " response_col=response_col,\n", 196 | " date_col=date_col,\n", 197 | " seasonality=52,\n", 198 | " seed=8888,\n", 199 | " num_warmup=400,\n", 200 | " num_sample=400,\n", 201 | " )\n", 202 | " \n", 203 | " ets.fit(df=train_df)\n", 204 | " predicted_df_ETSFull = ets.predict(df=test_df)\n", 205 | " \n", 206 | " return predicted_df_ETSFull['prediction'][:11]\n", 207 | "\n", 208 | "# Aggregated Posteriors\n", 209 | "\n", 210 | "def ETSAggregated_model(date_col, response_col, train_df, test_df):\n", 211 | " ets = ETSAggregated(\n", 212 | " response_col=response_col,\n", 213 | " date_col=date_col,\n", 214 | " seasonality=52,\n", 215 | " seed=8888,\n", 216 | " )\n", 217 | " ets.fit(df=train_df)\n", 218 | " predicted_df_ETSAggregated = ets.predict(df=test_df)\n", 219 | " \n", 220 | " return predicted_df_ETSAggregated['prediction'][:11]\n", 221 | "\n", 222 | "\n", 223 | "# Damped Local Trend (DLT)\n", 224 | "\n", 225 | "# Global Trend Configurations\n", 226 | "\n", 227 | "# Linear Global Trend\n", 228 | "\n", 229 | "# linear global trend\n", 230 | "def DLTMAP_lin(date_col, response_col, train_df, test_df):\n", 231 | " dlt = DLTMAP(\n", 232 | " response_col=response_col,\n", 233 | " date_col=date_col,\n", 234 | " seasonality=52,\n", 235 | " seed=8888,\n", 236 | " )\n", 237 | "\n", 238 | " dlt.fit(train_df)\n", 239 | " predicted_df_DLTMAP_lin = dlt.predict(test_df)\n", 240 | " \n", 241 | " return predicted_df_DLTMAP_lin['prediction'][:11]\n", 242 | "\n", 243 | "\n", 244 | "# log-linear global trend\n", 245 | "def DLTMAP_log_lin(date_col, response_col, train_df, test_df):\n", 246 | " dlt = DLTMAP(\n", 247 | " response_col=response_col,\n", 248 | " date_col=date_col,\n", 249 | " seasonality=52,\n", 250 | " seed=8888,\n", 251 | " global_trend_option='loglinear'\n", 252 | " )\n", 253 | "\n", 254 | " dlt.fit(train_df)\n", 255 | " predicted_df_DLTMAP_log_lin = dlt.predict(test_df)\n", 256 | " \n", 257 | " return predicted_df_DLTMAP_log_lin['prediction'][:11]\n", 258 | "\n", 259 | "\n", 260 | "# log-linear global trend\n", 261 | "def DLTMAP_flat(date_col, response_col, train_df, test_df):\n", 262 | " dlt = DLTMAP(\n", 263 | " response_col=response_col,\n", 264 | " date_col=date_col,\n", 265 | " seasonality=52,\n", 266 | " seed=8888,\n", 267 | " global_trend_option='flat'\n", 268 | " )\n", 269 | "\n", 270 | " dlt.fit(train_df)\n", 271 | " predicted_df_DLTMAP_flat = dlt.predict(test_df)\n", 272 | " \n", 273 | " return predicted_df_DLTMAP_flat['prediction'][:11]\n", 274 | "\n", 275 | "\n", 276 | "# logistic global trend\n", 277 | "def DLTMAP_logistic(date_col, response_col, train_df, test_df):\n", 278 | " dlt = DLTMAP(\n", 279 | " response_col=response_col,\n", 280 | " date_col=date_col,\n", 281 | " seasonality=52,\n", 282 | " seed=8888,\n", 283 | " global_trend_option='logistic'\n", 284 | " )\n", 285 | "\n", 286 | " dlt.fit(train_df)\n", 287 | " predicted_df_DLTMAP_logistic = dlt.predict(test_df)\n", 288 | " \n", 289 | " return predicted_df_DLTMAP_logistic['prediction'][:11]\n", 290 | "\n", 291 | "\n", 292 | "# Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 293 | "\n", 294 | "def DLTFull_model(date_col, response_col, train_df, test_df):\n", 295 | " dlt = DLTFull(\n", 296 | " response_col=response_col,\n", 297 | " date_col=date_col,\n", 298 | " num_warmup=400,\n", 299 | " num_sample=400,\n", 300 | " seasonality=52,\n", 301 | " seed=8888\n", 302 | " )\n", 303 | " \n", 304 | " dlt.fit(df=train_df)\n", 305 | " predicted_df_DLTFull = dlt.predict(df=test_df)\n", 306 | "\n", 307 | " return predicted_df_DLTFull['prediction'][:11]\n", 308 | "\n", 309 | "\n", 310 | "# Damped Local Trend Full (DLTAggregated)\n", 311 | "\n", 312 | "def DLTAggregated_model(date_col, response_col, train_df, test_df):\n", 313 | " ets = DLTAggregated(\n", 314 | " response_col=response_col,\n", 315 | " date_col=date_col,\n", 316 | " seasonality=52,\n", 317 | " seed=8888,\n", 318 | " )\n", 319 | " \n", 320 | " ets.fit(df=train_df)\n", 321 | " predicted_df_DLTAggregated = ets.predict(df=test_df)\n", 322 | " \n", 323 | " return predicted_df_DLTAggregated['prediction'][:11]\n", 324 | "\n", 325 | "\n", 326 | "# Local Global Trend (LGT) Model\n", 327 | "\n", 328 | "# Local Global Trend Maximum a Posteriori (LGTMAP)\n", 329 | "\n", 330 | "def LGTMAP_model(date_col, response_col, train_df, test_df):\n", 331 | " lgt = LGTMAP(\n", 332 | " response_col=response_col,\n", 333 | " date_col=date_col,\n", 334 | " seasonality=52,\n", 335 | " seed=8888,\n", 336 | " )\n", 337 | "\n", 338 | " lgt.fit(df=train_df)\n", 339 | " predicted_df_LGTMAP = lgt.predict(df=test_df)\n", 340 | " \n", 341 | " return predicted_df_LGTMAP['prediction'][:11]\n", 342 | "\n", 343 | "# LGTFull\n", 344 | "\n", 345 | "def LGTFull_model(date_col, response_col, train_df, test_df):\n", 346 | " lgt = LGTFull(\n", 347 | " response_col=response_col,\n", 348 | " date_col=date_col,\n", 349 | " seasonality=52,\n", 350 | " seed=8888,\n", 351 | " )\n", 352 | "\n", 353 | " lgt.fit(df=train_df)\n", 354 | " predicted_df_LGTFull = lgt.predict(df=test_df)\n", 355 | " \n", 356 | " return predicted_df_LGTFull['prediction'][:11]\n", 357 | "\n", 358 | "# LGTAggregated\n", 359 | "\n", 360 | "def LGTAggregated_model(date_col, response_col, train_df, test_df):\n", 361 | " lgt = LGTAggregated(\n", 362 | " response_col=response_col,\n", 363 | " date_col=date_col,\n", 364 | " seasonality=52,\n", 365 | " seed=8888,\n", 366 | " )\n", 367 | "\n", 368 | " lgt.fit(df=train_df)\n", 369 | " predicted_df_LGTAggregated = lgt.predict(df=test_df)\n", 370 | " \n", 371 | " return predicted_df_LGTAggregated['prediction'][:11]\n", 372 | "\n", 373 | "# Using Pyro for Estimation\n", 374 | "\n", 375 | "# MAP Fit and Predict\n", 376 | "\n", 377 | "def LGTMAP_PyroEstimatorMAP(date_col, response_col, train_df, test_df):\n", 378 | " lgt_map = LGTMAP(\n", 379 | " response_col=response_col,\n", 380 | " date_col=date_col,\n", 381 | " seasonality=52,\n", 382 | " seed=8888,\n", 383 | " estimator_type=PyroEstimatorMAP,\n", 384 | " )\n", 385 | "\n", 386 | " lgt_map.fit(df=train_df)\n", 387 | " predicted_df_LGTMAP_pyro = lgt_map.predict(df=test_df)\n", 388 | " \n", 389 | " return predicted_df_LGTMAP_pyro['prediction'][:11]\n", 390 | "\n", 391 | "# VI Fit and Predict\n", 392 | "\n", 393 | "def LGTFull_pyro(date_col, response_col, train_df, test_df):\n", 394 | " lgt_vi = LGTFull(\n", 395 | " response_col=response_col,\n", 396 | " date_col=date_col,\n", 397 | " seasonality=52,\n", 398 | " seed=8888,\n", 399 | " num_steps=101,\n", 400 | " num_sample=100,\n", 401 | " learning_rate=0.1,\n", 402 | " n_bootstrap_draws=-1,\n", 403 | " estimator_type=PyroEstimatorVI,\n", 404 | " )\n", 405 | "\n", 406 | " lgt_vi.fit(df=train_df)\n", 407 | "\n", 408 | " predicted_df_LGTFull_pyro = lgt_vi.predict(df=test_df)\n", 409 | " \n", 410 | " return predicted_df_LGTFull_pyro['prediction'][:11]\n", 411 | "\n", 412 | "\n", 413 | "# Kernel-based Time-varying Regression (KTR)\n", 414 | "\n", 415 | "# KTRLite\n", 416 | "\n", 417 | "def ktrlite_MAP(date_col, response_col, train_df, test_df):\n", 418 | " ktrlite = KTRLiteMAP(\n", 419 | " response_col=response_col,\n", 420 | " #response_col=np.log(df[response_col]),\n", 421 | " date_col=date_col,\n", 422 | " level_knot_scale=.1,\n", 423 | " span_level=.05,\n", 424 | " )\n", 425 | " \n", 426 | " ktrlite.fit(train_df)\n", 427 | " \n", 428 | " predicted_df_ktrlite_MAP = ktrlite.predict(df=test_df, decompose=True)\n", 429 | " \n", 430 | " return predicted_df_ktrlite_MAP['prediction'][:11]" 431 | ], 432 | "outputs": [], 433 | "metadata": {} 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "source": [ 438 | "## Root-Mean-Square Deviation (RMSD) or Root-Mean-Square Error (RMSE)" 439 | ], 440 | "metadata": {} 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "source": [ 446 | "def rmse(actual, pred): \n", 447 | " actual, pred = np.array(actual), np.array(pred)\n", 448 | " return np.sqrt(np.square(np.subtract(actual,pred)).mean())" 449 | ], 450 | "outputs": [], 451 | "metadata": {} 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "source": [ 457 | "def evaluating_models(index, column):\n", 458 | " \n", 459 | " '''\n", 460 | " Parameters:\n", 461 | " index: column index\n", 462 | " column: column name\n", 463 | " \n", 464 | " Returns:\n", 465 | " models_df: new dataframe with \n", 466 | " '''\n", 467 | " \n", 468 | " tmp_df['Date'] = pd.to_datetime(curve_df['Date'].astype(str))\n", 469 | " tmp_df['Penetration'] = curve_df[column].astype(float)\n", 470 | " \n", 471 | " date_col = 'Date'\n", 472 | " response_col = 'Penetration'\n", 473 | " \n", 474 | "\n", 475 | " # Decompose Prediction\n", 476 | "\n", 477 | " train_df = tmp_df[tmp_df['Date'] < '2022-01-01']\n", 478 | " test_df = tmp_df[tmp_df['Date'] <= '2025-01-01']\n", 479 | " \n", 480 | " models_df.at[index ,'Item Name'] = column\n", 481 | "\n", 482 | " \n", 483 | " # Making predictions with each model\n", 484 | " try:\n", 485 | " models_df.at[index , 'ETSMAP'] = rmse(\n", 486 | " tmp_df['Penetration'][:11], \n", 487 | " (ETSMAP_model(date_col, response_col, train_df, test_df))).astype(float)\n", 488 | " except:\n", 489 | " models_df.at[index , 'ETSMAP'] = 100\n", 490 | " try: \n", 491 | " models_df.at[index , 'ETSFull'] = rmse(\n", 492 | " tmp_df['Penetration'][:11], \n", 493 | " (ETSFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 494 | " except:\n", 495 | " models_df.at[index , 'ETSFull'] = 100\n", 496 | " try:\n", 497 | " models_df.at[index , 'ETSAggregated'] = rmse(\n", 498 | " tmp_df['Penetration'][:11], \n", 499 | " (ETSAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 500 | " except:\n", 501 | " models_df.at[index , 'ETSAggregated'] = 100\n", 502 | "\n", 503 | " \n", 504 | " try:\n", 505 | " models_df.at[index , 'DLTMAP_lin'] = rmse(\n", 506 | " tmp_df['Penetration'][:11], \n", 507 | " (DLTMAP_lin(date_col, response_col, train_df, test_df))).astype(float)\n", 508 | " except:\n", 509 | " models_df.at[index , 'DLTMAP_lin'] = 100\n", 510 | " try:\n", 511 | " models_df.at[index , 'DLTMAP_log_lin'] = rmse(\n", 512 | " tmp_df['Penetration'][:11], \n", 513 | " (DLTMAP_log_lin(date_col, response_col, train_df, test_df))).astype(float)\n", 514 | " except:\n", 515 | " models_df.at[index , 'DLTMAP_log_lin'] = 100\n", 516 | " try:\n", 517 | " models_df.at[index , 'DLTMAP_flat'] = rmse(\n", 518 | " tmp_df['Penetration'][:11], \n", 519 | " (DLTMAP_flat(date_col, response_col, train_df, test_df))).astype(float)\n", 520 | " except:\n", 521 | " models_df.at[index , 'DLTMAP_flat'] = 100\n", 522 | " try:\n", 523 | " models_df.at[index , 'DLTMAP_logistic'] = rmse(\n", 524 | " tmp_df['Penetration'][:11], \n", 525 | " (DLTMAP_logistic(date_col, response_col, train_df, test_df))).astype(float)\n", 526 | " except:\n", 527 | " models_df.at[index , 'DLTMAP_logistic'] = 100\n", 528 | " try: \n", 529 | " models_df.at[index , 'DLTFull'] = rmse(\n", 530 | " tmp_df['Penetration'][:11], \n", 531 | " (DLTFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 532 | " except:\n", 533 | " models_df.at[index , 'DLTFull'] = 100\n", 534 | " try:\n", 535 | " models_df.at[index , 'DLTAggregated'] = rmse(\n", 536 | " tmp_df['Penetration'][:11], \n", 537 | " (DLTAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 538 | " except: \n", 539 | " models_df.at[index , 'DLTAggregated'] = 100\n", 540 | " \n", 541 | " \n", 542 | " try:\n", 543 | " models_df.at[index , 'LGTMAP'] = rmse(\n", 544 | " tmp_df['Penetration'][:11], \n", 545 | " (LGTMAP_model(date_col, response_col, train_df, test_df))).astype(float)\n", 546 | " except:\n", 547 | " models_df.at[index , 'LGTMAP'] = 100\n", 548 | " try:\n", 549 | " models_df.at[index , 'LGTFull'] = rmse(\n", 550 | " tmp_df['Penetration'][:11], \n", 551 | " (LGTFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 552 | " except: \n", 553 | " models_df.at[index , 'LGTFull'] = 100\n", 554 | " try: \n", 555 | " models_df.at[index , 'LGTAggregated'] = rmse(\n", 556 | " tmp_df['Penetration'][:11], \n", 557 | " (LGTAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 558 | " except:\n", 559 | " models_df.at[index , 'LGTAggregated'] = 100\n", 560 | "\n", 561 | " \n", 562 | " # Using Pyro for Estimation\n", 563 | " try:\n", 564 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = rmse(\n", 565 | " tmp_df['Penetration'][:11], (LGTMAP_PyroEstimatorMAP(date_col, \n", 566 | " response_col, train_df, test_df))).astype(float)\n", 567 | " except:\n", 568 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = 100\n", 569 | " try:\n", 570 | " models_df.at[index , 'LGTFull_pyro4'] = rmse(\n", 571 | " tmp_df['Penetration'][:11], \n", 572 | " (LGTFull_pyro(date_col, response_col, train_df, test_df))).astype(float)\n", 573 | " except:\n", 574 | " models_df.at[index , 'LGTFull_pyro4'] = 100\n", 575 | " \n", 576 | " # Kernel-based Time-varying Regression (KTR)\n", 577 | " try:\n", 578 | " models_df.at[index , 'KTR_Lite_MAP'] = rmse(\n", 579 | " tmp_df['Penetration'][:11], \n", 580 | " (ktrlite_MAP(date_col, response_col, train_df, test_df))).astype(float)\n", 581 | " except:\n", 582 | " models_df.at[index , 'KTR_Lite_MAP'] = 100\n", 583 | " \n", 584 | " \n", 585 | " models_df.at[index, 'Curve Type'] = df[column].iloc[-1]\n", 586 | " \n", 587 | " \n", 588 | " return models_df" 589 | ], 590 | "outputs": [], 591 | "metadata": {} 592 | }, 593 | { 594 | "cell_type": "markdown", 595 | "source": [ 596 | "## Calculating minimal RMSE value for each item" 597 | ], 598 | "metadata": {} 599 | }, 600 | { 601 | "cell_type": "code", 602 | "execution_count": null, 603 | "source": [ 604 | "def min_value(df):\n", 605 | " \n", 606 | " '''\n", 607 | " Parameters:\n", 608 | " df: input dataframe with multiple columns and values in a row\n", 609 | " \n", 610 | " Returns:\n", 611 | " models_df: existing dataframe with added the new 'Model' column filled with \n", 612 | " the name of the best-fitted model for each item\n", 613 | " '''\n", 614 | " \n", 615 | " df.iloc[:, 1:-1].apply(pd.to_numeric)\n", 616 | " df['Model'] = df.iloc[:, 1:-1].idxmin(axis=1)\n", 617 | " \n", 618 | " return models_df" 619 | ], 620 | "outputs": [], 621 | "metadata": {} 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "source": [ 626 | "## Evaluating Orbit models for each item" 627 | ], 628 | "metadata": {} 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": null, 633 | "source": [ 634 | "import time\n", 635 | "\n", 636 | "\n", 637 | "tmp_df = pd.DataFrame()\n", 638 | "models_df = pd.DataFrame()\n", 639 | "\n", 640 | "start = time.time()\n", 641 | "\n", 642 | "for index, column in enumerate(curve_df.columns[1:2]):\n", 643 | " evaluating_models(index, column)\n", 644 | " \n", 645 | "end = time.time()\n", 646 | "print(end - start)" 647 | ], 648 | "outputs": [], 649 | "metadata": {} 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": null, 654 | "source": [ 655 | "models_df" 656 | ], 657 | "outputs": [], 658 | "metadata": {} 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": null, 663 | "source": [ 664 | "min_value(models_df)" 665 | ], 666 | "outputs": [], 667 | "metadata": {} 668 | } 669 | ], 670 | "metadata": { 671 | "instance_type": "ml.g4dn.xlarge", 672 | "kernelspec": { 673 | "display_name": "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)", 674 | "language": "python", 675 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/pytorch-1.6-gpu-py36-cu110-ubuntu18.04-v3" 676 | }, 677 | "language_info": { 678 | "codemirror_mode": { 679 | "name": "ipython", 680 | "version": 3 681 | }, 682 | "file_extension": ".py", 683 | "mimetype": "text/x-python", 684 | "name": "python", 685 | "nbconvert_exporter": "python", 686 | "pygments_lexer": "ipython3", 687 | "version": "3.6.13" 688 | } 689 | }, 690 | "nbformat": 4, 691 | "nbformat_minor": 4 692 | } 693 | -------------------------------------------------------------------------------- /Orbit Models/stable version/evaluating_models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# Orbit (**O**bject-**OR**iented **B**ayes**I**an **T**ime Series)\n", 7 | "\n", 8 | "- [Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit/tree/master)\n", 9 | "- [Orbit’s Documentation](https://orbit-ml.readthedocs.io/en/stable/)\n", 10 | "- [Quick Start](https://orbit-ml.readthedocs.io/en/stable/tutorials/quick_start.html#)\n", 11 | "- [Orbit: Probabilistic Forecast with Exponential Smoothing](https://arxiv.org/abs/2004.08492) Paper\n", 12 | "\n", 13 | "\n", 14 | "### Implemented Models\n", 15 | "\n", 16 | "- ETS (which stands for Error, Trend, and Seasonality) Model\n", 17 | "- Methods of Estimations\n", 18 | " - Maximum a Posteriori (MAP)\n", 19 | " - Full Bayesian Estimation\n", 20 | " - Aggregated Posteriors\n", 21 | "- Damped Local Trend (DLT)\n", 22 | " - Global Trend Configurations:\n", 23 | " - Linear Global Trend\n", 24 | " - Log-Linear Global Trend\n", 25 | " - Flat Global Trend\n", 26 | " - Logistic Global Trend\n", 27 | " - Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 28 | "- Local Global Trend (LGT)\n", 29 | " - Local Global Trend Maximum a Posteriori (LGTMAP)\n", 30 | " - Local Global Trend for full Bayesian prediction (LGTFull)\n", 31 | " - Local Global Trend for aggregated posterior prediction (LGTAggregated)\n", 32 | "- Using Pyro for Estimation\n", 33 | " - MAP Fit and Predict\n", 34 | " - VI Fit and Predict\n", 35 | "- Kernel-based Time-varying Regression (KTR)\n", 36 | " - Kernel-based Time-varying Regression Lite (KTRLite)" 37 | ], 38 | "metadata": {} 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "source": [ 44 | "!pip install orbit-ml --no-input" 45 | ], 46 | "outputs": [], 47 | "metadata": {} 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "source": [ 53 | "import awswrangler as wr\n", 54 | "import boto3\n", 55 | "from sagemaker import get_execution_role\n", 56 | "import pandas as pd\n", 57 | "import numpy as np\n", 58 | "import json\n", 59 | "import urllib3\n", 60 | "from scipy.optimize import curve_fit\n", 61 | "\n", 62 | "import orbit\n", 63 | "from orbit import *\n", 64 | "from orbit.models.dlt import ETSFull, ETSMAP, ETSAggregated, DLTMAP, DLTFull, DLTMAP, DLTAggregated\n", 65 | "from orbit.models.lgt import LGTMAP, LGTAggregated, LGTFull\n", 66 | "from orbit.models.ktrlite import KTRLiteMAP\n", 67 | "\n", 68 | "from orbit.estimators.pyro_estimator import PyroEstimatorVI, PyroEstimatorMAP" 69 | ], 70 | "outputs": [], 71 | "metadata": {} 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "source": [ 77 | "import warnings\n", 78 | "\n", 79 | "warnings.simplefilter('ignore')\n", 80 | "warnings.filterwarnings('ignore')" 81 | ], 82 | "outputs": [], 83 | "metadata": {} 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "source": [ 88 | "# Uploading data\n", 89 | "\n", 90 | "- uploading data for **models**" 91 | ], 92 | "metadata": {} 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "source": [ 98 | "role = get_execution_role()\n", 99 | "bucket='...'\n", 100 | "data_key = '...csv' \n", 101 | "data_location = 's3://{}/{}'.format(bucket, data_key)" 102 | ], 103 | "outputs": [], 104 | "metadata": {} 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "source": [ 110 | "df = pd.DataFrame(pd.read_csv(data_location))" 111 | ], 112 | "outputs": [], 113 | "metadata": {} 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "source": [ 119 | "df = df.rename({'Unnamed: 0': 'Date'}, axis = 1)\n", 120 | "df.index = df['Date']" 121 | ], 122 | "outputs": [], 123 | "metadata": {} 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "source": [ 129 | "df.shape" 130 | ], 131 | "outputs": [], 132 | "metadata": {} 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "source": [ 138 | "#df = df.drop(['curve'], axis = 0)" 139 | ], 140 | "outputs": [], 141 | "metadata": {} 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "source": [ 147 | "df" 148 | ], 149 | "outputs": [], 150 | "metadata": {} 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "source": [ 155 | "### Orbit Models" 156 | ], 157 | "metadata": {} 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "source": [ 163 | "# ETS (which stands for Error, Trend, and Seasonality)\n", 164 | "\n", 165 | "# Methods of Estimations\n", 166 | "\n", 167 | "# Maximum a Posteriori (MAP)\n", 168 | "\n", 169 | "# The advantage of MAP estimation is a faster computational speed.\n", 170 | "\n", 171 | "def ETSMAP_model(date_col, response_col, train_df, test_df):\n", 172 | " ets = ETSMAP(\n", 173 | " response_col=response_col,\n", 174 | " date_col=date_col,\n", 175 | " seasonality=52,\n", 176 | " seed=8888,\n", 177 | " )\n", 178 | " \n", 179 | " ets.fit(df=train_df)\n", 180 | " predicted_df_MAP = ets.predict(df=test_df)\n", 181 | " \n", 182 | " return predicted_df_MAP['prediction']\n", 183 | "\n", 184 | "# Full Bayesian Estimation\n", 185 | "\n", 186 | "\n", 187 | "def ETSFull_model(date_col, response_col, train_df, test_df):\n", 188 | " ets = ETSFull(\n", 189 | " response_col=response_col,\n", 190 | " date_col=date_col,\n", 191 | " seasonality=52,\n", 192 | " seed=8888,\n", 193 | " num_warmup=400,\n", 194 | " num_sample=400,\n", 195 | " )\n", 196 | " \n", 197 | " ets.fit(df=train_df)\n", 198 | " predicted_df_ETSFull = ets.predict(df=test_df)\n", 199 | " \n", 200 | " return predicted_df_ETSFull['prediction']\n", 201 | "\n", 202 | "# Aggregated Posteriors\n", 203 | "\n", 204 | "def ETSAggregated_model(date_col, response_col, train_df, test_df):\n", 205 | " ets = ETSAggregated(\n", 206 | " response_col=response_col,\n", 207 | " date_col=date_col,\n", 208 | " seasonality=52,\n", 209 | " seed=8888,\n", 210 | " )\n", 211 | " ets.fit(df=train_df)\n", 212 | " predicted_df_ETSAggregated = ets.predict(df=test_df)\n", 213 | " \n", 214 | " return predicted_df_ETSAggregated['prediction']\n", 215 | "\n", 216 | "\n", 217 | "# Damped Local Trend (DLT)\n", 218 | "\n", 219 | "# Global Trend Configurations\n", 220 | "\n", 221 | "# Linear Global Trend\n", 222 | "\n", 223 | "# linear global trend\n", 224 | "def DLTMAP_lin(date_col, response_col, train_df, test_df):\n", 225 | " dlt = DLTMAP(\n", 226 | " response_col=response_col,\n", 227 | " date_col=date_col,\n", 228 | " seasonality=52,\n", 229 | " seed=8888,\n", 230 | " )\n", 231 | "\n", 232 | " dlt.fit(train_df)\n", 233 | " predicted_df_DLTMAP_lin = dlt.predict(test_df)\n", 234 | " \n", 235 | " return predicted_df_DLTMAP_lin['prediction']\n", 236 | "\n", 237 | "\n", 238 | "# log-linear global trend\n", 239 | "def DLTMAP_log_lin(date_col, response_col, train_df, test_df):\n", 240 | " dlt = DLTMAP(\n", 241 | " response_col=response_col,\n", 242 | " date_col=date_col,\n", 243 | " seasonality=52,\n", 244 | " seed=8888,\n", 245 | " global_trend_option='loglinear'\n", 246 | " )\n", 247 | "\n", 248 | " dlt.fit(train_df)\n", 249 | " predicted_df_DLTMAP_log_lin = dlt.predict(test_df)\n", 250 | " \n", 251 | " return predicted_df_DLTMAP_log_lin['prediction']\n", 252 | "\n", 253 | "\n", 254 | "# log-linear global trend\n", 255 | "def DLTMAP_flat(date_col, response_col, train_df, test_df):\n", 256 | " dlt = DLTMAP(\n", 257 | " response_col=response_col,\n", 258 | " date_col=date_col,\n", 259 | " seasonality=52,\n", 260 | " seed=8888,\n", 261 | " global_trend_option='flat'\n", 262 | " )\n", 263 | "\n", 264 | " dlt.fit(train_df)\n", 265 | " predicted_df_DLTMAP_flat = dlt.predict(test_df)\n", 266 | " \n", 267 | " return predicted_df_DLTMAP_flat['prediction']\n", 268 | "\n", 269 | "\n", 270 | "# logistic global trend\n", 271 | "def DLTMAP_logistic(date_col, response_col, train_df, test_df):\n", 272 | " dlt = DLTMAP(\n", 273 | " response_col=response_col,\n", 274 | " date_col=date_col,\n", 275 | " seasonality=52,\n", 276 | " seed=8888,\n", 277 | " global_trend_option='logistic'\n", 278 | " )\n", 279 | "\n", 280 | " dlt.fit(train_df)\n", 281 | " predicted_df_DLTMAP_logistic = dlt.predict(test_df)\n", 282 | " \n", 283 | " return predicted_df_DLTMAP_logistic['prediction']\n", 284 | "\n", 285 | "\n", 286 | "# Damped Local Trend Full Bayesian Estimation (DLTFull)\n", 287 | "\n", 288 | "def DLTFull_model(date_col, response_col, train_df, test_df):\n", 289 | " dlt = DLTFull(\n", 290 | " response_col=response_col,\n", 291 | " date_col=date_col,\n", 292 | " num_warmup=400,\n", 293 | " num_sample=400,\n", 294 | " seasonality=52,\n", 295 | " seed=8888\n", 296 | " )\n", 297 | " \n", 298 | " dlt.fit(df=train_df)\n", 299 | " predicted_df_DLTFull = dlt.predict(df=test_df)\n", 300 | "\n", 301 | " return predicted_df_DLTFull['prediction']\n", 302 | "\n", 303 | "\n", 304 | "# Damped Local Trend Full (DLTAggregated)\n", 305 | "\n", 306 | "def DLTAggregated_model(date_col, response_col, train_df, test_df):\n", 307 | " ets = DLTAggregated(\n", 308 | " response_col=response_col,\n", 309 | " date_col=date_col,\n", 310 | " seasonality=52,\n", 311 | " seed=8888,\n", 312 | " )\n", 313 | " \n", 314 | " ets.fit(df=train_df)\n", 315 | " predicted_df_DLTAggregated = ets.predict(df=test_df)\n", 316 | " \n", 317 | " return predicted_df_DLTAggregated['prediction']\n", 318 | "\n", 319 | "\n", 320 | "# Local Global Trend (LGT) Model\n", 321 | "\n", 322 | "# Local Global Trend Maximum a Posteriori (LGTMAP)\n", 323 | "\n", 324 | "def LGTMAP_model(date_col, response_col, train_df, test_df):\n", 325 | " lgt = LGTMAP(\n", 326 | " response_col=response_col,\n", 327 | " date_col=date_col,\n", 328 | " seasonality=52,\n", 329 | " seed=8888,\n", 330 | " )\n", 331 | "\n", 332 | " lgt.fit(df=train_df)\n", 333 | " predicted_df_LGTMAP = lgt.predict(df=test_df)\n", 334 | " \n", 335 | " return predicted_df_LGTMAP['prediction']\n", 336 | "\n", 337 | "# LGTFull\n", 338 | "\n", 339 | "def LGTFull_model(date_col, response_col, train_df, test_df):\n", 340 | " lgt = LGTFull(\n", 341 | " response_col=response_col,\n", 342 | " date_col=date_col,\n", 343 | " seasonality=52,\n", 344 | " seed=8888,\n", 345 | " )\n", 346 | "\n", 347 | " lgt.fit(df=train_df)\n", 348 | " predicted_df_LGTFull = lgt.predict(df=test_df)\n", 349 | " \n", 350 | " return predicted_df_LGTFull['prediction']\n", 351 | "\n", 352 | "# LGTAggregated\n", 353 | "\n", 354 | "def LGTAggregated_model(date_col, response_col, train_df, test_df):\n", 355 | " lgt = LGTAggregated(\n", 356 | " response_col=response_col,\n", 357 | " date_col=date_col,\n", 358 | " seasonality=52,\n", 359 | " seed=8888,\n", 360 | " )\n", 361 | "\n", 362 | " lgt.fit(df=train_df)\n", 363 | " predicted_df_LGTAggregated = lgt.predict(df=test_df)\n", 364 | " \n", 365 | " return predicted_df_LGTAggregated['prediction']\n", 366 | "\n", 367 | "# Using Pyro for Estimation\n", 368 | "\n", 369 | "# MAP Fit and Predict\n", 370 | "\n", 371 | "def LGTMAP_PyroEstimatorMAP(date_col, response_col, train_df, test_df):\n", 372 | " lgt_map = LGTMAP(\n", 373 | " response_col=response_col,\n", 374 | " date_col=date_col,\n", 375 | " seasonality=52,\n", 376 | " seed=8888,\n", 377 | " estimator_type=PyroEstimatorMAP,\n", 378 | " )\n", 379 | "\n", 380 | " lgt_map.fit(df=train_df)\n", 381 | " predicted_df_LGTMAP_pyro = lgt_map.predict(df=test_df)\n", 382 | " \n", 383 | " return predicted_df_LGTMAP_pyro['prediction']\n", 384 | "\n", 385 | "# VI Fit and Predict\n", 386 | "\n", 387 | "def LGTFull_pyro(date_col, response_col, train_df, test_df):\n", 388 | " lgt_vi = LGTFull(\n", 389 | " response_col=response_col,\n", 390 | " date_col=date_col,\n", 391 | " seasonality=52,\n", 392 | " seed=8888,\n", 393 | " num_steps=101,\n", 394 | " num_sample=100,\n", 395 | " learning_rate=0.1,\n", 396 | " n_bootstrap_draws=-1,\n", 397 | " estimator_type=PyroEstimatorVI,\n", 398 | " )\n", 399 | "\n", 400 | " lgt_vi.fit(df=train_df)\n", 401 | "\n", 402 | " predicted_df_LGTFull_pyro = lgt_vi.predict(df=test_df)\n", 403 | " \n", 404 | " return predicted_df_LGTFull_pyro['prediction']\n", 405 | "\n", 406 | "\n", 407 | "# Kernel-based Time-varying Regression (KTR)\n", 408 | "\n", 409 | "# KTRLite\n", 410 | "\n", 411 | "def ktrlite_MAP(date_col, response_col, train_df, test_df):\n", 412 | " ktrlite = KTRLiteMAP(\n", 413 | " response_col=response_col,\n", 414 | " #response_col=np.log(df[response_col]),\n", 415 | " date_col=date_col,\n", 416 | " level_knot_scale=.1,\n", 417 | " span_level=.05,\n", 418 | " )\n", 419 | " \n", 420 | " ktrlite.fit(train_df)\n", 421 | " \n", 422 | " predicted_df_ktrlite_MAP = ktrlite.predict(df=test_df, decompose=True)\n", 423 | " \n", 424 | " return predicted_df_ktrlite_MAP['prediction']" 425 | ], 426 | "outputs": [], 427 | "metadata": {} 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "source": [ 432 | "### Root-Mean-Square Deviation (RMSD) or Root-Mean-Square Error (RMSE)" 433 | ], 434 | "metadata": {} 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "source": [ 440 | "def rmse(actual, pred): \n", 441 | " actual, pred = np.array(actual), np.array(pred)\n", 442 | " return np.sqrt(np.square(np.subtract(actual,pred)).mean())" 443 | ], 444 | "outputs": [], 445 | "metadata": {} 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "source": [ 451 | "def evaluating_models(index, column):\n", 452 | " \n", 453 | " '''\n", 454 | " Parameters:\n", 455 | " index: column index\n", 456 | " column: column name\n", 457 | " \n", 458 | " Returns:\n", 459 | " models_df: new dataframe with \n", 460 | " '''\n", 461 | "\n", 462 | " cocompared_values = compared_function(column) # the values with which the model values are compared\n", 463 | "\n", 464 | " tmp_df['Date'] = pd.to_datetime(df['Date'].astype(str))\n", 465 | " tmp_df['Penetration'] = df[column].astype(float)\n", 466 | " \n", 467 | " date_col = 'Date'\n", 468 | " response_col = 'Penetration'\n", 469 | " \n", 470 | " # Forecasting for N years ahead\n", 471 | " \n", 472 | " test_size = 4 # 5 - without 2021, 4 - with 2021\n", 473 | " train_df = tmp_df[:-test_size]\n", 474 | " test_df = tmp_df[-test_size:]\n", 475 | "\n", 476 | " # Decompose Prediction\n", 477 | "\n", 478 | " #train_df = tmp_df[tmp_df['Date'] < '2022-01-01']\n", 479 | " #test_df = tmp_df[tmp_df['Date'] <= '2025-01-01']\n", 480 | " \n", 481 | " models_df.at[index ,'Item Name'] = column\n", 482 | "\n", 483 | " \n", 484 | " # Making predictions with each model\n", 485 | " try:\n", 486 | " models_df.at[index , 'ETSMAP'] = rmse(\n", 487 | " cocompared_values, (ETSMAP_model(date_col, response_col, train_df, test_df))).astype(float)\n", 488 | " except:\n", 489 | " models_df.at[index , 'ETSMAP'] = 100\n", 490 | " try: \n", 491 | " models_df.at[index , 'ETSFull'] = rmse(\n", 492 | " cocompared_values, (ETSFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 493 | " except:\n", 494 | " models_df.at[index , 'ETSFull'] = 100\n", 495 | " try:\n", 496 | " models_df.at[index , 'ETSAggregated'] = rmse(\n", 497 | " cocompared_values, (ETSAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 498 | " except:\n", 499 | " models_df.at[index , 'ETSAggregated'] = 100\n", 500 | "\n", 501 | " \n", 502 | " try:\n", 503 | " models_df.at[index , 'DLTMAP_lin'] = rmse(\n", 504 | " cocompared_values, (DLTMAP_lin(date_col, response_col, train_df, test_df))).astype(float)\n", 505 | " except:\n", 506 | " models_df.at[index , 'DLTMAP_lin'] = 100\n", 507 | " try:\n", 508 | " models_df.at[index , 'DLTMAP_log_lin'] = rmse(\n", 509 | " cocompared_values, (DLTMAP_log_lin(date_col, response_col, train_df, test_df))).astype(float)\n", 510 | " except:\n", 511 | " models_df.at[index , 'DLTMAP_log_lin'] = 100\n", 512 | " try:\n", 513 | " models_df.at[index , 'DLTMAP_flat'] = rmse(\n", 514 | " cocompared_values, (DLTMAP_flat(date_col, response_col, train_df, test_df))).astype(float)\n", 515 | " except:\n", 516 | " models_df.at[index , 'DLTMAP_flat'] = 100\n", 517 | " try:\n", 518 | " models_df.at[index , 'DLTMAP_logistic'] = rmse(\n", 519 | " cocompared_values, (DLTMAP_logistic(date_col, response_col, train_df, test_df))).astype(float)\n", 520 | " except:\n", 521 | " models_df.at[index , 'DLTMAP_logistic'] = 100\n", 522 | " try: \n", 523 | " models_df.at[index , 'DLTFull'] = rmse(\n", 524 | " cocompared_values, (DLTFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 525 | " except:\n", 526 | " models_df.at[index , 'DLTFull'] = 100\n", 527 | " try:\n", 528 | " models_df.at[index , 'DLTAggregated'] = rmse(\n", 529 | " cocompared_values, (DLTAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 530 | " except: \n", 531 | " models_df.at[index , 'DLTAggregated'] = 100\n", 532 | " \n", 533 | " \n", 534 | " try:\n", 535 | " models_df.at[index , 'LGTMAP'] = rmse(\n", 536 | " cocompared_values, (LGTMAP_model(date_col, response_col, train_df, test_df))).astype(float)\n", 537 | " except:\n", 538 | " models_df.at[index , 'LGTMAP'] = 100\n", 539 | " try:\n", 540 | " models_df.at[index , 'LGTFull'] = rmse(\n", 541 | " cocompared_values, (LGTFull_model(date_col, response_col, train_df, test_df))).astype(float)\n", 542 | " except: \n", 543 | " models_df.at[index , 'LGTFull'] = 100\n", 544 | " try: \n", 545 | " models_df.at[index , 'LGTAggregated'] = rmse(\n", 546 | " cocompared_values, (LGTAggregated_model(date_col, response_col, train_df, test_df))).astype(float)\n", 547 | " except:\n", 548 | " models_df.at[index , 'LGTAggregated'] = 100\n", 549 | "\n", 550 | " \n", 551 | " # Using Pyro for Estimation\n", 552 | " try:\n", 553 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = rmse(\n", 554 | " cocompared_values, (LGTMAP_PyroEstimatorMAP(date_col, \n", 555 | " response_col, train_df, test_df))).astype(float)\n", 556 | " except:\n", 557 | " models_df.at[index , 'LGTMAP_PyroEstimatorMAP'] = 100\n", 558 | " try:\n", 559 | " models_df.at[index , 'LGTFull_pyro4'] = rmse(\n", 560 | " cocompared_values, (LGTFull_pyro(date_col, response_col, train_df, test_df))).astype(float)\n", 561 | " except:\n", 562 | " models_df.at[index , 'LGTFull_pyro4'] = 100\n", 563 | " \n", 564 | " # Kernel-based Time-varying Regression (KTR)\n", 565 | " try:\n", 566 | " models_df.at[index , 'KTR_Lite_MAP'] = rmse(\n", 567 | " cocompared_values, (ktrlite_MAP(date_col, response_col, train_df, test_df))).astype(float)\n", 568 | " except:\n", 569 | " models_df.at[index , 'KTR_Lite_MAP'] = 100\n", 570 | " \n", 571 | " \n", 572 | " models_df.at[index, 'Curve Type'] = df[column].iloc[-1]\n", 573 | " \n", 574 | " \n", 575 | " return models_df" 576 | ], 577 | "outputs": [], 578 | "metadata": {} 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "source": [ 583 | "### Calculating minimal RMSE value for each item" 584 | ], 585 | "metadata": {} 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": null, 590 | "source": [ 591 | "def min_value(df):\n", 592 | " \n", 593 | " '''\n", 594 | " Parameters:\n", 595 | " df: input dataframe with multiple columns and values in a row\n", 596 | " \n", 597 | " Returns:\n", 598 | " models_df: existing dataframe with added the new 'Model' column filled with \n", 599 | " the name of the best-fitted model for each item\n", 600 | " '''\n", 601 | " \n", 602 | " df.iloc[:, 1:-1].apply(pd.to_numeric)\n", 603 | " df['Model'] = df.iloc[:, 1:-1].idxmin(axis=1)\n", 604 | " \n", 605 | " return models_df" 606 | ], 607 | "outputs": [], 608 | "metadata": {} 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "source": [ 613 | "### Evaluating Orbit models for each item" 614 | ], 615 | "metadata": {} 616 | }, 617 | { 618 | "cell_type": "code", 619 | "execution_count": null, 620 | "source": [ 621 | "import time\n", 622 | "\n", 623 | "\n", 624 | "tmp_df = pd.DataFrame()\n", 625 | "models_df = pd.DataFrame()\n", 626 | "\n", 627 | "start = time.time()\n", 628 | "\n", 629 | "for index, column in enumerate(curve_df.columns[1:2]):\n", 630 | " evaluating_models(index, column)\n", 631 | " \n", 632 | "end = time.time()\n", 633 | "print(end - start)" 634 | ], 635 | "outputs": [], 636 | "metadata": {} 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "source": [ 642 | "models_df" 643 | ], 644 | "outputs": [], 645 | "metadata": {} 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": null, 650 | "source": [ 651 | "min_value(models_df)" 652 | ], 653 | "outputs": [], 654 | "metadata": {} 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "source": [ 659 | "# \n", 660 | "\n", 661 | "-------------------------\n", 662 | "\n", 663 | "## ETS (which stands for Error, Trend, and Seasonality)\n", 664 | "\n", 665 | "### Methods of Estimations\n", 666 | "\n", 667 | "#### Maximum a Posteriori (MAP)\n", 668 | "\n", 669 | "> The advantage of MAP estimation is a **faster computational speed**." 670 | ], 671 | "metadata": {} 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": null, 676 | "source": [ 677 | "def ETSMAP_model(date_col, response_col, train_df, test_df):\n", 678 | " ets = ETSMAP(\n", 679 | " response_col=response_col,\n", 680 | " date_col=date_col,\n", 681 | " seasonality=52,\n", 682 | " seed=8888,\n", 683 | " )\n", 684 | " \n", 685 | " ets.fit(df=train_df)\n", 686 | " predicted_df_MAP = ets.predict(df=test_df)\n", 687 | " \n", 688 | " return predicted_df_MAP['prediction']" 689 | ], 690 | "outputs": [], 691 | "metadata": {} 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "source": [ 696 | "## Full Bayesian Estimation\n", 697 | "\n", 698 | "> Compared to MAP, it usually takes longer time to fit a full Bayesian models where **No-U-Turn Sampler (NUTS)** ([Hoffman and Gelman 2011](https://arxiv.org/abs/1111.4246)) is carried out under the hood. The advantage is that the inference and estimation are usually more robust." 699 | ], 700 | "metadata": {} 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "source": [ 706 | "def ETSFull_model(date_col, response_col, train_df, test_df):\n", 707 | " ets = ETSFull(\n", 708 | " response_col=response_col,\n", 709 | " date_col=date_col,\n", 710 | " seasonality=52,\n", 711 | " seed=8888,\n", 712 | " num_warmup=400,\n", 713 | " num_sample=400,\n", 714 | " )\n", 715 | " \n", 716 | " ets.fit(df=train_df)\n", 717 | " predicted_df_ETSFull = ets.predict(df=test_df)\n", 718 | " \n", 719 | " return predicted_df_ETSFull['prediction']" 720 | ], 721 | "outputs": [], 722 | "metadata": {} 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "source": [ 727 | "## Aggregated Posteriors\n", 728 | "\n", 729 | "> Just like the full Bayesian method, it runs through the MCMC algorithm which is **NUTS** by default. The difference from a full model is that aggregated model first aggregates the posterior samples based on mean or median (via aggregate_method) then does the prediction using the aggreated posterior." 730 | ], 731 | "metadata": {} 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "source": [ 737 | "def ETSAggregated_model(date_col, response_col, train_df, test_df):\n", 738 | " ets = ETSAggregated(\n", 739 | " response_col=response_col,\n", 740 | " date_col=date_col,\n", 741 | " seasonality=52,\n", 742 | " seed=8888,\n", 743 | " )\n", 744 | " ets.fit(df=train_df)\n", 745 | " predicted_df_ETSAggregated = ets.predict(df=test_df)\n", 746 | " \n", 747 | " return predicted_df_ETSAggregated['prediction']" 748 | ], 749 | "outputs": [], 750 | "metadata": {} 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "source": [ 755 | "## Damped Local Trend (DLT)\n", 756 | "\n", 757 | "> DLT is one of the main exponential smoothing models support in `orbit`. Performance is benchmarked with M3 monthly, M4 weekly dataset and some Uber internal dataset ([Ng and Wang et al., 2020](https://arxiv.org/abs/2004.08492)). The model is a fusion between the classical ETS ([Hyndman et. al., 2008](http://www.exponentialsmoothing.net/home))) with some refinement leveraging ideas from Rlgt ([Smyl et al., 2019](https://cran.r-project.org/web/packages/Rlgt/index.html)). \n", 758 | "\n", 759 | "### Global Trend Configurations\n", 760 | "\n", 761 | "#### Linear Global Trend" 762 | ], 763 | "metadata": {} 764 | }, 765 | { 766 | "cell_type": "code", 767 | "execution_count": null, 768 | "source": [ 769 | "# linear global trend\n", 770 | "def DLTMAP_lin(date_col, response_col, train_df, test_df):\n", 771 | " dlt = DLTMAP(\n", 772 | " response_col=response_col,\n", 773 | " date_col=date_col,\n", 774 | " seasonality=52,\n", 775 | " seed=8888,\n", 776 | " )\n", 777 | "\n", 778 | " dlt.fit(train_df)\n", 779 | " predicted_df_DLTMAP_lin = dlt.predict(test_df)\n", 780 | " \n", 781 | " return predicted_df_DLTMAP_lin['prediction']" 782 | ], 783 | "outputs": [], 784 | "metadata": {} 785 | }, 786 | { 787 | "cell_type": "code", 788 | "execution_count": null, 789 | "source": [ 790 | "# log-linear global trend\n", 791 | "def DLTMAP_log_lin(date_col, response_col, train_df, test_df):\n", 792 | " dlt = DLTMAP(\n", 793 | " response_col=response_col,\n", 794 | " date_col=date_col,\n", 795 | " seasonality=52,\n", 796 | " seed=8888,\n", 797 | " global_trend_option='loglinear'\n", 798 | " )\n", 799 | "\n", 800 | " dlt.fit(train_df)\n", 801 | " predicted_df_DLTMAP_log_lin = dlt.predict(test_df)\n", 802 | " \n", 803 | " return predicted_df_DLTMAP_log_lin['prediction']" 804 | ], 805 | "outputs": [], 806 | "metadata": {} 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": null, 811 | "source": [ 812 | "# log-linear global trend\n", 813 | "def DLTMAP_flat(date_col, response_col, train_df, test_df):\n", 814 | " dlt = DLTMAP(\n", 815 | " response_col=response_col,\n", 816 | " date_col=date_col,\n", 817 | " seasonality=52,\n", 818 | " seed=8888,\n", 819 | " global_trend_option='flat'\n", 820 | " )\n", 821 | "\n", 822 | " dlt.fit(train_df)\n", 823 | " predicted_df_DLTMAP_flat = dlt.predict(test_df)\n", 824 | " \n", 825 | " return predicted_df_DLTMAP_flat['prediction']" 826 | ], 827 | "outputs": [], 828 | "metadata": {} 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "source": [ 833 | "#### Logistic Global Trend" 834 | ], 835 | "metadata": {} 836 | }, 837 | { 838 | "cell_type": "code", 839 | "execution_count": null, 840 | "source": [ 841 | "# logistic global trend\n", 842 | "def DLTMAP_logistic(date_col, response_col, train_df, test_df):\n", 843 | " dlt = DLTMAP(\n", 844 | " response_col=response_col,\n", 845 | " date_col=date_col,\n", 846 | " seasonality=52,\n", 847 | " seed=8888,\n", 848 | " global_trend_option='logistic'\n", 849 | " )\n", 850 | "\n", 851 | " dlt.fit(train_df)\n", 852 | " predicted_df_DLTMAP_logistic = dlt.predict(test_df)\n", 853 | " \n", 854 | " return predicted_df_DLTMAP_logistic['prediction']" 855 | ], 856 | "outputs": [], 857 | "metadata": {} 858 | }, 859 | { 860 | "cell_type": "markdown", 861 | "source": [ 862 | "### Damped Local Trend Full Bayesian Estimation (DLTFull)" 863 | ], 864 | "metadata": {} 865 | }, 866 | { 867 | "cell_type": "code", 868 | "execution_count": null, 869 | "source": [ 870 | "def DLTFull_model(date_col, response_col, train_df, test_df):\n", 871 | " dlt = DLTFull(\n", 872 | " response_col=response_col,\n", 873 | " date_col=date_col,\n", 874 | " num_warmup=400,\n", 875 | " num_sample=400,\n", 876 | " seasonality=52,\n", 877 | " seed=8888\n", 878 | " )\n", 879 | " \n", 880 | " dlt.fit(df=train_df)\n", 881 | " predicted_df_DLTFull = dlt.predict(df=test_df)\n", 882 | "\n", 883 | " return predicted_df_DLTFull['prediction']" 884 | ], 885 | "outputs": [], 886 | "metadata": {} 887 | }, 888 | { 889 | "cell_type": "markdown", 890 | "source": [ 891 | "### Damped Local Trend Full (DLTAggregated)" 892 | ], 893 | "metadata": {} 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": null, 898 | "source": [ 899 | "def DLTAggregated_model(date_col, response_col, train_df, test_df):\n", 900 | " ets = DLTAggregated(\n", 901 | " response_col=response_col,\n", 902 | " date_col=date_col,\n", 903 | " seasonality=52,\n", 904 | " seed=8888,\n", 905 | " )\n", 906 | " \n", 907 | " ets.fit(df=train_df)\n", 908 | " predicted_df_DLTAggregated = ets.predict(df=test_df)\n", 909 | " \n", 910 | " return predicted_df_DLTAggregated['prediction']" 911 | ], 912 | "outputs": [], 913 | "metadata": {} 914 | }, 915 | { 916 | "cell_type": "markdown", 917 | "source": [ 918 | "## Local Global Trend (LGT) Model\n", 919 | "\n", 920 | "> LGT stands for Local and Global Trend and is a refined model from **Rlgt** ([Smyl et al., 2019](https://cran.r-project.org/web/packages/Rlgt/index.html)). The main difference is that LGT is an additive form taking log-transformation response as the modeling response. This essentially converts the model into a multicplicative with some advantages ([Ng and Wang et al., 2020](https://arxiv.org/abs/2004.08492)). \n", 921 | "\n", 922 | "### Local Global Trend Maximum a Posteriori (LGTMAP)\n", 923 | "\n", 924 | "> LGTMAP is the model class for MAP (Maximum a Posteriori) estimation." 925 | ], 926 | "metadata": {} 927 | }, 928 | { 929 | "cell_type": "code", 930 | "execution_count": null, 931 | "source": [ 932 | "def LGTMAP_model(date_col, response_col, train_df, test_df):\n", 933 | " lgt = LGTMAP(\n", 934 | " response_col=response_col,\n", 935 | " date_col=date_col,\n", 936 | " seasonality=52,\n", 937 | " seed=8888,\n", 938 | " )\n", 939 | "\n", 940 | " lgt.fit(df=train_df)\n", 941 | " predicted_df_LGTMAP = lgt.predict(df=test_df)\n", 942 | " \n", 943 | " return predicted_df_LGTMAP['prediction']" 944 | ], 945 | "outputs": [], 946 | "metadata": {} 947 | }, 948 | { 949 | "cell_type": "markdown", 950 | "source": [ 951 | "### LGTFull\n", 952 | "\n", 953 | "> LGTFull is the model class for full Bayesian prediction. In full Bayesian prediction, the prediction will be conducted once for each parameter posterior sample, and the final prediction results are aggregated. Prediction will always return the median (aka 50th percentile) along with any additional percentiles that are provided." 954 | ], 955 | "metadata": {} 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": null, 960 | "source": [ 961 | "def LGTFull_model(date_col, response_col, train_df, test_df):\n", 962 | " lgt = LGTFull(\n", 963 | " response_col=response_col,\n", 964 | " date_col=date_col,\n", 965 | " seasonality=52,\n", 966 | " seed=8888,\n", 967 | " )\n", 968 | "\n", 969 | " lgt.fit(df=train_df)\n", 970 | " predicted_df_LGTFull = lgt.predict(df=test_df)\n", 971 | " \n", 972 | " return predicted_df_LGTFull['prediction']" 973 | ], 974 | "outputs": [], 975 | "metadata": {} 976 | }, 977 | { 978 | "cell_type": "markdown", 979 | "source": [ 980 | "#### Observations\n", 981 | "\n", 982 | "- **Time:** This model takes longer time to fit a full Bayesian models where No-U-Turn Sampler (NUTS) ([Hoffman and Gelman 2011](https://arxiv.org/abs/1111.4246)) is carried out under the hood." 983 | ], 984 | "metadata": {} 985 | }, 986 | { 987 | "cell_type": "markdown", 988 | "source": [ 989 | "### LGTAggregated\n", 990 | "\n", 991 | "> LGTAggregated is the model class for aggregated posterior prediction. In aggregated prediction, the parameter posterior samples are reduced using `aggregate_method ({ 'mean', 'median' })` before performing a single prediction." 992 | ], 993 | "metadata": {} 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": null, 998 | "source": [ 999 | "def LGTAggregated_model(date_col, response_col, train_df, test_df):\n", 1000 | " lgt = LGTAggregated(\n", 1001 | " response_col=response_col,\n", 1002 | " date_col=date_col,\n", 1003 | " seasonality=52,\n", 1004 | " seed=8888,\n", 1005 | " )\n", 1006 | "\n", 1007 | " lgt.fit(df=train_df)\n", 1008 | " predicted_df_LGTAggregated = lgt.predict(df=test_df)\n", 1009 | " \n", 1010 | " return predicted_df_LGTAggregated['prediction']" 1011 | ], 1012 | "outputs": [], 1013 | "metadata": {} 1014 | }, 1015 | { 1016 | "cell_type": "markdown", 1017 | "source": [ 1018 | "## Using Pyro for Estimation\n", 1019 | "\n", 1020 | "> Currently are still experimenting Pyro and support Pyro only with LGT.\n", 1021 | "\n", 1022 | "### MAP Fit and Predict" 1023 | ], 1024 | "metadata": {} 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "source": [ 1030 | "def LGTMAP_PyroEstimatorMAP(date_col, response_col, train_df, test_df):\n", 1031 | " lgt_map = LGTMAP(\n", 1032 | " response_col=response_col,\n", 1033 | " date_col=date_col,\n", 1034 | " seasonality=52,\n", 1035 | " seed=8888,\n", 1036 | " estimator_type=PyroEstimatorMAP,\n", 1037 | " )\n", 1038 | "\n", 1039 | " lgt_map.fit(df=train_df)\n", 1040 | " predicted_df_LGTMAP_pyro = lgt_map.predict(df=test_df)\n", 1041 | " \n", 1042 | " return predicted_df_LGTMAP_pyro['prediction']" 1043 | ], 1044 | "outputs": [], 1045 | "metadata": {} 1046 | }, 1047 | { 1048 | "cell_type": "markdown", 1049 | "source": [ 1050 | "### VI Fit and Predict" 1051 | ], 1052 | "metadata": {} 1053 | }, 1054 | { 1055 | "cell_type": "code", 1056 | "execution_count": null, 1057 | "source": [ 1058 | "def LGTFull_pyro(date_col, response_col, train_df, test_df):\n", 1059 | " lgt_vi = LGTFull(\n", 1060 | " response_col=response_col,\n", 1061 | " date_col=date_col,\n", 1062 | " seasonality=52,\n", 1063 | " seed=8888,\n", 1064 | " num_steps=101,\n", 1065 | " num_sample=100,\n", 1066 | " learning_rate=0.1,\n", 1067 | " n_bootstrap_draws=-1,\n", 1068 | " estimator_type=PyroEstimatorVI,\n", 1069 | " )\n", 1070 | "\n", 1071 | " lgt_vi.fit(df=train_df)\n", 1072 | "\n", 1073 | " predicted_df_LGTFull_pyro = lgt_vi.predict(df=test_df)\n", 1074 | " \n", 1075 | " return predicted_df_LGTFull_pyro['prediction']" 1076 | ], 1077 | "outputs": [], 1078 | "metadata": {} 1079 | }, 1080 | { 1081 | "cell_type": "markdown", 1082 | "source": [ 1083 | "## Kernel-based Time-varying Regression (KTR)\n", 1084 | "\n", 1085 | "> Implemented in the stable version of Orbit library\n", 1086 | ">\n", 1087 | "> [Documentation](https://orbit-ml.readthedocs.io/en/stable/tutorials/ktrlite.html)\n", 1088 | "\n", 1089 | "### KTRLite" 1090 | ], 1091 | "metadata": {} 1092 | }, 1093 | { 1094 | "cell_type": "code", 1095 | "execution_count": null, 1096 | "source": [ 1097 | "def ktrlite_MAP(date_col, response_col, train_df, test_df):\n", 1098 | " ktrlite = KTRLiteMAP(\n", 1099 | " response_col=response_col,\n", 1100 | " #response_col=np.log(df[response_col]),\n", 1101 | " date_col=date_col,\n", 1102 | " level_knot_scale=.1,\n", 1103 | " span_level=.05,\n", 1104 | " )\n", 1105 | " \n", 1106 | " ktrlite.fit(train_df)\n", 1107 | " \n", 1108 | " predicted_df_ktrlite_MAP = ktrlite.predict(df=test_df, decompose=True)\n", 1109 | " \n", 1110 | " return predicted_df_ktrlite_MAP['prediction']" 1111 | ], 1112 | "outputs": [], 1113 | "metadata": {} 1114 | } 1115 | ], 1116 | "metadata": { 1117 | "instance_type": "ml.g4dn.xlarge", 1118 | "interpreter": { 1119 | "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" 1120 | }, 1121 | "kernelspec": { 1122 | "display_name": "Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)", 1123 | "language": "python", 1124 | "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/pytorch-1.6-gpu-py36-cu110-ubuntu18.04-v3" 1125 | }, 1126 | "language_info": { 1127 | "codemirror_mode": { 1128 | "name": "ipython", 1129 | "version": 3 1130 | }, 1131 | "file_extension": ".py", 1132 | "mimetype": "text/x-python", 1133 | "name": "python", 1134 | "nbconvert_exporter": "python", 1135 | "pygments_lexer": "ipython3", 1136 | "version": "3.6.13" 1137 | } 1138 | }, 1139 | "nbformat": 4, 1140 | "nbformat_minor": 4 1141 | } 1142 | -------------------------------------------------------------------------------- /Outliers/README.md: -------------------------------------------------------------------------------- 1 | # Outliers / Anomaly Detection in Time Series 2 | 3 | ## 📰 Articles 4 | 5 | - [TODS: Detecting Different Types of Outliers from Time Series Data](https://towardsdatascience.com/tods-detecting-outliers-from-time-series-data-2d4bd2e91381) 6 | - [Understanding outliers in time series analysis](https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/understanding-outliers-in-time-series-analysis.htm) 7 | - [Anomaly Detection in Time Series](https://neptune.ai/blog/anomaly-detection-in-time-series) 8 | - [What are outliers in the data?](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 | ## 🎓 University Courses 4 | 5 | | Title | Description | 6 | | :---: | :--- | 7 | |[MIT 18.S096 Topics in Mathematics w Applications in Finance](https://ocw.mit.edu/courses/mathematics/18-s096-topics-in-mathematics-with-applications-in-finance-fall-2013/)|

The purpose of the class is to expose undergraduate and graduate students to the mathematical concepts and techniques used in the financial industry. Mathematics lectures are mixed with lectures illustrating the corresponding application in the financial industry. MIT mathematicians teach the mathematics part while industry professionals give the lectures on applications in finance.

[Video lectures](https://www.youtube.com/playlist?list=PLUl4u3cNGP63ctJIEC1UnZ0btsphnnoHR)

| 8 | 9 | ## Courses 10 | 11 | - [Time Series Forecasting](https://www.udacity.com/course/time-series-forecasting--ud980), Udacity 12 | > The Time Series Forecasting course provides students with the foundational knowledge to build and apply time series forecasting models in a variety of business contexts 13 | - [Методы анализа и прогнозирования временных рядов](https://openedu.ru/course/urfu/METHODS/) 14 | 15 | ## 📚 Books 16 | 17 | - [Forecasting: Principles and Practice](https://otexts.com/fpp2/#) **(3rd ed)** by Rob J Hyndman and George Athanasopoulos, Monash University, Australia 18 | > This textbook is intended to provide a comprehensive introduction to forecasting methods and to present enough information about each method for readers to be able to use them sensibly. We don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details. 19 | - [Practical Time Series Analysis: Prediction with Statistics and Machine Learning](https://www.amazon.com/Practical-Time-Analysis-Prediction-Statistics/dp/1492041653) by Aileen Nielsen 20 | - [Introduction to Time Series Forecasting With Python](https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/) by Jason Brownlee 21 | - [Analysis of Financial Time Series](https://www.amazon.com/Analysis-Financial-Time-Ruey-Tsay/dp/0470414359/) by Ruey S. Tsay 22 | - [Economic Forecasting](https://press.princeton.edu/books/hardcover/9780691140131/economic-forecasting) by Graham Elliott and Allan Timmermann, 2016 23 | - [Forecasting Economic Time Series](https://www.amazon.com/Forecasting-Economic-ECONOMETRICS-MATHEMATICAL-ECONOMICS-ebook/dp/B01DY7LSWO) (ECONOMIC THEORY, ECONOMETRICS, AND MATHEMATICAL ECONOMICS) by C. W. J. Granger, 1986 24 | - [sits: Data Analysis and Machine Learning on Earth Observation Data Cubes with Satellite Image Time Series](https://e-sensing.github.io/sitsbook/) 25 | - [Bayesian Time Series Models](https://www.amazon.com/Bayesian-Time-Models-David-Barber/dp/0521196760) by David Barber, A. Taylan Cemgil, Silvia Chiappa 26 | 27 | ## YouTube 28 | 29 | - [Time Series Analysis](https://www.youtube.com/playlist?list=PLvcbYUQ5t0UHOLnBzl46_Q6QKtFgfMGc3) playlist 30 | 31 | ## Blogs 32 | 33 | - []() 34 | - []() 35 | - []() 36 | 37 | ## Articles 38 | 39 | - [An overview of time series forecasting models](https://towardsdatascience.com/an-overview-of-time-series-forecasting-models-a2fa7a358fcb) 40 | > We describe 10 forecasting models and we apply them to predict the evolution of an industrial production index 41 | - [21 Great Articles and Tutorials on Time Series](https://www.datasciencecentral.com/profiles/blogs/21-great-articles-and-tutorials-on-time-series?utm_content=bufferfc0a4&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer) 42 | - [Introduction to the Fundamentals of Time Series Data and Analysis](https://www.aptech.com/blog/introduction-to-the-fundamentals-of-time-series-data-and-analysis/) 43 | - [The Complete Guide to Time Series Analysis and Forecasting](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775) 44 | 45 | ### Small Time Series Dataset 46 | 47 | - [Making predictions on a very small time series dataset](https://medium.com/@amit.arora15/making-predictions-using-a-very-small-dataset-230dd579dca8) 48 | - [Forecasting very short time series](https://otexts.com/fpp2/long-short-ts.html) 49 | 50 | ### Topic specific 51 | 52 | - [Difference between **estimation** and **prediction**?](https://stats.stackexchange.com/questions/17773/what-is-the-difference-between-estimation-and-prediction) 53 | 54 | ## Models, Algorithms 55 | 56 | | Title | Description, key points, related links | 57 | | :---: | :--- | 58 | |**MEAN**|

[Is it unusual for the MEAN to outperform ARIMA?](https://stats.stackexchange.com/questions/124955/is-it-unusual-for-the-mean-to-outperform-arima)
[]()
[]()
[]()

| 59 | 60 | ## Tools 61 | 62 | ### Frameworks 63 | 64 | | Title | Description | 65 | | :---: | :--- | 66 | |[Kats](https://facebookresearch.github.io/Kats/) by Facebook|Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis. Time series analysis is an essential component of Data Science and Engineering work at industry, from understanding the key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, etc. Kats is released by Facebook's Infrastructure Data Science team. It is available for download on [PyPI](https://pypi.org/project/kats/).| 67 | |[Prophet](https://facebook.github.io/prophet/) by Facebook|

Prophet is a forecasting procedure implemented in R and Python. It is **Generalized additive model (GAM)** with three main components: non-linear trends are fit with seasonality and holidays.

It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

g_t is the piecewise linear or logistic growth curve to model the non-periodic changes in the time series, s_t is the seasonality term, h_t is the holiday effect with irregular schedules, and εt is the error term.

On a high level, Prophet is framing the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series.

As a computational tool/software, moreover, Prophet allows users to manually supply change points in fitting the trend term and set the boundaries for saturation growth, which gives great flexibility in business applications.

Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.

| 68 | |[Orbit](https://eng.uber.com/orbit/) by Uber|Orbit (**O**bject-**OR**iented **B**ayes**I**an **T**ime Series) is a general interface for **Bayesian exponential smoothing model**. The goal of Orbit development team is to create a tool that is easy to use, flexible, interitible, and high performing (fast computation). Under the hood, Orbit uses the probabilistic programming languages (PPL) including but not limited to Stan and Pyro for posterior approximation (i.e, MCMC sampling, SVI). Below is a quadrant chart to position a few time series related packages in our assessment in terms of flexibility and completeness. Orbit is the only tool that allows for easy model specification and analysis while not limiting itself to a small subset of models. For example Prophet has a complete end to end solution but only has one model type and Pyro has total specification model flexibility but does not give an end to end solution. Thus Orbit bridges the gap between business problems and statistical solutions.

[:octocat: Orbit: A Python Package for Bayesian Forecasting](https://github.com/uber/orbit)
[Orbit’s Documentation](https://uber.github.io/orbit/)
[Quick Start](https://uber.github.io/orbit/tutorials/quick_start.html)
[Orbit: Probabilistic Forecast with Exponential Smoothing](https://arxiv.org/abs/2004.08492) Paper

| 69 | |[Greykite](https://linkedin.github.io/greykite/) by LinkedIn|

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

Silverkite algorithm works well on most time series, and is especially adept for those with changepoints in trend or seasonality, event/holiday effects, and temporal dependencies. Its forecasts are interpretable and therefore useful for trusted decision-making and insights.

The Greykite library provides a framework that makes it easy to develop a good forecast model, with exploratory data analysis, outlier/anomaly preprocessing, feature extraction and engineering, grid search, evaluation, benchmarking, and plotting. Other open source algorithms can be supported through Greykite’s interface to take advantage of this framework.

[:octocat: Greykite](https://github.com/linkedin/greykite)
[Getting Started](https://linkedin.github.io/greykite/get_started)
[A flexible forecasting model for production systems](https://arxiv.org/abs/2105.01098) Paper
[Greykite: A flexible, intuitive, and fast forecasting library](https://engineering.linkedin.com/blog/2021/greykite--a-flexible--intuitive--and-fast-forecasting-library)

| 70 | |[statsmodels](https://www.statsmodels.org/stable/index.html)| statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The package is released under the open source Modified BSD (3-clause) license. The online documentation is hosted at statsmodels.org.| 71 | |[Merlion](https://opensource.salesforce.com/Merlion/v1.0.0/index.html)|Merlion is a Python library for time series intelligence. It features a unified interface for many commonly used models and datasets for anomaly detection and forecasting on both univariate and multivariate time series, along with standard pre-processing and post-processing layers. It has several modules to improve ease-of-use, including visualization, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides a unique evaluation framework that simulates the live deployment and re-training of a model in production. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs, and benchmark them across multiple time series datasets.| 72 | |[Auto_TS: Auto_TimeSeries](https://github.com/AutoViML/Auto_TS)|Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.| 73 | |[TensorFlow Probability](https://www.tensorflow.org/probability)|TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU). It's for data scientists, statisticians, ML researchers, and practitioners who want to encode domain knowledge to understand data and make predictions.| 74 | |[Pyro](https://pyro.ai)|

Deep Universal Probabilistic Programming.

[:octocat: Pyro](https://github.com/pyro-ppl/pyro)

| 75 | |[ArviZ: Exploratory analysis of Bayesian models](https://arviz-devs.github.io/arviz/#)|ArviZ is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, sample diagnostics, model checking, and comparison.| 76 | |[PyStan](https://pystan.readthedocs.io/en/latest/)|PyStan is a Python interface to Stan, a package for Bayesian inference.| 77 | |[StatsForecast](https://github.com/Nixtla/statsforecast)|

**StatsForecast** offers a collection of widely used univariate time series forecasting models, including automatic `ARIMA` and `ETS` modeling optimized for high performance using `numba`. It also includes a large battery of benchmarking models.
[Fast Time Series Forecasting with StatsForecast](https://towardsdatascience.com/fast-time-series-forecasting-with-statsforecast-694d1670a2f3)
[]()
[]()

| 78 | | Transformers |

:page_facing_up: **Article:** [Probabilistic Time Series Forecasting with 🤗 Transformers](https://huggingface.co/blog/time-series-transformers)
:gear: **Code:** [Google Colaboratory](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb#scrollTo=SxHDCa7vwPBF)
[]()
[]()

| 79 | 80 | 81 | ## GitHub Repositories :octocat: 82 | 83 | | Title | Description | 84 | | :---: | :--- | 85 | |[awesome_time_series_in_python](https://github.com/MaxBenChrist/awesome_time_series_in_python)|This curated list contains python packages for time series analysis| 86 | 87 | ## Podcasts 🎧 88 | 89 | | Title | Description | 90 | | :---: | :--- | 91 | |[Data Skeptic](https://podcasts.apple.com/ua/podcast/data-skeptic/id890348705)|Episode - [Forecasting Principles and Practice](https://podcasts.apple.com/ua/podcast/forecasting-principles-and-practice/id890348705?i=1000522928916)| 92 | |[Seriously Social](https://podcasts.apple.com/ua/podcast/seriously-social/id1509419418)|Episode - [Forecasting the future: the science of prediction](https://podcasts.apple.com/ua/podcast/forecasting-the-future-the-science-of-prediction/id1509419418?i=1000516647970) | 93 | |[The Curious Quant](https://podcasts.apple.com/ua/podcast/the-curious-quant/id1481550488)|Episode - [Forecasting COVID, time series, and why causality doesnt matter as much as you think‪](https://podcasts.apple.com/ua/podcast/ep20-prof-rob-hyndman-forecasting-covid-time-series/id1481550488?i=1000485268452)| 94 | |[Forecasting Impact](https://podcasts.apple.com/ua/podcast/forecasting-impact/id1550584556)|| 95 | |[The Random Sample](https://podcasts.apple.com/ua/podcast/the-random-sample/id1439750898)|Episode - [Forecasting the future & the future of forecasting](https://podcasts.apple.com/ua/podcast/forecasting-the-future-the-future-of-forecasting/id1439750898?i=1000475866199) | 96 | |[Thought Capital](https://podcasts.apple.com/ua/podcast/thought-capital/id1434491776)|Episode - [Forecasts are always wrong (but we need them anyway)](https://podcasts.apple.com/ua/podcast/forecasts-are-always-wrong-but-we-need-them-anyway/id1434491776?i=1000452853638)| 97 | -------------------------------------------------------------------------------- /header.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ElizaLo/Time-Series/42483d39a09e8945b0650227504fd064f1898219/header.png --------------------------------------------------------------------------------