├── .gitignore ├── LICENSE ├── README.md ├── gpu ├── 01-Intro_to_cuDF.ipynb ├── 01-Intro_to_cuGraph.ipynb ├── 01-Introduction-LinearRegression-Hyperparam.ipynb ├── 02-Intro_to_cuDF_UDFs.ipynb ├── 02-LogisticRegression.ipynb ├── 02-Louvain.ipynb ├── 03-Pagerank.ipynb ├── 03-UMAP.ipynb ├── Introduction.ipynb ├── README.md ├── data │ └── karate-data.csv └── index.ipynb └── modern_time_series_analysis ├── ModernTimeSeriesAnalysis ├── DeepLearning │ ├── Electricity │ │ ├── 5_Forecasting_electric_use_with_mxnet_INSTRUCTOR.ipynb │ │ ├── 5_Forecasting_electric_use_with_mxnet_STUDENT.ipynb │ │ ├── electricity.diff.txt │ │ ├── models.json │ │ ├── perf.py │ │ └── run.json │ └── Stocks │ │ ├── 6_Stocks_INSTRUCTOR.ipynb │ │ ├── 6_Stocks_STUDENT.ipynb │ │ └── sp500.csv ├── MachineLearning │ ├── 3_Trees_for_Classification_and_Prediction_INSTRUCTOR.ipynb │ ├── 3_Trees_for_Classification_and_Prediction_STUDENT.ipynb │ ├── 4_Clustering_INSTRUCTOR.ipynb │ ├── 4_Clustering_STUDENT.ipynb │ ├── data │ │ ├── 50words.csv │ │ ├── AirPassengers.csv │ │ ├── featurized_hists.csv │ │ ├── featurized_words.csv │ │ ├── full_eeg_data_features.csv │ │ ├── pairwise_word_distances.npy │ │ └── training_eeg.csv │ ├── dtaidistance │ │ ├── __init__.py │ │ ├── alignment.py │ │ ├── clustering.py │ │ ├── dp.py │ │ ├── dtw.py │ │ ├── dtw_c.pyx │ │ ├── dtw_ndim.py │ │ ├── dtw_ndim_visualisation.py │ │ ├── dtw_visualisation.py │ │ ├── dtw_weighted.py │ │ └── util.py │ └── full_eeg_data_features.csv └── StateSpaceModels │ ├── 1_Structural_Time_Series_INSTRUCTOR.ipynb │ ├── 1_Structural_Time_Series_STUDENT.ipynb │ ├── 2_Gaussian_HMM_INSTRUCTOR.ipynb │ ├── 2_Gaussian_HMM_STUDENT.ipynb │ ├── Nile.csv │ └── global_temps.csv ├── README.md ├── SciPyModernTimeSeries.pdf └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 John Stilley 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tutorial Sessions for SciPy Con 2019 2 | 3 | I couldn't attend every tutorial, but these seemed like fun: 4 | 5 | 6 | ## Monday, 8AM (Room 203) Modern Time Series Analysis 7 | 8 | * [Here](https://github.com/john-science/scipy_con_2019/tree/main/modern_time_series_analysis/) is my version of this tutorial, since it was not available online. 9 | 10 | ## Monday, 8AM (Room 106) PyTest 11 | 12 | * [Here](https://leemangeophysicalllc.github.io/testing-with-python/) is a link to their official GitHub. 13 | 14 | ## Monday 1PM (Room 203) Bayesian Data Science: Probabilistic Programming 15 | 16 | * [Here](https://github.com/john-science/bayesian-stats-modelling-tutorial) is my fork of their GitHub. 17 | 18 | 19 | ## Tuesday 8AM (Room 106) RAPIDS: Open GPU Data Science 20 | 21 | * [Here](https://github.com/john-science/scipy_con_2019/tree/main/gpu) is my version of their tutorial, since it was not available online. 22 | 23 | ## Tuesday 1PM (Room 104) Escape from Auto-manual Testing with Hypothesis! 24 | 25 | * [Here](https://github.com/john-science/escape-from-automanual-testing) is my fork of their GitHub. 26 | -------------------------------------------------------------------------------- /gpu/01-Intro_to_cuDF.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/01-Intro_to_cuGraph.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/01-Introduction-LinearRegression-Hyperparam.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/02-Intro_to_cuDF_UDFs.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/02-LogisticRegression.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/02-Louvain.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/03-Pagerank.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/03-UMAP.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/Introduction.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/README.md: -------------------------------------------------------------------------------- 1 | # GPU Data Science 2 | 3 | > Okay, so they are giving us cloud IP addresses to log into, so I may not have much to share here, except my notes. 4 | 5 | ## Notes 6 | 7 | * Fun! They provide a cloud instance for everyone in the room, so we can play with a machine with sufficient GPU power. 8 | * The old solution: Hadoop, but that had too much IO. So people moved to Spark. 9 | * RAPIDS is Nvidia's GPU architecture 10 | * RAPIDS uses Apache Arrow (which tries to standardize the storage of columnar data in memory) 11 | * cuDF has nearly identical syntax to Pandas DataFrames 12 | * The API for cuDF is super handy, I'll give them that. 13 | * I guess that means you can also use cuDF as a drop-in replacement in your code base... I see. 14 | * String Example: big string replace operations on pd.DataFrame 15 | * 50x speed-up using cuDF and fixed-length NumPy strings 16 | * TODO: Whoops. At work, are we not converting our strings in pd.DataFrames to NumPy string arrays? 17 | * [RAPIDS docs](https://docs.rapids.ai/) 18 | * [Play around with RAPIDS](https://rapids.ai/start.html) 19 | * [cuDF GitHub](https://github.com/rapidsai/cudf) 20 | * [cupy](https://cupy.chainer.org/) is a replacement for NumPy, but in the GPU 21 | 22 | -------------------------------------------------------------------------------- /gpu/data/karate-data.csv: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /gpu/index.ipynb: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Jupyter Notebook 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 118 | 119 | 125 | 126 | 148 | 149 |
150 | 151 | 152 |
153 | 154 | 155 | 156 |
157 | 173 |
174 | 175 | 176 | 177 |
178 | 179 | 180 |
181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 194 | 195 | 196 | 197 | 222 | 223 | 224 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/models.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "scenarios": { 4 | "drop": [0.2], 5 | "batch-n": [128, 256], 6 | "lr": [0.0001], 7 | "n-epochs":[10], 8 | "model": [ "simple_lstnet_mode", "lstnet_mode", "rnn_model", "cnn_model", "fc_model"], 9 | "data_src": [ 10 | "electricity.txt", 11 | "electricity.diff.txt", 12 | ], 13 | "win": [128, 64,], 14 | "seasonal_period": [24] 15 | }, 16 | 17 | "drop": "$", 18 | "win": "$", 19 | "batch-n": "$", 20 | "lr": "$", 21 | "n-epochs": "$", 22 | "data-file": "/data/${data_src}", 23 | "save-dir": "${model}_${lr}_${batch-n}_${win}_${drop}/${data_src}" 24 | }, 25 | 26 | ] 27 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/perf.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import mxnet as mx 4 | import pdb 5 | 6 | np.seterr(divide='ignore', invalid='ignore') 7 | 8 | ## for saving 9 | import pandas as pd 10 | import os 11 | 12 | def COR(label, pred): 13 | label_demeaned = label - label.mean(0) 14 | label_sumsquares = np.sum(np.square(label_demeaned), 0) 15 | 16 | pred_demeaned = pred - pred.mean(0) 17 | pred_sumsquares = np.sum(np.square(pred_demeaned), 0) 18 | 19 | cor_coef = np.diagonal(np.dot(label_demeaned.T, pred_demeaned)) / \ 20 | np.sqrt(label_sumsquares * pred_sumsquares) 21 | 22 | return np.nanmean(cor_coef) 23 | 24 | 25 | def write_eval(pred, label, save_dir, mode, epoch): 26 | if not os.path.exists(save_dir): 27 | os.makedirs(save_dir) 28 | 29 | pred_df = pd.DataFrame(pred) 30 | label_df = pd.DataFrame(label) 31 | if epoch < 10: 32 | pred_df.to_csv( os.path.join(save_dir, '%s_pred_0%d.csv' % (mode, epoch))) 33 | label_df.to_csv(os.path.join(save_dir, '%s_label_0%d.csv' % (mode, epoch))) 34 | else : 35 | pred_df.to_csv( os.path.join(save_dir, '%s_pred_%d.csv' % (mode, epoch))) 36 | label_df.to_csv(os.path.join(save_dir, '%s_label_%d.csv' % (mode, epoch))) 37 | 38 | return { 'COR': COR(label,pred) } 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/run.json: -------------------------------------------------------------------------------- 1 | { 2 | "script": "/demo/fits.py", 3 | "models": "/demo/models.json", 4 | } 5 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/3_Trees_for_Classification_and_Prediction_STUDENT.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import matplotlib.pyplot as plt\n", 10 | "plt.rcParams['figure.figsize'] = [10, 10]" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import cesium\n", 20 | "import xgboost as xgb\n", 21 | "import numpy as np\n", 22 | "import matplotlib.pyplot as plt\n", 23 | "import pandas as pd\n", 24 | "import time\n", 25 | "\n", 26 | "from cesium import datasets\n", 27 | "from cesium import featurize as ft\n", 28 | "\n", 29 | "import scipy\n", 30 | "from scipy.stats import pearsonr, spearmanr\n", 31 | "from scipy.stats import skew\n", 32 | "\n", 33 | "import sklearn\n", 34 | "from sklearn.ensemble import RandomForestClassifier\n", 35 | "from sklearn.metrics import accuracy_score\n", 36 | "from sklearn.model_selection import train_test_split" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "print(cesium.__version__)\n", 46 | "print(xgb.__version__)\n", 47 | "print(scipy.__version__)\n", 48 | "print(sklearn.__version__)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Load data and generate some features of interest" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "eeg = datasets.fetch_andrzejak()" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "type(eeg)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "eeg.keys()" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### Visually inspect" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "plt.subplot(3, 1, 1)\n", 99 | "plt.plot(eeg[\"measurements\"][0])\n", 100 | "plt.legend(eeg['classes'][0])\n", 101 | "plt.subplot(3, 1, 2)\n", 102 | "plt.plot(eeg[\"measurements\"][300])\n", 103 | "plt.legend(eeg['classes'][300])\n", 104 | "plt.subplot(3, 1, 3)\n", 105 | "plt.plot(eeg[\"measurements\"][450])\n", 106 | "plt.legend(eeg['classes'][450])" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "type(eeg[\"measurements\"][0])" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": null, 121 | "metadata": {}, 122 | "outputs": [], 123 | "source": [ 124 | "type(eeg)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "eeg.keys()" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "type(eeg['measurements'])" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "len(eeg['measurements'])" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [ 160 | "eeg['measurements'][0].shape" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "## Generate the features" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": null, 173 | "metadata": {}, 174 | "outputs": [], 175 | "source": [ 176 | "# from cesium import featurize as ft\n", 177 | "# features_to_use = [\"amplitude\",\n", 178 | "# \"percent_beyond_1_std\",\n", 179 | "# \"percent_close_to_median\",\n", 180 | "# \"skew\",\n", 181 | "# \"max_slope\"]\n", 182 | "# fset_cesium = ft.featurize_time_series(times=eeg[\"times\"],\n", 183 | "# values=eeg[\"measurements\"],\n", 184 | "# errors=None,\n", 185 | "# features_to_use=features_to_use,\n", 186 | "# scheduler = None)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": null, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "fset_cesium = pd.read_csv(\"data/full_eeg_data_features.csv\", header = [0, 1])" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": { 202 | "scrolled": true 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "fset_cesium.head()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "# fset_cesium.to_csv(\"full_eeg_data_features.csv\")" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [ 224 | "fset_cesium.shape" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "## Exercise: validate/calculate these features by hand\n", 232 | "#### look up feature definitions here: http://cesium-ml.org/docs/feature_table.html\n", 233 | "confirm the values by hand coding these features for the first EEG measurement\n", 234 | "(that is eeg[\"measurements\"][0])" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "ex = eeg[\"measurements\"][0]\n", 244 | "ex_mean = np.mean(ex)\n", 245 | "ex_std = np.std(ex)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## Prepare data for training" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "X_train, X_test, y_train, y_test = train_test_split(\n", 269 | " fset_cesium.iloc[:, 1:6].values, eeg[\"classes\"], random_state=21)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "## Try a random forest with these features" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "clf = RandomForestClassifier(n_estimators=10, max_depth=3,\n", 286 | " random_state=21)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "clf.fit(X_train, y_train)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": {}, 302 | "outputs": [], 303 | "source": [ 304 | "clf.score(X_train, y_train)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "clf.score(X_test, y_test)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "np.unique(y_test, return_counts=True)" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "y_test" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "y_test.shape" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "y_train.shape" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "## Try XGBoost with these features" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "model = xgb.XGBClassifier(n_estimators=10, max_depth=3,\n", 366 | " random_state=21)\n", 367 | "model.fit(X_train, y_train)" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": null, 373 | "metadata": {}, 374 | "outputs": [], 375 | "source": [ 376 | "model.score(X_test, y_test)" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "model.score(X_train, y_train)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": null, 391 | "metadata": {}, 392 | "outputs": [], 393 | "source": [ 394 | "xgb.plot_importance(model)" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "## Time Series Forecasting with Decision Trees" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### Explore the data" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "ap = pd.read_csv(\"data/AirPassengers.csv\", parse_dates=[0])" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": null, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "ap.head()" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "ap.set_index('Month', inplace=True)" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "ap.head()" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": null, 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [ 453 | "plt.plot(ap)" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [ 462 | "plt.plot(np.diff(np.log(ap.values[:, 0])))" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "ts = np.diff(np.log(ap.values[:, 0]))" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "## Exercise: now that we have 1 time series, how can we convert it to many samples?" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 5, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "NSTEPS = 12" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": {}, 494 | "outputs": [], 495 | "source": [] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "## Exercise: now that we have the time series broken down into a set of samples, how to featurize?" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": null, 507 | "metadata": {}, 508 | "outputs": [], 509 | "source": [ 510 | "measures = [vals[i][0:(NSTEPS - 1)] for i in range(vals.shape[0])]" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": {}, 517 | "outputs": [], 518 | "source": [ 519 | "times = [[j for j in range(NSTEPS - 1)] for i in range(vals.shape[0])]" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": {}, 526 | "outputs": [], 527 | "source": [] 528 | }, 529 | { 530 | "cell_type": "markdown", 531 | "metadata": {}, 532 | "source": [ 533 | "## Exercise: can you fit an XGBRegressor to this problem? Let's use the first 100 'time series' as the training data" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### RMSE can be hard to digest .... Use other assessments to determine how well the model performs" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": null, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [] 556 | }, 557 | { 558 | "cell_type": "markdown", 559 | "metadata": {}, 560 | "source": [ 561 | "### What went wrong? Let's revisit the feature set" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "fset_ap.head()" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": {}, 577 | "outputs": [], 578 | "source": [ 579 | "plt.plot(vals[0])\n", 580 | "plt.plot(vals[1])\n", 581 | "plt.plot(vals[2])" 582 | ] 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "## We need to find a way to generate features that encode positional information" 589 | ] 590 | }, 591 | { 592 | "cell_type": "markdown", 593 | "metadata": {}, 594 | "source": [ 595 | "### now we will generate our own features" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "vals.shape" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": null, 610 | "metadata": {}, 611 | "outputs": [], 612 | "source": [ 613 | "feats = np.zeros( (vals.shape[0], 6), dtype = np.float32)\n", 614 | "for i in range(vals.shape[0]):\n", 615 | " feats[i, 0] = np.where(vals[i] == np.max(vals[i]))[0][0]\n", 616 | " feats[i, 1] = np.where(vals[i] == np.min(vals[i]))[0][0]\n", 617 | " feats[i, 2] = feats[i, 0] - feats[i, 1]\n", 618 | " feats[i, 3] = np.max(vals[i][-3:])\n", 619 | " feats[i, 4] = vals[i][-1] - vals[i][-2]\n", 620 | " feats[i, 5] = vals[i][-1] - vals[i][-3]" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [ 629 | "feats[0:3]" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": {}, 635 | "source": [ 636 | "### How do these look compared to the first set of features?" 637 | ] 638 | }, 639 | { 640 | "cell_type": "code", 641 | "execution_count": null, 642 | "metadata": {}, 643 | "outputs": [], 644 | "source": [ 645 | "pd.DataFrame(feats[0:3])" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [ 654 | "X_train, y_train = feats[:100, :], outcomes[:100]\n", 655 | "X_test, y_test = feats[100:, :], outcomes[100:]" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": null, 661 | "metadata": {}, 662 | "outputs": [], 663 | "source": [ 664 | "model = xgb.XGBRegressor(n_estimators=20, max_depth=2,\n", 665 | " random_state=21)\n", 666 | "eval_set = [(X_test, y_test)]\n", 667 | "model.fit(X_train, y_train, eval_metric=\"rmse\", eval_set=eval_set, verbose=True)" 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": null, 673 | "metadata": {}, 674 | "outputs": [], 675 | "source": [ 676 | "plt.scatter(model.predict(X_test), y_test)" 677 | ] 678 | }, 679 | { 680 | "cell_type": "code", 681 | "execution_count": null, 682 | "metadata": {}, 683 | "outputs": [], 684 | "source": [ 685 | "print(pearsonr(model.predict(X_test), y_test))\n", 686 | "print(spearmanr(model.predict(X_test), y_test))" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": null, 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "plt.scatter(model.predict(X_train), y_train)" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": null, 701 | "metadata": {}, 702 | "outputs": [], 703 | "source": [ 704 | "print(pearsonr(model.predict(X_train), y_train))\n", 705 | "print(spearmanr(model.predict(X_train), y_train))" 706 | ] 707 | }, 708 | { 709 | "cell_type": "code", 710 | "execution_count": null, 711 | "metadata": {}, 712 | "outputs": [], 713 | "source": [] 714 | } 715 | ], 716 | "metadata": { 717 | "kernelspec": { 718 | "display_name": "Python 3", 719 | "language": "python", 720 | "name": "python3" 721 | }, 722 | "language_info": { 723 | "codemirror_mode": { 724 | "name": "ipython", 725 | "version": 3 726 | }, 727 | "file_extension": ".py", 728 | "mimetype": "text/x-python", 729 | "name": "python", 730 | "nbconvert_exporter": "python", 731 | "pygments_lexer": "ipython3", 732 | "version": "3.6.8" 733 | } 734 | }, 735 | "nbformat": 4, 736 | "nbformat_minor": 2 737 | } 738 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/50words.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/50words.csv -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/AirPassengers.csv: -------------------------------------------------------------------------------- 1 | Month,#Passengers 2 | 1949-01,112 3 | 1949-02,118 4 | 1949-03,132 5 | 1949-04,129 6 | 1949-05,121 7 | 1949-06,135 8 | 1949-07,148 9 | 1949-08,148 10 | 1949-09,136 11 | 1949-10,119 12 | 1949-11,104 13 | 1949-12,118 14 | 1950-01,115 15 | 1950-02,126 16 | 1950-03,141 17 | 1950-04,135 18 | 1950-05,125 19 | 1950-06,149 20 | 1950-07,170 21 | 1950-08,170 22 | 1950-09,158 23 | 1950-10,133 24 | 1950-11,114 25 | 1950-12,140 26 | 1951-01,145 27 | 1951-02,150 28 | 1951-03,178 29 | 1951-04,163 30 | 1951-05,172 31 | 1951-06,178 32 | 1951-07,199 33 | 1951-08,199 34 | 1951-09,184 35 | 1951-10,162 36 | 1951-11,146 37 | 1951-12,166 38 | 1952-01,171 39 | 1952-02,180 40 | 1952-03,193 41 | 1952-04,181 42 | 1952-05,183 43 | 1952-06,218 44 | 1952-07,230 45 | 1952-08,242 46 | 1952-09,209 47 | 1952-10,191 48 | 1952-11,172 49 | 1952-12,194 50 | 1953-01,196 51 | 1953-02,196 52 | 1953-03,236 53 | 1953-04,235 54 | 1953-05,229 55 | 1953-06,243 56 | 1953-07,264 57 | 1953-08,272 58 | 1953-09,237 59 | 1953-10,211 60 | 1953-11,180 61 | 1953-12,201 62 | 1954-01,204 63 | 1954-02,188 64 | 1954-03,235 65 | 1954-04,227 66 | 1954-05,234 67 | 1954-06,264 68 | 1954-07,302 69 | 1954-08,293 70 | 1954-09,259 71 | 1954-10,229 72 | 1954-11,203 73 | 1954-12,229 74 | 1955-01,242 75 | 1955-02,233 76 | 1955-03,267 77 | 1955-04,269 78 | 1955-05,270 79 | 1955-06,315 80 | 1955-07,364 81 | 1955-08,347 82 | 1955-09,312 83 | 1955-10,274 84 | 1955-11,237 85 | 1955-12,278 86 | 1956-01,284 87 | 1956-02,277 88 | 1956-03,317 89 | 1956-04,313 90 | 1956-05,318 91 | 1956-06,374 92 | 1956-07,413 93 | 1956-08,405 94 | 1956-09,355 95 | 1956-10,306 96 | 1956-11,271 97 | 1956-12,306 98 | 1957-01,315 99 | 1957-02,301 100 | 1957-03,356 101 | 1957-04,348 102 | 1957-05,355 103 | 1957-06,422 104 | 1957-07,465 105 | 1957-08,467 106 | 1957-09,404 107 | 1957-10,347 108 | 1957-11,305 109 | 1957-12,336 110 | 1958-01,340 111 | 1958-02,318 112 | 1958-03,362 113 | 1958-04,348 114 | 1958-05,363 115 | 1958-06,435 116 | 1958-07,491 117 | 1958-08,505 118 | 1958-09,404 119 | 1958-10,359 120 | 1958-11,310 121 | 1958-12,337 122 | 1959-01,360 123 | 1959-02,342 124 | 1959-03,406 125 | 1959-04,396 126 | 1959-05,420 127 | 1959-06,472 128 | 1959-07,548 129 | 1959-08,559 130 | 1959-09,463 131 | 1959-10,407 132 | 1959-11,362 133 | 1959-12,405 134 | 1960-01,417 135 | 1960-02,391 136 | 1960-03,419 137 | 1960-04,461 138 | 1960-05,472 139 | 1960-06,535 140 | 1960-07,622 141 | 1960-08,606 142 | 1960-09,508 143 | 1960-10,461 144 | 1960-11,390 145 | 1960-12,432 -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/pairwise_word_distances.npy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/pairwise_word_distances.npy -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance 4 | ~~~~~~~~~~~~ 5 | 6 | Time series distance methods. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import logging 14 | 15 | 16 | logger = logging.getLogger("be.kuleuven.dtai.distance") 17 | 18 | 19 | from . import dtw 20 | try: 21 | from . import dtw_c 22 | except ImportError: 23 | # Try to compile automatically 24 | # try: 25 | # import numpy as np 26 | # import pyximport 27 | # pyximport.install(setup_args={'include_dirs': np.get_include()}) 28 | # from . import dtw_c 29 | # except ImportError: 30 | # logger.warning("\nDTW C variant not available.\n\n" + 31 | # "If you want to use the C libraries (not required, depends on cython), " + 32 | # "then run `cd {};python3 setup.py build_ext --inplace`.".format(dtaidistance_dir)) 33 | dtw_c = None 34 | 35 | __version__ = "1.2.2" 36 | __author__ = "Wannes Meert" 37 | __copyright__ = "Copyright 2017-2019 KU Leuven, DTAI Research Group" 38 | __license__ = "Apache License, Version 2.0" 39 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/alignment.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.alignment 4 | ~~~~~~~~~~~~~~~~~~~~~~ 5 | 6 | Sequence alignment (e.g. Needleman–Wunsch). 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import logging 14 | import math 15 | import numpy as np 16 | 17 | from .dp import dp 18 | 19 | 20 | def needleman_wunsch(s1, s2, window=None, max_dist=None, 21 | max_step=None, max_length_diff=None, psi=None): 22 | """Needleman-Wunsch global sequence alignment. 23 | 24 | Example: 25 | 26 | >> s1 = "GATTACA" 27 | >> s2 = "GCATGCU" 28 | >> value, matrix = alignment.needleman_wunsch(s1, s2) 29 | >> algn, s1a, s2a = alignment.best_alignment(matrix, s1, s2) 30 | >> print(matrix) 31 | [[-0., -1., -2., -3., -4., -5., -6., -7.], 32 | [-1., 1., -0., -1., -2., -3., -4., -5.], 33 | [-2., -0., -0., 1., -0., -1., -2., -3.], 34 | [-3., -1., -1., -0., 2., 1., -0., -1.], 35 | [-4., -2., -2., -1., 1., 1., -0., -1.], 36 | [-5., -3., -3., -1., -0., -0., -0., -1.], 37 | [-6., -4., -2., -2., -1., -1., 1., -0.], 38 | [-7., -5., -3., -1., -2., -2., -0., -0.]] 39 | >> print(''.join(s1a), ''.join(s2a)) 40 | 'G-ATTACA', 'GCAT-GCU' 41 | 42 | """ 43 | value, matrix = dp(s1, s2, 44 | _needleman_wunsch_fn, border=_needleman_wunsch_border, 45 | penalty=0, window=window, max_dist=max_dist, 46 | max_step=max_step, max_length_diff=max_length_diff, psi=psi) 47 | matrix = -matrix 48 | return value, matrix 49 | 50 | 51 | def _needleman_wunsch_fn(v1, v2): 52 | """Needleman-Wunsch 53 | 54 | Match: +1 -> -1 55 | Mismatch or Indel: −1 -> +1 56 | 57 | The values are reversed because our general dynamic programming algorithm 58 | selects the minimal value instead of the maximal value. 59 | """ 60 | d_indel = 1 # gap / indel 61 | if v1 == v2: 62 | d = -1 # match 63 | else: 64 | d = 1 # mismatch 65 | return d, d_indel 66 | 67 | 68 | def _needleman_wunsch_border(ri, ci): 69 | if ri == 0: 70 | return ci 71 | if ci == 0: 72 | return ri 73 | return 0 74 | 75 | 76 | def best_alignment(paths, s1=None, s2=None, gap="-", order=None): 77 | """Compute the optimal alignment from the nxm paths matrix. 78 | 79 | :param paths: Paths matrix (e.g. from needleman_wunsch) 80 | :param s1: First sequence, if given the aligned sequence will be created 81 | :param s2: Second sequence, if given the aligned sequence will be created 82 | :param gap: Gap symbol that is inserted into s1 and s2 to align the sequences 83 | :param order: Array with order of comparisons (there might be multiple optimal paths) 84 | The default order is 0,1,2: (-1,-1), (-1,-0), (-0,-1) 85 | For example, 1,0,2 is (-1,-0), (-1,-1), (-0,-1) 86 | There might be more optimal paths than covered by these orderings. For example, 87 | when using combinations of these orderings in different parts of the matrix. 88 | """ 89 | i, j = int(paths.shape[0] - 1), int(paths.shape[1] - 1) 90 | p = [] 91 | if paths[i, j] != -1: 92 | p.append((i - 1, j - 1)) 93 | ops = [(-1,-1), (-1,-0), (-0,-1)] 94 | if order is None: 95 | order = [0, 1, 2] 96 | while i > 0 and j > 0: 97 | prev_vals = [paths[i + ops[orderi][0], j + ops[orderi][1]] for orderi in order] 98 | # c = np.argmax([paths[i - 1, j - 1], paths[i - 1, j], paths[i, j - 1]]) 99 | c = int(np.argmax(prev_vals)) 100 | opi, opj = ops[order[c]] 101 | i, j = i + opi, j + opj 102 | if paths[i, j] != -1: 103 | p.append((i - 1, j - 1)) 104 | p.pop() 105 | p.reverse() 106 | if s1 is not None: 107 | s1a = [] 108 | s1ip = -1 109 | for s1i, _ in p: 110 | if s1i == s1ip + 1: 111 | s1a.append(s1[s1i]) 112 | else: 113 | s1a.append(gap) 114 | s1ip = s1i 115 | else: 116 | s1a = None 117 | if s2 is not None: 118 | s2a = [] 119 | s2ip = -1 120 | for _, s2i in p: 121 | if s2i == s2ip + 1: 122 | s2a.append(s2[s2i]) 123 | else: 124 | s2a.append(gap) 125 | s2ip = s2i 126 | else: 127 | s2a = None 128 | 129 | return p, s1a, s2a 130 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/clustering.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.clustering 4 | ~~~~~~~~~~~~~~~~~~~~~~~ 5 | 6 | Time series clustering. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import logging 14 | from pathlib import Path 15 | from collections import deque 16 | import numpy as np 17 | 18 | from .util import SeriesContainer 19 | 20 | try: 21 | from tqdm import tqdm 22 | except ImportError: 23 | tqdm = None 24 | 25 | 26 | logger = logging.getLogger("be.kuleuven.dtai.distance") 27 | 28 | 29 | class Hierarchical: 30 | """Hierarchical clustering. 31 | 32 | Note: This method first computes the entire distance matrix. This is not ideal for extremely large 33 | data sets. 34 | 35 | :param dists_fun: Function to compute pairwise distance matrix between set of series. 36 | :param dists_options: Arguments to pass to dists_fun. 37 | :param max_dist: Do not merge or cluster series that are further apart than this. 38 | :param merge_hook: Function that is called when two series are clustered. 39 | The function definition is `def merge_hook(from_idx, to_idx, distance)`, where idx is the index of the series. 40 | :param order_hook: Function that is called to decide on the next idx out of all shortest distances 41 | :param show_progress: Use a tqdm progress bar 42 | """ 43 | 44 | def __init__(self, dists_fun, dists_options, max_dist=np.inf, 45 | merge_hook=None, order_hook=None, show_progress=True): 46 | self.dists_fun = dists_fun 47 | self.dists_options = dists_options 48 | self.max_dist = max_dist 49 | self.merge_hook = merge_hook 50 | self.order_hook = order_hook 51 | self.show_progress = show_progress 52 | 53 | def fit(self, series): 54 | """Merge sequences. 55 | 56 | :param series: Iterator over series. 57 | :return: Dictionary with as keys the prototype indicices and as values all the indicides of the series in 58 | that cluster. 59 | """ 60 | nb_series = len(series) 61 | cluster_idx = dict() 62 | dists = self.dists_fun(series, **self.dists_options) 63 | min_value = np.min(dists) 64 | min_idxs = np.argwhere(dists == min_value) 65 | if self.order_hook: 66 | min_idx = self.order_hook(min_idxs) 67 | else: 68 | min_idx = min_idxs[0, :] 69 | deleted = set() 70 | cnt_merge = 0 71 | logger.debug('Merging patterns') 72 | if self.show_progress and tqdm: 73 | pbar = tqdm(total=dists.shape[0]) 74 | else: 75 | pbar = None 76 | # Hierarchical clustering (distance to prototype) 77 | while min_value <= self.max_dist: 78 | cnt_merge += 1 79 | i1, i2 = int(min_idx[0]), int(min_idx[1]) 80 | if self.merge_hook: 81 | result = self.merge_hook(i2, i1, min_value) 82 | if result: 83 | i1, i2 = result 84 | logger.debug("Merge {} <- {} ({:.3f})".format(i1, i2, min_value)) 85 | if i1 not in cluster_idx: 86 | cluster_idx[i1] = {i1} 87 | if i2 in cluster_idx: 88 | cluster_idx[i1].update(cluster_idx[i2]) 89 | del cluster_idx[i2] 90 | else: 91 | cluster_idx[i1].add(i2) 92 | # if recompute: 93 | # for r in range(i1): 94 | # if r not in deleted and abs(len(cur_seqs[r]) - len(cur_seqs[i1])) <= max_length_diff: 95 | # dists[r, i1] = self.dist(cur_seqs[r], cur_seqs[i1], **dist_opts) 96 | # for c in range(i1+1, len(cur_seqs)): 97 | # if c not in deleted and abs(len(cur_seqs[i1]) - len(cur_seqs[c])) <= max_length_diff: 98 | # dists[i1, c] = self.dist(cur_seqs[i1], cur_seqs[c], **dist_opts) 99 | for r in range(i2): 100 | dists[r, i2] = np.inf 101 | for c in range(i2 + 1, len(series)): 102 | dists[i2, c] = np.inf 103 | deleted.add(i2) 104 | if len(deleted) == nb_series - 1: 105 | break 106 | if pbar: 107 | pbar.update(1) 108 | # min_idx = np.unravel_index(np.argmin(dists), dists.shape) 109 | # min_value = dists[min_idx] 110 | min_value = np.min(dists) 111 | # if np.isinf(min_value): 112 | # break 113 | min_idxs = np.argwhere(dists == min_value) 114 | if self.order_hook: 115 | min_idx = self.order_hook(min_idxs) 116 | else: 117 | min_idx = min_idxs[0, :] 118 | if pbar: 119 | pbar.update(dists.shape[0] - cnt_merge) 120 | 121 | prototypes = [] 122 | for i in range(len(series)): 123 | if i not in deleted: 124 | prototypes.append(i) 125 | if i not in cluster_idx: 126 | cluster_idx[i] = set(i) 127 | return cluster_idx 128 | 129 | 130 | class BaseTree: 131 | """Base Tree abstract class. 132 | 133 | Returns a datastructure compatible with the Scipy clustering methods: 134 | 135 | https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html 136 | 137 | A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are 138 | combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original 139 | observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value 140 | Z[i, 3] represents the number of original observations in the newly formed cluster. 141 | """ 142 | 143 | def __init__(self, **kwargs): 144 | self.linkage = None 145 | self.series = None 146 | self._series_y = None 147 | self.ts_height_factor = None 148 | 149 | @property 150 | def maxnode(self): 151 | return len(self.series) - 1 + len(self.linkage) 152 | 153 | def get_linkage(self, node): 154 | if node < len(self.series): 155 | return None 156 | idx = int(node - len(self.series)) 157 | return self.linkage[idx] 158 | 159 | def plot(self, filename=None, axes=None, ts_height=10, 160 | bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1, 161 | tr_label_margin=3, tr_left_margin=2, ts_label_margin=0, 162 | show_ts_label=None, show_tr_label=None, 163 | cmap='viridis_r', ts_color=None): 164 | """Plot the hierarchy and time series. 165 | 166 | :param filename: If a filename is passed, the image is written to this file. 167 | :param axes: If a axes array is passed the image is added to this figure. 168 | Expects axes[0] and axes[1] to be present. 169 | :param ts_height: Height of a time series 170 | :param bottom_margin: Margin on bottom 171 | :param top_margin: Margin on top 172 | :param ts_left_margin: Margin on left of time series image 173 | :param ts_sample_length: Space between two points in the time series 174 | :param tr_label_margin: Margin between tree split and label 175 | :param tr_left_margin: Left margin for tree 176 | :param ts_label_margin: Margin between start of series and label 177 | :param show_ts_label: Show label indices. Boolean, callable or subscriptable object. 178 | If it is a callable object, the index of the time series will be given and the 179 | return string will be printed. 180 | :param show_tr_label: Show tree distances. Boolean, callable or subscriptable object. 181 | If it is a callable object, the index of the time series will be given and the 182 | return string will be printed. 183 | :param cmap: Matplotlib colormap name 184 | :param ts_color: function that takes the index and returns a color 185 | (compatible with the matplotlib.color color argument) 186 | """ 187 | # print('linkage') 188 | # for l in self.linkage: 189 | # print(l) 190 | from matplotlib import pyplot as plt 191 | from matplotlib.lines import Line2D 192 | import matplotlib.colors as colors 193 | import matplotlib.cm as cmx 194 | 195 | if show_ts_label is True: 196 | show_ts_label = lambda idx: str(int(idx)) 197 | elif show_ts_label is False or show_ts_label is None: 198 | show_ts_label = lambda idx: "" 199 | elif callable(show_ts_label): 200 | pass 201 | elif hasattr(show_ts_label, "__getitem__"): 202 | show_ts_label_prev = show_ts_label 203 | show_ts_label = lambda idx: show_ts_label_prev[idx] 204 | else: 205 | raise AttributeError("Unknown type for show_ts_label, expecting boolean, subscriptable or callable, " 206 | "got {}".format(type(show_ts_label))) 207 | if show_tr_label is True: 208 | show_tr_label = lambda dist: "{:.2f}".format(dist) 209 | elif show_tr_label is False or show_tr_label is None: 210 | show_tr_label = lambda dist: "" 211 | elif callable(show_tr_label): 212 | pass 213 | elif hasattr(show_tr_label, "__getitem__"): 214 | show_tr_label_prev = show_tr_label 215 | show_tr_label = lambda idx: show_tr_label_prev[idx] 216 | else: 217 | raise AttributeError("Unknown type for show_ts_label, expecting boolean, subscriptable or callable, " 218 | "got {}".format(type(show_ts_label))) 219 | 220 | self._series_y = [0] * len(self.series) 221 | 222 | max_dist = 0 223 | for _, _, d, _ in self.linkage: 224 | if not np.isinf(d): 225 | max_dist = max(max_dist, d) 226 | 227 | node_props = dict() 228 | 229 | max_y = self.series.get_max_y() 230 | self.ts_height_factor = (ts_height / max_y) * 0.9 231 | 232 | def count(node, height): 233 | # print('count({},{})'.format(node, height)) 234 | maxheight = None 235 | maxcumdist = None 236 | curdepth = None 237 | cnt = 0 238 | left_cnt = None 239 | right_cnt = None 240 | if node < len(self.series): 241 | # Leaf 242 | cnt += 1 243 | maxheight = height 244 | maxcumdist = 0 245 | curdepth = 0 246 | left_cnt = 0 247 | right_cnt = 0 248 | else: 249 | # Inner node 250 | child_left, child_right, dist, cnt2 = self.get_linkage(int(node)) 251 | child_left, child_right, cnt2 = int(child_left), int(child_right), int(cnt2) 252 | if child_left == child_right: 253 | raise Exception("Row in linkage contains same node as left and right child: {}-{}". 254 | format(child_left, child_right)) 255 | if np.isinf(dist): 256 | dist = 1.5*max_dist 257 | # Left 258 | nc, nmh, ncd, nmd = count(child_left, height + 1) 259 | cnt += nc 260 | maxheight = nmh 261 | maxcumdist = nmd + dist 262 | curdepth = ncd + 1 263 | left_cnt = nc 264 | # Right 265 | nc, nmh, ncd, nmd = count(child_right, height + 1) 266 | cnt += nc 267 | maxheight = max(maxheight, nmh) 268 | maxcumdist = max(maxcumdist, nmd + dist) 269 | curdepth = max(curdepth, ncd + 1) 270 | right_cnt = nc 271 | # if cnt != cnt2: 272 | # raise Exception("Count in linkage not correct") 273 | # print('c', node, c) 274 | node_props[int(node)] = (cnt, curdepth, left_cnt, right_cnt, maxcumdist) 275 | # print('count({},{}) = {}, {}, {}, {}'.format(node, height, cnt, maxheight, curdepth, maxcumdist)) 276 | return cnt, maxheight, curdepth, maxcumdist 277 | 278 | cnt, maxheight, curdepth, maxcumdist = count(self.maxnode, 0) 279 | # for node, props in node_props.items(): 280 | # print("{:<3}: {}".format(node, props)) 281 | 282 | if axes is None: 283 | fig, ax = plt.subplots(nrows=1, ncols=2, frameon=False) 284 | else: 285 | fig, ax = None, axes 286 | ax[0].set_axis_off() 287 | # ax[0].set_xlim(left=0, right=curdept) 288 | ax[0].set_xlim(left=0, right=tr_left_margin + maxcumdist + 0.05) 289 | ax[0].set_ylim(bottom=0, top=bottom_margin + ts_height * len(self.series) + top_margin) 290 | # ax[0].plot([0,1],[1,2]) 291 | # ax[0].add_line(Line2D((0,1),(2,2), lw=2, color='black', axes=ax[0])) 292 | 293 | ax[1].set_axis_off() 294 | ax[1].set_xlim(left=0, right=ts_left_margin + ts_sample_length * len(self.series[0])) 295 | ax[1].set_ylim(bottom=0, top=bottom_margin + ts_height * len(self.series) + top_margin) 296 | 297 | if type(cmap) == str: 298 | cmap = plt.get_cmap(cmap) 299 | else: 300 | pass 301 | line_colors = cmx.ScalarMappable(norm=colors.Normalize(vmin=0, vmax=max_dist), cmap=cmap) 302 | 303 | cnt_ts = 0 304 | 305 | def plot_i(node, depth, cnt_ts, prev_lcnt, ax, left): 306 | # print('plot_i', node, depth, cnt_ts, prev_lcnt) 307 | pcnt, pdepth, plcnt, prcnt, pcdist = node_props[node] 308 | # px = maxheight - pdepth 309 | px = tr_left_margin + maxcumdist - pcdist 310 | py = prev_lcnt * ts_height 311 | if node < len(self.series): 312 | # Plot series 313 | # print('plot series y={}'.format(ts_bottom_margin + ts_height * cnt_ts + self.ts_height_factor)) 314 | self._series_y[int(node)] = bottom_margin + ts_height * cnt_ts 315 | serie = self.series[int(node)] 316 | ax[1].text(ts_left_margin + ts_label_margin, 317 | bottom_margin + ts_height * cnt_ts + ts_height / 2, 318 | show_ts_label(int(node)), ha='left', va='center') 319 | if ts_color: 320 | curcolor = ts_color(int(node)) 321 | else: 322 | curcolor = None 323 | ax[1].plot(ts_left_margin + ts_sample_length * np.arange(len(serie)), 324 | bottom_margin + ts_height * cnt_ts + self.ts_height_factor * serie, 325 | color=curcolor) 326 | cnt_ts += 1 327 | 328 | else: 329 | child_left, child_right, dist, _ = self.get_linkage(node) 330 | color = line_colors.to_rgba(dist) 331 | ax[0].text(px + tr_label_margin, py, 332 | show_tr_label(dist), ha='left', va='center', color=color) 333 | 334 | # Left 335 | ccnt, cdepth, clcntl, crcntl, clcdist = node_props[child_left] 336 | # print('left', ccnt, cdepth, clcntl, crcntl) 337 | # cx = maxheight - cdepth 338 | cx = tr_left_margin + maxcumdist - clcdist 339 | cy = (prev_lcnt - crcntl) * ts_height 340 | if py == cy: 341 | cy -= 1 / 2 * ts_height 342 | # print('plot line', (px, cx), (py, cy)) 343 | # ax[0].add_line(Line2D((px, cx), (py, cy), lw=2, color='black', axes=ax[0])) 344 | ax[0].add_line(Line2D((px, px), (py, cy), lw=1, color=color, axes=ax[0])) 345 | ax[0].add_line(Line2D((px, cx), (cy, cy), lw=1, color=color, axes=ax[0])) 346 | cnt_ts = plot_i(child_left, depth + 1, cnt_ts, prev_lcnt - crcntl, ax, True) 347 | 348 | # Right 349 | ccnt, cdepth, clcntr, crcntr, crcdist = node_props[child_right] 350 | # print('right', ccnt, cdepth, clcntr, crcntr) 351 | # cx = maxheight - cdepth 352 | cx = tr_left_margin + maxcumdist - crcdist 353 | cy = (prev_lcnt + clcntr) * ts_height 354 | if py == cy: 355 | cy += 1 / 2 * ts_height 356 | # print('plot line', (px, cx), (py, cy)) 357 | # ax[0].add_line(Line2D((px, cx), (py, cy), lw=2, color='black', axes=ax[0])) 358 | ax[0].add_line(Line2D((px, px), (py, cy), lw=1, color=color, axes=ax[0])) 359 | ax[0].add_line(Line2D((px, cx), (cy, cy), lw=1, color=color, axes=ax[0])) 360 | cnt_ts = plot_i(child_right, depth + 1, cnt_ts, prev_lcnt + clcntr, ax, False) 361 | return cnt_ts 362 | 363 | plot_i(self.maxnode, 0, 0, node_props[self.maxnode][2], ax, True) 364 | 365 | if filename: 366 | if isinstance(filename, Path): 367 | filename = str(filename) 368 | plt.savefig(filename, bbox_inches='tight', pad_inches=0) 369 | plt.close() 370 | fig, ax = None, None 371 | 372 | return fig, ax 373 | 374 | def to_dot(self): 375 | child_left, child_right, dist, cnt = self.get_linkage(self.maxnode) 376 | node_deque = deque([(self.maxnode, child_left), (self.maxnode, child_right)]) 377 | # print(node_deque) 378 | s = ["digraph tree {"] 379 | while len(node_deque) > 0: 380 | from_node, to_node = node_deque.popleft() 381 | s.append(" {} -> {};".format(from_node, to_node)) 382 | if to_node >= len(self.series): 383 | child_left, child_right, dist, cnt = self.get_linkage(to_node) 384 | node_deque.append((to_node, child_left)) 385 | node_deque.append((to_node, child_right)) 386 | # print(node_deque) 387 | s.append("}") 388 | return "\n".join(s) 389 | 390 | 391 | class HierarchicalTree(BaseTree): 392 | """Wrapper to keep track of the full tree that represents the hierarchical clustering. 393 | 394 | :param model: Clustering object. For example of class :class:`Hierarchical`. 395 | If no model is given, the arguments are identical to those of class :class:`Hierarchical`. 396 | """ 397 | 398 | def __init__(self, model=None, **kwargs): 399 | if model is None: 400 | self._model = Hierarchical(**kwargs) 401 | else: 402 | self._model = model 403 | super().__init__(**kwargs) 404 | self._model.max_dist = np.inf 405 | 406 | def fit(self, series, *args, **kwargs): 407 | self.series = SeriesContainer.wrap(series) 408 | self.linkage = [] 409 | new_nodes = {i: i for i in range(len(series))} 410 | if self._model.merge_hook: 411 | old_merge_hook = self._model.merge_hook 412 | else: 413 | old_merge_hook = None 414 | 415 | def merge_hook(from_idx, to_idx, distance): 416 | # print('merge_hook', from_idx, to_idx) 417 | new_idx = len(self.series) + len(self.linkage) 418 | # print('adding to linkage: ', new_nodes[from_idx], new_nodes[to_idx], distance, 0) 419 | if new_nodes[from_idx] is None: 420 | raise Exception('Trying to merge series that is already merged') 421 | self.linkage.append((new_nodes[from_idx], new_nodes[to_idx], distance, 0)) 422 | new_nodes[to_idx] = new_idx 423 | new_nodes[from_idx] = None 424 | if old_merge_hook: 425 | old_merge_hook(from_idx, to_idx, distance) 426 | 427 | self._model.merge_hook = merge_hook 428 | 429 | result = self._model.fit(series, *args, **kwargs) 430 | self._model.merge_hook = old_merge_hook 431 | return result 432 | 433 | 434 | class LinkageTree(BaseTree): 435 | """Hierarchical clustering using the Scipy linkage function. 436 | 437 | This is the same but faster algorithm as available in Hierarchical (~10 times faster). But with less 438 | options to steer the clustering (e.g. no possibility to give weights). It still computes the entire 439 | distance matrix first and is thus not ideal for extremely large data sets. 440 | """ 441 | 442 | def __init__(self, dists_fun, dists_options, method='complete'): 443 | """ 444 | 445 | :param dists_fun: Distance funcion, e.g. dtw.distance 446 | :param dists_options: Options passed to dists_fun 447 | :param method: Linkage method (see scipy.cluster.hierarchy.linkage) 448 | """ 449 | super().__init__() 450 | self.dists_fun = dists_fun 451 | self.dists_options = dists_options 452 | self.method = method 453 | 454 | def fit(self, series): 455 | self.series = SeriesContainer.wrap(series) 456 | try: 457 | from scipy.cluster.hierarchy import linkage 458 | except ImportError: 459 | logger.error("The LinkageTree class requires the scipy package to be installed.") 460 | self.linkage = None 461 | linkage = None 462 | return 463 | dists = self.dists_fun(self.series, **self.dists_options) 464 | dists_cond = np.zeros(self._size_cond(len(series))) 465 | idx = 0 466 | for r in range(len(series) - 1): 467 | dists_cond[idx:idx + len(series) - r - 1] = dists[r, r + 1:] 468 | idx += len(series) - r - 1 469 | 470 | self.linkage = linkage(dists_cond, method=self.method, metric='euclidean') 471 | 472 | def _size_cond(self, size): 473 | n = int(size) 474 | return int((n * (n - 1)) / 2) 475 | 476 | 477 | class Hooks: 478 | @staticmethod 479 | def create_weighthook(weights, series): 480 | def newhook(i1, i2, dist): 481 | w1 = weights[i1] 482 | w2 = weights[i2] 483 | p1 = series[i1] 484 | p2 = series[i2] 485 | if w1 < w2 or (w1 == w2 and len(p1) > len(p2)): 486 | i1, i2 = i2, i1 487 | weights[i1] = w1 + w2 488 | return i1, i2 489 | return newhook 490 | 491 | @staticmethod 492 | def create_orderhook(weights): 493 | def newhook(idxs): 494 | min_idx = -1 495 | max_weight = -1 496 | for r, c in [idxs[ii, :] for ii in range(idxs.shape[0])]: 497 | total = weights[r] + weights[c] 498 | if total > max_weight: 499 | max_weight = total 500 | min_idx = (r, c) 501 | return min_idx 502 | return newhook 503 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dp.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.dp 4 | ~~~~~~~~~~~~~~~ 5 | 6 | Generic Dynamic Programming functions 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import logging 14 | import numpy as np 15 | 16 | 17 | logger = logging.getLogger("be.kuleuven.dtai.distance") 18 | 19 | 20 | def dp(s1, s2, fn, border=None, window=None, max_dist=None, 21 | max_step=None, max_length_diff=None, penalty=None, psi=None): 22 | """ 23 | Generic dynamic programming. 24 | 25 | This function does not optimize storage when a window size is given (e.g. in contrast with 26 | the fast DTW functions). 27 | 28 | :param s1: First sequence 29 | :param s2: Second sequence 30 | :param fn: Function to compare two items from both sequences 31 | :param border: Callable object to fill in the initial borders (border(row_idx, col_idx). 32 | :param window: see :meth:`distance` 33 | :param max_dist: see :meth:`distance` 34 | :param max_step: see :meth:`distance` 35 | :param max_length_diff: see :meth:`distance` 36 | :param penalty: see :meth:`distance` 37 | :param psi: see :meth:`distance` 38 | :returns: (DTW distance, DTW matrix) 39 | """ 40 | r, c = len(s1), len(s2) 41 | if max_length_diff is not None and abs(r - c) > max_length_diff: 42 | return np.inf 43 | if window is None: 44 | window = max(r, c) 45 | if not max_step: 46 | max_step = np.inf 47 | if not max_dist: 48 | max_dist = np.inf 49 | if not penalty: 50 | penalty = 0 51 | if psi is None: 52 | psi = 0 53 | dtw = np.full((r + 1, c + 1), np.inf) 54 | if border: 55 | for ci in range(c + 1): 56 | dtw[0, ci] = border(0, ci) 57 | for ri in range(1, r + 1): 58 | dtw[ri, 0] = border(ri, 0) 59 | for i in range(psi + 1): 60 | dtw[0, i] = 0 61 | dtw[i, 0] = 0 62 | last_under_max_dist = 0 63 | i0 = 1 64 | i1 = 0 65 | for i in range(r): 66 | if last_under_max_dist == -1: 67 | prev_last_under_max_dist = np.inf 68 | else: 69 | prev_last_under_max_dist = last_under_max_dist 70 | last_under_max_dist = -1 71 | i0 = i 72 | i1 = i + 1 73 | for j in range(max(0, i - max(0, r - c) - window + 1), min(c, i + max(0, c - r) + window)): 74 | d, d_indel = fn(s1[i], s2[j]) 75 | if max_step is not None: 76 | if d > max_step: 77 | d = np.inf 78 | if d_indel > max_step: 79 | d_indel = np.inf 80 | if d > max_step and d_indel > max_step: 81 | continue 82 | # print(f"[{i1},{j+1}] -> [{s1[i]},{s2[j]}] -> {d},{d_indel}") 83 | dtw[i1, j + 1] = min(d + dtw[i0, j], 84 | d_indel + dtw[i0, j + 1] + penalty, 85 | d_indel + dtw[i1, j] + penalty) 86 | if max_dist is not None: 87 | if dtw[i1, j + 1] <= max_dist: 88 | last_under_max_dist = j 89 | else: 90 | dtw[i1, j + 1] = np.inf 91 | if prev_last_under_max_dist < j + 1: 92 | break 93 | if max_dist is not None and last_under_max_dist == -1: 94 | return np.inf, dtw 95 | if psi == 0: 96 | d = dtw[i1, min(c, c + window - 1)] 97 | else: 98 | ir = i1 99 | ic = min(c, c + window - 1) 100 | vr = dtw[ir-psi:ir+1, ic] 101 | vc = dtw[ir, ic-psi:ic+1] 102 | mir = np.argmin(vr) 103 | mic = np.argmin(vc) 104 | if vr[mir] < vc[mic]: 105 | dtw[ir-psi+mir+1:ir+1, ic] = -1 106 | d = vr[mir] 107 | else: 108 | dtw[ir, ic - psi + mic + 1:ic+1] = -1 109 | d = vc[mic] 110 | return d, dtw 111 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_ndim.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.dtw_ndim 4 | ~~~~~~~~~~~~~~~~~~~~~ 5 | 6 | Dynamic Time Warping (DTW) for N-dimensional series. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import os 14 | import logging 15 | import math 16 | import numpy as np 17 | 18 | logger = logging.getLogger("be.kuleuven.dtai.distance") 19 | dtaidistance_dir = os.path.join(os.path.abspath(os.path.dirname(__file__)), os.pardir) 20 | 21 | try: 22 | from . import dtw_c 23 | except ImportError: 24 | # logger.info('C library not available') 25 | dtw_c = None 26 | 27 | try: 28 | from tqdm import tqdm 29 | except ImportError: 30 | logger.info('tqdm library not available') 31 | tqdm = None 32 | 33 | 34 | def distance(s1, s2, window=None, max_dist=None, 35 | max_step=None, max_length_diff=None, penalty=None, psi=None, 36 | use_c=False): 37 | """Dynamic Time Warping using multidimensional sequences. 38 | 39 | cost = EuclideanDistance(s1[i], s2[j]) 40 | 41 | See :py:meth:`dtaidistance.dtw.distance` for parameters. 42 | """ 43 | if use_c: 44 | logger.error("No C version implemented (yet)") 45 | return 46 | r, c = len(s1), len(s2) 47 | if max_length_diff is not None and abs(r - c) > max_length_diff: 48 | return np.inf 49 | if window is None: 50 | window = max(r, c) 51 | if not max_step: 52 | max_step = np.inf 53 | else: 54 | max_step *= max_step 55 | if not max_dist: 56 | max_dist = np.inf 57 | else: 58 | max_dist *= max_dist 59 | if not penalty: 60 | penalty = 0 61 | else: 62 | penalty *= penalty 63 | if psi is None: 64 | psi = 0 65 | length = min(c + 1, abs(r - c) + 2 * (window - 1) + 1 + 1 + 1) 66 | # print("length (py) = {}".format(length)) 67 | dtw = np.full((2, length), np.inf) 68 | # dtw[0, 0] = 0 69 | for i in range(psi + 1): 70 | dtw[0, i] = 0 71 | last_under_max_dist = 0 72 | skip = 0 73 | i0 = 1 74 | i1 = 0 75 | psi_shortest = np.inf 76 | for i in range(r): 77 | # print("i={}".format(i)) 78 | # print(dtw) 79 | if last_under_max_dist == -1: 80 | prev_last_under_max_dist = np.inf 81 | else: 82 | prev_last_under_max_dist = last_under_max_dist 83 | last_under_max_dist = -1 84 | skipp = skip 85 | skip = max(0, i - max(0, r - c) - window + 1) 86 | i0 = 1 - i0 87 | i1 = 1 - i1 88 | dtw[i1, :] = np.inf 89 | j_start = max(0, i - max(0, r - c) - window + 1) 90 | j_end = min(c, i + max(0, c - r) + window) 91 | if dtw.shape[1] == c + 1: 92 | skip = 0 93 | if psi != 0 and j_start == 0 and i < psi: 94 | dtw[i1, 0] = 0 95 | for j in range(j_start, j_end): 96 | d = np.sum((s1[i] - s2[j]) ** 2) 97 | if d > max_step: 98 | continue 99 | assert j + 1 - skip >= 0 100 | assert j - skipp >= 0 101 | assert j + 1 - skipp >= 0 102 | assert j - skip >= 0 103 | dtw[i1, j + 1 - skip] = d + min(dtw[i0, j - skipp], 104 | dtw[i0, j + 1 - skipp] + penalty, 105 | dtw[i1, j - skip] + penalty) 106 | # print('({},{}), ({},{}), ({},{})'.format(i0, j - skipp, i0, j + 1 - skipp, i1, j - skip)) 107 | # print('{}, {}, {}'.format(dtw[i0, j - skipp], dtw[i0, j + 1 - skipp], dtw[i1, j - skip])) 108 | # print('i={}, j={}, d={}, skip={}, skipp={}'.format(i,j,d,skip,skipp)) 109 | # print(dtw) 110 | if dtw[i1, j + 1 - skip] <= max_dist: 111 | last_under_max_dist = j 112 | else: 113 | # print('above max_dist', dtw[i1, j + 1 - skip], i1, j + 1 - skip) 114 | dtw[i1, j + 1 - skip] = np.inf 115 | if prev_last_under_max_dist + 1 - skipp < j + 1 - skip: 116 | # print("break") 117 | break 118 | if last_under_max_dist == -1: 119 | # print('early stop') 120 | # print(dtw) 121 | return np.inf 122 | if psi != 0 and j_end == len(s2) and len(s1) - 1 - i <= psi: 123 | psi_shortest = min(psi_shortest, dtw[i1, length - 1]) 124 | if psi == 0: 125 | d = math.sqrt(dtw[i1, min(c, c + window - 1) - skip]) 126 | else: 127 | ic = min(c, c + window - 1) - skip 128 | vc = dtw[i1, ic - psi:ic + 1] 129 | d = min(np.min(vc), psi_shortest) 130 | d = math.sqrt(d) 131 | return d 132 | 133 | 134 | def _distance_with_params(t): 135 | return distance(t[0], t[1], **t[2]) 136 | 137 | 138 | def warping_paths(s1, s2, window=None, max_dist=None, 139 | max_step=None, max_length_diff=None, penalty=None, psi=None,): 140 | """ 141 | Dynamic Time Warping (keep full matrix) using multidimensional sequences. 142 | 143 | cost = EuclideanDistance(s1[i], s2[j]) 144 | 145 | See :py:meth:`dtaidistance.dtw.warping_paths` for parameters. 146 | """ 147 | r, c = len(s1), len(s2) 148 | if max_length_diff is not None and abs(r - c) > max_length_diff: 149 | return np.inf 150 | if window is None: 151 | window = max(r, c) 152 | if not max_step: 153 | max_step = np.inf 154 | else: 155 | max_step *= max_step 156 | if not max_dist: 157 | max_dist = np.inf 158 | else: 159 | max_dist *= max_dist 160 | if not penalty: 161 | penalty = 0 162 | else: 163 | penalty *= penalty 164 | if psi is None: 165 | psi = 0 166 | dtw = np.full((r + 1, c + 1), np.inf) 167 | # dtw[0, 0] = 0 168 | for i in range(psi + 1): 169 | dtw[0, i] = 0 170 | dtw[i, 0] = 0 171 | last_under_max_dist = 0 172 | i0 = 1 173 | i1 = 0 174 | for i in range(r): 175 | if last_under_max_dist == -1: 176 | prev_last_under_max_dist = np.inf 177 | else: 178 | prev_last_under_max_dist = last_under_max_dist 179 | last_under_max_dist = -1 180 | i0 = i 181 | i1 = i + 1 182 | # print('i =', i, 'skip =',skip, 'skipp =', skipp) 183 | # jmin = max(0, i - max(0, r - c) - window + 1) 184 | # jmax = min(c, i + max(0, c - r) + window) 185 | # print(i,jmin,jmax) 186 | # x = dtw[i, jmin-skipp:jmax-skipp] 187 | # y = dtw[i, jmin+1-skipp:jmax+1-skipp] 188 | # print(x,y,dtw[i+1, jmin+1-skip:jmax+1-skip]) 189 | # dtw[i+1, jmin+1-skip:jmax+1-skip] = np.minimum(x, 190 | # y) 191 | for j in range(max(0, i - max(0, r - c) - window + 1), min(c, i + max(0, c - r) + window)): 192 | # print('j =', j, 'max=',min(c, c - r + i + window)) 193 | d = np.sum((s1[i] - s2[j]) ** 2) 194 | if max_step is not None and d > max_step: 195 | continue 196 | # print(i, j + 1 - skip, j - skipp, j + 1 - skipp, j - skip) 197 | dtw[i1, j + 1] = d + min(dtw[i0, j], 198 | dtw[i0, j + 1] + penalty, 199 | dtw[i1, j] + penalty) 200 | # dtw[i + 1, j + 1 - skip] = d + min(dtw[i + 1, j + 1 - skip], dtw[i + 1, j - skip]) 201 | if max_dist is not None: 202 | if dtw[i1, j + 1] <= max_dist: 203 | last_under_max_dist = j 204 | else: 205 | dtw[i1, j + 1] = np.inf 206 | if prev_last_under_max_dist < j + 1: 207 | break 208 | if max_dist is not None and last_under_max_dist == -1: 209 | # print('early stop') 210 | # print(dtw) 211 | return np.inf, dtw 212 | dtw = np.sqrt(dtw) 213 | if psi == 0: 214 | d = dtw[i1, min(c, c + window - 1)] 215 | else: 216 | ir = i1 217 | ic = min(c, c + window - 1) 218 | vr = dtw[ir-psi:ir+1, ic] 219 | vc = dtw[ir, ic-psi:ic+1] 220 | mir = np.argmin(vr) 221 | mic = np.argmin(vc) 222 | if vr[mir] < vc[mic]: 223 | dtw[ir-psi+mir+1:ir+1, ic] = -1 224 | d = vr[mir] 225 | else: 226 | dtw[ir, ic - psi + mic + 1:ic+1] = -1 227 | d = vc[mic] 228 | return d, dtw 229 | 230 | 231 | def distance_matrix(s, max_dist=None, max_length_diff=None, 232 | window=None, max_step=None, penalty=None, psi=None, 233 | block=None, parallel=False, 234 | use_c=False, show_progress=False): 235 | """Dynamic Time Warping distance matrix using multidimensional sequences. 236 | 237 | cost = EuclideanDistance(s1[i], s2[j]) 238 | 239 | See :py:meth:`dtaidistance.dtw.distance_matrix` for parameters. 240 | """ 241 | if parallel and not use_c: 242 | try: 243 | import multiprocessing as mp 244 | logger.info('Using multiprocessing') 245 | except ImportError: 246 | parallel = False 247 | mp = None 248 | else: 249 | mp = None 250 | dist_opts = { 251 | 'max_dist': max_dist, 252 | 'max_step': max_step, 253 | 'window': window, 254 | 'max_length_diff': max_length_diff, 255 | 'penalty': penalty, 256 | 'psi': psi 257 | } 258 | dists = None 259 | if max_length_diff is None: 260 | max_length_diff = np.inf 261 | large_value = np.inf 262 | logger.info('Computing distances') 263 | if use_c: 264 | logger.error("No C version available (yet)") 265 | if not use_c: 266 | logger.info("Compute distances in Python") 267 | if isinstance(s, np.ndarray) and len(s.shape) == 2: 268 | ss = [np.asarray(s[i]).reshape(-1) for i in range(s.shape[0])] 269 | s = ss 270 | if parallel: 271 | logger.info("Use parallel computation") 272 | dists = np.zeros((len(s), len(s))) + large_value 273 | if block is None: 274 | idxs = np.triu_indices(len(s), k=1) 275 | else: 276 | idxsl_r = [] 277 | idxsl_c = [] 278 | for r in range(block[0][0], block[0][1]): 279 | for c in range(max(r + 1, block[1][0]), min(len(s), block[1][1])): 280 | idxsl_r.append(r) 281 | idxsl_c.append(c) 282 | idxs = (np.array(idxsl_r), np.array(idxsl_c)) 283 | with mp.Pool() as p: 284 | dists[idxs] = p.map(_distance_with_params, [(s[r], s[c], dist_opts) for c, r in zip(*idxs)]) 285 | # pbar = tqdm(total=int((len(s)*(len(s)-1)/2))) 286 | # for r in range(len(s)): 287 | # dists[r,r+1:len(s)] = p.map(distance, [(s[r],s[c], dist_opts) for c in range(r+1,len(cur))]) 288 | # pbar.update(len(s) - r - 1) 289 | # pbar.close() 290 | else: 291 | logger.info("Use serial computation") 292 | dists = np.zeros((len(s), len(s))) + large_value 293 | if block is None: 294 | it_r = range(len(s)) 295 | else: 296 | it_r = range(block[0][0], block[0][1]) 297 | if show_progress: 298 | it_r = tqdm(it_r) 299 | for r in it_r: 300 | if block is None: 301 | it_c = range(r + 1, len(s)) 302 | else: 303 | it_c = range(max(r + 1, block[1][0]), min(len(s), block[1][1])) 304 | for c in it_c: 305 | if abs(len(s[r]) - len(s[c])) <= max_length_diff: 306 | dists[r, c] = distance(s[r], s[c], **dist_opts) 307 | return dists 308 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_ndim_visualisation.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.dtw_visualisation 4 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5 | 6 | Dynamic Time Warping (DTW) visualisations. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import os 14 | import logging 15 | import math 16 | import numpy as np 17 | 18 | from .util import dtaidistance_dir 19 | 20 | logger = logging.getLogger("be.kuleuven.dtai.distance") 21 | 22 | from . import dtw 23 | try: 24 | from . import dtw_c 25 | except ImportError: 26 | # logger.info('C library not available') 27 | dtw_c = None 28 | 29 | try: 30 | from tqdm import tqdm 31 | except ImportError: 32 | logger.info('tqdm library not available') 33 | tqdm = None 34 | 35 | 36 | def plot_warping(s1, s2, path, filename=None): 37 | """Plot the optimal warping between to sequences. 38 | 39 | :param s1: From sequence. 40 | :param s2: To sequence. 41 | :param path: Optimal warping path. 42 | :param filename: Filename path (optional). 43 | """ 44 | import matplotlib.pyplot as plt 45 | import matplotlib as mpl 46 | fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True) 47 | ax[0].pcolormesh(np.transpose(s1)) 48 | ax[1].pcolormesh(np.transpose(s2)) 49 | transFigure = fig.transFigure.inverted() 50 | lines = [] 51 | line_options = {'linewidth': 2, 'color': 'orange', 'alpha': 0.8} 52 | for r_c, c_c in path: 53 | if r_c < 0 or c_c < 0: 54 | continue 55 | coord1 = transFigure.transform(ax[0].transData.transform([r_c+.5, 0])) 56 | coord2 = transFigure.transform(ax[1].transData.transform([c_c+.5, 0])) 57 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]), 58 | transform=fig.transFigure, **line_options)) 59 | fig.lines = lines 60 | if filename: 61 | plt.savefig(filename) 62 | plt.close() 63 | fig, ax = None, None 64 | return fig, ax 65 | 66 | 67 | def plot_warpingpaths(s1, s2, paths, path=None, filename=None, shownumbers=False): 68 | """Plot the warping paths matrix. 69 | 70 | :param s1: Series 1 71 | :param s2: Series 2 72 | :param paths: Warping paths matrix 73 | :param path: Path to draw (typically this is the best path) 74 | :param filename: Filename for the image (optional) 75 | :param shownumbers: Show distances also as numbers 76 | """ 77 | from matplotlib import pyplot as plt 78 | from matplotlib import gridspec 79 | 80 | fig = plt.figure(figsize=(10, 10), frameon=True) 81 | gs = gridspec.GridSpec(2, 2, wspace=1, hspace=1, 82 | left=0, right=1.0, bottom=0, top=1.0, 83 | height_ratios=[1, 6], 84 | width_ratios=[1, 6]) 85 | 86 | if path is None: 87 | p = dtw.best_path(paths) 88 | else: 89 | p = path 90 | 91 | ax0 = fig.add_subplot(gs[0, 0]) 92 | ax0.set_axis_off() 93 | ax0.text(0, 0, "Dist = {:.4f}".format(paths[p[-1][0], p[-1][1]])) 94 | ax0.xaxis.set_major_locator(plt.NullLocator()) 95 | ax0.yaxis.set_major_locator(plt.NullLocator()) 96 | 97 | # Top time series 98 | ax1 = fig.add_subplot(gs[0, 1:]) 99 | ax1.set_ylim([0, s2.shape[1]]) 100 | ax1.set_axis_off() 101 | ax1.xaxis.tick_top() 102 | ax1.pcolormesh(np.transpose(s2)) 103 | ax1.xaxis.set_major_locator(plt.NullLocator()) 104 | ax1.yaxis.set_major_locator(plt.NullLocator()) 105 | 106 | # Left time series 107 | ax2 = fig.add_subplot(gs[1:, 0]) 108 | ax2.set_xlim([0, s1.shape[1]]) 109 | ax2.set_axis_off() 110 | ax2.xaxis.set_major_locator(plt.NullLocator()) 111 | ax2.yaxis.set_major_locator(plt.NullLocator()) 112 | ax2.pcolormesh(np.flipud(s1)) 113 | 114 | ax3 = fig.add_subplot(gs[1:, 1:]) 115 | ax3.matshow(paths[1:, 1:]) 116 | py, px = zip(*p) 117 | ax3.plot(px, py, ".-", color="red") 118 | if shownumbers: 119 | for r in range(1, paths.shape[0]): 120 | for c in range(1, paths.shape[1]): 121 | ax3.text(c - 1, r - 1, "{:.2f}".format(paths[r, c])) 122 | 123 | gs.tight_layout(fig, pad=1.0, h_pad=1.0, w_pad=1.0) 124 | 125 | ax = fig.axes 126 | 127 | if filename: 128 | if type(filename) != str: 129 | filename = str(filename) 130 | plt.savefig(filename) 131 | plt.close() 132 | fig, ax = None, None 133 | return fig, ax -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_visualisation.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.dtw_visualisation 4 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5 | 6 | Dynamic Time Warping (DTW) visualisations. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import os 14 | import logging 15 | import math 16 | import numpy as np 17 | 18 | from .util import dtaidistance_dir 19 | 20 | logger = logging.getLogger("be.kuleuven.dtai.distance") 21 | 22 | from . import dtw 23 | try: 24 | from . import dtw_c 25 | except ImportError: 26 | # logger.info('C library not available') 27 | dtw_c = None 28 | 29 | try: 30 | from tqdm import tqdm 31 | except ImportError: 32 | logger.info('tqdm library not available') 33 | tqdm = None 34 | 35 | 36 | def plot_warp(from_s, to_s, new_s, path, filename=None): 37 | """Plot the warped sequence and its relation to the original sequence 38 | and the target sequence. 39 | 40 | :param from_s: From sequence. 41 | :param to_s: To sequence. 42 | :param new_s: Warped version of from sequence. 43 | :param path: Optimal warping path. 44 | :param filename: Filename path (optional). 45 | """ 46 | try: 47 | import matplotlib.pyplot as plt 48 | import matplotlib as mpl 49 | except ImportError: 50 | logger.error("The plot_warp function requires the matplotlib package to be installed.") 51 | return 52 | fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=True) 53 | ax[0].plot(from_s, label="From") 54 | ax[0].legend() 55 | ax[1].plot(to_s, label="To") 56 | ax[1].legend() 57 | transFigure = fig.transFigure.inverted() 58 | lines = [] 59 | line_options = {'linewidth': 0.5, 'color': 'orange', 'alpha': 0.8} 60 | for r_c, c_c in path: 61 | if r_c < 0 or c_c < 0: 62 | continue 63 | coord1 = transFigure.transform(ax[0].transData.transform([r_c, from_s[r_c]])) 64 | coord2 = transFigure.transform(ax[1].transData.transform([c_c, to_s[c_c]])) 65 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]), 66 | transform=fig.transFigure, **line_options)) 67 | ax[2].plot(new_s, label="From-warped") 68 | ax[2].legend() 69 | for i in range(len(to_s)): 70 | coord1 = transFigure.transform(ax[1].transData.transform([i, to_s[i]])) 71 | coord2 = transFigure.transform(ax[2].transData.transform([i, new_s[i]])) 72 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]), 73 | transform=fig.transFigure, **line_options)) 74 | fig.lines = lines 75 | if filename: 76 | plt.savefig(filename) 77 | plt.close() 78 | fig, ax = None, None 79 | return fig, ax 80 | 81 | 82 | def plot_warping(s1, s2, path, filename=None): 83 | """Plot the optimal warping between to sequences. 84 | 85 | :param s1: From sequence. 86 | :param s2: To sequence. 87 | :param path: Optimal warping path. 88 | :param filename: Filename path (optional). 89 | """ 90 | import matplotlib.pyplot as plt 91 | import matplotlib as mpl 92 | fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True) 93 | ax[0].plot(s1) 94 | ax[1].plot(s2) 95 | transFigure = fig.transFigure.inverted() 96 | lines = [] 97 | line_options = {'linewidth': 0.5, 'color': 'orange', 'alpha': 0.8} 98 | for r_c, c_c in path: 99 | if r_c < 0 or c_c < 0: 100 | continue 101 | coord1 = transFigure.transform(ax[0].transData.transform([r_c, s1[r_c]])) 102 | coord2 = transFigure.transform(ax[1].transData.transform([c_c, s2[c_c]])) 103 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]), 104 | transform=fig.transFigure, **line_options)) 105 | fig.lines = lines 106 | if filename: 107 | plt.savefig(filename) 108 | plt.close() 109 | fig, ax = None, None 110 | return fig, ax 111 | 112 | 113 | def plot_warpingpaths(s1, s2, paths, path=None, filename=None, shownumbers=False): 114 | """Plot the warping paths matrix. 115 | 116 | :param s1: Series 1 117 | :param s2: Series 2 118 | :param paths: Warping paths matrix 119 | :param path: Path to draw (typically this is the best path) 120 | :param filename: Filename for the image (optional) 121 | :param shownumbers: Show distances also as numbers 122 | """ 123 | from matplotlib import pyplot as plt 124 | from matplotlib import gridspec 125 | from matplotlib.ticker import FuncFormatter 126 | 127 | ratio = max(len(s1), len(s2)) 128 | min_y = min(np.min(s1), np.min(s2)) 129 | max_y = max(np.max(s1), np.max(s2)) 130 | 131 | fig = plt.figure(figsize=(10, 10), frameon=True) 132 | gs = gridspec.GridSpec(2, 2, wspace=1, hspace=1, 133 | left=0, right=1.0, bottom=0, top=1.0, 134 | height_ratios=[1, 6], 135 | width_ratios=[1, 6]) 136 | max_s2_x = np.max(s2) 137 | max_s2_y = len(s2) 138 | max_s1_x = np.max(s1) 139 | min_s1_x = np.min(s1) 140 | max_s1_y = len(s1) 141 | 142 | if path is None: 143 | p = dtw.best_path(paths) 144 | else: 145 | p = path 146 | 147 | def format_fn2_x(tick_val, tick_pos): 148 | return max_s2_x - tick_val 149 | 150 | def format_fn2_y(tick_val, tick_pos): 151 | return int(max_s2_y - tick_val) 152 | 153 | ax0 = fig.add_subplot(gs[0, 0]) 154 | ax0.set_axis_off() 155 | ax0.text(0, 0, "Dist = {:.4f}".format(paths[p[-1][0], p[-1][1]])) 156 | ax0.xaxis.set_major_locator(plt.NullLocator()) 157 | ax0.yaxis.set_major_locator(plt.NullLocator()) 158 | 159 | ax1 = fig.add_subplot(gs[0, 1:]) 160 | ax1.set_ylim([min_y, max_y]) 161 | ax1.set_axis_off() 162 | ax1.xaxis.tick_top() 163 | # ax1.set_aspect(0.454) 164 | ax1.plot(range(len(s2)), s2, ".-") 165 | ax1.xaxis.set_major_locator(plt.NullLocator()) 166 | ax1.yaxis.set_major_locator(plt.NullLocator()) 167 | 168 | ax2 = fig.add_subplot(gs[1:, 0]) 169 | ax2.set_xlim([-max_y, -min_y]) 170 | ax2.set_axis_off() 171 | # ax2.set_aspect(0.8) 172 | # ax2.xaxis.set_major_formatter(FuncFormatter(format_fn2_x)) 173 | # ax2.yaxis.set_major_formatter(FuncFormatter(format_fn2_y)) 174 | ax2.xaxis.set_major_locator(plt.NullLocator()) 175 | ax2.yaxis.set_major_locator(plt.NullLocator()) 176 | ax2.plot(-s1, range(max_s1_y, 0, -1), ".-") 177 | 178 | ax3 = fig.add_subplot(gs[1:, 1:]) 179 | # ax3.set_aspect(1) 180 | ax3.matshow(paths[1:, 1:]) 181 | # ax3.grid(which='major', color='w', linestyle='-', linewidth=0) 182 | # ax3.set_axis_off() 183 | py, px = zip(*p) 184 | ax3.plot(px, py, ".-", color="red") 185 | # ax3.xaxis.set_major_locator(plt.NullLocator()) 186 | # ax3.yaxis.set_major_locator(plt.NullLocator()) 187 | if shownumbers: 188 | for r in range(1, paths.shape[0]): 189 | for c in range(1, paths.shape[1]): 190 | ax3.text(c - 1, r - 1, "{:.2f}".format(paths[r, c])) 191 | 192 | gs.tight_layout(fig, pad=1.0, h_pad=1.0, w_pad=1.0) 193 | # fig.subplots_adjust(hspace=0, wspace=0) 194 | 195 | ax = fig.axes 196 | 197 | if filename: 198 | if type(filename) != str: 199 | filename = str(filename) 200 | plt.savefig(filename) 201 | plt.close() 202 | fig, ax = None, None 203 | return fig, ax 204 | 205 | def plot_matrix(distances, filename=None, ax=None, shownumbers=False): 206 | from matplotlib import pyplot as plt 207 | 208 | if ax is None: 209 | if shownumbers: 210 | figsize = (15, 15) 211 | else: 212 | figsize = None 213 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=figsize) 214 | else: 215 | fig = None 216 | 217 | ax.xaxis.set_ticks_position('top') 218 | ax.yaxis.set_ticks_position('both') 219 | 220 | im = ax.imshow(distances) 221 | idxs = [str(i) for i in range(len(distances))] 222 | # Show all ticks 223 | ax.set_xticks(np.arange(len(idxs))) 224 | ax.set_xticklabels(idxs) 225 | ax.set_yticks(np.arange(len(idxs))) 226 | ax.set_yticklabels(idxs) 227 | 228 | ax.set_title("Distances between series", pad=30) 229 | 230 | if shownumbers: 231 | for i in range(len(idxs)): 232 | for j in range(len(idxs)): 233 | if not np.isinf(distances[i, j]): 234 | l = "{:.2f}".format(distances[i, j]) 235 | ax.text(j, i, l, ha="center", va="center", color="w") 236 | 237 | if filename: 238 | if type(filename) != str: 239 | filename = str(filename) 240 | plt.savefig(filename) 241 | plt.close() 242 | fig, ax = None, None 243 | return fig, ax 244 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/util.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | """ 3 | dtaidistance.util 4 | ~~~~~~~~~~~~~~~~~ 5 | 6 | Utility functions for DTAIDistance. 7 | 8 | :author: Wannes Meert 9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group. 10 | :license: Apache License, Version 2.0, see LICENSE for details. 11 | 12 | """ 13 | import os 14 | import sys 15 | import logging 16 | from array import array 17 | from pathlib import Path 18 | import tempfile 19 | 20 | import numpy as np 21 | 22 | 23 | logger = logging.getLogger("be.kuleuven.dtai.distance") 24 | 25 | 26 | dtaidistance_dir = os.path.abspath(os.path.dirname(__file__)) 27 | 28 | 29 | def prepare_directory(directory=None): 30 | """Prepare the given directory, create it if necessary. 31 | If no directory is given, a new directory will be created in the system's temp directory. 32 | """ 33 | if directory is not None: 34 | directory = Path(directory) 35 | if not directory.exists(): 36 | directory.mkdir(parents=True) 37 | logger.debug("Using directory: {}".format(directory)) 38 | return Path(directory) 39 | directory = tempfile.mkdtemp(prefix="dtaidistance_") 40 | logger.debug("Using directory: {}".format(directory)) 41 | return Path(directory) 42 | 43 | 44 | class SeriesContainer: 45 | def __init__(self, series): 46 | """Container for a list of series. 47 | 48 | This wrapper class knows how to deal with multiple types of datastructures to represent 49 | a list of sequences: 50 | - List[array.array] 51 | - List[numpy.array] 52 | - List[List] 53 | - numpy.array 54 | - numpy.matrix 55 | 56 | When using the C-based extensions, the data is automatically verified and converted. 57 | """ 58 | if isinstance(series, SeriesContainer): 59 | self.series = series.series 60 | elif isinstance(series, np.ndarray) and len(series.shape) == 2: 61 | # A matrix always returns a 2D array, also if you select one row (to be consistent 62 | # and always be a matrix datastructure). The methods in this toolbox expect a 63 | # 1D array thus we need to convert to a 1D or 2D array. 64 | # self.series = [np.asarray(series[i]).reshape(-1) for i in range(series.shape[0])] 65 | self.series = np.asarray(series, order='C') 66 | elif type(series) == set or type(series) == tuple: 67 | self.series = list(series) 68 | else: 69 | self.series = series 70 | 71 | def c_data(self): 72 | """Return a datastructure that the C-component knows how to handle. 73 | The method tries to avoid copying or reallocating memory. 74 | 75 | :return: Either a list of buffers or a two-dimensional buffer. The 76 | buffers are guaranteed to be C-contiguous and can thus be used 77 | as regular pointer-based arrays in C. 78 | """ 79 | if type(self.series) == list: 80 | for i in range(len(self.series)): 81 | serie = self.series[i] 82 | if isinstance(serie, np.ndarray): 83 | if not serie.flags.c_contiguous: 84 | serie = np.asarray(serie, order='C') 85 | self.series[i] = serie 86 | elif isinstance(serie, array): 87 | pass 88 | else: 89 | raise Exception("Type of series not supported, " 90 | "expected numpy.array or array.array but got {}".format(type(serie))) 91 | elif isinstance(self.series, np.ndarray): 92 | if not self.series.flags.c_contiguous: 93 | self.series = self.series.copy(order='C') 94 | return self.series 95 | 96 | def get_max_y(self): 97 | max_y = 0 98 | if isinstance(self.series, np.ndarray): 99 | max_y = max(np.max(self.series), abs(np.min(self.series))) 100 | else: 101 | for serie in self.series: 102 | max_y = max(max_y, np.max(serie), abs(np.min(serie))) 103 | return max_y 104 | 105 | def __getitem__(self, item): 106 | return self.series[item] 107 | 108 | def __len__(self): 109 | return len(self.series) 110 | 111 | def __str__(self): 112 | return "SeriesContainer:\n{}".format(self.series) 113 | 114 | @staticmethod 115 | def wrap(series): 116 | if isinstance(series, SeriesContainer): 117 | return series 118 | return SeriesContainer(series) 119 | 120 | 121 | def recompile(): 122 | import subprocess as sp 123 | sp.run([sys.executable, 'setup.py', 'build_ext', '--inplace'], cwd=dtaidistance_dir) 124 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_INSTRUCTOR.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline\n", 10 | "import matplotlib\n", 11 | "matplotlib.rcParams['figure.figsize'] = [8, 3]\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "\n", 14 | "import pandas as pd\n", 15 | "import numpy as np\n", 16 | "import statsmodels.api as sm\n", 17 | "import statsmodels\n", 18 | "\n", 19 | "import scipy\n", 20 | "from scipy.stats import pearsonr\n", 21 | "\n", 22 | "from pandas.plotting import register_matplotlib_converters\n", 23 | "register_matplotlib_converters()" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "print(matplotlib.__version__)\n", 33 | "print(pd.__version__)\n", 34 | "print(np.__version__)\n", 35 | "print(statsmodels.__version__)\n", 36 | "print(scipy.__version__)\n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Obtain and visualize data" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "## data obtained from https://datahub.io/core/global-temp#data\n", 53 | "df = pd.read_csv(\"global_temps.csv\")\n", 54 | "df.head()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "df.Mean[:100].plot()" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Exercise: what is wrong with the data and plot above? How can we fix this?" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | " df = df.pivot(index='Date', columns='Source', values='Mean')" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "df.head()" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "df.GCAG.plot()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "type(df.index)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## Exercise: how can we make the index more time aware?" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "df.index = pd.to_datetime(df.index)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "type(df.index)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "df.GCAG.plot()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "df['1880']" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "plt.plot(df['1880':'1950'][['GCAG', 'GISTEMP']])" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "plt.plot(df['1950':][['GISTEMP']])" 168 | ] 169 | }, 170 | { 171 | "cell_type": "markdown", 172 | "metadata": {}, 173 | "source": [ 174 | "## Exercise: How strongly do these measurements correlate contemporaneously? What about with a time lag?" 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": null, 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "plt.scatter(df['1880':'1900'][['GCAG']], df['1880':'1900'][['GISTEMP']])" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "plt.scatter(df['1880':'1899'][['GCAG']], df['1881':'1900'][['GISTEMP']])" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "pearsonr(df['1880':'1899'].GCAG, df['1881':'1900'].GISTEMP)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "df['1880':'1899'][['GCAG']].head()" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "df['1881':'1900'][['GISTEMP']].head()" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "min(df.index)" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "max(df.index)" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "## Unobserved component model" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": null, 250 | "metadata": {}, 251 | "outputs": [], 252 | "source": [ 253 | "train = df['1960':]" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "### model parameters" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "# smooth trend model without seasonal or cyclical components\n", 270 | "model = {\n", 271 | " 'level': 'smooth trend', 'cycle': False, 'seasonal': None, \n", 272 | "}\n" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "### fitting a model" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "scrolled": true 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "# https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.structural.UnobservedComponents.html\n", 291 | "gcag_mod = sm.tsa.UnobservedComponents(train['GCAG'], **model)\n", 292 | "gcag_res = gcag_mod.fit()" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "fig = gcag_res.plot_components(legend_loc='lower right', figsize=(15, 9));" 302 | ] 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "## Plotting predictions" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "# Perform rolling prediction and multistep forecast\n", 318 | "num_steps = 20\n", 319 | "predict_res = gcag_res.get_prediction(dynamic=train['GCAG'].shape[0] - num_steps)\n", 320 | "\n", 321 | "predict = predict_res.predicted_mean\n", 322 | "ci = predict_res.conf_int()" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "plt.plot(predict)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "plt.scatter(train['GCAG'], predict)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "fig, ax = plt.subplots()\n", 350 | "# Plot the results\n", 351 | "ax.plot(train['GCAG'], 'k.', label='Observations');\n", 352 | "ax.plot(train.index[:-num_steps], predict[:-num_steps], label='One-step-ahead Prediction');\n", 353 | "\n", 354 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n", 355 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n", 356 | "\n", 357 | "# Cleanup the image\n", 358 | "legend = ax.legend(loc='upper left');" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "fig, ax = plt.subplots()\n", 368 | "# Plot the results\n", 369 | "ax.plot(train.index[-40:], train['GCAG'][-40:], 'k.', label='Observations');\n", 370 | "ax.plot(train.index[-40:-num_steps], predict[-40:-num_steps], label='One-step-ahead Prediction');\n", 371 | "\n", 372 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n", 373 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n", 374 | "\n", 375 | "# Cleanup the image\n", 376 | "legend = ax.legend(loc='upper left');" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "## Exercise: consider adding a seasonal term for 12 periods for the model fit above. Does this improve the fit of the model?" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": null, 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "seasonal_model = {\n", 393 | " 'level': 'local linear trend',\n", 394 | " 'seasonal': 12\n", 395 | "}\n", 396 | "mod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n", 397 | "res = mod.fit(method='powell', disp=False)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": null, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "fig = res.plot_components(legend_loc='lower right', figsize=(15, 9));" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "## How does this compare to the original model?" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [ 422 | "pearsonr(gcag_res.predict(), train['GCAG'])" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": {}, 429 | "outputs": [], 430 | "source": [ 431 | "np.mean(np.abs(gcag_res.predict() - train['GCAG']))" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [ 440 | "np.mean(np.abs(res.predict() - train['GCAG']))" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "## Explore the seasonality more" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": null, 453 | "metadata": {}, 454 | "outputs": [], 455 | "source": [ 456 | "seasonal_model = {\n", 457 | " 'level': 'local level',\n", 458 | " 'seasonal': 12\n", 459 | "}\n", 460 | "llmod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n", 461 | "ll_level_res = llmod.fit(method='powell', disp=False)" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [ 470 | "fig = ll_level_res.plot_components(legend_loc='lower right', figsize=(15, 9));" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": null, 476 | "metadata": {}, 477 | "outputs": [], 478 | "source": [ 479 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": null, 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "train[:48].GCAG.plot()" 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": null, 494 | "metadata": {}, 495 | "outputs": [], 496 | "source": [] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "## Exercise: a common null model for time series is to predict the value at time t-1 for the value at time t. How does such a model compare to the models we fit here?" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "### Consider correlation" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "pearsonr(ll_level_res.predict(), train['GCAG'])" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "pearsonr(train['GCAG'].iloc[:-1, ], train['GCAG'].iloc[1:, ])" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "### What about mean absolute error?" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": {}, 541 | "outputs": [], 542 | "source": [ 543 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))" 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": null, 549 | "metadata": {}, 550 | "outputs": [], 551 | "source": [ 552 | "np.mean(np.abs(train['GCAG'].iloc[:-1, ].values, train['GCAG'].iloc[1:, ].values))" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": {}, 559 | "outputs": [], 560 | "source": [] 561 | } 562 | ], 563 | "metadata": { 564 | "kernelspec": { 565 | "display_name": "Python 3", 566 | "language": "python", 567 | "name": "python3" 568 | }, 569 | "language_info": { 570 | "codemirror_mode": { 571 | "name": "ipython", 572 | "version": 3 573 | }, 574 | "file_extension": ".py", 575 | "mimetype": "text/x-python", 576 | "name": "python", 577 | "nbconvert_exporter": "python", 578 | "pygments_lexer": "ipython3", 579 | "version": "3.6.8" 580 | } 581 | }, 582 | "nbformat": 4, 583 | "nbformat_minor": 2 584 | } 585 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_STUDENT.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "%matplotlib inline\n", 10 | "import matplotlib\n", 11 | "matplotlib.rcParams['figure.figsize'] = [8, 3]\n", 12 | "import matplotlib.pyplot as plt\n", 13 | "\n", 14 | "import pandas as pd\n", 15 | "import numpy as np\n", 16 | "import statsmodels.api as sm\n", 17 | "import statsmodels\n", 18 | "\n", 19 | "import scipy\n", 20 | "from scipy.stats import pearsonr\n", 21 | "\n", 22 | "from pandas.plotting import register_matplotlib_converters\n", 23 | "register_matplotlib_converters()" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "print(matplotlib.__version__)\n", 33 | "print(pd.__version__)\n", 34 | "print(np.__version__)\n", 35 | "print(statsmodels.__version__)\n", 36 | "print(scipy.__version__)\n" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Obtain and visualize data" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "## data obtained from https://datahub.io/core/global-temp#data\n", 53 | "df = pd.read_csv(\"global_temps.csv\")\n", 54 | "df.head()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "df.Mean[:100].plot()" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Exercise: what is wrong with the data and plot above? How can we fix this?" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## Exercise: how can we make the index more time aware?" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## Exercise: How strongly do these measurements correlate contemporaneously? What about with a time lag?" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "## Unobserved component model" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "train = df['1960':]" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### model parameters" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# smooth trend model without seasonal or cyclical components\n", 138 | "model = {\n", 139 | " 'level': 'smooth trend', 'cycle': False, 'seasonal': None, \n", 140 | "}\n" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "### fitting a model" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": { 154 | "scrolled": true 155 | }, 156 | "outputs": [], 157 | "source": [ 158 | "# https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.structural.UnobservedComponents.html\n", 159 | "gcag_mod = sm.tsa.UnobservedComponents(train['GCAG'], **model)\n", 160 | "gcag_res = gcag_mod.fit()" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "fig = gcag_res.plot_components(legend_loc='lower right', figsize=(15, 9));" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "## Plotting predictions" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [ 185 | "# Perform rolling prediction and multistep forecast\n", 186 | "num_steps = 20\n", 187 | "predict_res = gcag_res.get_prediction(dynamic=train['GCAG'].shape[0] - num_steps)\n", 188 | "\n", 189 | "predict = predict_res.predicted_mean\n", 190 | "ci = predict_res.conf_int()" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "plt.plot(predict)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "plt.scatter(train['GCAG'], predict)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "fig, ax = plt.subplots()\n", 218 | "# Plot the results\n", 219 | "ax.plot(train['GCAG'], 'k.', label='Observations');\n", 220 | "ax.plot(train.index[:-num_steps], predict[:-num_steps], label='One-step-ahead Prediction');\n", 221 | "\n", 222 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n", 223 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n", 224 | "\n", 225 | "# Cleanup the image\n", 226 | "legend = ax.legend(loc='upper left');" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "fig, ax = plt.subplots()\n", 236 | "# Plot the results\n", 237 | "ax.plot(train.index[-40:], train['GCAG'][-40:], 'k.', label='Observations');\n", 238 | "ax.plot(train.index[-40:-num_steps], predict[-40:-num_steps], label='One-step-ahead Prediction');\n", 239 | "\n", 240 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n", 241 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n", 242 | "\n", 243 | "# Cleanup the image\n", 244 | "legend = ax.legend(loc='upper left');" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "## Exercise: consider adding a seasonal term for 12 periods for the model fit above. Does this improve the fit of the model?" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "metadata": {}, 264 | "source": [ 265 | "## How does this compare to the original model?" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "## Let's explore the seasonality more" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": {}, 286 | "outputs": [], 287 | "source": [ 288 | "seasonal_model = {\n", 289 | " 'level': 'local level',\n", 290 | " 'seasonal': 12\n", 291 | "}\n", 292 | "llmod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n", 293 | "ll_level_res = llmod.fit(method='powell', disp=False)" 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": null, 299 | "metadata": {}, 300 | "outputs": [], 301 | "source": [ 302 | "fig = ll_level_res.plot_components(legend_loc='lower right', figsize=(15, 9));" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": {}, 309 | "outputs": [], 310 | "source": [ 311 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "train[:48].GCAG.plot()" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "## Exercise: a common null model for time series is to predict the value at time t-1 for the value at time t. How does such a model compare to the models we fit here?" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "### Consider correlation" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": {}, 341 | "outputs": [], 342 | "source": [] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "### What about mean absolute error?" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": null, 354 | "metadata": {}, 355 | "outputs": [], 356 | "source": [] 357 | } 358 | ], 359 | "metadata": { 360 | "kernelspec": { 361 | "display_name": "Python 3", 362 | "language": "python", 363 | "name": "python3" 364 | }, 365 | "language_info": { 366 | "codemirror_mode": { 367 | "name": "ipython", 368 | "version": 3 369 | }, 370 | "file_extension": ".py", 371 | "mimetype": "text/x-python", 372 | "name": "python", 373 | "nbconvert_exporter": "python", 374 | "pygments_lexer": "ipython3", 375 | "version": "3.6.8" 376 | } 377 | }, 378 | "nbformat": 4, 379 | "nbformat_minor": 2 380 | } 381 | -------------------------------------------------------------------------------- /modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/Nile.csv: -------------------------------------------------------------------------------- 1 | "","year","val" 2 | "1",1871,1120 3 | "2",1872,1160 4 | "3",1873,963 5 | "4",1874,1210 6 | "5",1875,1160 7 | "6",1876,1160 8 | "7",1877,813 9 | "8",1878,1230 10 | "9",1879,1370 11 | "10",1880,1140 12 | "11",1881,995 13 | "12",1882,935 14 | "13",1883,1110 15 | "14",1884,994 16 | "15",1885,1020 17 | "16",1886,960 18 | "17",1887,1180 19 | "18",1888,799 20 | "19",1889,958 21 | "20",1890,1140 22 | "21",1891,1100 23 | "22",1892,1210 24 | "23",1893,1150 25 | "24",1894,1250 26 | "25",1895,1260 27 | "26",1896,1220 28 | "27",1897,1030 29 | "28",1898,1100 30 | "29",1899,774 31 | "30",1900,840 32 | "31",1901,874 33 | "32",1902,694 34 | "33",1903,940 35 | "34",1904,833 36 | "35",1905,701 37 | "36",1906,916 38 | "37",1907,692 39 | "38",1908,1020 40 | "39",1909,1050 41 | "40",1910,969 42 | "41",1911,831 43 | "42",1912,726 44 | "43",1913,456 45 | "44",1914,824 46 | "45",1915,702 47 | "46",1916,1120 48 | "47",1917,1100 49 | "48",1918,832 50 | "49",1919,764 51 | "50",1920,821 52 | "51",1921,768 53 | "52",1922,845 54 | "53",1923,864 55 | "54",1924,862 56 | "55",1925,698 57 | "56",1926,845 58 | "57",1927,744 59 | "58",1928,796 60 | "59",1929,1040 61 | "60",1930,759 62 | "61",1931,781 63 | "62",1932,865 64 | "63",1933,845 65 | "64",1934,944 66 | "65",1935,984 67 | "66",1936,897 68 | "67",1937,822 69 | "68",1938,1010 70 | "69",1939,771 71 | "70",1940,676 72 | "71",1941,649 73 | "72",1942,846 74 | "73",1943,812 75 | "74",1944,742 76 | "75",1945,801 77 | "76",1946,1040 78 | "77",1947,860 79 | "78",1948,874 80 | "79",1949,848 81 | "80",1950,890 82 | "81",1951,744 83 | "82",1952,749 84 | "83",1953,838 85 | "84",1954,1050 86 | "85",1955,918 87 | "86",1956,986 88 | "87",1957,797 89 | "88",1958,923 90 | "89",1959,975 91 | "90",1960,815 92 | "91",1961,1020 93 | "92",1962,906 94 | "93",1963,901 95 | "94",1964,1170 96 | "95",1965,912 97 | "96",1966,746 98 | "97",1967,919 99 | "98",1968,718 100 | "99",1969,714 101 | "100",1970,740 102 | -------------------------------------------------------------------------------- /modern_time_series_analysis/README.md: -------------------------------------------------------------------------------- 1 | # Modern Time Series Analysis 2 | 3 | ## Install and Setup 4 | 5 | Okay, I wanted to streamline this. So you can just use this repo: 6 | 7 | pip install -r requirements.txt 8 | 9 | OR, since this entire tutorial is run from Jupyter notebooks, you probably want to install your requirements there: 10 | 11 | ```shell 12 | $ which jupyter 13 | /home/my_user_name/stuff/bin/jupyter 14 | $ /home/my_user_name/stuff/bin/pip install -r requirements.txt 15 | 16 | $ cd /full/path/to/modern_time_series_analysis/ 17 | $ jupyter notebook 18 | ``` 19 | 20 | ## Syllabus 21 | 22 | And all the data and notebooks you need will be [here](https://github.com/theJollySin/scipy_con_2019/tree/master/modern_time_series_analysis/ModernTimeSeriesAnalysis): 23 | 24 | 1. [Structural Time Series](ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_INSTRUCTOR.ipynb ) 25 | 2. [Gaussian HMM](ModernTimeSeriesAnalysis/StateSpaceModels/2_Gaussian_HMM_INSTRUCTOR.ipynb) 26 | 3. [Trees for Classification and Prediction](ModernTimeSeriesAnalysis/MachineLearning/3_Trees_for_Classification_and_Prediction_INSTRUCTOR.ipynb) 27 | 4. [Clustering](ModernTimeSeriesAnalysis/MachineLearning/4_Clustering_INSTRUCTOR.ipynb) 28 | 5. [Forecasting electricity use with mxnet](ModernTimeSeriesAnalysis/DeepLearning/Electricity/5_Forecasting_electric_use_with_mxnet_INSTRUCTOR.ipynb) 29 | 6. [Stocks](ModernTimeSeriesAnalysis/DeepLearning/Stocks/6_Stocks_INSTRUCTOR.ipynb) 30 | 31 | 32 | ## Notes 33 | 34 | Don't expect anything exciting here. These are literally just my notes 35 | 36 | 37 | ### Structural Time Series 38 | 39 | * [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) models are old school, but if your machine learning toys don't do better than this, why bother? 40 | * [Kalman Filtering!](https://en.wikipedia.org/wiki/Kalman_filter) - Love me a Kalman Filter 41 | 42 | 43 | ### Machine Learning - Time Series Trees for Classification and Prediction 44 | 45 | * Most machine learning on time series uses tools that weren't designed for time series data. We always have to use other tools and "make 'em work". 46 | * Doctors and nurses to "feature detection" on "time series data" when they look at heart rates from ECGs. Everything old is new again. 47 | * We're going to cover [Random Forests](https://en.wikipedia.org/wiki/Random_forest) and [Gradient Boosted Trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting) with [xgboost](https://xgboost.readthedocs.io/en/latest/). 48 | * So we're going to look at different time series data sets and figure out which ones look the most similar. So, if we had little snippets of time series data, the human eye might be able to find similar sets. But feature detection by human labor is costly and slow. ...It'll probably still be computationally intensive. 49 | * [Cesium](https://github.com/cesium-ml/cesium) is a feature generation library - Mostly just for initial exploration. 50 | * Are we chopping up our time series into blocks because it is cyclical in nature? Or would we do that anyway? 51 | * We used a sliding window, we DID NOT chop it into blocks. 52 | * We used a machine learning method on a problem that had very little data. This was not a good choice. 53 | * The important lessons here were supposed to be: 54 | 1. Chop up your time series data to get more data. 55 | 2. If you don't have enough data, machine learning is the wrong approach. 56 | 57 | 58 | ### Machine Learning - Clustering 59 | 60 | * [Dynamic Time Warping](https://en.wikipedia.org/wiki/Dynamic_time_warping) 61 | * [Here](https://github.com/wannesm/dtaidistance) is one little library for DTW 62 | 63 | 64 | ### Deep Learning - Electric Use 65 | 66 | * Typically if you want to put time series data into a neural network, you use [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network). 67 | * Research later: [GRU vs LSTM](https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm) 68 | * You can also use a [CNN](https://en.wikipedia.org/wiki/Convolutional_neural_network) 69 | * CNNs might be a little better for classification than prediction, compared to RNN 70 | * This example finally uses parallel time series signals to be processed together; a *MUCH* more interesting problem. 71 | 72 | 73 | ### Deep Learning - Stocks 74 | 75 | * Do this on your own later. 76 | 77 | -------------------------------------------------------------------------------- /modern_time_series_analysis/SciPyModernTimeSeries.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/SciPyModernTimeSeries.pdf -------------------------------------------------------------------------------- /modern_time_series_analysis/requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.7.1 2 | astor==0.8.0 3 | cesium==0.9.9 4 | certifi==2024.7.4 5 | chardet==3.0.4 6 | hmmlearn==0.2.2 7 | gast==0.2.2 8 | google-pasta==0.1.7 9 | graphviz==0.8.4 10 | grpcio==1.53.2 11 | h5py==2.9.0 12 | idna==3.7 13 | joblib>=1.2.0 14 | Keras-Applications==1.0.8 15 | Keras-Preprocessing==1.1.0 16 | Markdown==3.1.1 17 | mxnet==1.9.1 18 | numpy==1.22.0 19 | pandas==0.24.2 20 | patsy==0.5.1 21 | protobuf==3.18.3 22 | python-dateutil==2.8.0 23 | pytz==2019.1 24 | requests==2.32.0 25 | scikit-learn==1.5.0 26 | scipy==1.10.0 27 | six==1.12.0 28 | sklearn==0.0 29 | statsmodels==0.10.0 30 | tensorboard==1.14.0 31 | tensorflow==2.12.1 32 | tensorflow-estimator==1.14.0 33 | termcolor==1.1.0 34 | urllib3==1.26.19 35 | Werkzeug==3.0.6 36 | wrapt==1.11.2 37 | xgboost==0.90 38 | --------------------------------------------------------------------------------