├── .gitignore
├── LICENSE
├── README.md
├── gpu
├── 01-Intro_to_cuDF.ipynb
├── 01-Intro_to_cuGraph.ipynb
├── 01-Introduction-LinearRegression-Hyperparam.ipynb
├── 02-Intro_to_cuDF_UDFs.ipynb
├── 02-LogisticRegression.ipynb
├── 02-Louvain.ipynb
├── 03-Pagerank.ipynb
├── 03-UMAP.ipynb
├── Introduction.ipynb
├── README.md
├── data
│ └── karate-data.csv
└── index.ipynb
└── modern_time_series_analysis
├── ModernTimeSeriesAnalysis
├── DeepLearning
│ ├── Electricity
│ │ ├── 5_Forecasting_electric_use_with_mxnet_INSTRUCTOR.ipynb
│ │ ├── 5_Forecasting_electric_use_with_mxnet_STUDENT.ipynb
│ │ ├── electricity.diff.txt
│ │ ├── models.json
│ │ ├── perf.py
│ │ └── run.json
│ └── Stocks
│ │ ├── 6_Stocks_INSTRUCTOR.ipynb
│ │ ├── 6_Stocks_STUDENT.ipynb
│ │ └── sp500.csv
├── MachineLearning
│ ├── 3_Trees_for_Classification_and_Prediction_INSTRUCTOR.ipynb
│ ├── 3_Trees_for_Classification_and_Prediction_STUDENT.ipynb
│ ├── 4_Clustering_INSTRUCTOR.ipynb
│ ├── 4_Clustering_STUDENT.ipynb
│ ├── data
│ │ ├── 50words.csv
│ │ ├── AirPassengers.csv
│ │ ├── featurized_hists.csv
│ │ ├── featurized_words.csv
│ │ ├── full_eeg_data_features.csv
│ │ ├── pairwise_word_distances.npy
│ │ └── training_eeg.csv
│ ├── dtaidistance
│ │ ├── __init__.py
│ │ ├── alignment.py
│ │ ├── clustering.py
│ │ ├── dp.py
│ │ ├── dtw.py
│ │ ├── dtw_c.pyx
│ │ ├── dtw_ndim.py
│ │ ├── dtw_ndim_visualisation.py
│ │ ├── dtw_visualisation.py
│ │ ├── dtw_weighted.py
│ │ └── util.py
│ └── full_eeg_data_features.csv
└── StateSpaceModels
│ ├── 1_Structural_Time_Series_INSTRUCTOR.ipynb
│ ├── 1_Structural_Time_Series_STUDENT.ipynb
│ ├── 2_Gaussian_HMM_INSTRUCTOR.ipynb
│ ├── 2_Gaussian_HMM_STUDENT.ipynb
│ ├── Nile.csv
│ └── global_temps.csv
├── README.md
├── SciPyModernTimeSeries.pdf
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 John Stilley
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial Sessions for SciPy Con 2019
2 |
3 | I couldn't attend every tutorial, but these seemed like fun:
4 |
5 |
6 | ## Monday, 8AM (Room 203) Modern Time Series Analysis
7 |
8 | * [Here](https://github.com/john-science/scipy_con_2019/tree/main/modern_time_series_analysis/) is my version of this tutorial, since it was not available online.
9 |
10 | ## Monday, 8AM (Room 106) PyTest
11 |
12 | * [Here](https://leemangeophysicalllc.github.io/testing-with-python/) is a link to their official GitHub.
13 |
14 | ## Monday 1PM (Room 203) Bayesian Data Science: Probabilistic Programming
15 |
16 | * [Here](https://github.com/john-science/bayesian-stats-modelling-tutorial) is my fork of their GitHub.
17 |
18 |
19 | ## Tuesday 8AM (Room 106) RAPIDS: Open GPU Data Science
20 |
21 | * [Here](https://github.com/john-science/scipy_con_2019/tree/main/gpu) is my version of their tutorial, since it was not available online.
22 |
23 | ## Tuesday 1PM (Room 104) Escape from Auto-manual Testing with Hypothesis!
24 |
25 | * [Here](https://github.com/john-science/escape-from-automanual-testing) is my fork of their GitHub.
26 |
--------------------------------------------------------------------------------
/gpu/01-Intro_to_cuDF.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/01-Intro_to_cuGraph.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/01-Introduction-LinearRegression-Hyperparam.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/02-Intro_to_cuDF_UDFs.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/02-LogisticRegression.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/02-Louvain.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/03-Pagerank.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/03-UMAP.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/Introduction.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/README.md:
--------------------------------------------------------------------------------
1 | # GPU Data Science
2 |
3 | > Okay, so they are giving us cloud IP addresses to log into, so I may not have much to share here, except my notes.
4 |
5 | ## Notes
6 |
7 | * Fun! They provide a cloud instance for everyone in the room, so we can play with a machine with sufficient GPU power.
8 | * The old solution: Hadoop, but that had too much IO. So people moved to Spark.
9 | * RAPIDS is Nvidia's GPU architecture
10 | * RAPIDS uses Apache Arrow (which tries to standardize the storage of columnar data in memory)
11 | * cuDF has nearly identical syntax to Pandas DataFrames
12 | * The API for cuDF is super handy, I'll give them that.
13 | * I guess that means you can also use cuDF as a drop-in replacement in your code base... I see.
14 | * String Example: big string replace operations on pd.DataFrame
15 | * 50x speed-up using cuDF and fixed-length NumPy strings
16 | * TODO: Whoops. At work, are we not converting our strings in pd.DataFrames to NumPy string arrays?
17 | * [RAPIDS docs](https://docs.rapids.ai/)
18 | * [Play around with RAPIDS](https://rapids.ai/start.html)
19 | * [cuDF GitHub](https://github.com/rapidsai/cudf)
20 | * [cupy](https://cupy.chainer.org/) is a replacement for NumPy, but in the GPU
21 |
22 |
--------------------------------------------------------------------------------
/gpu/data/karate-data.csv:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/gpu/index.ipynb:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | Jupyter Notebook
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
118 |
119 |
120 |
121 | Jupyter Notebook requires JavaScript.
122 | Please enable it to proceed.
123 |
124 |
125 |
126 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
188 |
189 |
194 |
195 |
196 |
197 |
222 |
223 |
224 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/models.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "scenarios": {
4 | "drop": [0.2],
5 | "batch-n": [128, 256],
6 | "lr": [0.0001],
7 | "n-epochs":[10],
8 | "model": [ "simple_lstnet_mode", "lstnet_mode", "rnn_model", "cnn_model", "fc_model"],
9 | "data_src": [
10 | "electricity.txt",
11 | "electricity.diff.txt",
12 | ],
13 | "win": [128, 64,],
14 | "seasonal_period": [24]
15 | },
16 |
17 | "drop": "$",
18 | "win": "$",
19 | "batch-n": "$",
20 | "lr": "$",
21 | "n-epochs": "$",
22 | "data-file": "/data/${data_src}",
23 | "save-dir": "${model}_${lr}_${batch-n}_${win}_${drop}/${data_src}"
24 | },
25 |
26 | ]
27 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/perf.py:
--------------------------------------------------------------------------------
1 |
2 | import numpy as np
3 | import mxnet as mx
4 | import pdb
5 |
6 | np.seterr(divide='ignore', invalid='ignore')
7 |
8 | ## for saving
9 | import pandas as pd
10 | import os
11 |
12 | def COR(label, pred):
13 | label_demeaned = label - label.mean(0)
14 | label_sumsquares = np.sum(np.square(label_demeaned), 0)
15 |
16 | pred_demeaned = pred - pred.mean(0)
17 | pred_sumsquares = np.sum(np.square(pred_demeaned), 0)
18 |
19 | cor_coef = np.diagonal(np.dot(label_demeaned.T, pred_demeaned)) / \
20 | np.sqrt(label_sumsquares * pred_sumsquares)
21 |
22 | return np.nanmean(cor_coef)
23 |
24 |
25 | def write_eval(pred, label, save_dir, mode, epoch):
26 | if not os.path.exists(save_dir):
27 | os.makedirs(save_dir)
28 |
29 | pred_df = pd.DataFrame(pred)
30 | label_df = pd.DataFrame(label)
31 | if epoch < 10:
32 | pred_df.to_csv( os.path.join(save_dir, '%s_pred_0%d.csv' % (mode, epoch)))
33 | label_df.to_csv(os.path.join(save_dir, '%s_label_0%d.csv' % (mode, epoch)))
34 | else :
35 | pred_df.to_csv( os.path.join(save_dir, '%s_pred_%d.csv' % (mode, epoch)))
36 | label_df.to_csv(os.path.join(save_dir, '%s_label_%d.csv' % (mode, epoch)))
37 |
38 | return { 'COR': COR(label,pred) }
39 |
40 |
41 |
42 |
43 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/DeepLearning/Electricity/run.json:
--------------------------------------------------------------------------------
1 | {
2 | "script": "/demo/fits.py",
3 | "models": "/demo/models.json",
4 | }
5 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/3_Trees_for_Classification_and_Prediction_STUDENT.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import matplotlib.pyplot as plt\n",
10 | "plt.rcParams['figure.figsize'] = [10, 10]"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "import cesium\n",
20 | "import xgboost as xgb\n",
21 | "import numpy as np\n",
22 | "import matplotlib.pyplot as plt\n",
23 | "import pandas as pd\n",
24 | "import time\n",
25 | "\n",
26 | "from cesium import datasets\n",
27 | "from cesium import featurize as ft\n",
28 | "\n",
29 | "import scipy\n",
30 | "from scipy.stats import pearsonr, spearmanr\n",
31 | "from scipy.stats import skew\n",
32 | "\n",
33 | "import sklearn\n",
34 | "from sklearn.ensemble import RandomForestClassifier\n",
35 | "from sklearn.metrics import accuracy_score\n",
36 | "from sklearn.model_selection import train_test_split"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "print(cesium.__version__)\n",
46 | "print(xgb.__version__)\n",
47 | "print(scipy.__version__)\n",
48 | "print(sklearn.__version__)"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "## Load data and generate some features of interest"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": null,
61 | "metadata": {},
62 | "outputs": [],
63 | "source": [
64 | "eeg = datasets.fetch_andrzejak()"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "metadata": {},
71 | "outputs": [],
72 | "source": [
73 | "type(eeg)"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": null,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "eeg.keys()"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "### Visually inspect"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "plt.subplot(3, 1, 1)\n",
99 | "plt.plot(eeg[\"measurements\"][0])\n",
100 | "plt.legend(eeg['classes'][0])\n",
101 | "plt.subplot(3, 1, 2)\n",
102 | "plt.plot(eeg[\"measurements\"][300])\n",
103 | "plt.legend(eeg['classes'][300])\n",
104 | "plt.subplot(3, 1, 3)\n",
105 | "plt.plot(eeg[\"measurements\"][450])\n",
106 | "plt.legend(eeg['classes'][450])"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": null,
112 | "metadata": {},
113 | "outputs": [],
114 | "source": [
115 | "type(eeg[\"measurements\"][0])"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": null,
121 | "metadata": {},
122 | "outputs": [],
123 | "source": [
124 | "type(eeg)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "eeg.keys()"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "type(eeg['measurements'])"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "execution_count": null,
148 | "metadata": {},
149 | "outputs": [],
150 | "source": [
151 | "len(eeg['measurements'])"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "eeg['measurements'][0].shape"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "## Generate the features"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "metadata": {},
174 | "outputs": [],
175 | "source": [
176 | "# from cesium import featurize as ft\n",
177 | "# features_to_use = [\"amplitude\",\n",
178 | "# \"percent_beyond_1_std\",\n",
179 | "# \"percent_close_to_median\",\n",
180 | "# \"skew\",\n",
181 | "# \"max_slope\"]\n",
182 | "# fset_cesium = ft.featurize_time_series(times=eeg[\"times\"],\n",
183 | "# values=eeg[\"measurements\"],\n",
184 | "# errors=None,\n",
185 | "# features_to_use=features_to_use,\n",
186 | "# scheduler = None)"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": null,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "fset_cesium = pd.read_csv(\"data/full_eeg_data_features.csv\", header = [0, 1])"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {
202 | "scrolled": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "fset_cesium.head()"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "metadata": {},
213 | "outputs": [],
214 | "source": [
215 | "# fset_cesium.to_csv(\"full_eeg_data_features.csv\")"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": [
224 | "fset_cesium.shape"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "## Exercise: validate/calculate these features by hand\n",
232 | "#### look up feature definitions here: http://cesium-ml.org/docs/feature_table.html\n",
233 | "confirm the values by hand coding these features for the first EEG measurement\n",
234 | "(that is eeg[\"measurements\"][0])"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": [
243 | "ex = eeg[\"measurements\"][0]\n",
244 | "ex_mean = np.mean(ex)\n",
245 | "ex_std = np.std(ex)"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": []
254 | },
255 | {
256 | "cell_type": "markdown",
257 | "metadata": {},
258 | "source": [
259 | "## Prepare data for training"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": null,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": [
268 | "X_train, X_test, y_train, y_test = train_test_split(\n",
269 | " fset_cesium.iloc[:, 1:6].values, eeg[\"classes\"], random_state=21)"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "## Try a random forest with these features"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": null,
282 | "metadata": {},
283 | "outputs": [],
284 | "source": [
285 | "clf = RandomForestClassifier(n_estimators=10, max_depth=3,\n",
286 | " random_state=21)"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": null,
292 | "metadata": {},
293 | "outputs": [],
294 | "source": [
295 | "clf.fit(X_train, y_train)"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": null,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "clf.score(X_train, y_train)"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": null,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "clf.score(X_test, y_test)"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "np.unique(y_test, return_counts=True)"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "y_test"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {},
338 | "outputs": [],
339 | "source": [
340 | "y_test.shape"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": [
349 | "y_train.shape"
350 | ]
351 | },
352 | {
353 | "cell_type": "markdown",
354 | "metadata": {},
355 | "source": [
356 | "## Try XGBoost with these features"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": null,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": [
365 | "model = xgb.XGBClassifier(n_estimators=10, max_depth=3,\n",
366 | " random_state=21)\n",
367 | "model.fit(X_train, y_train)"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": null,
373 | "metadata": {},
374 | "outputs": [],
375 | "source": [
376 | "model.score(X_test, y_test)"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": null,
382 | "metadata": {},
383 | "outputs": [],
384 | "source": [
385 | "model.score(X_train, y_train)"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {},
392 | "outputs": [],
393 | "source": [
394 | "xgb.plot_importance(model)"
395 | ]
396 | },
397 | {
398 | "cell_type": "markdown",
399 | "metadata": {},
400 | "source": [
401 | "## Time Series Forecasting with Decision Trees"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "### Explore the data"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": null,
414 | "metadata": {},
415 | "outputs": [],
416 | "source": [
417 | "ap = pd.read_csv(\"data/AirPassengers.csv\", parse_dates=[0])"
418 | ]
419 | },
420 | {
421 | "cell_type": "code",
422 | "execution_count": null,
423 | "metadata": {},
424 | "outputs": [],
425 | "source": [
426 | "ap.head()"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": null,
432 | "metadata": {},
433 | "outputs": [],
434 | "source": [
435 | "ap.set_index('Month', inplace=True)"
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": null,
441 | "metadata": {},
442 | "outputs": [],
443 | "source": [
444 | "ap.head()"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": null,
450 | "metadata": {},
451 | "outputs": [],
452 | "source": [
453 | "plt.plot(ap)"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": null,
459 | "metadata": {},
460 | "outputs": [],
461 | "source": [
462 | "plt.plot(np.diff(np.log(ap.values[:, 0])))"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": null,
468 | "metadata": {},
469 | "outputs": [],
470 | "source": [
471 | "ts = np.diff(np.log(ap.values[:, 0]))"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "## Exercise: now that we have 1 time series, how can we convert it to many samples?"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": 5,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": [
487 | "NSTEPS = 12"
488 | ]
489 | },
490 | {
491 | "cell_type": "code",
492 | "execution_count": null,
493 | "metadata": {},
494 | "outputs": [],
495 | "source": []
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "## Exercise: now that we have the time series broken down into a set of samples, how to featurize?"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": null,
507 | "metadata": {},
508 | "outputs": [],
509 | "source": [
510 | "measures = [vals[i][0:(NSTEPS - 1)] for i in range(vals.shape[0])]"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": null,
516 | "metadata": {},
517 | "outputs": [],
518 | "source": [
519 | "times = [[j for j in range(NSTEPS - 1)] for i in range(vals.shape[0])]"
520 | ]
521 | },
522 | {
523 | "cell_type": "code",
524 | "execution_count": null,
525 | "metadata": {},
526 | "outputs": [],
527 | "source": []
528 | },
529 | {
530 | "cell_type": "markdown",
531 | "metadata": {},
532 | "source": [
533 | "## Exercise: can you fit an XGBRegressor to this problem? Let's use the first 100 'time series' as the training data"
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": null,
539 | "metadata": {},
540 | "outputs": [],
541 | "source": []
542 | },
543 | {
544 | "cell_type": "markdown",
545 | "metadata": {},
546 | "source": [
547 | "### RMSE can be hard to digest .... Use other assessments to determine how well the model performs"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": null,
553 | "metadata": {},
554 | "outputs": [],
555 | "source": []
556 | },
557 | {
558 | "cell_type": "markdown",
559 | "metadata": {},
560 | "source": [
561 | "### What went wrong? Let's revisit the feature set"
562 | ]
563 | },
564 | {
565 | "cell_type": "code",
566 | "execution_count": null,
567 | "metadata": {},
568 | "outputs": [],
569 | "source": [
570 | "fset_ap.head()"
571 | ]
572 | },
573 | {
574 | "cell_type": "code",
575 | "execution_count": null,
576 | "metadata": {},
577 | "outputs": [],
578 | "source": [
579 | "plt.plot(vals[0])\n",
580 | "plt.plot(vals[1])\n",
581 | "plt.plot(vals[2])"
582 | ]
583 | },
584 | {
585 | "cell_type": "markdown",
586 | "metadata": {},
587 | "source": [
588 | "## We need to find a way to generate features that encode positional information"
589 | ]
590 | },
591 | {
592 | "cell_type": "markdown",
593 | "metadata": {},
594 | "source": [
595 | "### now we will generate our own features"
596 | ]
597 | },
598 | {
599 | "cell_type": "code",
600 | "execution_count": null,
601 | "metadata": {},
602 | "outputs": [],
603 | "source": [
604 | "vals.shape"
605 | ]
606 | },
607 | {
608 | "cell_type": "code",
609 | "execution_count": null,
610 | "metadata": {},
611 | "outputs": [],
612 | "source": [
613 | "feats = np.zeros( (vals.shape[0], 6), dtype = np.float32)\n",
614 | "for i in range(vals.shape[0]):\n",
615 | " feats[i, 0] = np.where(vals[i] == np.max(vals[i]))[0][0]\n",
616 | " feats[i, 1] = np.where(vals[i] == np.min(vals[i]))[0][0]\n",
617 | " feats[i, 2] = feats[i, 0] - feats[i, 1]\n",
618 | " feats[i, 3] = np.max(vals[i][-3:])\n",
619 | " feats[i, 4] = vals[i][-1] - vals[i][-2]\n",
620 | " feats[i, 5] = vals[i][-1] - vals[i][-3]"
621 | ]
622 | },
623 | {
624 | "cell_type": "code",
625 | "execution_count": null,
626 | "metadata": {},
627 | "outputs": [],
628 | "source": [
629 | "feats[0:3]"
630 | ]
631 | },
632 | {
633 | "cell_type": "markdown",
634 | "metadata": {},
635 | "source": [
636 | "### How do these look compared to the first set of features?"
637 | ]
638 | },
639 | {
640 | "cell_type": "code",
641 | "execution_count": null,
642 | "metadata": {},
643 | "outputs": [],
644 | "source": [
645 | "pd.DataFrame(feats[0:3])"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "execution_count": null,
651 | "metadata": {},
652 | "outputs": [],
653 | "source": [
654 | "X_train, y_train = feats[:100, :], outcomes[:100]\n",
655 | "X_test, y_test = feats[100:, :], outcomes[100:]"
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "execution_count": null,
661 | "metadata": {},
662 | "outputs": [],
663 | "source": [
664 | "model = xgb.XGBRegressor(n_estimators=20, max_depth=2,\n",
665 | " random_state=21)\n",
666 | "eval_set = [(X_test, y_test)]\n",
667 | "model.fit(X_train, y_train, eval_metric=\"rmse\", eval_set=eval_set, verbose=True)"
668 | ]
669 | },
670 | {
671 | "cell_type": "code",
672 | "execution_count": null,
673 | "metadata": {},
674 | "outputs": [],
675 | "source": [
676 | "plt.scatter(model.predict(X_test), y_test)"
677 | ]
678 | },
679 | {
680 | "cell_type": "code",
681 | "execution_count": null,
682 | "metadata": {},
683 | "outputs": [],
684 | "source": [
685 | "print(pearsonr(model.predict(X_test), y_test))\n",
686 | "print(spearmanr(model.predict(X_test), y_test))"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": null,
692 | "metadata": {},
693 | "outputs": [],
694 | "source": [
695 | "plt.scatter(model.predict(X_train), y_train)"
696 | ]
697 | },
698 | {
699 | "cell_type": "code",
700 | "execution_count": null,
701 | "metadata": {},
702 | "outputs": [],
703 | "source": [
704 | "print(pearsonr(model.predict(X_train), y_train))\n",
705 | "print(spearmanr(model.predict(X_train), y_train))"
706 | ]
707 | },
708 | {
709 | "cell_type": "code",
710 | "execution_count": null,
711 | "metadata": {},
712 | "outputs": [],
713 | "source": []
714 | }
715 | ],
716 | "metadata": {
717 | "kernelspec": {
718 | "display_name": "Python 3",
719 | "language": "python",
720 | "name": "python3"
721 | },
722 | "language_info": {
723 | "codemirror_mode": {
724 | "name": "ipython",
725 | "version": 3
726 | },
727 | "file_extension": ".py",
728 | "mimetype": "text/x-python",
729 | "name": "python",
730 | "nbconvert_exporter": "python",
731 | "pygments_lexer": "ipython3",
732 | "version": "3.6.8"
733 | }
734 | },
735 | "nbformat": 4,
736 | "nbformat_minor": 2
737 | }
738 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/50words.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/50words.csv
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/AirPassengers.csv:
--------------------------------------------------------------------------------
1 | Month,#Passengers
2 | 1949-01,112
3 | 1949-02,118
4 | 1949-03,132
5 | 1949-04,129
6 | 1949-05,121
7 | 1949-06,135
8 | 1949-07,148
9 | 1949-08,148
10 | 1949-09,136
11 | 1949-10,119
12 | 1949-11,104
13 | 1949-12,118
14 | 1950-01,115
15 | 1950-02,126
16 | 1950-03,141
17 | 1950-04,135
18 | 1950-05,125
19 | 1950-06,149
20 | 1950-07,170
21 | 1950-08,170
22 | 1950-09,158
23 | 1950-10,133
24 | 1950-11,114
25 | 1950-12,140
26 | 1951-01,145
27 | 1951-02,150
28 | 1951-03,178
29 | 1951-04,163
30 | 1951-05,172
31 | 1951-06,178
32 | 1951-07,199
33 | 1951-08,199
34 | 1951-09,184
35 | 1951-10,162
36 | 1951-11,146
37 | 1951-12,166
38 | 1952-01,171
39 | 1952-02,180
40 | 1952-03,193
41 | 1952-04,181
42 | 1952-05,183
43 | 1952-06,218
44 | 1952-07,230
45 | 1952-08,242
46 | 1952-09,209
47 | 1952-10,191
48 | 1952-11,172
49 | 1952-12,194
50 | 1953-01,196
51 | 1953-02,196
52 | 1953-03,236
53 | 1953-04,235
54 | 1953-05,229
55 | 1953-06,243
56 | 1953-07,264
57 | 1953-08,272
58 | 1953-09,237
59 | 1953-10,211
60 | 1953-11,180
61 | 1953-12,201
62 | 1954-01,204
63 | 1954-02,188
64 | 1954-03,235
65 | 1954-04,227
66 | 1954-05,234
67 | 1954-06,264
68 | 1954-07,302
69 | 1954-08,293
70 | 1954-09,259
71 | 1954-10,229
72 | 1954-11,203
73 | 1954-12,229
74 | 1955-01,242
75 | 1955-02,233
76 | 1955-03,267
77 | 1955-04,269
78 | 1955-05,270
79 | 1955-06,315
80 | 1955-07,364
81 | 1955-08,347
82 | 1955-09,312
83 | 1955-10,274
84 | 1955-11,237
85 | 1955-12,278
86 | 1956-01,284
87 | 1956-02,277
88 | 1956-03,317
89 | 1956-04,313
90 | 1956-05,318
91 | 1956-06,374
92 | 1956-07,413
93 | 1956-08,405
94 | 1956-09,355
95 | 1956-10,306
96 | 1956-11,271
97 | 1956-12,306
98 | 1957-01,315
99 | 1957-02,301
100 | 1957-03,356
101 | 1957-04,348
102 | 1957-05,355
103 | 1957-06,422
104 | 1957-07,465
105 | 1957-08,467
106 | 1957-09,404
107 | 1957-10,347
108 | 1957-11,305
109 | 1957-12,336
110 | 1958-01,340
111 | 1958-02,318
112 | 1958-03,362
113 | 1958-04,348
114 | 1958-05,363
115 | 1958-06,435
116 | 1958-07,491
117 | 1958-08,505
118 | 1958-09,404
119 | 1958-10,359
120 | 1958-11,310
121 | 1958-12,337
122 | 1959-01,360
123 | 1959-02,342
124 | 1959-03,406
125 | 1959-04,396
126 | 1959-05,420
127 | 1959-06,472
128 | 1959-07,548
129 | 1959-08,559
130 | 1959-09,463
131 | 1959-10,407
132 | 1959-11,362
133 | 1959-12,405
134 | 1960-01,417
135 | 1960-02,391
136 | 1960-03,419
137 | 1960-04,461
138 | 1960-05,472
139 | 1960-06,535
140 | 1960-07,622
141 | 1960-08,606
142 | 1960-09,508
143 | 1960-10,461
144 | 1960-11,390
145 | 1960-12,432
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/pairwise_word_distances.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/data/pairwise_word_distances.npy
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance
4 | ~~~~~~~~~~~~
5 |
6 | Time series distance methods.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import logging
14 |
15 |
16 | logger = logging.getLogger("be.kuleuven.dtai.distance")
17 |
18 |
19 | from . import dtw
20 | try:
21 | from . import dtw_c
22 | except ImportError:
23 | # Try to compile automatically
24 | # try:
25 | # import numpy as np
26 | # import pyximport
27 | # pyximport.install(setup_args={'include_dirs': np.get_include()})
28 | # from . import dtw_c
29 | # except ImportError:
30 | # logger.warning("\nDTW C variant not available.\n\n" +
31 | # "If you want to use the C libraries (not required, depends on cython), " +
32 | # "then run `cd {};python3 setup.py build_ext --inplace`.".format(dtaidistance_dir))
33 | dtw_c = None
34 |
35 | __version__ = "1.2.2"
36 | __author__ = "Wannes Meert"
37 | __copyright__ = "Copyright 2017-2019 KU Leuven, DTAI Research Group"
38 | __license__ = "Apache License, Version 2.0"
39 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/alignment.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.alignment
4 | ~~~~~~~~~~~~~~~~~~~~~~
5 |
6 | Sequence alignment (e.g. Needleman–Wunsch).
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import logging
14 | import math
15 | import numpy as np
16 |
17 | from .dp import dp
18 |
19 |
20 | def needleman_wunsch(s1, s2, window=None, max_dist=None,
21 | max_step=None, max_length_diff=None, psi=None):
22 | """Needleman-Wunsch global sequence alignment.
23 |
24 | Example:
25 |
26 | >> s1 = "GATTACA"
27 | >> s2 = "GCATGCU"
28 | >> value, matrix = alignment.needleman_wunsch(s1, s2)
29 | >> algn, s1a, s2a = alignment.best_alignment(matrix, s1, s2)
30 | >> print(matrix)
31 | [[-0., -1., -2., -3., -4., -5., -6., -7.],
32 | [-1., 1., -0., -1., -2., -3., -4., -5.],
33 | [-2., -0., -0., 1., -0., -1., -2., -3.],
34 | [-3., -1., -1., -0., 2., 1., -0., -1.],
35 | [-4., -2., -2., -1., 1., 1., -0., -1.],
36 | [-5., -3., -3., -1., -0., -0., -0., -1.],
37 | [-6., -4., -2., -2., -1., -1., 1., -0.],
38 | [-7., -5., -3., -1., -2., -2., -0., -0.]]
39 | >> print(''.join(s1a), ''.join(s2a))
40 | 'G-ATTACA', 'GCAT-GCU'
41 |
42 | """
43 | value, matrix = dp(s1, s2,
44 | _needleman_wunsch_fn, border=_needleman_wunsch_border,
45 | penalty=0, window=window, max_dist=max_dist,
46 | max_step=max_step, max_length_diff=max_length_diff, psi=psi)
47 | matrix = -matrix
48 | return value, matrix
49 |
50 |
51 | def _needleman_wunsch_fn(v1, v2):
52 | """Needleman-Wunsch
53 |
54 | Match: +1 -> -1
55 | Mismatch or Indel: −1 -> +1
56 |
57 | The values are reversed because our general dynamic programming algorithm
58 | selects the minimal value instead of the maximal value.
59 | """
60 | d_indel = 1 # gap / indel
61 | if v1 == v2:
62 | d = -1 # match
63 | else:
64 | d = 1 # mismatch
65 | return d, d_indel
66 |
67 |
68 | def _needleman_wunsch_border(ri, ci):
69 | if ri == 0:
70 | return ci
71 | if ci == 0:
72 | return ri
73 | return 0
74 |
75 |
76 | def best_alignment(paths, s1=None, s2=None, gap="-", order=None):
77 | """Compute the optimal alignment from the nxm paths matrix.
78 |
79 | :param paths: Paths matrix (e.g. from needleman_wunsch)
80 | :param s1: First sequence, if given the aligned sequence will be created
81 | :param s2: Second sequence, if given the aligned sequence will be created
82 | :param gap: Gap symbol that is inserted into s1 and s2 to align the sequences
83 | :param order: Array with order of comparisons (there might be multiple optimal paths)
84 | The default order is 0,1,2: (-1,-1), (-1,-0), (-0,-1)
85 | For example, 1,0,2 is (-1,-0), (-1,-1), (-0,-1)
86 | There might be more optimal paths than covered by these orderings. For example,
87 | when using combinations of these orderings in different parts of the matrix.
88 | """
89 | i, j = int(paths.shape[0] - 1), int(paths.shape[1] - 1)
90 | p = []
91 | if paths[i, j] != -1:
92 | p.append((i - 1, j - 1))
93 | ops = [(-1,-1), (-1,-0), (-0,-1)]
94 | if order is None:
95 | order = [0, 1, 2]
96 | while i > 0 and j > 0:
97 | prev_vals = [paths[i + ops[orderi][0], j + ops[orderi][1]] for orderi in order]
98 | # c = np.argmax([paths[i - 1, j - 1], paths[i - 1, j], paths[i, j - 1]])
99 | c = int(np.argmax(prev_vals))
100 | opi, opj = ops[order[c]]
101 | i, j = i + opi, j + opj
102 | if paths[i, j] != -1:
103 | p.append((i - 1, j - 1))
104 | p.pop()
105 | p.reverse()
106 | if s1 is not None:
107 | s1a = []
108 | s1ip = -1
109 | for s1i, _ in p:
110 | if s1i == s1ip + 1:
111 | s1a.append(s1[s1i])
112 | else:
113 | s1a.append(gap)
114 | s1ip = s1i
115 | else:
116 | s1a = None
117 | if s2 is not None:
118 | s2a = []
119 | s2ip = -1
120 | for _, s2i in p:
121 | if s2i == s2ip + 1:
122 | s2a.append(s2[s2i])
123 | else:
124 | s2a.append(gap)
125 | s2ip = s2i
126 | else:
127 | s2a = None
128 |
129 | return p, s1a, s2a
130 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/clustering.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.clustering
4 | ~~~~~~~~~~~~~~~~~~~~~~~
5 |
6 | Time series clustering.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import logging
14 | from pathlib import Path
15 | from collections import deque
16 | import numpy as np
17 |
18 | from .util import SeriesContainer
19 |
20 | try:
21 | from tqdm import tqdm
22 | except ImportError:
23 | tqdm = None
24 |
25 |
26 | logger = logging.getLogger("be.kuleuven.dtai.distance")
27 |
28 |
29 | class Hierarchical:
30 | """Hierarchical clustering.
31 |
32 | Note: This method first computes the entire distance matrix. This is not ideal for extremely large
33 | data sets.
34 |
35 | :param dists_fun: Function to compute pairwise distance matrix between set of series.
36 | :param dists_options: Arguments to pass to dists_fun.
37 | :param max_dist: Do not merge or cluster series that are further apart than this.
38 | :param merge_hook: Function that is called when two series are clustered.
39 | The function definition is `def merge_hook(from_idx, to_idx, distance)`, where idx is the index of the series.
40 | :param order_hook: Function that is called to decide on the next idx out of all shortest distances
41 | :param show_progress: Use a tqdm progress bar
42 | """
43 |
44 | def __init__(self, dists_fun, dists_options, max_dist=np.inf,
45 | merge_hook=None, order_hook=None, show_progress=True):
46 | self.dists_fun = dists_fun
47 | self.dists_options = dists_options
48 | self.max_dist = max_dist
49 | self.merge_hook = merge_hook
50 | self.order_hook = order_hook
51 | self.show_progress = show_progress
52 |
53 | def fit(self, series):
54 | """Merge sequences.
55 |
56 | :param series: Iterator over series.
57 | :return: Dictionary with as keys the prototype indicices and as values all the indicides of the series in
58 | that cluster.
59 | """
60 | nb_series = len(series)
61 | cluster_idx = dict()
62 | dists = self.dists_fun(series, **self.dists_options)
63 | min_value = np.min(dists)
64 | min_idxs = np.argwhere(dists == min_value)
65 | if self.order_hook:
66 | min_idx = self.order_hook(min_idxs)
67 | else:
68 | min_idx = min_idxs[0, :]
69 | deleted = set()
70 | cnt_merge = 0
71 | logger.debug('Merging patterns')
72 | if self.show_progress and tqdm:
73 | pbar = tqdm(total=dists.shape[0])
74 | else:
75 | pbar = None
76 | # Hierarchical clustering (distance to prototype)
77 | while min_value <= self.max_dist:
78 | cnt_merge += 1
79 | i1, i2 = int(min_idx[0]), int(min_idx[1])
80 | if self.merge_hook:
81 | result = self.merge_hook(i2, i1, min_value)
82 | if result:
83 | i1, i2 = result
84 | logger.debug("Merge {} <- {} ({:.3f})".format(i1, i2, min_value))
85 | if i1 not in cluster_idx:
86 | cluster_idx[i1] = {i1}
87 | if i2 in cluster_idx:
88 | cluster_idx[i1].update(cluster_idx[i2])
89 | del cluster_idx[i2]
90 | else:
91 | cluster_idx[i1].add(i2)
92 | # if recompute:
93 | # for r in range(i1):
94 | # if r not in deleted and abs(len(cur_seqs[r]) - len(cur_seqs[i1])) <= max_length_diff:
95 | # dists[r, i1] = self.dist(cur_seqs[r], cur_seqs[i1], **dist_opts)
96 | # for c in range(i1+1, len(cur_seqs)):
97 | # if c not in deleted and abs(len(cur_seqs[i1]) - len(cur_seqs[c])) <= max_length_diff:
98 | # dists[i1, c] = self.dist(cur_seqs[i1], cur_seqs[c], **dist_opts)
99 | for r in range(i2):
100 | dists[r, i2] = np.inf
101 | for c in range(i2 + 1, len(series)):
102 | dists[i2, c] = np.inf
103 | deleted.add(i2)
104 | if len(deleted) == nb_series - 1:
105 | break
106 | if pbar:
107 | pbar.update(1)
108 | # min_idx = np.unravel_index(np.argmin(dists), dists.shape)
109 | # min_value = dists[min_idx]
110 | min_value = np.min(dists)
111 | # if np.isinf(min_value):
112 | # break
113 | min_idxs = np.argwhere(dists == min_value)
114 | if self.order_hook:
115 | min_idx = self.order_hook(min_idxs)
116 | else:
117 | min_idx = min_idxs[0, :]
118 | if pbar:
119 | pbar.update(dists.shape[0] - cnt_merge)
120 |
121 | prototypes = []
122 | for i in range(len(series)):
123 | if i not in deleted:
124 | prototypes.append(i)
125 | if i not in cluster_idx:
126 | cluster_idx[i] = set(i)
127 | return cluster_idx
128 |
129 |
130 | class BaseTree:
131 | """Base Tree abstract class.
132 |
133 | Returns a datastructure compatible with the Scipy clustering methods:
134 |
135 | https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
136 |
137 | A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are
138 | combined to form cluster n + i. A cluster with an index less than n corresponds to one of the original
139 | observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value
140 | Z[i, 3] represents the number of original observations in the newly formed cluster.
141 | """
142 |
143 | def __init__(self, **kwargs):
144 | self.linkage = None
145 | self.series = None
146 | self._series_y = None
147 | self.ts_height_factor = None
148 |
149 | @property
150 | def maxnode(self):
151 | return len(self.series) - 1 + len(self.linkage)
152 |
153 | def get_linkage(self, node):
154 | if node < len(self.series):
155 | return None
156 | idx = int(node - len(self.series))
157 | return self.linkage[idx]
158 |
159 | def plot(self, filename=None, axes=None, ts_height=10,
160 | bottom_margin=2, top_margin=2, ts_left_margin=0, ts_sample_length=1,
161 | tr_label_margin=3, tr_left_margin=2, ts_label_margin=0,
162 | show_ts_label=None, show_tr_label=None,
163 | cmap='viridis_r', ts_color=None):
164 | """Plot the hierarchy and time series.
165 |
166 | :param filename: If a filename is passed, the image is written to this file.
167 | :param axes: If a axes array is passed the image is added to this figure.
168 | Expects axes[0] and axes[1] to be present.
169 | :param ts_height: Height of a time series
170 | :param bottom_margin: Margin on bottom
171 | :param top_margin: Margin on top
172 | :param ts_left_margin: Margin on left of time series image
173 | :param ts_sample_length: Space between two points in the time series
174 | :param tr_label_margin: Margin between tree split and label
175 | :param tr_left_margin: Left margin for tree
176 | :param ts_label_margin: Margin between start of series and label
177 | :param show_ts_label: Show label indices. Boolean, callable or subscriptable object.
178 | If it is a callable object, the index of the time series will be given and the
179 | return string will be printed.
180 | :param show_tr_label: Show tree distances. Boolean, callable or subscriptable object.
181 | If it is a callable object, the index of the time series will be given and the
182 | return string will be printed.
183 | :param cmap: Matplotlib colormap name
184 | :param ts_color: function that takes the index and returns a color
185 | (compatible with the matplotlib.color color argument)
186 | """
187 | # print('linkage')
188 | # for l in self.linkage:
189 | # print(l)
190 | from matplotlib import pyplot as plt
191 | from matplotlib.lines import Line2D
192 | import matplotlib.colors as colors
193 | import matplotlib.cm as cmx
194 |
195 | if show_ts_label is True:
196 | show_ts_label = lambda idx: str(int(idx))
197 | elif show_ts_label is False or show_ts_label is None:
198 | show_ts_label = lambda idx: ""
199 | elif callable(show_ts_label):
200 | pass
201 | elif hasattr(show_ts_label, "__getitem__"):
202 | show_ts_label_prev = show_ts_label
203 | show_ts_label = lambda idx: show_ts_label_prev[idx]
204 | else:
205 | raise AttributeError("Unknown type for show_ts_label, expecting boolean, subscriptable or callable, "
206 | "got {}".format(type(show_ts_label)))
207 | if show_tr_label is True:
208 | show_tr_label = lambda dist: "{:.2f}".format(dist)
209 | elif show_tr_label is False or show_tr_label is None:
210 | show_tr_label = lambda dist: ""
211 | elif callable(show_tr_label):
212 | pass
213 | elif hasattr(show_tr_label, "__getitem__"):
214 | show_tr_label_prev = show_tr_label
215 | show_tr_label = lambda idx: show_tr_label_prev[idx]
216 | else:
217 | raise AttributeError("Unknown type for show_ts_label, expecting boolean, subscriptable or callable, "
218 | "got {}".format(type(show_ts_label)))
219 |
220 | self._series_y = [0] * len(self.series)
221 |
222 | max_dist = 0
223 | for _, _, d, _ in self.linkage:
224 | if not np.isinf(d):
225 | max_dist = max(max_dist, d)
226 |
227 | node_props = dict()
228 |
229 | max_y = self.series.get_max_y()
230 | self.ts_height_factor = (ts_height / max_y) * 0.9
231 |
232 | def count(node, height):
233 | # print('count({},{})'.format(node, height))
234 | maxheight = None
235 | maxcumdist = None
236 | curdepth = None
237 | cnt = 0
238 | left_cnt = None
239 | right_cnt = None
240 | if node < len(self.series):
241 | # Leaf
242 | cnt += 1
243 | maxheight = height
244 | maxcumdist = 0
245 | curdepth = 0
246 | left_cnt = 0
247 | right_cnt = 0
248 | else:
249 | # Inner node
250 | child_left, child_right, dist, cnt2 = self.get_linkage(int(node))
251 | child_left, child_right, cnt2 = int(child_left), int(child_right), int(cnt2)
252 | if child_left == child_right:
253 | raise Exception("Row in linkage contains same node as left and right child: {}-{}".
254 | format(child_left, child_right))
255 | if np.isinf(dist):
256 | dist = 1.5*max_dist
257 | # Left
258 | nc, nmh, ncd, nmd = count(child_left, height + 1)
259 | cnt += nc
260 | maxheight = nmh
261 | maxcumdist = nmd + dist
262 | curdepth = ncd + 1
263 | left_cnt = nc
264 | # Right
265 | nc, nmh, ncd, nmd = count(child_right, height + 1)
266 | cnt += nc
267 | maxheight = max(maxheight, nmh)
268 | maxcumdist = max(maxcumdist, nmd + dist)
269 | curdepth = max(curdepth, ncd + 1)
270 | right_cnt = nc
271 | # if cnt != cnt2:
272 | # raise Exception("Count in linkage not correct")
273 | # print('c', node, c)
274 | node_props[int(node)] = (cnt, curdepth, left_cnt, right_cnt, maxcumdist)
275 | # print('count({},{}) = {}, {}, {}, {}'.format(node, height, cnt, maxheight, curdepth, maxcumdist))
276 | return cnt, maxheight, curdepth, maxcumdist
277 |
278 | cnt, maxheight, curdepth, maxcumdist = count(self.maxnode, 0)
279 | # for node, props in node_props.items():
280 | # print("{:<3}: {}".format(node, props))
281 |
282 | if axes is None:
283 | fig, ax = plt.subplots(nrows=1, ncols=2, frameon=False)
284 | else:
285 | fig, ax = None, axes
286 | ax[0].set_axis_off()
287 | # ax[0].set_xlim(left=0, right=curdept)
288 | ax[0].set_xlim(left=0, right=tr_left_margin + maxcumdist + 0.05)
289 | ax[0].set_ylim(bottom=0, top=bottom_margin + ts_height * len(self.series) + top_margin)
290 | # ax[0].plot([0,1],[1,2])
291 | # ax[0].add_line(Line2D((0,1),(2,2), lw=2, color='black', axes=ax[0]))
292 |
293 | ax[1].set_axis_off()
294 | ax[1].set_xlim(left=0, right=ts_left_margin + ts_sample_length * len(self.series[0]))
295 | ax[1].set_ylim(bottom=0, top=bottom_margin + ts_height * len(self.series) + top_margin)
296 |
297 | if type(cmap) == str:
298 | cmap = plt.get_cmap(cmap)
299 | else:
300 | pass
301 | line_colors = cmx.ScalarMappable(norm=colors.Normalize(vmin=0, vmax=max_dist), cmap=cmap)
302 |
303 | cnt_ts = 0
304 |
305 | def plot_i(node, depth, cnt_ts, prev_lcnt, ax, left):
306 | # print('plot_i', node, depth, cnt_ts, prev_lcnt)
307 | pcnt, pdepth, plcnt, prcnt, pcdist = node_props[node]
308 | # px = maxheight - pdepth
309 | px = tr_left_margin + maxcumdist - pcdist
310 | py = prev_lcnt * ts_height
311 | if node < len(self.series):
312 | # Plot series
313 | # print('plot series y={}'.format(ts_bottom_margin + ts_height * cnt_ts + self.ts_height_factor))
314 | self._series_y[int(node)] = bottom_margin + ts_height * cnt_ts
315 | serie = self.series[int(node)]
316 | ax[1].text(ts_left_margin + ts_label_margin,
317 | bottom_margin + ts_height * cnt_ts + ts_height / 2,
318 | show_ts_label(int(node)), ha='left', va='center')
319 | if ts_color:
320 | curcolor = ts_color(int(node))
321 | else:
322 | curcolor = None
323 | ax[1].plot(ts_left_margin + ts_sample_length * np.arange(len(serie)),
324 | bottom_margin + ts_height * cnt_ts + self.ts_height_factor * serie,
325 | color=curcolor)
326 | cnt_ts += 1
327 |
328 | else:
329 | child_left, child_right, dist, _ = self.get_linkage(node)
330 | color = line_colors.to_rgba(dist)
331 | ax[0].text(px + tr_label_margin, py,
332 | show_tr_label(dist), ha='left', va='center', color=color)
333 |
334 | # Left
335 | ccnt, cdepth, clcntl, crcntl, clcdist = node_props[child_left]
336 | # print('left', ccnt, cdepth, clcntl, crcntl)
337 | # cx = maxheight - cdepth
338 | cx = tr_left_margin + maxcumdist - clcdist
339 | cy = (prev_lcnt - crcntl) * ts_height
340 | if py == cy:
341 | cy -= 1 / 2 * ts_height
342 | # print('plot line', (px, cx), (py, cy))
343 | # ax[0].add_line(Line2D((px, cx), (py, cy), lw=2, color='black', axes=ax[0]))
344 | ax[0].add_line(Line2D((px, px), (py, cy), lw=1, color=color, axes=ax[0]))
345 | ax[0].add_line(Line2D((px, cx), (cy, cy), lw=1, color=color, axes=ax[0]))
346 | cnt_ts = plot_i(child_left, depth + 1, cnt_ts, prev_lcnt - crcntl, ax, True)
347 |
348 | # Right
349 | ccnt, cdepth, clcntr, crcntr, crcdist = node_props[child_right]
350 | # print('right', ccnt, cdepth, clcntr, crcntr)
351 | # cx = maxheight - cdepth
352 | cx = tr_left_margin + maxcumdist - crcdist
353 | cy = (prev_lcnt + clcntr) * ts_height
354 | if py == cy:
355 | cy += 1 / 2 * ts_height
356 | # print('plot line', (px, cx), (py, cy))
357 | # ax[0].add_line(Line2D((px, cx), (py, cy), lw=2, color='black', axes=ax[0]))
358 | ax[0].add_line(Line2D((px, px), (py, cy), lw=1, color=color, axes=ax[0]))
359 | ax[0].add_line(Line2D((px, cx), (cy, cy), lw=1, color=color, axes=ax[0]))
360 | cnt_ts = plot_i(child_right, depth + 1, cnt_ts, prev_lcnt + clcntr, ax, False)
361 | return cnt_ts
362 |
363 | plot_i(self.maxnode, 0, 0, node_props[self.maxnode][2], ax, True)
364 |
365 | if filename:
366 | if isinstance(filename, Path):
367 | filename = str(filename)
368 | plt.savefig(filename, bbox_inches='tight', pad_inches=0)
369 | plt.close()
370 | fig, ax = None, None
371 |
372 | return fig, ax
373 |
374 | def to_dot(self):
375 | child_left, child_right, dist, cnt = self.get_linkage(self.maxnode)
376 | node_deque = deque([(self.maxnode, child_left), (self.maxnode, child_right)])
377 | # print(node_deque)
378 | s = ["digraph tree {"]
379 | while len(node_deque) > 0:
380 | from_node, to_node = node_deque.popleft()
381 | s.append(" {} -> {};".format(from_node, to_node))
382 | if to_node >= len(self.series):
383 | child_left, child_right, dist, cnt = self.get_linkage(to_node)
384 | node_deque.append((to_node, child_left))
385 | node_deque.append((to_node, child_right))
386 | # print(node_deque)
387 | s.append("}")
388 | return "\n".join(s)
389 |
390 |
391 | class HierarchicalTree(BaseTree):
392 | """Wrapper to keep track of the full tree that represents the hierarchical clustering.
393 |
394 | :param model: Clustering object. For example of class :class:`Hierarchical`.
395 | If no model is given, the arguments are identical to those of class :class:`Hierarchical`.
396 | """
397 |
398 | def __init__(self, model=None, **kwargs):
399 | if model is None:
400 | self._model = Hierarchical(**kwargs)
401 | else:
402 | self._model = model
403 | super().__init__(**kwargs)
404 | self._model.max_dist = np.inf
405 |
406 | def fit(self, series, *args, **kwargs):
407 | self.series = SeriesContainer.wrap(series)
408 | self.linkage = []
409 | new_nodes = {i: i for i in range(len(series))}
410 | if self._model.merge_hook:
411 | old_merge_hook = self._model.merge_hook
412 | else:
413 | old_merge_hook = None
414 |
415 | def merge_hook(from_idx, to_idx, distance):
416 | # print('merge_hook', from_idx, to_idx)
417 | new_idx = len(self.series) + len(self.linkage)
418 | # print('adding to linkage: ', new_nodes[from_idx], new_nodes[to_idx], distance, 0)
419 | if new_nodes[from_idx] is None:
420 | raise Exception('Trying to merge series that is already merged')
421 | self.linkage.append((new_nodes[from_idx], new_nodes[to_idx], distance, 0))
422 | new_nodes[to_idx] = new_idx
423 | new_nodes[from_idx] = None
424 | if old_merge_hook:
425 | old_merge_hook(from_idx, to_idx, distance)
426 |
427 | self._model.merge_hook = merge_hook
428 |
429 | result = self._model.fit(series, *args, **kwargs)
430 | self._model.merge_hook = old_merge_hook
431 | return result
432 |
433 |
434 | class LinkageTree(BaseTree):
435 | """Hierarchical clustering using the Scipy linkage function.
436 |
437 | This is the same but faster algorithm as available in Hierarchical (~10 times faster). But with less
438 | options to steer the clustering (e.g. no possibility to give weights). It still computes the entire
439 | distance matrix first and is thus not ideal for extremely large data sets.
440 | """
441 |
442 | def __init__(self, dists_fun, dists_options, method='complete'):
443 | """
444 |
445 | :param dists_fun: Distance funcion, e.g. dtw.distance
446 | :param dists_options: Options passed to dists_fun
447 | :param method: Linkage method (see scipy.cluster.hierarchy.linkage)
448 | """
449 | super().__init__()
450 | self.dists_fun = dists_fun
451 | self.dists_options = dists_options
452 | self.method = method
453 |
454 | def fit(self, series):
455 | self.series = SeriesContainer.wrap(series)
456 | try:
457 | from scipy.cluster.hierarchy import linkage
458 | except ImportError:
459 | logger.error("The LinkageTree class requires the scipy package to be installed.")
460 | self.linkage = None
461 | linkage = None
462 | return
463 | dists = self.dists_fun(self.series, **self.dists_options)
464 | dists_cond = np.zeros(self._size_cond(len(series)))
465 | idx = 0
466 | for r in range(len(series) - 1):
467 | dists_cond[idx:idx + len(series) - r - 1] = dists[r, r + 1:]
468 | idx += len(series) - r - 1
469 |
470 | self.linkage = linkage(dists_cond, method=self.method, metric='euclidean')
471 |
472 | def _size_cond(self, size):
473 | n = int(size)
474 | return int((n * (n - 1)) / 2)
475 |
476 |
477 | class Hooks:
478 | @staticmethod
479 | def create_weighthook(weights, series):
480 | def newhook(i1, i2, dist):
481 | w1 = weights[i1]
482 | w2 = weights[i2]
483 | p1 = series[i1]
484 | p2 = series[i2]
485 | if w1 < w2 or (w1 == w2 and len(p1) > len(p2)):
486 | i1, i2 = i2, i1
487 | weights[i1] = w1 + w2
488 | return i1, i2
489 | return newhook
490 |
491 | @staticmethod
492 | def create_orderhook(weights):
493 | def newhook(idxs):
494 | min_idx = -1
495 | max_weight = -1
496 | for r, c in [idxs[ii, :] for ii in range(idxs.shape[0])]:
497 | total = weights[r] + weights[c]
498 | if total > max_weight:
499 | max_weight = total
500 | min_idx = (r, c)
501 | return min_idx
502 | return newhook
503 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dp.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.dp
4 | ~~~~~~~~~~~~~~~
5 |
6 | Generic Dynamic Programming functions
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import logging
14 | import numpy as np
15 |
16 |
17 | logger = logging.getLogger("be.kuleuven.dtai.distance")
18 |
19 |
20 | def dp(s1, s2, fn, border=None, window=None, max_dist=None,
21 | max_step=None, max_length_diff=None, penalty=None, psi=None):
22 | """
23 | Generic dynamic programming.
24 |
25 | This function does not optimize storage when a window size is given (e.g. in contrast with
26 | the fast DTW functions).
27 |
28 | :param s1: First sequence
29 | :param s2: Second sequence
30 | :param fn: Function to compare two items from both sequences
31 | :param border: Callable object to fill in the initial borders (border(row_idx, col_idx).
32 | :param window: see :meth:`distance`
33 | :param max_dist: see :meth:`distance`
34 | :param max_step: see :meth:`distance`
35 | :param max_length_diff: see :meth:`distance`
36 | :param penalty: see :meth:`distance`
37 | :param psi: see :meth:`distance`
38 | :returns: (DTW distance, DTW matrix)
39 | """
40 | r, c = len(s1), len(s2)
41 | if max_length_diff is not None and abs(r - c) > max_length_diff:
42 | return np.inf
43 | if window is None:
44 | window = max(r, c)
45 | if not max_step:
46 | max_step = np.inf
47 | if not max_dist:
48 | max_dist = np.inf
49 | if not penalty:
50 | penalty = 0
51 | if psi is None:
52 | psi = 0
53 | dtw = np.full((r + 1, c + 1), np.inf)
54 | if border:
55 | for ci in range(c + 1):
56 | dtw[0, ci] = border(0, ci)
57 | for ri in range(1, r + 1):
58 | dtw[ri, 0] = border(ri, 0)
59 | for i in range(psi + 1):
60 | dtw[0, i] = 0
61 | dtw[i, 0] = 0
62 | last_under_max_dist = 0
63 | i0 = 1
64 | i1 = 0
65 | for i in range(r):
66 | if last_under_max_dist == -1:
67 | prev_last_under_max_dist = np.inf
68 | else:
69 | prev_last_under_max_dist = last_under_max_dist
70 | last_under_max_dist = -1
71 | i0 = i
72 | i1 = i + 1
73 | for j in range(max(0, i - max(0, r - c) - window + 1), min(c, i + max(0, c - r) + window)):
74 | d, d_indel = fn(s1[i], s2[j])
75 | if max_step is not None:
76 | if d > max_step:
77 | d = np.inf
78 | if d_indel > max_step:
79 | d_indel = np.inf
80 | if d > max_step and d_indel > max_step:
81 | continue
82 | # print(f"[{i1},{j+1}] -> [{s1[i]},{s2[j]}] -> {d},{d_indel}")
83 | dtw[i1, j + 1] = min(d + dtw[i0, j],
84 | d_indel + dtw[i0, j + 1] + penalty,
85 | d_indel + dtw[i1, j] + penalty)
86 | if max_dist is not None:
87 | if dtw[i1, j + 1] <= max_dist:
88 | last_under_max_dist = j
89 | else:
90 | dtw[i1, j + 1] = np.inf
91 | if prev_last_under_max_dist < j + 1:
92 | break
93 | if max_dist is not None and last_under_max_dist == -1:
94 | return np.inf, dtw
95 | if psi == 0:
96 | d = dtw[i1, min(c, c + window - 1)]
97 | else:
98 | ir = i1
99 | ic = min(c, c + window - 1)
100 | vr = dtw[ir-psi:ir+1, ic]
101 | vc = dtw[ir, ic-psi:ic+1]
102 | mir = np.argmin(vr)
103 | mic = np.argmin(vc)
104 | if vr[mir] < vc[mic]:
105 | dtw[ir-psi+mir+1:ir+1, ic] = -1
106 | d = vr[mir]
107 | else:
108 | dtw[ir, ic - psi + mic + 1:ic+1] = -1
109 | d = vc[mic]
110 | return d, dtw
111 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_ndim.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.dtw_ndim
4 | ~~~~~~~~~~~~~~~~~~~~~
5 |
6 | Dynamic Time Warping (DTW) for N-dimensional series.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import os
14 | import logging
15 | import math
16 | import numpy as np
17 |
18 | logger = logging.getLogger("be.kuleuven.dtai.distance")
19 | dtaidistance_dir = os.path.join(os.path.abspath(os.path.dirname(__file__)), os.pardir)
20 |
21 | try:
22 | from . import dtw_c
23 | except ImportError:
24 | # logger.info('C library not available')
25 | dtw_c = None
26 |
27 | try:
28 | from tqdm import tqdm
29 | except ImportError:
30 | logger.info('tqdm library not available')
31 | tqdm = None
32 |
33 |
34 | def distance(s1, s2, window=None, max_dist=None,
35 | max_step=None, max_length_diff=None, penalty=None, psi=None,
36 | use_c=False):
37 | """Dynamic Time Warping using multidimensional sequences.
38 |
39 | cost = EuclideanDistance(s1[i], s2[j])
40 |
41 | See :py:meth:`dtaidistance.dtw.distance` for parameters.
42 | """
43 | if use_c:
44 | logger.error("No C version implemented (yet)")
45 | return
46 | r, c = len(s1), len(s2)
47 | if max_length_diff is not None and abs(r - c) > max_length_diff:
48 | return np.inf
49 | if window is None:
50 | window = max(r, c)
51 | if not max_step:
52 | max_step = np.inf
53 | else:
54 | max_step *= max_step
55 | if not max_dist:
56 | max_dist = np.inf
57 | else:
58 | max_dist *= max_dist
59 | if not penalty:
60 | penalty = 0
61 | else:
62 | penalty *= penalty
63 | if psi is None:
64 | psi = 0
65 | length = min(c + 1, abs(r - c) + 2 * (window - 1) + 1 + 1 + 1)
66 | # print("length (py) = {}".format(length))
67 | dtw = np.full((2, length), np.inf)
68 | # dtw[0, 0] = 0
69 | for i in range(psi + 1):
70 | dtw[0, i] = 0
71 | last_under_max_dist = 0
72 | skip = 0
73 | i0 = 1
74 | i1 = 0
75 | psi_shortest = np.inf
76 | for i in range(r):
77 | # print("i={}".format(i))
78 | # print(dtw)
79 | if last_under_max_dist == -1:
80 | prev_last_under_max_dist = np.inf
81 | else:
82 | prev_last_under_max_dist = last_under_max_dist
83 | last_under_max_dist = -1
84 | skipp = skip
85 | skip = max(0, i - max(0, r - c) - window + 1)
86 | i0 = 1 - i0
87 | i1 = 1 - i1
88 | dtw[i1, :] = np.inf
89 | j_start = max(0, i - max(0, r - c) - window + 1)
90 | j_end = min(c, i + max(0, c - r) + window)
91 | if dtw.shape[1] == c + 1:
92 | skip = 0
93 | if psi != 0 and j_start == 0 and i < psi:
94 | dtw[i1, 0] = 0
95 | for j in range(j_start, j_end):
96 | d = np.sum((s1[i] - s2[j]) ** 2)
97 | if d > max_step:
98 | continue
99 | assert j + 1 - skip >= 0
100 | assert j - skipp >= 0
101 | assert j + 1 - skipp >= 0
102 | assert j - skip >= 0
103 | dtw[i1, j + 1 - skip] = d + min(dtw[i0, j - skipp],
104 | dtw[i0, j + 1 - skipp] + penalty,
105 | dtw[i1, j - skip] + penalty)
106 | # print('({},{}), ({},{}), ({},{})'.format(i0, j - skipp, i0, j + 1 - skipp, i1, j - skip))
107 | # print('{}, {}, {}'.format(dtw[i0, j - skipp], dtw[i0, j + 1 - skipp], dtw[i1, j - skip]))
108 | # print('i={}, j={}, d={}, skip={}, skipp={}'.format(i,j,d,skip,skipp))
109 | # print(dtw)
110 | if dtw[i1, j + 1 - skip] <= max_dist:
111 | last_under_max_dist = j
112 | else:
113 | # print('above max_dist', dtw[i1, j + 1 - skip], i1, j + 1 - skip)
114 | dtw[i1, j + 1 - skip] = np.inf
115 | if prev_last_under_max_dist + 1 - skipp < j + 1 - skip:
116 | # print("break")
117 | break
118 | if last_under_max_dist == -1:
119 | # print('early stop')
120 | # print(dtw)
121 | return np.inf
122 | if psi != 0 and j_end == len(s2) and len(s1) - 1 - i <= psi:
123 | psi_shortest = min(psi_shortest, dtw[i1, length - 1])
124 | if psi == 0:
125 | d = math.sqrt(dtw[i1, min(c, c + window - 1) - skip])
126 | else:
127 | ic = min(c, c + window - 1) - skip
128 | vc = dtw[i1, ic - psi:ic + 1]
129 | d = min(np.min(vc), psi_shortest)
130 | d = math.sqrt(d)
131 | return d
132 |
133 |
134 | def _distance_with_params(t):
135 | return distance(t[0], t[1], **t[2])
136 |
137 |
138 | def warping_paths(s1, s2, window=None, max_dist=None,
139 | max_step=None, max_length_diff=None, penalty=None, psi=None,):
140 | """
141 | Dynamic Time Warping (keep full matrix) using multidimensional sequences.
142 |
143 | cost = EuclideanDistance(s1[i], s2[j])
144 |
145 | See :py:meth:`dtaidistance.dtw.warping_paths` for parameters.
146 | """
147 | r, c = len(s1), len(s2)
148 | if max_length_diff is not None and abs(r - c) > max_length_diff:
149 | return np.inf
150 | if window is None:
151 | window = max(r, c)
152 | if not max_step:
153 | max_step = np.inf
154 | else:
155 | max_step *= max_step
156 | if not max_dist:
157 | max_dist = np.inf
158 | else:
159 | max_dist *= max_dist
160 | if not penalty:
161 | penalty = 0
162 | else:
163 | penalty *= penalty
164 | if psi is None:
165 | psi = 0
166 | dtw = np.full((r + 1, c + 1), np.inf)
167 | # dtw[0, 0] = 0
168 | for i in range(psi + 1):
169 | dtw[0, i] = 0
170 | dtw[i, 0] = 0
171 | last_under_max_dist = 0
172 | i0 = 1
173 | i1 = 0
174 | for i in range(r):
175 | if last_under_max_dist == -1:
176 | prev_last_under_max_dist = np.inf
177 | else:
178 | prev_last_under_max_dist = last_under_max_dist
179 | last_under_max_dist = -1
180 | i0 = i
181 | i1 = i + 1
182 | # print('i =', i, 'skip =',skip, 'skipp =', skipp)
183 | # jmin = max(0, i - max(0, r - c) - window + 1)
184 | # jmax = min(c, i + max(0, c - r) + window)
185 | # print(i,jmin,jmax)
186 | # x = dtw[i, jmin-skipp:jmax-skipp]
187 | # y = dtw[i, jmin+1-skipp:jmax+1-skipp]
188 | # print(x,y,dtw[i+1, jmin+1-skip:jmax+1-skip])
189 | # dtw[i+1, jmin+1-skip:jmax+1-skip] = np.minimum(x,
190 | # y)
191 | for j in range(max(0, i - max(0, r - c) - window + 1), min(c, i + max(0, c - r) + window)):
192 | # print('j =', j, 'max=',min(c, c - r + i + window))
193 | d = np.sum((s1[i] - s2[j]) ** 2)
194 | if max_step is not None and d > max_step:
195 | continue
196 | # print(i, j + 1 - skip, j - skipp, j + 1 - skipp, j - skip)
197 | dtw[i1, j + 1] = d + min(dtw[i0, j],
198 | dtw[i0, j + 1] + penalty,
199 | dtw[i1, j] + penalty)
200 | # dtw[i + 1, j + 1 - skip] = d + min(dtw[i + 1, j + 1 - skip], dtw[i + 1, j - skip])
201 | if max_dist is not None:
202 | if dtw[i1, j + 1] <= max_dist:
203 | last_under_max_dist = j
204 | else:
205 | dtw[i1, j + 1] = np.inf
206 | if prev_last_under_max_dist < j + 1:
207 | break
208 | if max_dist is not None and last_under_max_dist == -1:
209 | # print('early stop')
210 | # print(dtw)
211 | return np.inf, dtw
212 | dtw = np.sqrt(dtw)
213 | if psi == 0:
214 | d = dtw[i1, min(c, c + window - 1)]
215 | else:
216 | ir = i1
217 | ic = min(c, c + window - 1)
218 | vr = dtw[ir-psi:ir+1, ic]
219 | vc = dtw[ir, ic-psi:ic+1]
220 | mir = np.argmin(vr)
221 | mic = np.argmin(vc)
222 | if vr[mir] < vc[mic]:
223 | dtw[ir-psi+mir+1:ir+1, ic] = -1
224 | d = vr[mir]
225 | else:
226 | dtw[ir, ic - psi + mic + 1:ic+1] = -1
227 | d = vc[mic]
228 | return d, dtw
229 |
230 |
231 | def distance_matrix(s, max_dist=None, max_length_diff=None,
232 | window=None, max_step=None, penalty=None, psi=None,
233 | block=None, parallel=False,
234 | use_c=False, show_progress=False):
235 | """Dynamic Time Warping distance matrix using multidimensional sequences.
236 |
237 | cost = EuclideanDistance(s1[i], s2[j])
238 |
239 | See :py:meth:`dtaidistance.dtw.distance_matrix` for parameters.
240 | """
241 | if parallel and not use_c:
242 | try:
243 | import multiprocessing as mp
244 | logger.info('Using multiprocessing')
245 | except ImportError:
246 | parallel = False
247 | mp = None
248 | else:
249 | mp = None
250 | dist_opts = {
251 | 'max_dist': max_dist,
252 | 'max_step': max_step,
253 | 'window': window,
254 | 'max_length_diff': max_length_diff,
255 | 'penalty': penalty,
256 | 'psi': psi
257 | }
258 | dists = None
259 | if max_length_diff is None:
260 | max_length_diff = np.inf
261 | large_value = np.inf
262 | logger.info('Computing distances')
263 | if use_c:
264 | logger.error("No C version available (yet)")
265 | if not use_c:
266 | logger.info("Compute distances in Python")
267 | if isinstance(s, np.ndarray) and len(s.shape) == 2:
268 | ss = [np.asarray(s[i]).reshape(-1) for i in range(s.shape[0])]
269 | s = ss
270 | if parallel:
271 | logger.info("Use parallel computation")
272 | dists = np.zeros((len(s), len(s))) + large_value
273 | if block is None:
274 | idxs = np.triu_indices(len(s), k=1)
275 | else:
276 | idxsl_r = []
277 | idxsl_c = []
278 | for r in range(block[0][0], block[0][1]):
279 | for c in range(max(r + 1, block[1][0]), min(len(s), block[1][1])):
280 | idxsl_r.append(r)
281 | idxsl_c.append(c)
282 | idxs = (np.array(idxsl_r), np.array(idxsl_c))
283 | with mp.Pool() as p:
284 | dists[idxs] = p.map(_distance_with_params, [(s[r], s[c], dist_opts) for c, r in zip(*idxs)])
285 | # pbar = tqdm(total=int((len(s)*(len(s)-1)/2)))
286 | # for r in range(len(s)):
287 | # dists[r,r+1:len(s)] = p.map(distance, [(s[r],s[c], dist_opts) for c in range(r+1,len(cur))])
288 | # pbar.update(len(s) - r - 1)
289 | # pbar.close()
290 | else:
291 | logger.info("Use serial computation")
292 | dists = np.zeros((len(s), len(s))) + large_value
293 | if block is None:
294 | it_r = range(len(s))
295 | else:
296 | it_r = range(block[0][0], block[0][1])
297 | if show_progress:
298 | it_r = tqdm(it_r)
299 | for r in it_r:
300 | if block is None:
301 | it_c = range(r + 1, len(s))
302 | else:
303 | it_c = range(max(r + 1, block[1][0]), min(len(s), block[1][1]))
304 | for c in it_c:
305 | if abs(len(s[r]) - len(s[c])) <= max_length_diff:
306 | dists[r, c] = distance(s[r], s[c], **dist_opts)
307 | return dists
308 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_ndim_visualisation.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.dtw_visualisation
4 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5 |
6 | Dynamic Time Warping (DTW) visualisations.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import os
14 | import logging
15 | import math
16 | import numpy as np
17 |
18 | from .util import dtaidistance_dir
19 |
20 | logger = logging.getLogger("be.kuleuven.dtai.distance")
21 |
22 | from . import dtw
23 | try:
24 | from . import dtw_c
25 | except ImportError:
26 | # logger.info('C library not available')
27 | dtw_c = None
28 |
29 | try:
30 | from tqdm import tqdm
31 | except ImportError:
32 | logger.info('tqdm library not available')
33 | tqdm = None
34 |
35 |
36 | def plot_warping(s1, s2, path, filename=None):
37 | """Plot the optimal warping between to sequences.
38 |
39 | :param s1: From sequence.
40 | :param s2: To sequence.
41 | :param path: Optimal warping path.
42 | :param filename: Filename path (optional).
43 | """
44 | import matplotlib.pyplot as plt
45 | import matplotlib as mpl
46 | fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True)
47 | ax[0].pcolormesh(np.transpose(s1))
48 | ax[1].pcolormesh(np.transpose(s2))
49 | transFigure = fig.transFigure.inverted()
50 | lines = []
51 | line_options = {'linewidth': 2, 'color': 'orange', 'alpha': 0.8}
52 | for r_c, c_c in path:
53 | if r_c < 0 or c_c < 0:
54 | continue
55 | coord1 = transFigure.transform(ax[0].transData.transform([r_c+.5, 0]))
56 | coord2 = transFigure.transform(ax[1].transData.transform([c_c+.5, 0]))
57 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]),
58 | transform=fig.transFigure, **line_options))
59 | fig.lines = lines
60 | if filename:
61 | plt.savefig(filename)
62 | plt.close()
63 | fig, ax = None, None
64 | return fig, ax
65 |
66 |
67 | def plot_warpingpaths(s1, s2, paths, path=None, filename=None, shownumbers=False):
68 | """Plot the warping paths matrix.
69 |
70 | :param s1: Series 1
71 | :param s2: Series 2
72 | :param paths: Warping paths matrix
73 | :param path: Path to draw (typically this is the best path)
74 | :param filename: Filename for the image (optional)
75 | :param shownumbers: Show distances also as numbers
76 | """
77 | from matplotlib import pyplot as plt
78 | from matplotlib import gridspec
79 |
80 | fig = plt.figure(figsize=(10, 10), frameon=True)
81 | gs = gridspec.GridSpec(2, 2, wspace=1, hspace=1,
82 | left=0, right=1.0, bottom=0, top=1.0,
83 | height_ratios=[1, 6],
84 | width_ratios=[1, 6])
85 |
86 | if path is None:
87 | p = dtw.best_path(paths)
88 | else:
89 | p = path
90 |
91 | ax0 = fig.add_subplot(gs[0, 0])
92 | ax0.set_axis_off()
93 | ax0.text(0, 0, "Dist = {:.4f}".format(paths[p[-1][0], p[-1][1]]))
94 | ax0.xaxis.set_major_locator(plt.NullLocator())
95 | ax0.yaxis.set_major_locator(plt.NullLocator())
96 |
97 | # Top time series
98 | ax1 = fig.add_subplot(gs[0, 1:])
99 | ax1.set_ylim([0, s2.shape[1]])
100 | ax1.set_axis_off()
101 | ax1.xaxis.tick_top()
102 | ax1.pcolormesh(np.transpose(s2))
103 | ax1.xaxis.set_major_locator(plt.NullLocator())
104 | ax1.yaxis.set_major_locator(plt.NullLocator())
105 |
106 | # Left time series
107 | ax2 = fig.add_subplot(gs[1:, 0])
108 | ax2.set_xlim([0, s1.shape[1]])
109 | ax2.set_axis_off()
110 | ax2.xaxis.set_major_locator(plt.NullLocator())
111 | ax2.yaxis.set_major_locator(plt.NullLocator())
112 | ax2.pcolormesh(np.flipud(s1))
113 |
114 | ax3 = fig.add_subplot(gs[1:, 1:])
115 | ax3.matshow(paths[1:, 1:])
116 | py, px = zip(*p)
117 | ax3.plot(px, py, ".-", color="red")
118 | if shownumbers:
119 | for r in range(1, paths.shape[0]):
120 | for c in range(1, paths.shape[1]):
121 | ax3.text(c - 1, r - 1, "{:.2f}".format(paths[r, c]))
122 |
123 | gs.tight_layout(fig, pad=1.0, h_pad=1.0, w_pad=1.0)
124 |
125 | ax = fig.axes
126 |
127 | if filename:
128 | if type(filename) != str:
129 | filename = str(filename)
130 | plt.savefig(filename)
131 | plt.close()
132 | fig, ax = None, None
133 | return fig, ax
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/dtw_visualisation.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.dtw_visualisation
4 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5 |
6 | Dynamic Time Warping (DTW) visualisations.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import os
14 | import logging
15 | import math
16 | import numpy as np
17 |
18 | from .util import dtaidistance_dir
19 |
20 | logger = logging.getLogger("be.kuleuven.dtai.distance")
21 |
22 | from . import dtw
23 | try:
24 | from . import dtw_c
25 | except ImportError:
26 | # logger.info('C library not available')
27 | dtw_c = None
28 |
29 | try:
30 | from tqdm import tqdm
31 | except ImportError:
32 | logger.info('tqdm library not available')
33 | tqdm = None
34 |
35 |
36 | def plot_warp(from_s, to_s, new_s, path, filename=None):
37 | """Plot the warped sequence and its relation to the original sequence
38 | and the target sequence.
39 |
40 | :param from_s: From sequence.
41 | :param to_s: To sequence.
42 | :param new_s: Warped version of from sequence.
43 | :param path: Optimal warping path.
44 | :param filename: Filename path (optional).
45 | """
46 | try:
47 | import matplotlib.pyplot as plt
48 | import matplotlib as mpl
49 | except ImportError:
50 | logger.error("The plot_warp function requires the matplotlib package to be installed.")
51 | return
52 | fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=True)
53 | ax[0].plot(from_s, label="From")
54 | ax[0].legend()
55 | ax[1].plot(to_s, label="To")
56 | ax[1].legend()
57 | transFigure = fig.transFigure.inverted()
58 | lines = []
59 | line_options = {'linewidth': 0.5, 'color': 'orange', 'alpha': 0.8}
60 | for r_c, c_c in path:
61 | if r_c < 0 or c_c < 0:
62 | continue
63 | coord1 = transFigure.transform(ax[0].transData.transform([r_c, from_s[r_c]]))
64 | coord2 = transFigure.transform(ax[1].transData.transform([c_c, to_s[c_c]]))
65 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]),
66 | transform=fig.transFigure, **line_options))
67 | ax[2].plot(new_s, label="From-warped")
68 | ax[2].legend()
69 | for i in range(len(to_s)):
70 | coord1 = transFigure.transform(ax[1].transData.transform([i, to_s[i]]))
71 | coord2 = transFigure.transform(ax[2].transData.transform([i, new_s[i]]))
72 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]),
73 | transform=fig.transFigure, **line_options))
74 | fig.lines = lines
75 | if filename:
76 | plt.savefig(filename)
77 | plt.close()
78 | fig, ax = None, None
79 | return fig, ax
80 |
81 |
82 | def plot_warping(s1, s2, path, filename=None):
83 | """Plot the optimal warping between to sequences.
84 |
85 | :param s1: From sequence.
86 | :param s2: To sequence.
87 | :param path: Optimal warping path.
88 | :param filename: Filename path (optional).
89 | """
90 | import matplotlib.pyplot as plt
91 | import matplotlib as mpl
92 | fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True)
93 | ax[0].plot(s1)
94 | ax[1].plot(s2)
95 | transFigure = fig.transFigure.inverted()
96 | lines = []
97 | line_options = {'linewidth': 0.5, 'color': 'orange', 'alpha': 0.8}
98 | for r_c, c_c in path:
99 | if r_c < 0 or c_c < 0:
100 | continue
101 | coord1 = transFigure.transform(ax[0].transData.transform([r_c, s1[r_c]]))
102 | coord2 = transFigure.transform(ax[1].transData.transform([c_c, s2[c_c]]))
103 | lines.append(mpl.lines.Line2D((coord1[0], coord2[0]), (coord1[1], coord2[1]),
104 | transform=fig.transFigure, **line_options))
105 | fig.lines = lines
106 | if filename:
107 | plt.savefig(filename)
108 | plt.close()
109 | fig, ax = None, None
110 | return fig, ax
111 |
112 |
113 | def plot_warpingpaths(s1, s2, paths, path=None, filename=None, shownumbers=False):
114 | """Plot the warping paths matrix.
115 |
116 | :param s1: Series 1
117 | :param s2: Series 2
118 | :param paths: Warping paths matrix
119 | :param path: Path to draw (typically this is the best path)
120 | :param filename: Filename for the image (optional)
121 | :param shownumbers: Show distances also as numbers
122 | """
123 | from matplotlib import pyplot as plt
124 | from matplotlib import gridspec
125 | from matplotlib.ticker import FuncFormatter
126 |
127 | ratio = max(len(s1), len(s2))
128 | min_y = min(np.min(s1), np.min(s2))
129 | max_y = max(np.max(s1), np.max(s2))
130 |
131 | fig = plt.figure(figsize=(10, 10), frameon=True)
132 | gs = gridspec.GridSpec(2, 2, wspace=1, hspace=1,
133 | left=0, right=1.0, bottom=0, top=1.0,
134 | height_ratios=[1, 6],
135 | width_ratios=[1, 6])
136 | max_s2_x = np.max(s2)
137 | max_s2_y = len(s2)
138 | max_s1_x = np.max(s1)
139 | min_s1_x = np.min(s1)
140 | max_s1_y = len(s1)
141 |
142 | if path is None:
143 | p = dtw.best_path(paths)
144 | else:
145 | p = path
146 |
147 | def format_fn2_x(tick_val, tick_pos):
148 | return max_s2_x - tick_val
149 |
150 | def format_fn2_y(tick_val, tick_pos):
151 | return int(max_s2_y - tick_val)
152 |
153 | ax0 = fig.add_subplot(gs[0, 0])
154 | ax0.set_axis_off()
155 | ax0.text(0, 0, "Dist = {:.4f}".format(paths[p[-1][0], p[-1][1]]))
156 | ax0.xaxis.set_major_locator(plt.NullLocator())
157 | ax0.yaxis.set_major_locator(plt.NullLocator())
158 |
159 | ax1 = fig.add_subplot(gs[0, 1:])
160 | ax1.set_ylim([min_y, max_y])
161 | ax1.set_axis_off()
162 | ax1.xaxis.tick_top()
163 | # ax1.set_aspect(0.454)
164 | ax1.plot(range(len(s2)), s2, ".-")
165 | ax1.xaxis.set_major_locator(plt.NullLocator())
166 | ax1.yaxis.set_major_locator(plt.NullLocator())
167 |
168 | ax2 = fig.add_subplot(gs[1:, 0])
169 | ax2.set_xlim([-max_y, -min_y])
170 | ax2.set_axis_off()
171 | # ax2.set_aspect(0.8)
172 | # ax2.xaxis.set_major_formatter(FuncFormatter(format_fn2_x))
173 | # ax2.yaxis.set_major_formatter(FuncFormatter(format_fn2_y))
174 | ax2.xaxis.set_major_locator(plt.NullLocator())
175 | ax2.yaxis.set_major_locator(plt.NullLocator())
176 | ax2.plot(-s1, range(max_s1_y, 0, -1), ".-")
177 |
178 | ax3 = fig.add_subplot(gs[1:, 1:])
179 | # ax3.set_aspect(1)
180 | ax3.matshow(paths[1:, 1:])
181 | # ax3.grid(which='major', color='w', linestyle='-', linewidth=0)
182 | # ax3.set_axis_off()
183 | py, px = zip(*p)
184 | ax3.plot(px, py, ".-", color="red")
185 | # ax3.xaxis.set_major_locator(plt.NullLocator())
186 | # ax3.yaxis.set_major_locator(plt.NullLocator())
187 | if shownumbers:
188 | for r in range(1, paths.shape[0]):
189 | for c in range(1, paths.shape[1]):
190 | ax3.text(c - 1, r - 1, "{:.2f}".format(paths[r, c]))
191 |
192 | gs.tight_layout(fig, pad=1.0, h_pad=1.0, w_pad=1.0)
193 | # fig.subplots_adjust(hspace=0, wspace=0)
194 |
195 | ax = fig.axes
196 |
197 | if filename:
198 | if type(filename) != str:
199 | filename = str(filename)
200 | plt.savefig(filename)
201 | plt.close()
202 | fig, ax = None, None
203 | return fig, ax
204 |
205 | def plot_matrix(distances, filename=None, ax=None, shownumbers=False):
206 | from matplotlib import pyplot as plt
207 |
208 | if ax is None:
209 | if shownumbers:
210 | figsize = (15, 15)
211 | else:
212 | figsize = None
213 | fig, ax = plt.subplots(nrows=1, ncols=1, figsize=figsize)
214 | else:
215 | fig = None
216 |
217 | ax.xaxis.set_ticks_position('top')
218 | ax.yaxis.set_ticks_position('both')
219 |
220 | im = ax.imshow(distances)
221 | idxs = [str(i) for i in range(len(distances))]
222 | # Show all ticks
223 | ax.set_xticks(np.arange(len(idxs)))
224 | ax.set_xticklabels(idxs)
225 | ax.set_yticks(np.arange(len(idxs)))
226 | ax.set_yticklabels(idxs)
227 |
228 | ax.set_title("Distances between series", pad=30)
229 |
230 | if shownumbers:
231 | for i in range(len(idxs)):
232 | for j in range(len(idxs)):
233 | if not np.isinf(distances[i, j]):
234 | l = "{:.2f}".format(distances[i, j])
235 | ax.text(j, i, l, ha="center", va="center", color="w")
236 |
237 | if filename:
238 | if type(filename) != str:
239 | filename = str(filename)
240 | plt.savefig(filename)
241 | plt.close()
242 | fig, ax = None, None
243 | return fig, ax
244 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/MachineLearning/dtaidistance/util.py:
--------------------------------------------------------------------------------
1 | # -*- coding: UTF-8 -*-
2 | """
3 | dtaidistance.util
4 | ~~~~~~~~~~~~~~~~~
5 |
6 | Utility functions for DTAIDistance.
7 |
8 | :author: Wannes Meert
9 | :copyright: Copyright 2017-2018 KU Leuven, DTAI Research Group.
10 | :license: Apache License, Version 2.0, see LICENSE for details.
11 |
12 | """
13 | import os
14 | import sys
15 | import logging
16 | from array import array
17 | from pathlib import Path
18 | import tempfile
19 |
20 | import numpy as np
21 |
22 |
23 | logger = logging.getLogger("be.kuleuven.dtai.distance")
24 |
25 |
26 | dtaidistance_dir = os.path.abspath(os.path.dirname(__file__))
27 |
28 |
29 | def prepare_directory(directory=None):
30 | """Prepare the given directory, create it if necessary.
31 | If no directory is given, a new directory will be created in the system's temp directory.
32 | """
33 | if directory is not None:
34 | directory = Path(directory)
35 | if not directory.exists():
36 | directory.mkdir(parents=True)
37 | logger.debug("Using directory: {}".format(directory))
38 | return Path(directory)
39 | directory = tempfile.mkdtemp(prefix="dtaidistance_")
40 | logger.debug("Using directory: {}".format(directory))
41 | return Path(directory)
42 |
43 |
44 | class SeriesContainer:
45 | def __init__(self, series):
46 | """Container for a list of series.
47 |
48 | This wrapper class knows how to deal with multiple types of datastructures to represent
49 | a list of sequences:
50 | - List[array.array]
51 | - List[numpy.array]
52 | - List[List]
53 | - numpy.array
54 | - numpy.matrix
55 |
56 | When using the C-based extensions, the data is automatically verified and converted.
57 | """
58 | if isinstance(series, SeriesContainer):
59 | self.series = series.series
60 | elif isinstance(series, np.ndarray) and len(series.shape) == 2:
61 | # A matrix always returns a 2D array, also if you select one row (to be consistent
62 | # and always be a matrix datastructure). The methods in this toolbox expect a
63 | # 1D array thus we need to convert to a 1D or 2D array.
64 | # self.series = [np.asarray(series[i]).reshape(-1) for i in range(series.shape[0])]
65 | self.series = np.asarray(series, order='C')
66 | elif type(series) == set or type(series) == tuple:
67 | self.series = list(series)
68 | else:
69 | self.series = series
70 |
71 | def c_data(self):
72 | """Return a datastructure that the C-component knows how to handle.
73 | The method tries to avoid copying or reallocating memory.
74 |
75 | :return: Either a list of buffers or a two-dimensional buffer. The
76 | buffers are guaranteed to be C-contiguous and can thus be used
77 | as regular pointer-based arrays in C.
78 | """
79 | if type(self.series) == list:
80 | for i in range(len(self.series)):
81 | serie = self.series[i]
82 | if isinstance(serie, np.ndarray):
83 | if not serie.flags.c_contiguous:
84 | serie = np.asarray(serie, order='C')
85 | self.series[i] = serie
86 | elif isinstance(serie, array):
87 | pass
88 | else:
89 | raise Exception("Type of series not supported, "
90 | "expected numpy.array or array.array but got {}".format(type(serie)))
91 | elif isinstance(self.series, np.ndarray):
92 | if not self.series.flags.c_contiguous:
93 | self.series = self.series.copy(order='C')
94 | return self.series
95 |
96 | def get_max_y(self):
97 | max_y = 0
98 | if isinstance(self.series, np.ndarray):
99 | max_y = max(np.max(self.series), abs(np.min(self.series)))
100 | else:
101 | for serie in self.series:
102 | max_y = max(max_y, np.max(serie), abs(np.min(serie)))
103 | return max_y
104 |
105 | def __getitem__(self, item):
106 | return self.series[item]
107 |
108 | def __len__(self):
109 | return len(self.series)
110 |
111 | def __str__(self):
112 | return "SeriesContainer:\n{}".format(self.series)
113 |
114 | @staticmethod
115 | def wrap(series):
116 | if isinstance(series, SeriesContainer):
117 | return series
118 | return SeriesContainer(series)
119 |
120 |
121 | def recompile():
122 | import subprocess as sp
123 | sp.run([sys.executable, 'setup.py', 'build_ext', '--inplace'], cwd=dtaidistance_dir)
124 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_INSTRUCTOR.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "%matplotlib inline\n",
10 | "import matplotlib\n",
11 | "matplotlib.rcParams['figure.figsize'] = [8, 3]\n",
12 | "import matplotlib.pyplot as plt\n",
13 | "\n",
14 | "import pandas as pd\n",
15 | "import numpy as np\n",
16 | "import statsmodels.api as sm\n",
17 | "import statsmodels\n",
18 | "\n",
19 | "import scipy\n",
20 | "from scipy.stats import pearsonr\n",
21 | "\n",
22 | "from pandas.plotting import register_matplotlib_converters\n",
23 | "register_matplotlib_converters()"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "print(matplotlib.__version__)\n",
33 | "print(pd.__version__)\n",
34 | "print(np.__version__)\n",
35 | "print(statsmodels.__version__)\n",
36 | "print(scipy.__version__)\n"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Obtain and visualize data"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "## data obtained from https://datahub.io/core/global-temp#data\n",
53 | "df = pd.read_csv(\"global_temps.csv\")\n",
54 | "df.head()"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "df.Mean[:100].plot()"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "## Exercise: what is wrong with the data and plot above? How can we fix this?"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | " df = df.pivot(index='Date', columns='Source', values='Mean')"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": null,
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "df.head()"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "metadata": {},
95 | "outputs": [],
96 | "source": [
97 | "df.GCAG.plot()"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "type(df.index)"
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "## Exercise: how can we make the index more time aware?"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": null,
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "df.index = pd.to_datetime(df.index)"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "type(df.index)"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": null,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "df.GCAG.plot()"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "df['1880']"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "plt.plot(df['1880':'1950'][['GCAG', 'GISTEMP']])"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": null,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "plt.plot(df['1950':][['GISTEMP']])"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "## Exercise: How strongly do these measurements correlate contemporaneously? What about with a time lag?"
175 | ]
176 | },
177 | {
178 | "cell_type": "code",
179 | "execution_count": null,
180 | "metadata": {},
181 | "outputs": [],
182 | "source": [
183 | "plt.scatter(df['1880':'1900'][['GCAG']], df['1880':'1900'][['GISTEMP']])"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": [
192 | "plt.scatter(df['1880':'1899'][['GCAG']], df['1881':'1900'][['GISTEMP']])"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": null,
198 | "metadata": {},
199 | "outputs": [],
200 | "source": [
201 | "pearsonr(df['1880':'1899'].GCAG, df['1881':'1900'].GISTEMP)"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {},
208 | "outputs": [],
209 | "source": [
210 | "df['1880':'1899'][['GCAG']].head()"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": null,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "df['1881':'1900'][['GISTEMP']].head()"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": null,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "min(df.index)"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "max(df.index)"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "## Unobserved component model"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": null,
250 | "metadata": {},
251 | "outputs": [],
252 | "source": [
253 | "train = df['1960':]"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "metadata": {},
259 | "source": [
260 | "### model parameters"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "metadata": {},
267 | "outputs": [],
268 | "source": [
269 | "# smooth trend model without seasonal or cyclical components\n",
270 | "model = {\n",
271 | " 'level': 'smooth trend', 'cycle': False, 'seasonal': None, \n",
272 | "}\n"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "### fitting a model"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {
286 | "scrolled": true
287 | },
288 | "outputs": [],
289 | "source": [
290 | "# https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.structural.UnobservedComponents.html\n",
291 | "gcag_mod = sm.tsa.UnobservedComponents(train['GCAG'], **model)\n",
292 | "gcag_res = gcag_mod.fit()"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "fig = gcag_res.plot_components(legend_loc='lower right', figsize=(15, 9));"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "## Plotting predictions"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "# Perform rolling prediction and multistep forecast\n",
318 | "num_steps = 20\n",
319 | "predict_res = gcag_res.get_prediction(dynamic=train['GCAG'].shape[0] - num_steps)\n",
320 | "\n",
321 | "predict = predict_res.predicted_mean\n",
322 | "ci = predict_res.conf_int()"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": null,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "plt.plot(predict)"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": null,
337 | "metadata": {},
338 | "outputs": [],
339 | "source": [
340 | "plt.scatter(train['GCAG'], predict)"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": [
349 | "fig, ax = plt.subplots()\n",
350 | "# Plot the results\n",
351 | "ax.plot(train['GCAG'], 'k.', label='Observations');\n",
352 | "ax.plot(train.index[:-num_steps], predict[:-num_steps], label='One-step-ahead Prediction');\n",
353 | "\n",
354 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n",
355 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n",
356 | "\n",
357 | "# Cleanup the image\n",
358 | "legend = ax.legend(loc='upper left');"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": null,
364 | "metadata": {},
365 | "outputs": [],
366 | "source": [
367 | "fig, ax = plt.subplots()\n",
368 | "# Plot the results\n",
369 | "ax.plot(train.index[-40:], train['GCAG'][-40:], 'k.', label='Observations');\n",
370 | "ax.plot(train.index[-40:-num_steps], predict[-40:-num_steps], label='One-step-ahead Prediction');\n",
371 | "\n",
372 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n",
373 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n",
374 | "\n",
375 | "# Cleanup the image\n",
376 | "legend = ax.legend(loc='upper left');"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "## Exercise: consider adding a seasonal term for 12 periods for the model fit above. Does this improve the fit of the model?"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": null,
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "seasonal_model = {\n",
393 | " 'level': 'local linear trend',\n",
394 | " 'seasonal': 12\n",
395 | "}\n",
396 | "mod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n",
397 | "res = mod.fit(method='powell', disp=False)"
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": null,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "fig = res.plot_components(legend_loc='lower right', figsize=(15, 9));"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "## How does this compare to the original model?"
414 | ]
415 | },
416 | {
417 | "cell_type": "code",
418 | "execution_count": null,
419 | "metadata": {},
420 | "outputs": [],
421 | "source": [
422 | "pearsonr(gcag_res.predict(), train['GCAG'])"
423 | ]
424 | },
425 | {
426 | "cell_type": "code",
427 | "execution_count": null,
428 | "metadata": {},
429 | "outputs": [],
430 | "source": [
431 | "np.mean(np.abs(gcag_res.predict() - train['GCAG']))"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {},
438 | "outputs": [],
439 | "source": [
440 | "np.mean(np.abs(res.predict() - train['GCAG']))"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "## Explore the seasonality more"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": [
456 | "seasonal_model = {\n",
457 | " 'level': 'local level',\n",
458 | " 'seasonal': 12\n",
459 | "}\n",
460 | "llmod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n",
461 | "ll_level_res = llmod.fit(method='powell', disp=False)"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": null,
467 | "metadata": {},
468 | "outputs": [],
469 | "source": [
470 | "fig = ll_level_res.plot_components(legend_loc='lower right', figsize=(15, 9));"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": null,
476 | "metadata": {},
477 | "outputs": [],
478 | "source": [
479 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": null,
485 | "metadata": {},
486 | "outputs": [],
487 | "source": [
488 | "train[:48].GCAG.plot()"
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": null,
494 | "metadata": {},
495 | "outputs": [],
496 | "source": []
497 | },
498 | {
499 | "cell_type": "markdown",
500 | "metadata": {},
501 | "source": [
502 | "## Exercise: a common null model for time series is to predict the value at time t-1 for the value at time t. How does such a model compare to the models we fit here?"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "### Consider correlation"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": null,
515 | "metadata": {},
516 | "outputs": [],
517 | "source": [
518 | "pearsonr(ll_level_res.predict(), train['GCAG'])"
519 | ]
520 | },
521 | {
522 | "cell_type": "code",
523 | "execution_count": null,
524 | "metadata": {},
525 | "outputs": [],
526 | "source": [
527 | "pearsonr(train['GCAG'].iloc[:-1, ], train['GCAG'].iloc[1:, ])"
528 | ]
529 | },
530 | {
531 | "cell_type": "markdown",
532 | "metadata": {},
533 | "source": [
534 | "### What about mean absolute error?"
535 | ]
536 | },
537 | {
538 | "cell_type": "code",
539 | "execution_count": null,
540 | "metadata": {},
541 | "outputs": [],
542 | "source": [
543 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))"
544 | ]
545 | },
546 | {
547 | "cell_type": "code",
548 | "execution_count": null,
549 | "metadata": {},
550 | "outputs": [],
551 | "source": [
552 | "np.mean(np.abs(train['GCAG'].iloc[:-1, ].values, train['GCAG'].iloc[1:, ].values))"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": null,
558 | "metadata": {},
559 | "outputs": [],
560 | "source": []
561 | }
562 | ],
563 | "metadata": {
564 | "kernelspec": {
565 | "display_name": "Python 3",
566 | "language": "python",
567 | "name": "python3"
568 | },
569 | "language_info": {
570 | "codemirror_mode": {
571 | "name": "ipython",
572 | "version": 3
573 | },
574 | "file_extension": ".py",
575 | "mimetype": "text/x-python",
576 | "name": "python",
577 | "nbconvert_exporter": "python",
578 | "pygments_lexer": "ipython3",
579 | "version": "3.6.8"
580 | }
581 | },
582 | "nbformat": 4,
583 | "nbformat_minor": 2
584 | }
585 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_STUDENT.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "%matplotlib inline\n",
10 | "import matplotlib\n",
11 | "matplotlib.rcParams['figure.figsize'] = [8, 3]\n",
12 | "import matplotlib.pyplot as plt\n",
13 | "\n",
14 | "import pandas as pd\n",
15 | "import numpy as np\n",
16 | "import statsmodels.api as sm\n",
17 | "import statsmodels\n",
18 | "\n",
19 | "import scipy\n",
20 | "from scipy.stats import pearsonr\n",
21 | "\n",
22 | "from pandas.plotting import register_matplotlib_converters\n",
23 | "register_matplotlib_converters()"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": null,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "print(matplotlib.__version__)\n",
33 | "print(pd.__version__)\n",
34 | "print(np.__version__)\n",
35 | "print(statsmodels.__version__)\n",
36 | "print(scipy.__version__)\n"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "## Obtain and visualize data"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": null,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "## data obtained from https://datahub.io/core/global-temp#data\n",
53 | "df = pd.read_csv(\"global_temps.csv\")\n",
54 | "df.head()"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "df.Mean[:100].plot()"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "## Exercise: what is wrong with the data and plot above? How can we fix this?"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": null,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": []
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "## Exercise: how can we make the index more time aware?"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": []
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "## Exercise: How strongly do these measurements correlate contemporaneously? What about with a time lag?"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "## Unobserved component model"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": null,
118 | "metadata": {},
119 | "outputs": [],
120 | "source": [
121 | "train = df['1960':]"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "### model parameters"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "# smooth trend model without seasonal or cyclical components\n",
138 | "model = {\n",
139 | " 'level': 'smooth trend', 'cycle': False, 'seasonal': None, \n",
140 | "}\n"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "### fitting a model"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {
154 | "scrolled": true
155 | },
156 | "outputs": [],
157 | "source": [
158 | "# https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.structural.UnobservedComponents.html\n",
159 | "gcag_mod = sm.tsa.UnobservedComponents(train['GCAG'], **model)\n",
160 | "gcag_res = gcag_mod.fit()"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "fig = gcag_res.plot_components(legend_loc='lower right', figsize=(15, 9));"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "## Plotting predictions"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": null,
182 | "metadata": {},
183 | "outputs": [],
184 | "source": [
185 | "# Perform rolling prediction and multistep forecast\n",
186 | "num_steps = 20\n",
187 | "predict_res = gcag_res.get_prediction(dynamic=train['GCAG'].shape[0] - num_steps)\n",
188 | "\n",
189 | "predict = predict_res.predicted_mean\n",
190 | "ci = predict_res.conf_int()"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "plt.plot(predict)"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": null,
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "plt.scatter(train['GCAG'], predict)"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {},
215 | "outputs": [],
216 | "source": [
217 | "fig, ax = plt.subplots()\n",
218 | "# Plot the results\n",
219 | "ax.plot(train['GCAG'], 'k.', label='Observations');\n",
220 | "ax.plot(train.index[:-num_steps], predict[:-num_steps], label='One-step-ahead Prediction');\n",
221 | "\n",
222 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n",
223 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n",
224 | "\n",
225 | "# Cleanup the image\n",
226 | "legend = ax.legend(loc='upper left');"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": null,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "fig, ax = plt.subplots()\n",
236 | "# Plot the results\n",
237 | "ax.plot(train.index[-40:], train['GCAG'][-40:], 'k.', label='Observations');\n",
238 | "ax.plot(train.index[-40:-num_steps], predict[-40:-num_steps], label='One-step-ahead Prediction');\n",
239 | "\n",
240 | "ax.plot(train.index[-num_steps:], predict[-num_steps:], 'r', label='Multistep Prediction');\n",
241 | "ax.plot(train.index[-num_steps:], ci.iloc[-num_steps:], 'k--');\n",
242 | "\n",
243 | "# Cleanup the image\n",
244 | "legend = ax.legend(loc='upper left');"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "## Exercise: consider adding a seasonal term for 12 periods for the model fit above. Does this improve the fit of the model?"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": null,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": []
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "metadata": {},
264 | "source": [
265 | "## How does this compare to the original model?"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": null,
271 | "metadata": {},
272 | "outputs": [],
273 | "source": []
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "## Let's explore the seasonality more"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": null,
285 | "metadata": {},
286 | "outputs": [],
287 | "source": [
288 | "seasonal_model = {\n",
289 | " 'level': 'local level',\n",
290 | " 'seasonal': 12\n",
291 | "}\n",
292 | "llmod = sm.tsa.UnobservedComponents(train['GCAG'], **seasonal_model)\n",
293 | "ll_level_res = llmod.fit(method='powell', disp=False)"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": null,
299 | "metadata": {},
300 | "outputs": [],
301 | "source": [
302 | "fig = ll_level_res.plot_components(legend_loc='lower right', figsize=(15, 9));"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "np.mean(np.abs(ll_level_res.predict() - train['GCAG']))"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": null,
317 | "metadata": {},
318 | "outputs": [],
319 | "source": [
320 | "train[:48].GCAG.plot()"
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {},
326 | "source": [
327 | "## Exercise: a common null model for time series is to predict the value at time t-1 for the value at time t. How does such a model compare to the models we fit here?"
328 | ]
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "metadata": {},
333 | "source": [
334 | "### Consider correlation"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": null,
340 | "metadata": {},
341 | "outputs": [],
342 | "source": []
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "### What about mean absolute error?"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": null,
354 | "metadata": {},
355 | "outputs": [],
356 | "source": []
357 | }
358 | ],
359 | "metadata": {
360 | "kernelspec": {
361 | "display_name": "Python 3",
362 | "language": "python",
363 | "name": "python3"
364 | },
365 | "language_info": {
366 | "codemirror_mode": {
367 | "name": "ipython",
368 | "version": 3
369 | },
370 | "file_extension": ".py",
371 | "mimetype": "text/x-python",
372 | "name": "python",
373 | "nbconvert_exporter": "python",
374 | "pygments_lexer": "ipython3",
375 | "version": "3.6.8"
376 | }
377 | },
378 | "nbformat": 4,
379 | "nbformat_minor": 2
380 | }
381 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/ModernTimeSeriesAnalysis/StateSpaceModels/Nile.csv:
--------------------------------------------------------------------------------
1 | "","year","val"
2 | "1",1871,1120
3 | "2",1872,1160
4 | "3",1873,963
5 | "4",1874,1210
6 | "5",1875,1160
7 | "6",1876,1160
8 | "7",1877,813
9 | "8",1878,1230
10 | "9",1879,1370
11 | "10",1880,1140
12 | "11",1881,995
13 | "12",1882,935
14 | "13",1883,1110
15 | "14",1884,994
16 | "15",1885,1020
17 | "16",1886,960
18 | "17",1887,1180
19 | "18",1888,799
20 | "19",1889,958
21 | "20",1890,1140
22 | "21",1891,1100
23 | "22",1892,1210
24 | "23",1893,1150
25 | "24",1894,1250
26 | "25",1895,1260
27 | "26",1896,1220
28 | "27",1897,1030
29 | "28",1898,1100
30 | "29",1899,774
31 | "30",1900,840
32 | "31",1901,874
33 | "32",1902,694
34 | "33",1903,940
35 | "34",1904,833
36 | "35",1905,701
37 | "36",1906,916
38 | "37",1907,692
39 | "38",1908,1020
40 | "39",1909,1050
41 | "40",1910,969
42 | "41",1911,831
43 | "42",1912,726
44 | "43",1913,456
45 | "44",1914,824
46 | "45",1915,702
47 | "46",1916,1120
48 | "47",1917,1100
49 | "48",1918,832
50 | "49",1919,764
51 | "50",1920,821
52 | "51",1921,768
53 | "52",1922,845
54 | "53",1923,864
55 | "54",1924,862
56 | "55",1925,698
57 | "56",1926,845
58 | "57",1927,744
59 | "58",1928,796
60 | "59",1929,1040
61 | "60",1930,759
62 | "61",1931,781
63 | "62",1932,865
64 | "63",1933,845
65 | "64",1934,944
66 | "65",1935,984
67 | "66",1936,897
68 | "67",1937,822
69 | "68",1938,1010
70 | "69",1939,771
71 | "70",1940,676
72 | "71",1941,649
73 | "72",1942,846
74 | "73",1943,812
75 | "74",1944,742
76 | "75",1945,801
77 | "76",1946,1040
78 | "77",1947,860
79 | "78",1948,874
80 | "79",1949,848
81 | "80",1950,890
82 | "81",1951,744
83 | "82",1952,749
84 | "83",1953,838
85 | "84",1954,1050
86 | "85",1955,918
87 | "86",1956,986
88 | "87",1957,797
89 | "88",1958,923
90 | "89",1959,975
91 | "90",1960,815
92 | "91",1961,1020
93 | "92",1962,906
94 | "93",1963,901
95 | "94",1964,1170
96 | "95",1965,912
97 | "96",1966,746
98 | "97",1967,919
99 | "98",1968,718
100 | "99",1969,714
101 | "100",1970,740
102 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/README.md:
--------------------------------------------------------------------------------
1 | # Modern Time Series Analysis
2 |
3 | ## Install and Setup
4 |
5 | Okay, I wanted to streamline this. So you can just use this repo:
6 |
7 | pip install -r requirements.txt
8 |
9 | OR, since this entire tutorial is run from Jupyter notebooks, you probably want to install your requirements there:
10 |
11 | ```shell
12 | $ which jupyter
13 | /home/my_user_name/stuff/bin/jupyter
14 | $ /home/my_user_name/stuff/bin/pip install -r requirements.txt
15 |
16 | $ cd /full/path/to/modern_time_series_analysis/
17 | $ jupyter notebook
18 | ```
19 |
20 | ## Syllabus
21 |
22 | And all the data and notebooks you need will be [here](https://github.com/theJollySin/scipy_con_2019/tree/master/modern_time_series_analysis/ModernTimeSeriesAnalysis):
23 |
24 | 1. [Structural Time Series](ModernTimeSeriesAnalysis/StateSpaceModels/1_Structural_Time_Series_INSTRUCTOR.ipynb )
25 | 2. [Gaussian HMM](ModernTimeSeriesAnalysis/StateSpaceModels/2_Gaussian_HMM_INSTRUCTOR.ipynb)
26 | 3. [Trees for Classification and Prediction](ModernTimeSeriesAnalysis/MachineLearning/3_Trees_for_Classification_and_Prediction_INSTRUCTOR.ipynb)
27 | 4. [Clustering](ModernTimeSeriesAnalysis/MachineLearning/4_Clustering_INSTRUCTOR.ipynb)
28 | 5. [Forecasting electricity use with mxnet](ModernTimeSeriesAnalysis/DeepLearning/Electricity/5_Forecasting_electric_use_with_mxnet_INSTRUCTOR.ipynb)
29 | 6. [Stocks](ModernTimeSeriesAnalysis/DeepLearning/Stocks/6_Stocks_INSTRUCTOR.ipynb)
30 |
31 |
32 | ## Notes
33 |
34 | Don't expect anything exciting here. These are literally just my notes
35 |
36 |
37 | ### Structural Time Series
38 |
39 | * [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) models are old school, but if your machine learning toys don't do better than this, why bother?
40 | * [Kalman Filtering!](https://en.wikipedia.org/wiki/Kalman_filter) - Love me a Kalman Filter
41 |
42 |
43 | ### Machine Learning - Time Series Trees for Classification and Prediction
44 |
45 | * Most machine learning on time series uses tools that weren't designed for time series data. We always have to use other tools and "make 'em work".
46 | * Doctors and nurses to "feature detection" on "time series data" when they look at heart rates from ECGs. Everything old is new again.
47 | * We're going to cover [Random Forests](https://en.wikipedia.org/wiki/Random_forest) and [Gradient Boosted Trees](https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting) with [xgboost](https://xgboost.readthedocs.io/en/latest/).
48 | * So we're going to look at different time series data sets and figure out which ones look the most similar. So, if we had little snippets of time series data, the human eye might be able to find similar sets. But feature detection by human labor is costly and slow. ...It'll probably still be computationally intensive.
49 | * [Cesium](https://github.com/cesium-ml/cesium) is a feature generation library - Mostly just for initial exploration.
50 | * Are we chopping up our time series into blocks because it is cyclical in nature? Or would we do that anyway?
51 | * We used a sliding window, we DID NOT chop it into blocks.
52 | * We used a machine learning method on a problem that had very little data. This was not a good choice.
53 | * The important lessons here were supposed to be:
54 | 1. Chop up your time series data to get more data.
55 | 2. If you don't have enough data, machine learning is the wrong approach.
56 |
57 |
58 | ### Machine Learning - Clustering
59 |
60 | * [Dynamic Time Warping](https://en.wikipedia.org/wiki/Dynamic_time_warping)
61 | * [Here](https://github.com/wannesm/dtaidistance) is one little library for DTW
62 |
63 |
64 | ### Deep Learning - Electric Use
65 |
66 | * Typically if you want to put time series data into a neural network, you use [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network).
67 | * Research later: [GRU vs LSTM](https://datascience.stackexchange.com/questions/14581/when-to-use-gru-over-lstm)
68 | * You can also use a [CNN](https://en.wikipedia.org/wiki/Convolutional_neural_network)
69 | * CNNs might be a little better for classification than prediction, compared to RNN
70 | * This example finally uses parallel time series signals to be processed together; a *MUCH* more interesting problem.
71 |
72 |
73 | ### Deep Learning - Stocks
74 |
75 | * Do this on your own later.
76 |
77 |
--------------------------------------------------------------------------------
/modern_time_series_analysis/SciPyModernTimeSeries.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/john-science/scipy_con_2019/7280bc1949f90b151048c0ed127bdd656064c2cb/modern_time_series_analysis/SciPyModernTimeSeries.pdf
--------------------------------------------------------------------------------
/modern_time_series_analysis/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py==0.7.1
2 | astor==0.8.0
3 | cesium==0.9.9
4 | certifi==2024.7.4
5 | chardet==3.0.4
6 | hmmlearn==0.2.2
7 | gast==0.2.2
8 | google-pasta==0.1.7
9 | graphviz==0.8.4
10 | grpcio==1.53.2
11 | h5py==2.9.0
12 | idna==3.7
13 | joblib>=1.2.0
14 | Keras-Applications==1.0.8
15 | Keras-Preprocessing==1.1.0
16 | Markdown==3.1.1
17 | mxnet==1.9.1
18 | numpy==1.22.0
19 | pandas==0.24.2
20 | patsy==0.5.1
21 | protobuf==3.18.3
22 | python-dateutil==2.8.0
23 | pytz==2019.1
24 | requests==2.32.0
25 | scikit-learn==1.5.0
26 | scipy==1.10.0
27 | six==1.12.0
28 | sklearn==0.0
29 | statsmodels==0.10.0
30 | tensorboard==1.14.0
31 | tensorflow==2.12.1
32 | tensorflow-estimator==1.14.0
33 | termcolor==1.1.0
34 | urllib3==1.26.19
35 | Werkzeug==3.0.6
36 | wrapt==1.11.2
37 | xgboost==0.90
38 |
--------------------------------------------------------------------------------