DataScienceInteractivePython: Interactive Educational Data Science Python Dashboards Repository (0.0.1)
6 |
7 |
Interactive dashboards to help you over the intellectual hurdles of data science!
8 |
9 | *To support my students in my **Data Analytics and Geostatistics**, **Spatial Data Analytics** and **Machine Learning** courses and anyone else learning data analytics and machine learning, I have developed a set of Python interactive dashboards. When students struggle with a concept I make a new interactive dashboard so they can learn by playing with the statistics, models or theoretical concepts!*
10 |
11 | ### Michael Pyrcz, Professor, The University of Texas at Austin, Data Analytics, Geostatistics and Machine Learning
12 | #### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)
13 |
14 | ***
15 |
16 | ### Cite As:
17 |
18 | Pyrcz, Michael J. (2021). DataScienceInteractivePython: Educational Data Science Interactive Python Dashboards Repository (0.0.1). Zenodo. https://doi.org/10.5281/zenodo.5564966
19 |
20 | [](https://zenodo.org/doi/10.5281/zenodo.5564966)
21 |
22 | ***
23 |
24 | #### Binder
25 |
26 | To further support my students, I'm using [Binder](https://mybinder.readthedocs.io/en/latest/index.html) to host some of my **interactive Python spatial data analytics, geostatistics and machine learning demonstration workflows** online. Some of my students are having issues with setting up their local computing environments and instantiating the interactive workflows.
27 |
28 | * I hope this will assist these students and remove barriers for these educational tools to invite a wider audience that may benefit from experiential learning - playing with the systems and machines in real-time.
29 |
30 | [](https://mybinder.org/v2/gh/GeostatsGuy/DataScience_Interactive_Python/HEAD)
31 |
32 | Click on the link above to launch binder with container to run the included workflow.
33 |
34 | #### Setup
35 |
36 | A minimum environment includes:
37 |
38 | * Python 3.7.10 - due to the depdendency of GeostatsPy on the Numba package for code acceleration
39 | * MatPlotLib - plotting
40 | * NumPy - gridded data and array math
41 | * Pandas - tabulated data
42 | * SciPy - statistics module
43 | * ipywidgets - for plot interactivity
44 | * [GeostatsPy](https://pypi.org/project/geostatspy/) - geostatistical algorithms and functions (Pyrcz et al., 2021)
45 |
46 | The required datasets are available in the [GeoDataSets](https://github.com/GeostatsGuy/GeoDataSets) repository and linked in the workflows.
47 |
48 | #### Repository Summary
49 |
50 | The interative Python examples include a variety of topics like:
51 |
52 | * Bayesian and frequentist statistics
53 | * univariate and bivariate statistics
54 | * confidence intervals and hypothesis testing
55 | * Monte Carlo methods and bootstrap
56 | * inferential machine learning, principal component and cluster analysis
57 | * predictive machine learning, norms, model parameter training and hyperparameter tuning, overfit models
58 | * uncertainty modeling checking
59 | * spatial data debiasing
60 | * variogram calculation and modeling
61 | * spatial estimation, issues and trend modeling
62 | * spatial simulation and summarization over realizations
63 | * decision making in the presence of uncertainty
64 |
65 | If you want to see all my shared educational content check out:
66 | * [**Resources Inventory**](https://github.com/GeostatsGuy/Resources)
67 | * [**GeostatsGuy Lectures**](www.youtube.com/GeostatsGuyLectures)
68 |
69 | I hope this is helpful to anyone interested to learn about spatial data analytics, geostatistics and machine learning. I'm all about remoing barriers to education and encouraging folks to learn coding and data-driven modeling!
70 |
71 | Sincerely,
72 |
73 | Michael
74 |
75 | #### The Author:
76 |
77 | ### Michael Pyrcz, Professor, The University of Texas at Austin
78 | *Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*
79 |
80 | With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development.
81 |
82 | For more about Michael check out these links:
83 |
84 | #### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)
85 |
86 | #### Want to Work Together?
87 |
88 | I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
89 |
90 | * Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you!
91 |
92 | * Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
93 |
94 | * I can be reached at mpyrcz@austin.utexas.edu.
95 |
96 | I'm always happy to discuss,
97 |
98 | *Michael*
99 |
100 | Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin
101 |
102 | #### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)
103 |
104 |
105 |
--------------------------------------------------------------------------------
/Interactive_QQ_Plot.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": true
7 | },
8 | "source": [
9 | "
\n",
10 | " \n",
11 | "\n",
12 | "
\n",
13 | "\n",
14 | "## QQ-Plot Interactive Demonstration\n",
15 | "\n",
16 | "### QQ Plots in Python \n",
17 | "\n",
18 | "\n",
19 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
20 | "\n",
21 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "### Data Analytics: QQ Plots\n",
29 | "\n",
30 | "Here's a demonstration of calculation of QQ-plots in Python. This demonstration is part of the resources that I include for my courses in Spatial / Subsurface Data Analytics and Geostatistics at the Cockrell School of Engineering and Jackson School of Goesciences at the University of Texas at Austin. \n",
31 | "\n",
32 | "We will cover the following statistics:\n",
33 | "\n",
34 | "#### QQ-Plot\n",
35 | "* Convenient plot to compare distributions\n",
36 | "\n",
37 | "I have a lecture on QQ-plots available on [YouTube](https://www.youtube.com/watch?v=RETZus4XBNM). \n",
38 | "\n",
39 | "#### Getting Started\n",
40 | "\n",
41 | "Here's the steps to get setup in Python with the GeostatsPy package:\n",
42 | "\n",
43 | "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
44 | "2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n",
45 | "3. In the terminal type: pip install geostatspy. \n",
46 | "4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n",
47 | "\n",
48 | "You will need to copy the data file to your working directory. The dataset is available on my GitHub account in my GeoDataSets repository at:\n",
49 | "\n",
50 | "* Tabular data - [2D_MV_200wells.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/2D_MV_200wells.csv)\n",
51 | "\n",
52 | "#### Importing Packages\n",
53 | "\n",
54 | "We will need some standard packages. These should have been installed with Anaconda 3."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 1,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "%matplotlib inline\n",
64 | "from ipywidgets import interactive # widgets and interactivity\n",
65 | "from ipywidgets import widgets \n",
66 | "from ipywidgets import Layout\n",
67 | "from ipywidgets import Label\n",
68 | "from ipywidgets import VBox, HBox\n",
69 | "import numpy as np # ndarrys for gridded data\n",
70 | "import pandas as pd # DataFrames for tabular data\n",
71 | "import os # set working directory, run executables\n",
72 | "import matplotlib.pyplot as plt # plotting\n",
73 | "import matplotlib.gridspec as gridspec\n",
74 | "import matplotlib\n",
75 | "plt.rc('axes', axisbelow=True)"
76 | ]
77 | },
78 | {
79 | "cell_type": "markdown",
80 | "metadata": {},
81 | "source": [
82 | "#### Set the Working Directory\n",
83 | "\n",
84 | "I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time). Set this to your working directory, with the above mentioned data file."
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 2,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
94 | "l = widgets.Text(value=' Interactive QQ-Plot, Michael Pyrcz, Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'),continuous_update=True)\n",
95 | "\n",
96 | "n1 = widgets.IntSlider(min=0, max = 1000, value = 100, step = 10, description = '$n_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
97 | "n1.style.handle_color = 'red'\n",
98 | "\n",
99 | "m1 = widgets.FloatSlider(min=0.2, max = 0.8, value = 0.3, step = 0.1, description = '$\\overline{x}_{1}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
100 | "m1.style.handle_color = 'red'\n",
101 | "\n",
102 | "s1 = widgets.FloatSlider(min=0.0, max = 0.2, value = 0.03, step = 0.005, description = '$s_1$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
103 | "s1.style.handle_color = 'red'\n",
104 | "\n",
105 | "ui1 = widgets.VBox([n1,m1,s1],) # basic widget formatting \n",
106 | "\n",
107 | "n2 = widgets.IntSlider(min=0, max = 1000, value = 100, step = 10, description = '$n_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
108 | "n2.style.handle_color = 'blue'\n",
109 | "\n",
110 | "m2 = widgets.FloatSlider(min=0.2, max = 0.8, value = 0.2, step = 0.1, description = '$\\overline{x}_{2}$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
111 | "m2.style.handle_color = 'blue'\n",
112 | "\n",
113 | "s2 = widgets.FloatSlider(min=0, max = 0.2, value = 0.03, step = 0.005, description = '$s_2$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
114 | "s2.style.handle_color = 'blue'\n",
115 | "\n",
116 | "ui2 = widgets.VBox([n2,m2,s2],) # basic widget formatting \n",
117 | "\n",
118 | "nq = widgets.IntSlider(min=10, max = 1000, value = 100, step = 1, description = '$n_q$',orientation='horizontal',layout=Layout(width='300px', height='30px'),continuous_update=True)\n",
119 | "nq.style.handle_color = 'gray'\n",
120 | "\n",
121 | "plot = widgets.Checkbox(value=False,description='Make Plot')\n",
122 | "\n",
123 | "ui3 = widgets.VBox([nq,plot],) # basic widget formatting \n",
124 | "\n",
125 | "ui4 = widgets.HBox([ui1,ui2,ui3],) # basic widget formatting \n",
126 | "\n",
127 | "ui2 = widgets.VBox([l,ui4],)\n",
128 | "\n",
129 | "def f_make(n1, m1, s1, n2, m2, s2, nq,plot): # function to take parameters, make sample and plot\n",
130 | "\n",
131 | " seed = 73073; np.random.seed(seed=seed)\n",
132 | " X1 = np.random.normal(loc=m1,scale=s1,size=n1)\n",
133 | " X2 = np.random.normal(loc=m2,scale=s2,size=n2)\n",
134 | "\n",
135 | " xmin=0.0; xmax=0.6 \n",
136 | " \n",
137 | " cumul_prob = np.linspace(1,99,nq)\n",
138 | " X1_percentiles = np.percentile(X1,cumul_prob)\n",
139 | " X2_percentiles = np.percentile(X2,cumul_prob)\n",
140 | "\n",
141 | " fig = plt.figure()\n",
142 | " spec = fig.add_gridspec(2, 3)\n",
143 | " \n",
144 | " ax0 = fig.add_subplot(spec[:, 1:])\n",
145 | " ax0.scatter(X1_percentiles,X2_percentiles,color='darkorange',edgecolor='black',s=10,label='QQ-plot')\n",
146 | " ax0.plot([0,1],[0,1],ls='--',color='red')\n",
147 | " plt.grid(); plt.xlim([xmin,xmax]); plt.ylim([xmin,xmax]); plt.xlabel('X1 - Porosity (fraction)'); plt.ylabel('X2 - Porosity (fraction)'); \n",
148 | " plt.title('QQ-Plot'); plt.legend(loc='upper right')\n",
149 | " \n",
150 | " ax10 = fig.add_subplot(spec[0, 0])\n",
151 | " ax10.hist(X1,bins=np.linspace(xmin,xmax,30),color='red',alpha=0.3,edgecolor='black',label='X1',density=True)\n",
152 | " ax10.hist(X2,bins=np.linspace(xmin,xmax,30),color='blue',alpha=0.3,edgecolor='black',label='X2',density=True)\n",
153 | " ax10.grid(); plt.xlim([xmin,xmax]); ax10.set_ylim([0,15]); ax10.set_xlabel('Porosity (fraction)'); ax10.set_ylabel('Density')\n",
154 | " ax10.set_title('Histograms'); ax10.legend(loc='upper right')\n",
155 | " \n",
156 | " ax11 = fig.add_subplot(spec[1, 0])\n",
157 | " ax11.scatter(np.sort(X1),np.linspace(0,1,len(X1)),color='red',edgecolor='black',s=10,label='X1')\n",
158 | " ax11.scatter(np.sort(X2),np.linspace(0,1,len(X2)),color='blue',edgecolor='black',s=10,label='X2')\n",
159 | " ax11.grid(); ax11.set_xlim([xmin,xmax]); ax11.set_ylim([0,1]); ax11.set_xlabel('Porosity (fraction)'); ax11.set_ylabel('Cumulative Probability')\n",
160 | " ax11.set_title('CDFs'); ax11.legend(loc='upper left')\n",
161 | " \n",
162 | " if plot:\n",
163 | " fig.savefig('QQ_plot.png',dpi=600,facecolor='white')\n",
164 | " \n",
165 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.4, wspace=0.3, hspace=0.3); plt.show()\n",
166 | " \n",
167 | " \n",
168 | "# connect the function to make the samples and plot to the widgets \n",
169 | "interactive_plot = widgets.interactive_output(f_make, {'n1': n1, 'm1': m1, 's1': s1, 'n2': n2, 'm2': m2, 's2': s2, 'nq': nq, 'plot':plot})\n",
170 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "metadata": {},
176 | "source": [
177 | "### QQ-Plot, Comparing Distributions\n",
178 | "\n",
179 | "* demonstration of QQ-plots to compare distributions while varying the distributions\n",
180 | "\n",
181 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
182 | "\n",
183 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
184 | "\n",
185 | "### The Problem\n",
186 | "\n",
187 | "Let's make 2 random datasets, $\\color{red}{X_1}$ and $\\color{blue}{X_2}$.\n",
188 | "\n",
189 | "* **$n_1$**, **$n_2$** number of samples, **$\\overline{x}_1$**, **$\\overline{x}_2$** means and **$s_1$**, **$s_2$** standard deviation of the 2 sample sets\n",
190 | "* **$L$**: number of bootstrap realizations\n",
191 | "* **$\\alpha$**: alpha level"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 3,
197 | "metadata": {},
198 | "outputs": [
199 | {
200 | "data": {
201 | "application/vnd.jupyter.widget-view+json": {
202 | "model_id": "50c6357d5de24a68a43a2463e2515e59",
203 | "version_major": 2,
204 | "version_minor": 0
205 | },
206 | "text/plain": [
207 | "VBox(children=(Text(value=' Interactive QQ-Plot, Michael Pyrcz, Professor, The University of Texas at Austin',…"
208 | ]
209 | },
210 | "metadata": {},
211 | "output_type": "display_data"
212 | },
213 | {
214 | "data": {
215 | "application/vnd.jupyter.widget-view+json": {
216 | "model_id": "9ada9709a2b247a19f44a21da7460f6d",
217 | "version_major": 2,
218 | "version_minor": 0
219 | },
220 | "text/plain": [
221 | "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
222 | ]
223 | },
224 | "metadata": {},
225 | "output_type": "display_data"
226 | }
227 | ],
228 | "source": [
229 | "display(ui2, interactive_plot) # display the interactive plot"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "#### Comments\n",
237 | "\n",
238 | "This was a basic demonstration of QQ-plot in Python.\n",
239 | "\n",
240 | "I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at [Python Demos](https://github.com/GeostatsGuy/PythonNumericalDemos) and a Python package for data analytics and geostatistics at [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy). \n",
241 | " \n",
242 | "I hope this was helpful,\n",
243 | "\n",
244 | "*Michael*\n",
245 | "\n",
246 | "#### The Author:\n",
247 | "\n",
248 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
249 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
250 | "\n",
251 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
252 | "\n",
253 | "For more about Michael check out these links:\n",
254 | "\n",
255 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
256 | "\n",
257 | "#### Want to Work Together?\n",
258 | "\n",
259 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
260 | "\n",
261 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
262 | "\n",
263 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
264 | "\n",
265 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
266 | "\n",
267 | "I'm always happy to discuss,\n",
268 | "\n",
269 | "*Michael*\n",
270 | "\n",
271 | "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
272 | "\n",
273 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [],
281 | "source": []
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {},
287 | "outputs": [],
288 | "source": []
289 | }
290 | ],
291 | "metadata": {
292 | "kernelspec": {
293 | "display_name": "Python 3 (ipykernel)",
294 | "language": "python",
295 | "name": "python3"
296 | },
297 | "language_info": {
298 | "codemirror_mode": {
299 | "name": "ipython",
300 | "version": 3
301 | },
302 | "file_extension": ".py",
303 | "mimetype": "text/x-python",
304 | "name": "python",
305 | "nbconvert_exporter": "python",
306 | "pygments_lexer": "ipython3",
307 | "version": "3.9.12"
308 | }
309 | },
310 | "nbformat": 4,
311 | "nbformat_minor": 2
312 | }
313 |
--------------------------------------------------------------------------------
/Interactive_Sampling_Methods.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "7a524981",
6 | "metadata": {},
7 | "source": [
8 | "
\n",
9 | " \n",
10 | "\n",
11 | "
\n",
12 | "\n",
13 | "## Sampling Methods Demonstration\n",
14 | "\n",
15 | "\n",
16 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
17 | "\n",
18 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
19 | "\n",
20 | "\n",
21 | "### The Interactive Workflow\n",
22 | "\n",
23 | "Here's a simple workflow for comparing random and orthogonal sampling. \n",
24 | "\n",
25 | "* we use a 'toy problem' to demonstrate and compare these sampling methods \n",
26 | "\n",
27 | "#### Sampling\n",
28 | "\n",
29 | "While statistical theory supports random sampling, the fluctuation in sample statistics is quite extreme for small samples sizes. For example, the variance in the sample mean is calculated by standard error:\n",
30 | "\n",
31 | "\\begin{equation}\n",
32 | "\\sigma_{\\overline{x}}^2 = \\frac{\\sigma_s^2}{n}\n",
33 | "\\end{equation}\n",
34 | "\n",
35 | "To suppress these statistical fluctuations, alternative sampling methods are available:\n",
36 | "\n",
37 | "1. **Random Sampling** - next sample is drawn without consideration of the previously drawn samples\n",
38 | "2. **Latin Hypercube Sampling** - apply $k$ equiprobability bins to each feature, $X_m^k, m=1,...,M$. Then draw one sample from each bin, $n(X_m^k)=1$. \n",
39 | "3. **Orthogonal Sampling** - divide the joint probability density function into $k$ equal probability subspaces and then randomly draw an equal number of samples, $\\frac{n}{k}$ from each subspace. \n",
40 | "\n",
41 | "#### Objective \n",
42 | "\n",
43 | "An interactive exercise to try out and compare random and orthgonal sampling.\n",
44 | "\n",
45 | "* observe the stabilization of the sample statistics\n",
46 | "* observe the impact of number of regions on the results for orthogonal sampling\n",
47 | "\n",
48 | "#### Getting Started\n",
49 | "\n",
50 | "Here's the steps to get setup in Python with the GeostatsPy package:\n",
51 | "\n",
52 | "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
53 | "2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. \n",
54 | "3. In the terminal type: pip install geostatspy. \n",
55 | "4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n",
56 | "\n",
57 | "You will need to copy the data file to your working directory. They are available here:\n",
58 | "\n",
59 | "* Tabular data - sample_data.csv at https://git.io/fh4gm.\n",
60 | "\n",
61 | "There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. \n",
62 | "\n",
63 | "#### Load the required libraries\n",
64 | "\n",
65 | "The following code loads the required libraries."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 1,
71 | "id": "c4f2c0f3",
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "import numpy as np # arrays and array math\n",
76 | "import pandas as pd # tabular data and tabular data math\n",
77 | "import matplotlib.pyplot as plt # data visualization\n",
78 | "from matplotlib.cm import colors \n",
79 | "import scipy.stats as stats # Gaussian PDF and random sampling\n",
80 | "from ipywidgets import interactive # widgets and interactivity\n",
81 | "from ipywidgets import widgets \n",
82 | "from ipywidgets import Layout\n",
83 | "from ipywidgets import Label\n",
84 | "from ipywidgets import VBox, HBox"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "id": "31ceda7a",
90 | "metadata": {},
91 | "source": [
92 | "#### Interactive Sampling Methods\n",
93 | "\n",
94 | "The following code includes:\n",
95 | "\n",
96 | "* dashboard with data and orthogonal sampling parameters, number of samples and number of subspaces\n",
97 | "\n",
98 | "* plots of the data distribution and random and orthogonal samples"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 2,
104 | "id": "18f16228",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "# interactive calculation of the sample set (control of source parametric distribution and number of samples)\n",
109 | "style = {'description_width': 'initial'}\n",
110 | "l = widgets.Text(value=' Sampling Methods, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
111 | "nsamp = widgets.IntSlider(min = 1, max = 1000, value = 10, step = 5, description = '$n_{sample}$',orientation='horizontal',\n",
112 | " layout=Layout(width='500px', height='30px'),continuous_update = False)\n",
113 | "nsamp.style.handle_color = 'darkorange'\n",
114 | "npart = widgets.IntSlider(min = 1, max = 20, value = 4, step = 1, description = '$n_{subspace}$',orientation='horizontal',\n",
115 | " layout=Layout(width='500px', height='30px'),continuous_update = False)\n",
116 | "npart.style.handle_color = 'darkorange'\n",
117 | "\n",
118 | "uipars = widgets.HBox([nsamp,npart],) \n",
119 | "uik = widgets.VBox([l,uipars],)\n",
120 | "\n",
121 | "def f_make_sample(nsamp,npart): # function to take parameters, make sample and plot\n",
122 | " mean = 10.0; stdev = 2.0\n",
123 | " npop = 100000\n",
124 | " parts = []\n",
125 | " np.random.seed(seed = 79079)\n",
126 | " nbin = 70\n",
127 | " hbins = np.linspace(0,20,nbin)\n",
128 | " shbins = np.linspace(0,20,nbin*100)\n",
129 | " \n",
130 | " cmap = plt.cm.hot; norm = colors.Normalize(vmin=1, vmax=npart+1)\n",
131 | " \n",
132 | " x = np.random.normal(loc=mean,scale=stdev,size=npop)\n",
133 | " xs = np.random.choice(x,nsamp,replace = False)\n",
134 | " yhat = stats.norm.pdf(shbins,loc=mean,scale=stdev)\n",
135 | " \n",
136 | " ax1 = plt.subplot(121)\n",
137 | " ax1.plot(shbins,yhat,color='black',lw=2.0,zorder=1)\n",
138 | " \n",
139 | " ax2 = ax1.twinx()\n",
140 | " hist2ax,_,_ = ax2.hist(xs,bins=hbins,color='grey',alpha=1.0,edgecolor='black',zorder=10,density=True,\n",
141 | " histtype=u'step',linewidth=2,label='Samples'); ax1.set_xlabel('Porosity (%)')\n",
142 | " ax2.hist(xs,bins=hbins,color='grey',alpha=0.2,zorder=20,density=True)\n",
143 | " ax1.fill_between(shbins,0,yhat,color='darkorange',alpha=0.8,zorder=1)\n",
144 | " ax1.set_xlabel('Porosity (%)'); ax1.set_ylabel('Population Density'); ax1.set_title('Population and Random Sample'); \n",
145 | " ax1.set_ylim([0,0.3]); ax1.set_xlim([2,18])\n",
146 | " plt.legend(loc='upper right')\n",
147 | " ax2.set_ylabel('Sample Density',rotation=270,labelpad=20);\n",
148 | " \n",
149 | " pbins = np.percentile(x,np.linspace(0,100,npart+1))\n",
150 | " int_values = pd.cut(x, pbins,labels = np.arange(1,npart+1,1))\n",
151 | " \n",
152 | " for i in range(0,npart):\n",
153 | " parts.append(x[int_values == int(i+1)])\n",
154 | " \n",
155 | " latin_samples = np.zeros(nsamp)\n",
156 | " ipart = 0\n",
157 | " for isamp in range(0,nsamp):\n",
158 | " latin_samples[isamp] = np.random.choice(parts[ipart],1,replace = False) \n",
159 | " ipart = ipart + 1\n",
160 | " if ipart >= npart:\n",
161 | " ipart = 0\n",
162 | " \n",
163 | " ax3 = plt.subplot(122)\n",
164 | " ax3.plot(shbins,yhat,color='black',lw=2.0,zorder=1)\n",
165 | " \n",
166 | " ax4 = ax3.twinx()\n",
167 | " hist2bx,_,_ = ax4.hist(latin_samples,bins=hbins,color='grey',alpha=1.0,edgecolor='black',zorder=10,density=True,\n",
168 | " histtype=u'step',linewidth=2,label='Samples')\n",
169 | " ax4.hist(latin_samples,bins=hbins,color='grey',alpha=0.2,zorder=20,density=True) \n",
170 | " ax3.set_xlabel('Porosity (%)'); ax3.set_title('Population and Orthogonal Samples')\n",
171 | " ax3.set_ylabel('Population Density'); ax3.set_ylim([0,0.3]); ax3.set_xlim([2,18])\n",
172 | " plt.legend(loc='upper right'); ax4.set_ylabel('Sample Density',rotation=270,labelpad=20)\n",
173 | " \n",
174 | " i = 0\n",
175 | " for fbin in pbins[1:]:\n",
176 | " ax3.vlines(fbin,0,stats.norm.pdf(fbin,loc=mean,scale=stdev),color='black',lw=1.0)\n",
177 | " ax3.fill_between(shbins,0,yhat,color=plt.cm.inferno(i/(npart+1)),alpha=0.8,where=(np.logical_and([shbins > pbins[i]],[shbins < pbins[i+1]])[0]),zorder=1)\n",
178 | " i = i + 1\n",
179 | " \n",
180 | " ylim_24 = max(np.max(hist2ax)*(1.1 + (1.25-1.1)/1000 * nsamp),np.max(hist2bx)*(1.1 + (1.25-1.1)/1000 * nsamp))\n",
181 | " ax2.set_ylim([0.0,ylim_24])\n",
182 | " ax4.set_ylim([0.0,ylim_24])\n",
183 | " \n",
184 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.3, hspace=0.2); plt.show()\n",
185 | " \n",
186 | "# connect the function to make the samples and plot to the widgets \n",
187 | "interactive_plot = widgets.interactive_output(f_make_sample, {'nsamp':nsamp, 'npart':npart})\n",
188 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "id": "4fc3b000",
194 | "metadata": {},
195 | "source": [
196 | "### Interactive Sampling Demonstration\n",
197 | "\n",
198 | "Compare random and orthogonal sampling. Select the number of samples and number of subspaces and observe the sample distribution.\n",
199 | "\n",
200 | "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
201 | "\n",
202 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
203 | "\n",
204 | "### The Inputs\n",
205 | "\n",
206 | "* **$n_{sample}$** - the number of samples, **$n_{subspace}$** - the number of orthogonal subspaces"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 3,
212 | "id": "29c114d5",
213 | "metadata": {},
214 | "outputs": [
215 | {
216 | "data": {
217 | "application/vnd.jupyter.widget-view+json": {
218 | "model_id": "befce3a17b46493d99095e1e5c42bbce",
219 | "version_major": 2,
220 | "version_minor": 0
221 | },
222 | "text/plain": [
223 | "VBox(children=(Text(value=' Sampling Methods, Michael Pyrcz, Asso…"
224 | ]
225 | },
226 | "metadata": {},
227 | "output_type": "display_data"
228 | },
229 | {
230 | "data": {
231 | "application/vnd.jupyter.widget-view+json": {
232 | "model_id": "7065cb80ca4946f2b3247fa088cdd8ef",
233 | "version_major": 2,
234 | "version_minor": 0
235 | },
236 | "text/plain": [
237 | "Output()"
238 | ]
239 | },
240 | "metadata": {},
241 | "output_type": "display_data"
242 | }
243 | ],
244 | "source": [
245 | "display(uik, interactive_plot) # display the interactive plot"
246 | ]
247 | },
248 | {
249 | "cell_type": "markdown",
250 | "id": "2b0d85f7",
251 | "metadata": {},
252 | "source": [
253 | "#### Comments\n",
254 | "\n",
255 | "This was an interactive demonstration of sampling for data analytics. Much more could be done, I have other demonstrations on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
256 | " \n",
257 | "#### The Author:\n",
258 | "\n",
259 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
260 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
261 | "\n",
262 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
263 | "\n",
264 | "For more about Michael check out these links:\n",
265 | "\n",
266 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
267 | "\n",
268 | "#### Want to Work Together?\n",
269 | "\n",
270 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
271 | "\n",
272 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
273 | "\n",
274 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
275 | "\n",
276 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
277 | "\n",
278 | "I'm always happy to discuss,\n",
279 | "\n",
280 | "*Michael*\n",
281 | "\n",
282 | "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
283 | "\n",
284 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) "
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": null,
290 | "id": "090362ad",
291 | "metadata": {},
292 | "outputs": [],
293 | "source": []
294 | }
295 | ],
296 | "metadata": {
297 | "kernelspec": {
298 | "display_name": "Python 3 (ipykernel)",
299 | "language": "python",
300 | "name": "python3"
301 | },
302 | "language_info": {
303 | "codemirror_mode": {
304 | "name": "ipython",
305 | "version": 3
306 | },
307 | "file_extension": ".py",
308 | "mimetype": "text/x-python",
309 | "name": "python",
310 | "nbconvert_exporter": "python",
311 | "pygments_lexer": "ipython3",
312 | "version": "3.11.4"
313 | }
314 | },
315 | "nbformat": 4,
316 | "nbformat_minor": 5
317 | }
318 |
--------------------------------------------------------------------------------
/Interactive_PP_Plot.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "76df9c91",
6 | "metadata": {},
7 | "source": [
8 | "
\n",
12 | "\n",
13 | "## Interactive Gibbs Sampler \n",
14 | "\n",
15 | "### Michael J. Pyrcz, Professor, The University of Texas at Austin \n",
16 | "\n",
17 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "id": "2e15b023",
23 | "metadata": {},
24 | "source": [
25 | "#### Gibbs Sampler\n",
26 | "\n",
27 | "I teach the Gibbs Sampler as part of my lecture on Markov chain Monte Carlo (McMC) methods. This is critical to understand solution methods Bayesian machine learning methods. See my lectures:\n",
28 | "\n",
29 | "* [Bayesian linear regression lecture](https://youtu.be/LzZ5b3wdZQk?si=3Uu2pvCjsl1fH5qU)\n",
30 | "* [Markov chain Monte Carlo](https://youtu.be/7QX-yVboLhk?si=o7CSimpgFhjT1Vxo)\n",
31 | "* [Bayesian Linear Regression Example](https://youtu.be/JG69fxKzwt8?si=ywn9xC_Pe8YQwR2f)\n",
32 | "\n",
33 | "Gibbs sampler is one of the most intuitive methods for McMC.\n",
34 | "\n",
35 | "* as usual we don't have access to the joint distribution, but we have access to the conditional distributions. \n",
36 | "* instead of sampleing directly from the joint distribution (not available), we sequentially sample from the conditional distribution! \n",
37 | "\n",
38 | "For a bivariate example, features $X_1$ and $X_2$, we proceed as follows:\n",
39 | "\n",
40 | "1. Assign random values for $𝑋_1^{\\ell=0}$, $X_2^{\\ell=0}$\n",
41 | "\n",
42 | "2. Sample from $𝑓(𝑋_1|X_2^{\\ell=0})$ to get $𝑋_1^{\\ell=1}$ \n",
43 | "\n",
44 | "3. Sample from $𝑓(𝑋_2|X_1^{\\ell=1})$ to get $𝑋_2^{\\ell=1}$ \n",
45 | "\n",
46 | "4. Repeat for the next steps / samples, $\\ell = 1,\\ldots,𝐿$\n",
47 | "\n",
48 | "Although we only applied the conditional distribution, the resulting samples will have the correct joint distribution.\n",
49 | "\n",
50 | "\\begin{equation}\n",
51 | "f(X_1,X_2)\n",
52 | "\\end{equation}\n",
53 | "\n",
54 | "We never needed to use the joint distribution, we only needed the conditionals!\n",
55 | "\n",
56 | "* Bayesian Linear Regression - we apply Gibbs sampler to sample the posterior distributions of the model parameters given the data.\n",
57 | "\n",
58 | "#### Gibbs Sampler for Bivariate Gaussian Distribution\n",
59 | "\n",
60 | "Below I build out an interactive Gibbs sampler to sample the bivariate joint Gaussian distribution from only the conditional distributions!\n",
61 | "\n",
62 | "#### Load and Configure the Required Libraries\n",
63 | "\n",
64 | "The following code loads the required libraries and sets a plotting default."
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 1,
70 | "id": "da837ef7",
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "%matplotlib inline\n",
75 | "supress_warnings = False\n",
76 | "import os # to set current working directory \n",
77 | "import sys # supress output to screen for interactive variogram modeling\n",
78 | "import numpy as np # arrays and matrix math\n",
79 | "import pandas as pd # DataFrames\n",
80 | "from scipy.stats import norm # Gaussian PDF\n",
81 | "import matplotlib.pyplot as plt # plotting\n",
82 | "import seaborn as sns # plot PDF\n",
83 | "from sklearn.model_selection import train_test_split # train and test split\n",
84 | "from sklearn import tree # tree program from scikit learn (package for machine learning)\n",
85 | "from sklearn import metrics # measures to check our models\n",
86 | "import scipy.spatial as spatial #search for neighbours\n",
87 | "from matplotlib.patches import Rectangle # build a custom legend\n",
88 | "from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks\n",
89 | "import math # sqrt operator\n",
90 | "from ipywidgets import interactive # widgets and interactivity\n",
91 | "from ipywidgets import widgets \n",
92 | "from ipywidgets import Layout\n",
93 | "from ipywidgets import Label\n",
94 | "from ipywidgets import VBox, HBox\n",
95 | "cmap = plt.cm.inferno # default color bar, no bias and friendly for color vision defeciency\n",
96 | "plt.rc('axes', axisbelow=True) # grid behind plotting elements\n",
97 | "if supress_warnings == True:\n",
98 | " import warnings # supress any warnings for this demonstration\n",
99 | " warnings.filterwarnings('ignore') "
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "id": "2b57659e",
105 | "metadata": {},
106 | "source": [
107 | "#### Declare Functions\n",
108 | "\n",
109 | "The following functions for clean code. \n",
110 | "\n",
111 | "* Just a improved grid for the plot."
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 2,
117 | "id": "a333fd85",
118 | "metadata": {},
119 | "outputs": [],
120 | "source": [
121 | "def add_grid():\n",
122 | " plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids\n",
123 | " plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)\n",
124 | " plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks"
125 | ]
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "id": "e382a4e2",
130 | "metadata": {},
131 | "source": [
132 | "#### Interactive Gibbs Sampler to Sample the Bivariate Gausian Distribution Dashboard\n",
133 | "\n",
134 | "Here's a dashboard with a cool visualization for my interactive Gibbs sampler."
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 3,
140 | "id": "90be8276",
141 | "metadata": {},
142 | "outputs": [],
143 | "source": [
144 | "l = widgets.Text(value=' Interactive Gibbs Sampler Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
145 | " layout=Layout(width='750px', height='30px'))\n",
146 | "\n",
147 | "nsample = widgets.IntSlider(min=1, max = 101, value=10, step = 1, description = '$n_{sample}$',orientation='horizontal', \n",
148 | " style = {'description_width': 'initial'},layout=Layout(width='370px', height='30px'),continuous_update=False)\n",
149 | "rho = widgets.FloatSlider(min=-1.0, max = 1.0, value=0.7, step = 0.1, description = r'$\\rho_{X_1,X_2}$',orientation='horizontal',\n",
150 | " style = {'description_width': 'initial'},layout=Layout(width='370px', height='30px'),continuous_update=False)\n",
151 | "\n",
152 | "ui = widgets.HBox([nsample,rho],)\n",
153 | "ui2 = widgets.VBox([l,ui],)\n",
154 | "\n",
155 | "def run_plot(nsample,rho):\n",
156 | " mu1 = 0.0; sig1 = 1.0; mu2 = 0.0; sig2 = 1.0; seed = 73073; nc = 200\n",
157 | " \n",
158 | " L = nsample\n",
159 | " np.random.seed(seed=seed)\n",
160 | " x1 = np.zeros(L); x2 = np.zeros(L); x = np.linspace(-3,3,nc)\n",
161 | " \n",
162 | " x1[0] = np.random.rand(1) * 6.0 - 3.0; x2[0] = np.random.rand(1) * 6.0 - 3.0; \n",
163 | " \n",
164 | " plt.subplot(111)\n",
165 | " plt.scatter(x1[0],x2[0],color='grey',edgecolor='black',s=15,zorder=4)\n",
166 | " \n",
167 | " case = 0\n",
168 | " \n",
169 | " for l in range(1,L):\n",
170 | " if case == 0: # update x2\n",
171 | " x1[l] = x1[l-1]\n",
172 | " lmu = mu2 + rho * (sig2/sig1) * (x1[l] - mu1); lstd = 1 - rho**2\n",
173 | " x2[l] = np.random.normal(loc = lmu,scale = lstd,size = 1)\n",
174 | " case = 1\n",
175 | " plt.scatter(x1[l],x2[l],color='blue',edgecolor='black',s=15,alpha=1.0,zorder=100)\n",
176 | " plt.plot([x1[l-1],x1[l]],[x2[l-1],x2[l]],color='black',lw=1,alpha = max((l-(L-20))/20,0),zorder=4)\n",
177 | " plt.plot([x1[l-1],x1[l]],[x2[l-1],x2[l]],color='white',lw=3,alpha = max((l-(L-20))/20,0),zorder=3)\n",
178 | " if l == L-1:\n",
179 | " #plt.plot([x1[l],x1[l]],[-3,3],color='blue',alpha=0.7,zorder=10)\n",
180 | " pdf = norm.pdf(x, loc=lmu, scale=lstd)*0.5\n",
181 | " mask = pdf > np.percentile(pdf,q=40)\n",
182 | " plt.fill_betweenx(x[mask],x1[l]+pdf[mask],np.full(len(x[mask]),x1[l]),color='blue',alpha=0.2,zorder=2)\n",
183 | " plt.plot(x1[l]+pdf[mask],x[mask],color='blue',alpha=0.7,zorder=1)\n",
184 | " plt.arrow(x1[l-1],x2[l-1],0,x2[l]-x2[l-1],color='black',lw=0.5,head_width=0.05,length_includes_head=True,zorder=100)\n",
185 | " plt.scatter(x1[l],x2[l],color='white',edgecolor='blue',s=30,linewidth=1,alpha=1.0,zorder=100)\n",
186 | " plt.annotate(r'$f_{X_2|X_1}$ = ' + str(np.round(x1[l],2)),xy=[x1[l]+0.02,max(x[mask])-0.2],color='blue',rotation=-90)\n",
187 | " elif case == 1: # update x1\n",
188 | " x2[l] = x2[l-1]\n",
189 | " lmu = mu1 + rho * (sig1/sig2) * (x2[l] - mu2); lstd = 1 - rho**2\n",
190 | " x1[l] = np.random.normal(loc = lmu,scale = lstd,size = 1)\n",
191 | " case = 0\n",
192 | " plt.scatter(x1[l],x2[l],color='red',edgecolor='black',s=15,alpha=1.0,zorder=100)\n",
193 | " plt.plot([x1[l-1],x1[l]],[x2[l-1],x2[l]],color='black',lw=1,alpha = max((l-(L-20))/20,0),zorder=4)\n",
194 | " plt.plot([x1[l-1],x1[l]],[x2[l-1],x2[l]],color='white',lw=3,alpha = max((l-(L-20))/20,0),zorder=3)\n",
195 | " if l == L-1:\n",
196 | " #plt.plot([-3,3],[x2[l],x2[l]],color='red',alpha=0.7,zorder=10)\n",
197 | " pdf = norm.pdf(x, loc=lmu, scale=lstd)*0.5\n",
198 | " mask = pdf > np.percentile(pdf,q=40)\n",
199 | " plt.fill_between(x[mask],x2[l]+pdf[mask],np.full(len(x[mask]),x2[l]),color='red',alpha=0.2,zorder=2)\n",
200 | " plt.plot(x[mask],x2[l]+pdf[mask],color='red',alpha=0.7,zorder=1)\n",
201 | " plt.arrow(x1[l-1],x2[l-1],x1[l]-x1[l-1],0,color='black',lw=0.5,head_width=0.05,length_includes_head=True,zorder=100)\n",
202 | " plt.scatter(x1[l],x2[l],color='white',edgecolor='red',s=30,linewidth=1,alpha=1.0,zorder=100)\n",
203 | " plt.annotate(r'$f_{X_1|X_2}$ = ' + str(np.round(x2[l],2)),xy=[min(x[mask])-0.5,x2[l]+0.1],color='red')\n",
204 | " \n",
205 | " df = pd.DataFrame(np.vstack([x1,x2]).T, columns= ['x1','x2'])\n",
206 | " if L > 20:\n",
207 | " sns.kdeplot(data=df,x='x1',y='x2',color='grey',linewidths=1.0,alpha=min(((l-20)/20),1.0),levels=5,zorder=1)\n",
208 | " add_grid()\n",
209 | " plt.xlim([-3.5,3.5]); plt.ylim([-3.5,3.5]); plt.xlabel(r'$X_1$'); plt.ylabel(r'$X_2$'); plt.title('Gibbs Sampler - Bivariate Joint Gaussian Distribution')\n",
210 | " plt.subplots_adjust(left=0.0,bottom=0.0,right=1.0,top=1.1); plt.show() # set plot size \n",
211 | " \n",
212 | "# connect the function to make the samples and plot to the widgets \n",
213 | "interactive_plot = widgets.interactive_output(run_plot, {'nsample':nsample,'rho':rho})\n",
214 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating "
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "id": "faaceed1",
220 | "metadata": {},
221 | "source": [
222 | "### Interactive Gibbs Sampler Demonstation \n",
223 | "\n",
224 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
225 | "\n",
226 | "Set the number of samples and correlation coefficient and observe the Gibbs sampler.\n",
227 | "\n",
228 | "### The Inputs\n",
229 | "\n",
230 | "* **$n_{sample}$** - number of samples, **$\\rho_{X_1,X_2}$** - correlation coefficient"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": 4,
236 | "id": "899c4fa6",
237 | "metadata": {},
238 | "outputs": [
239 | {
240 | "data": {
241 | "application/vnd.jupyter.widget-view+json": {
242 | "model_id": "b530fb44f2a843f1960cddbbc6aef73d",
243 | "version_major": 2,
244 | "version_minor": 0
245 | },
246 | "text/plain": [
247 | "VBox(children=(Text(value=' Interactive Gibbs Sampler Demo, Prof. Michael Pyr…"
248 | ]
249 | },
250 | "metadata": {},
251 | "output_type": "display_data"
252 | },
253 | {
254 | "data": {
255 | "application/vnd.jupyter.widget-view+json": {
256 | "model_id": "e0e8fc2fdb43415697a1eaf74174806f",
257 | "version_major": 2,
258 | "version_minor": 0
259 | },
260 | "text/plain": [
261 | "Output()"
262 | ]
263 | },
264 | "metadata": {},
265 | "output_type": "display_data"
266 | }
267 | ],
268 | "source": [
269 | "display(ui2, interactive_plot) # display the interactive plot"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "id": "07eb83a5",
275 | "metadata": {},
276 | "source": [
277 | "#### Comments\n",
278 | "\n",
279 | "This was a basic demonstration of the Gibbs sampler for McMC. I have many other demonstrations and even basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
280 | " \n",
281 | "#### The Author:\n",
282 | "\n",
283 | "### Michael J. Pyrcz, Professor, The University of Texas at Austin \n",
284 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
285 | "\n",
286 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
287 | "\n",
288 | "For more about Michael check out these links:\n",
289 | "\n",
290 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
291 | "\n",
292 | "#### Want to Work Together?\n",
293 | "\n",
294 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
295 | "\n",
296 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
297 | "\n",
298 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
299 | "\n",
300 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
301 | "\n",
302 | "I'm always happy to discuss,\n",
303 | "\n",
304 | "*Michael*\n",
305 | "\n",
306 | "Michael Pyrcz, Ph.D., P.Eng. Professor, The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, Jackson School of Geosciences, The University of Texas at Austin\n",
307 | "\n",
308 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n",
309 | " "
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "id": "f51344db",
316 | "metadata": {},
317 | "outputs": [],
318 | "source": []
319 | }
320 | ],
321 | "metadata": {
322 | "kernelspec": {
323 | "display_name": "Python 3 (ipykernel)",
324 | "language": "python",
325 | "name": "python3"
326 | },
327 | "language_info": {
328 | "codemirror_mode": {
329 | "name": "ipython",
330 | "version": 3
331 | },
332 | "file_extension": ".py",
333 | "mimetype": "text/x-python",
334 | "name": "python",
335 | "nbconvert_exporter": "python",
336 | "pygments_lexer": "ipython3",
337 | "version": "3.11.4"
338 | }
339 | },
340 | "nbformat": 4,
341 | "nbformat_minor": 5
342 | }
343 |
--------------------------------------------------------------------------------
/Interactive_PCA_Eigen.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "56d479f5",
6 | "metadata": {},
7 | "source": [
8 | "
\n",
9 | " \n",
10 | "\n",
11 | "
\n",
12 | "\n",
13 | "### Interactive Workflow of Principal Component Analysis, Eigen Values and Eigen Vectors\n",
14 | " \n",
15 | "#### Michael Pyrcz, Associate Professor, University of Texas at Austin,\n",
16 | " \n",
17 | " ##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy),\n",
18 | "\n",
19 | "#### Introduction\n",
20 | "\n",
21 | "This semester I had students asking about how Eigen vectors and values behave in PCA. They wanted to see how they response to structure in covariance matrix. So I made this interactivity to demonstrate and visualize this! \n",
22 | "\n",
23 | "For more check out my YouTube channel [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig). For the walkthrough video of this workflow go here [walkthrough](TBD). Here's some basic concepts on Principal Component Analysis.\n",
24 | "\n",
25 | "#### Principal Component Analysis\n",
26 | "\n",
27 | "Principal Component Analysis one of a variety of methods for dimensional reduction:\n",
28 | "\n",
29 | "Dimensional reduction transforms the data to a lower dimension\n",
30 | "\n",
31 | "* Given features, $𝑋_1,\\dots,𝑋_𝑚$ we would require ${m \\choose 2}=\\frac{𝑚 \\cdot (𝑚−1)}{2}$ scatter plots to visualize just the two-dimensional scatter plots.\n",
32 | "\n",
33 | "* Once we have 4 or more variables understanding our data gets very hard.\n",
34 | "* Recall the curse of dimensionality, impact inference, modeling and visualization. \n",
35 | "\n",
36 | "One solution, is to find a good lower dimensional, $𝑝$, representation of the original dimensions $𝑚$\n",
37 | "\n",
38 | "Benefits of Working in a Reduced Dimensional Representation:\n",
39 | "\n",
40 | "1. Data storage / Computational Time\n",
41 | "2. Easier visualization\n",
42 | "3. Also takes care of multicollinearity \n",
43 | "\n",
44 | "#### Orthogonal Transformation \n",
45 | "\n",
46 | "Convert a set of observations into a set of linearly uncorrelated variables known as principal components\n",
47 | "\n",
48 | "* The number of principal components ($k$) available are min($𝑛−1,𝑚$) \n",
49 | "\n",
50 | "* Limited by the variables/features, $𝑚$, and the number of data\n",
51 | "\n",
52 | "Components are ordered\n",
53 | "\n",
54 | "* First component describes the larges possible variance / accounts for as much variability as possible\n",
55 | "* Next component describes the largest possible remaining variance \n",
56 | "* Up to the maximum number of principal components\n",
57 | "\n",
58 | "Eigen Values / Eigen Vectors\n",
59 | "\n",
60 | "* The Eigen values are the variance explained for each component\n",
61 | "* The Eigen vectors of the data covariance matrix are the principal components' loadings\n",
62 | "\n",
63 | "#### Install Packages\n",
64 | "\n",
65 | "For this interactive workflow to work, we need to install several packages relating to display features, widgets and data analysis interpretation."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 1,
71 | "id": "b9458018",
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "import pandas as pd # DataFrames and plotting\n",
76 | "import numpy as np\n",
77 | "import matplotlib.pyplot as plt # plotting\n",
78 | "from matplotlib.colors import ListedColormap # custom color maps\n",
79 | "import matplotlib.ticker as mtick\n",
80 | "from matplotlib.patches import Rectangle\n",
81 | "import matplotlib as mpl\n",
82 | "from mpl_toolkits.axes_grid1 import make_axes_locatable\n",
83 | "from numpy.linalg import eig # Eigen values and Eigen vectors\n",
84 | "from sklearn.decomposition import PCA # PCA program from scikit learn (package for machine learning)\n",
85 | "from sklearn.preprocessing import StandardScaler # normalize synthetic data\n",
86 | "from ipywidgets import interactive # widgets and interactivity\n",
87 | "from ipywidgets import widgets \n",
88 | "from ipywidgets import Layout\n",
89 | "from ipywidgets import Label\n",
90 | "from ipywidgets import VBox, HBox\n",
91 | "import warnings\n",
92 | "warnings.filterwarnings('ignore') # ignore warnings\n",
93 | "plt.rc('axes', axisbelow=True) # grids behind plot elements"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "id": "014a3dd1",
99 | "metadata": {},
100 | "source": [
101 | "#### Make the Dashbaord\n",
102 | "\n",
103 | "The numerical methods for this dashboard are:\n",
104 | "\n",
105 | "1. make a covariance matrix\n",
106 | "2. sample jointly from the multiGaussian distribution based on this covariance matrix\n",
107 | "3. standardize the samples to correct the mean and variance to 0.0 and 1.0 repsectively\n",
108 | "4. calculate the actual covariance matrix / this ensures that the covariance matrix is positive semidefiniate (because it is based on actual data)\n",
109 | "5. calculate the Eigen values and vectors, and sort by descending Eigen values\n",
110 | "6. plot the original feature variances, Eigen vectors and values."
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 2,
116 | "id": "36698d45",
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "l = widgets.Text(value=' PCA Eigen Vector / Component Loadings Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
121 | " layout=Layout(width='900px', height='30px'))\n",
122 | "# P_happening_label = widgets.Text(value='Probability of Happening',layout=Layout(width='50px',height='30px',line-size='0 px'))\n",
123 | "cstr = widgets.FloatSlider(min=0.0, max = 1.0, value=0.0, step = 0.1, description = r'$\\rho_{strength}$',orientation='horizontal', \n",
124 | " style = {'description_width':'initial','button_color':'green'},layout=Layout(width='600px',height='40px'),continuous_update=False,readout_format='.3f')\n",
125 | "\n",
126 | "ui_summary = widgets.HBox([cstr],)\n",
127 | "ui_summary1 = widgets.VBox([l,ui_summary],)\n",
128 | "\n",
129 | "def run_plot_summary(cstr):\n",
130 | " \n",
131 | " m = 4;\n",
132 | " \n",
133 | " mean = np.zeros((m)) # make inputs for multivariate dataset\n",
134 | " #cov = np.zeros((m,m))\n",
135 | " cov = np.full((m,m),0.0)\n",
136 | " for i in range(0,m):\n",
137 | " cov[i,i] = 1.0\n",
138 | " cov[0,1] = cov[1,0] = 0.99*cstr; cov[1,2] = cov[2,1] = -0.9*cstr; cov[0,2] = cov[2,0] = -0.7*cstr;\n",
139 | " \n",
140 | " data = np.random.multivariate_normal(mean = mean, cov = cov, size = 1000) # draw samples from MV Gaussian\n",
141 | " data = StandardScaler(copy=True, with_mean=True, with_std=True).fit(data).transform(data)\n",
142 | " \n",
143 | " cov_actual = np.cov(data,rowvar = False)\n",
144 | " \n",
145 | " eigen_values,eigen_vectors = eig(cov_actual) # Eigen values and vectors \n",
146 | " sorted_indices = np.argsort(-eigen_values)\n",
147 | " sorted_eigen_vectors = eigen_vectors[:, sorted_indices]\n",
148 | " sorted_eigen_values = np.sort(-eigen_values)*-1\n",
149 | " \n",
150 | " fig = plt.figure(figsize=(6, 6))\n",
151 | " gs = fig.add_gridspec(2,2 ,width_ratios=(1.0, 1.0))\n",
152 | " \n",
153 | " plt_center = fig.add_subplot(gs[1, 1])\n",
154 | " plt_x = fig.add_subplot(gs[1, 0],sharey=plt_center) \n",
155 | " plt_y = fig.add_subplot(gs[0, 1],sharex=plt_center) \n",
156 | " plt_extra = fig.add_subplot(gs[0, 0]) \n",
157 | " \n",
158 | " for i in range(0,m):\n",
159 | " for j in range(0,m):\n",
160 | " color = (sorted_eigen_vectors[j,i] + 1.0)/(2.0)\n",
161 | " plt_center.add_patch(Rectangle((i-0.5,j-0.5), 1, 1,color = plt.cm.RdGy_r(color),fill=True))\n",
162 | " \n",
163 | " if abs(sorted_eigen_vectors[j,i]) > 0.5:\n",
164 | " plt_center.annotate(np.round(sorted_eigen_vectors[j,i],1),(i-0.1,j-0.05),color='white')\n",
165 | " else:\n",
166 | " plt_center.annotate(np.round(sorted_eigen_vectors[j,i],1),(i-0.1,j-0.05))\n",
167 | " \n",
168 | " plt_center.set_xlim([-0.5,3.5]); plt_center.set_ylim([-0.5,3.5])\n",
169 | " plt_center.set_xticks([0,1, 2, 3],[1,2,3,4]); plt_center.set_yticks([0,1, 2, 3],[1,2,3,4])\n",
170 | " for x in np.arange(0.5,3.5,1.0):\n",
171 | " plt_center.plot([x,x],[-0.5,3.5],c='black',lw=3)\n",
172 | " plt_center.plot([-0.5,3.5],[x,x],c='black',lw=1,ls='--')\n",
173 | " plt_center.set_title('Eigen Vectors / Principal Component Loadings') \n",
174 | " plt_center.set_xlabel('Eigen Vector'); plt_center.set_ylabel('Feature')\n",
175 | " \n",
176 | " plt_x.barh(y=np.array([0,1,2,3],dtype='float'),width=np.var(data,axis=0),color='darkorange',edgecolor='black')\n",
177 | " plt_x.set_xlim([3.0,0]); plt_x.set_yticks([0,1, 2, 3],[1,2,3,4])\n",
178 | " plt_x.plot([1,1],[-0.5,3.5],c='black',ls='--'); plt_x.annotate('Equal Variance',(1.13,2.6),rotation=90.0,size=9)\n",
179 | " plt_x.set_ylabel('Feature'); plt_x.set_xlabel('Variance')\n",
180 | " plt_x.set_title('Original Feature Variance') \n",
181 | " plt_x.grid(axis='x',which='minor', color='#EEEEEE', linestyle=':', linewidth=0.5)\n",
182 | " plt_x.grid(axis='x',which='major', color='#DDDDDD', linewidth=0.8); plt_x.minorticks_on()\n",
183 | " for x in np.arange(0.5,3.5,1.0):\n",
184 | " plt_x.plot([-0.5,3.5],[x,x],c='black',lw=1,ls='--')\n",
185 | " \n",
186 | " plt_y.bar(x=np.array([0,1,2,3],dtype='float'),height=sorted_eigen_values,color='darkorange',edgecolor='black')\n",
187 | " plt_y.set_ylim([0,3.0]); plt_y.set_xticks([0,1, 2, 3],[1,2,3,4]); \n",
188 | " plt_y.plot([-0.5,3.5],[1,1],c='black',ls='--'); plt_y.annotate('Equal Variance',(2.55,1.05),size=9)\n",
189 | " plt_y.set_xlabel('Eigen Value'); plt_y.set_ylabel('Variance')\n",
190 | " plt_y.set_title('Sorted, Projected Feature Variance') \n",
191 | " plt_y.grid(axis='y',which='minor', color='#EEEEEE', linestyle=':', linewidth=0.5)\n",
192 | " plt_y.grid(axis='y',which='major', color='#DDDDDD', linewidth=0.8); plt_y.minorticks_on() \n",
193 | " for x in np.arange(0.5,3.5,1.0):\n",
194 | " plt_y.plot([x,x],[-0.5,3.5],c='black',lw=3)\n",
195 | "\n",
196 | " for i in range(0,m):\n",
197 | " for j in range(0,m):\n",
198 | " color = (cov_actual[j,i] + 1.0)/(2.0)\n",
199 | " plt_extra.add_patch(Rectangle((i-0.5,j-0.5), 1, 1,color = plt.cm.BrBG(color),fill=True))\n",
200 | " \n",
201 | " plt_extra.set_xlim([-0.5,3.5]); plt_extra.set_ylim([3.5,-0.5])\n",
202 | " plt_extra.set_xticks([0,1, 2, 3],[1,2,3,4]); plt_extra.set_yticks([0,1, 2, 3],[1,2,3,4])\n",
203 | " for x in np.arange(0.5,3.5,1.0):\n",
204 | " plt_extra.plot([x,x],[-0.5,3.5],c='black',lw=2)\n",
205 | " plt_extra.plot([-0.5,3.5],[x,x],c='black',lw=2)\n",
206 | " plt_extra.set_title('Covariance Matrix') \n",
207 | " \n",
208 | " cplt_extra = make_axes_locatable(plt_extra).append_axes('left', size='5%', pad=0.3)\n",
209 | " fig.colorbar(mpl.cm.ScalarMappable(norm=mpl.colors.Normalize(vmin=-1.0, vmax=1.0), cmap=plt.cm.BrBG),\n",
210 | " cax=cplt_extra, orientation='vertical')\n",
211 | " cplt_extra.yaxis.set_ticks_position('left')\n",
212 | " \n",
213 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.51, top=1.50, wspace=0.2, hspace=0.2); plt.show()\n",
214 | " \n",
215 | "interactive_plot_summary = widgets.interactive_output(run_plot_summary, {'cstr':cstr,})\n",
216 | "interactive_plot_summary.clear_output(wait = True) "
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "id": "9f1f60e1",
222 | "metadata": {},
223 | "source": [
224 | "### Interactive Principal Components Analysis, Component Loadings & Variance Explained Demostration\n",
225 | "\n",
226 | "* add data correlation / redundancy and observe the impact on the component loadings (Eigen vectors) and variance explained (Eigen values).\n",
227 | "\n",
228 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
229 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
230 | "\n",
231 | "The Inputs: **$\\rho_{strenth}$**: the strength of the correlation between features, scaler applied to $X_1$, $X_2$, & $X_3$ correlation."
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 3,
237 | "id": "b2108939",
238 | "metadata": {},
239 | "outputs": [
240 | {
241 | "data": {
242 | "application/vnd.jupyter.widget-view+json": {
243 | "model_id": "408af31c27ac462baa3f1779d3c164ca",
244 | "version_major": 2,
245 | "version_minor": 0
246 | },
247 | "text/plain": [
248 | "VBox(children=(Text(value=' PCA Eigen Vector / Component Loadings Demo…"
249 | ]
250 | },
251 | "metadata": {},
252 | "output_type": "display_data"
253 | },
254 | {
255 | "data": {
256 | "application/vnd.jupyter.widget-view+json": {
257 | "model_id": "e78bad5d43544f04a40fc3dc26cc7b03",
258 | "version_major": 2,
259 | "version_minor": 0
260 | },
261 | "text/plain": [
262 | "Output()"
263 | ]
264 | },
265 | "metadata": {},
266 | "output_type": "display_data"
267 | }
268 | ],
269 | "source": [
270 | "display(ui_summary1, interactive_plot_summary) # display the interactive plot"
271 | ]
272 | },
273 | {
274 | "cell_type": "markdown",
275 | "id": "15b997fe",
276 | "metadata": {},
277 | "source": [
278 | "#### Comments\n",
279 | "\n",
280 | "This was an interactive demonstrations of the Eigen values (variance explained) and Eigen vectors (component loadings) for Principal Components Analysis (PCA) with varianble between feature correlation. \n",
281 | "\n",
282 | "I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
283 | " \n",
284 | "I hope this was helpful,\n",
285 | "\n",
286 | "*Michael*\n",
287 | "\n",
288 | "#### The Author:\n",
289 | "\n",
290 | "### Michael J Pyrcz, Professor, The University of Texas at Austin \n",
291 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
292 | "\n",
293 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
294 | "\n",
295 | "For more about Michael check out these links:\n",
296 | "\n",
297 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
298 | "\n",
299 | "#### Want to Work Together?\n",
300 | "\n",
301 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
302 | "\n",
303 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
304 | "\n",
305 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
306 | "\n",
307 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
308 | "\n",
309 | "I'm always happy to discuss,\n",
310 | "\n",
311 | "*Michael*\n",
312 | "\n",
313 | "Michael Pyrcz, Ph.D., P.Eng. Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
314 | "\n",
315 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "execution_count": null,
321 | "id": "d23f7326",
322 | "metadata": {},
323 | "outputs": [],
324 | "source": []
325 | }
326 | ],
327 | "metadata": {
328 | "kernelspec": {
329 | "display_name": "Python 3 (ipykernel)",
330 | "language": "python",
331 | "name": "python3"
332 | },
333 | "language_info": {
334 | "codemirror_mode": {
335 | "name": "ipython",
336 | "version": 3
337 | },
338 | "file_extension": ".py",
339 | "mimetype": "text/x-python",
340 | "name": "python",
341 | "nbconvert_exporter": "python",
342 | "pygments_lexer": "ipython3",
343 | "version": "3.11.4"
344 | }
345 | },
346 | "nbformat": 4,
347 | "nbformat_minor": 5
348 | }
349 |
--------------------------------------------------------------------------------
/Interactive_Spurious_Correlations.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
\n",
8 | " \n",
9 | "\n",
10 | "
\n",
11 | "\n",
12 | "## Interactive Spurious Correlations Demonstration\n",
13 | "\n",
14 | "### Too Few Samples May Result in Spurious Correlations\n",
15 | "\n",
16 | "* in class I bring in 3 red balls, 2 green balls and my cowboy hat, yes I have one, recall I was a farmhand in Alberta, Canada\n",
17 | "\n",
18 | "* then I have students volunteer, one holds the hat, one draws balls with replacement and one records the results on the board\n",
19 | "\n",
20 | "* through multiple bootstrap sample sets we demonstrate the use of bootstrap to calculate uncertainty in the proportion from the sample itself through sampling with replacement\n",
21 | "\n",
22 | "* with this workflow we all provide an interactive plot demonstration with matplotlib and ipywidget packages to demonstrate this virtually\n",
23 | "\n",
24 | "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
25 | "\n",
26 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
27 | "\n",
28 | "#### Source of Spurious Correlations\n",
29 | "\n",
30 | "Let's explore the source of spurious correlations:\n",
31 | "\n",
32 | "* too few sample data\n",
33 | "\n",
34 | "* this issue can be exagerated when sampling from skewed distributions with possibility for extreme values \n",
35 | "\n",
36 | "What's the issue?\n",
37 | "\n",
38 | "* anomalously large absolute correlations between independent features\n",
39 | "\n",
40 | "We 'data mine' relationships that don't exist! Great examples are available at the [Spurious Correlations](https://www.tylervigen.com/spurious-correlations) website.\n",
41 | "\n",
42 | "#### The Correlation Coefficient\n",
43 | "\n",
44 | "Pearson’s Product‐Moment Correlation Coefficient\n",
45 | "* Provides a measure of the degree of linear relationship.\n",
46 | "* We refer to it as the 'correlation coefficient'\n",
47 | "\n",
48 | "Let's review the sample variance of variable $x$. Of course, I'm truncating our notation as $x$ is a set of samples a locations in our modeling space, $x(\\bf{u_\\alpha}), \\, \\forall \\, \\alpha = 0, 1, \\dots, n - 1$.\n",
49 | "\n",
50 | "\\begin{equation}\n",
51 | "\\sigma^2_{x} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})^2}{(n-1)}\n",
52 | "\\end{equation}\n",
53 | "\n",
54 | "We can expand the the squared term and replace on of them with $y$, another variable in addition to $x$.\n",
55 | "\n",
56 | "\\begin{equation}\n",
57 | "C_{xy} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)}\n",
58 | "\\end{equation}\n",
59 | "\n",
60 | "We now have a measure that represents the manner in which variables $x$ and $y$ co-vary or vary together. We can standardized the covariance by the product of the standard deviations of $x$ and $y$ to calculate the correlation coefficent. \n",
61 | "\n",
62 | "\\begin{equation}\n",
63 | "\\rho_{xy} = \\frac{\\sum_{i=1}^{n} (x_i - \\overline{x})(y_i - \\overline{y})}{(n-1)\\sigma_x \\sigma_y}, \\, -1.0 \\le \\rho_{xy} \\le 1.0\n",
64 | "\\end{equation}\n",
65 | "\n",
66 | "In summary we can state that the correlation coefficient is related to the covariance as:\n",
67 | "\n",
68 | "\\begin{equation}\n",
69 | "\\rho_{xy} = \\frac{C_{xy}}{\\sigma_x \\sigma_y}\n",
70 | "\\end{equation}\n",
71 | "\n",
72 | "\n",
73 | "#### Objective \n",
74 | "\n",
75 | "Provide an example and demonstration for:\n",
76 | "\n",
77 | "1. interactive plotting in Jupyter Notebooks with Python packages matplotlib and ipywidgets\n",
78 | "2. provide an intuitive hands-on example to explore spurious correlations \n",
79 | "\n",
80 | "#### Getting Started\n",
81 | "\n",
82 | "Here's the steps to get setup in Python with the GeostatsPy package:\n",
83 | "\n",
84 | "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
85 | "2. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. \n",
86 | "\n",
87 | "#### Load the Required Libraries\n",
88 | "\n",
89 | "The following code loads the required libraries."
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 1,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "%matplotlib inline\n",
99 | "from ipywidgets import interactive # widgets and interactivity\n",
100 | "from ipywidgets import widgets \n",
101 | "from ipywidgets import Layout\n",
102 | "from ipywidgets import Label\n",
103 | "from ipywidgets import VBox, HBox\n",
104 | "import matplotlib.pyplot as plt # plotting\n",
105 | "from matplotlib.colors import ListedColormap\n",
106 | "import numpy as np # working with arrays\n",
107 | "import pandas as pd # working with DataFrames\n",
108 | "import seaborn as sns # for matrix scatter plots\n",
109 | "from scipy.stats import triang # parametric distributions\n",
110 | "from scipy.stats import binom\n",
111 | "from scipy.stats import norm\n",
112 | "from scipy.stats import uniform\n",
113 | "from scipy.stats import triang\n",
114 | "from scipy.stats import lognorm\n",
115 | "from scipy import stats # statistical calculations\n",
116 | "import random # random drawing / bootstrap realizations of the data\n",
117 | "from matplotlib.gridspec import GridSpec # control of subplots \n",
118 | "import seaborn as sns # for matrix scatter plots"
119 | ]
120 | },
121 | {
122 | "cell_type": "markdown",
123 | "metadata": {},
124 | "source": [
125 | "#### Make a Synthetic Dataset\n",
126 | "\n",
127 | "This is an interactive method to:\n",
128 | "\n",
129 | "* select a parametric distribution\n",
130 | "\n",
131 | "* select the distribution parameters\n",
132 | "\n",
133 | "* select the number of samples\n",
134 | "\n",
135 | "* select the number of features \n",
136 | "\n",
137 | "Then we will view the lower triangular correlation matrix\n",
138 | "\n",
139 | "* we will color the correlations that are large (in absolute value $\\gt 0.8$)"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 2,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "bins = np.linspace(-1,1,100) # set histogram bins\n",
149 | "\n",
150 | "# interactive calculation of the random sample set (control of source parametric distribution and number of samples)\n",
151 | "l = widgets.Text(value=' Spurious Correlation Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
152 | "dist = widgets.Dropdown(\n",
153 | " options=['Triangular', 'Uniform', 'Gaussian', 'LogNorm'],\n",
154 | " value='Gaussian',\n",
155 | " description='Dataset Distribution:',\n",
156 | " disabled=False,\n",
157 | " layout=Layout(width='200px', height='30px')\n",
158 | ")\n",
159 | "a = widgets.FloatSlider(min=0.0, max = 100.0, value = 0.5, description = 'Sample: Mean/Mode',orientation='vertical',layout=Layout(width='170px', height='200px'))\n",
160 | "a.style.handle_color = 'blue'\n",
161 | "d = widgets.FloatSlider(min=0.01, max = 30.0, value = 5.0, step = 1.0, description = 'Sample: St.Dev.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n",
162 | "d.style.handle_color = 'green'\n",
163 | "b = widgets.FloatSlider(min = 0, max = 100.0, value = 0.5, description = 'Sample: Min.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n",
164 | "b.style.handle_color = 'red'\n",
165 | "c = widgets.IntSlider(min = 0, max = 100, value = 100, description = 'Sample: Max.',orientation='vertical',layout=Layout(width='110px', height='200px'))\n",
166 | "c.style.handle_color = 'orange'\n",
167 | "n = widgets.IntSlider(min = 2, max = 1000, value = 4, description = 'Number Samples',orientation='vertical',layout=Layout(width='110px', height='200px'))\n",
168 | "n.style.handle_color = 'gray'\n",
169 | "m = widgets.IntSlider(min = 2, max = 20, value = 10, description = 'Number Features',orientation='vertical',layout=Layout(width='110px', height='200px'))\n",
170 | "m.style.handle_color = 'gray'\n",
171 | "\n",
172 | "uia = widgets.HBox([dist,a,d,b,c,n,m],kwargs = {'justify_content':'center'}) # basic widget formatting\n",
173 | "#uib = widgets.HBox([n, m],kwargs = {'justify_content':'center'}) # basic widget formatting \n",
174 | "ui2 = widgets.VBox([l,uia],)\n",
175 | "\n",
176 | "def f_make(dist,a, b, c, d, n, m): # function to take parameters, make sample and plot\n",
177 | " dataset = make_data(dist,a, b, c, d, n, m)\n",
178 | " df = pd.DataFrame(data = dataset)\n",
179 | " corr = df.corr()\n",
180 | "\n",
181 | "# build a mask to remove the upper triangle\n",
182 | " mask = np.triu(np.ones_like(corr, dtype=bool))\n",
183 | " corr_values = corr.values\n",
184 | " corr_values2 = corr_values[mask != True]\n",
185 | " \n",
186 | "# make a custom colormap\n",
187 | " my_colormap = plt.cm.get_cmap('RdBu_r', 256)\n",
188 | " newcolors = my_colormap(np.linspace(0, 1, 256))\n",
189 | " white = np.array([256/256, 256/256, 256/256, 1])\n",
190 | " newcolors[26:230, :] = white # mask all correlations less than abs(0.8)\n",
191 | " newcmp = ListedColormap(newcolors)\n",
192 | "\n",
193 | "# Draw the heatmap with the mask and correct aspect ratio\n",
194 | " fig, (ax1) = plt.subplots(1, 1)\n",
195 | " sns.set(font_scale = 0.8)\n",
196 | " sns.heatmap(corr, ax = ax1, annot = True, mask=mask, cmap=newcmp, vmin = -1.0, vmax=1.0, center=0,\n",
197 | " square=True, linewidths=.5, linecolor = 'white', linewidth = 1, cbar_kws={'shrink': .5, 'label': 'Correlation Coefficents'})\n",
198 | " ax1.set_xlabel('Random Independent Features'); ax1.set_ylabel('Random Independent Features')\n",
199 | " ax1.set_title('Lower Triangular Correlation Matrix Heat Map')\n",
200 | " \n",
201 | "# ax2.hist(corr_values2, alpha=0.2,color=\"red\",edgecolor=\"black\", bins = bins)\n",
202 | "# ax2.set_title('Lower Triangular Correlation Coefficent Distribution'); ax2.set_xlabel('Correlation Coefficent'); ax2.set_ylabel('Frequency') \n",
203 | "# ax2.set_facecolor('white'); ax2.grid(True);\n",
204 | " \n",
205 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.2, top=3.2, wspace=0.2, hspace=0.2)\n",
206 | " plt.show()\n",
207 | "\n",
208 | "def make_data(dist,a, b, c, d, n, m): # function to check parameters and make sample \n",
209 | " if dist == 'Uniform':\n",
210 | " if b >= c:\n",
211 | " print('Invalid uniform distribution parameters')\n",
212 | " return None\n",
213 | " dataset = uniform.rvs(size=[n,m], loc = b, scale = c, random_state = 73073).tolist()\n",
214 | " return dataset\n",
215 | " elif dist == 'Triangular':\n",
216 | " interval = c - b\n",
217 | " if b >= a or a >= c or interval <= 0:\n",
218 | " print('Invalid triangular distribution parameters')\n",
219 | " return None \n",
220 | " dataset = triang.rvs(size=[n,m], loc = b, c = (a-b)/interval, scale = interval, random_state = 73073).tolist()\n",
221 | " return dataset\n",
222 | " elif dist == 'Gaussian':\n",
223 | " dataset = norm.rvs(size=[n,m], loc = a, scale = d, random_state = 73073).tolist()\n",
224 | " return dataset\n",
225 | " elif dist == 'LogNorm':\n",
226 | " dataset = lognorm.rvs(size=[n,m], loc = a, scale = np.exp(a), s = d, random_state = 73073).tolist()\n",
227 | " return dataset\n",
228 | " \n",
229 | "# connect the function to make the samples and plot to the widgets \n",
230 | "interactive_plot = widgets.interactive_output(f_make, {'dist': dist,'a': a, 'd': d, 'b': b, 'c': c, 'n': n, 'm': m})\n",
231 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
232 | ]
233 | },
234 | {
235 | "cell_type": "markdown",
236 | "metadata": {},
237 | "source": [
238 | "### Spurious Correlations Demonstration\n",
239 | "\n",
240 | "* spurious correlations due to a combination of too few samples and skewed distribution\n",
241 | "\n",
242 | "* interactive plot demonstration with ipywidget, matplotlib packages\n",
243 | "\n",
244 | "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
245 | "\n",
246 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
247 | "\n",
248 | "### The Problem\n",
249 | "\n",
250 | "Let's simulate bootstrap, resampling with replacement from a hat with $n_{red}$ and $n_{green}$ balls\n",
251 | "\n",
252 | "* **$n_{red}$**: number of red balls in the sample (placed in the hat)\n",
253 | "\n",
254 | "* **$n_{green}$**: number of green balls in the sample (placed in the hat)\n",
255 | "\n",
256 | "* **$L$**: number of bootstrap realizations"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 3,
262 | "metadata": {},
263 | "outputs": [
264 | {
265 | "data": {
266 | "application/vnd.jupyter.widget-view+json": {
267 | "model_id": "fa7e36176895433b8064d693754bf2f9",
268 | "version_major": 2,
269 | "version_minor": 0
270 | },
271 | "text/plain": [
272 | "VBox(children=(Text(value=' Spurious Correlation Demonstration, Michael P…"
273 | ]
274 | },
275 | "metadata": {},
276 | "output_type": "display_data"
277 | },
278 | {
279 | "data": {
280 | "application/vnd.jupyter.widget-view+json": {
281 | "model_id": "0961920d80ef49548349b9e21407f084",
282 | "version_major": 2,
283 | "version_minor": 0
284 | },
285 | "text/plain": [
286 | "Output()"
287 | ]
288 | },
289 | "metadata": {},
290 | "output_type": "display_data"
291 | }
292 | ],
293 | "source": [
294 | "display(ui2, interactive_plot) # display the interactive plot"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {},
300 | "source": [
301 | "#### Observations\n",
302 | "\n",
303 | "Some observations:\n",
304 | "\n",
305 | "* spurious correlations due to a combination of too few samples and skewed distribution\n",
306 | "\n",
307 | "* interactive plot demonstration with ipywidget, matplotlib packages\n",
308 | "\n",
309 | "\n",
310 | "#### Comments\n",
311 | "\n",
312 | "This was a simple demonstration of interactive plots in Jupyter Notebook Python with the ipywidgets and matplotlib packages. \n",
313 | "\n",
314 | "I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
315 | " \n",
316 | "I hope this was helpful,\n",
317 | "\n",
318 | "*Michael*\n",
319 | "\n",
320 | "#### The Author:\n",
321 | "\n",
322 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
323 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
324 | "\n",
325 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
326 | "\n",
327 | "For more about Michael check out these links:\n",
328 | "\n",
329 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
330 | "\n",
331 | "#### Want to Work Together?\n",
332 | "\n",
333 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
334 | "\n",
335 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
336 | "\n",
337 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
338 | "\n",
339 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
340 | "\n",
341 | "I'm always happy to discuss,\n",
342 | "\n",
343 | "*Michael*\n",
344 | "\n",
345 | "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
346 | "\n",
347 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": null,
353 | "metadata": {},
354 | "outputs": [],
355 | "source": []
356 | }
357 | ],
358 | "metadata": {
359 | "kernelspec": {
360 | "display_name": "Python 3 (ipykernel)",
361 | "language": "python",
362 | "name": "python3"
363 | },
364 | "language_info": {
365 | "codemirror_mode": {
366 | "name": "ipython",
367 | "version": 3
368 | },
369 | "file_extension": ".py",
370 | "mimetype": "text/x-python",
371 | "name": "python",
372 | "nbconvert_exporter": "python",
373 | "pygments_lexer": "ipython3",
374 | "version": "3.9.12"
375 | }
376 | },
377 | "nbformat": 4,
378 | "nbformat_minor": 2
379 | }
380 |
--------------------------------------------------------------------------------
/Interactive_Variogram_Nugget_Effect.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "collapsed": true
7 | },
8 | "source": [
9 | "\n",
10 | "
\n",
11 | " \n",
12 | "\n",
13 | "
\n",
14 | "\n",
15 | "## Spatial Data Analytics \n",
16 | "\n",
17 | "### Interactive Demonstration of the Variogram Nugget Effect \n",
18 | "\n",
19 | "#### Michael Pyrcz, Associate Professor, The University of Texas at Austin \n",
20 | "\n",
21 | "##### Contacts: [Twitter/@GeostatsGuy](https://twitter.com/geostatsguy) | [GitHub/GeostatsGuy](https://github.com/GeostatsGuy) | [www.michaelpyrcz.com](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446)\n",
22 | "\n",
23 | "This a simple demonstration of the variogram nugget effect structure for a 1D datasets with variable spatial continuity and visualization.\n",
24 | "\n",
25 | "* we will see that the nugget effect results from random error\n",
26 | "\n",
27 | "* we will perform the calculations in 1D for fast run times and ease of visualization.\n",
28 | "\n",
29 | "#### Load the required libraries\n",
30 | "\n",
31 | "The following code loads the required libraries.\n"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 1,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import os # to set current working directory \n",
41 | "import numpy as np # arrays and matrix math\n",
42 | "import matplotlib.pyplot as plt # for plotting\n",
43 | "from matplotlib.gridspec import GridSpec # custom matrix plots\n",
44 | "plt.rc('axes', axisbelow=True) # set axes and grids in the background for all plots\n",
45 | "from ipywidgets import interactive # widgets and interactivity\n",
46 | "from ipywidgets import widgets \n",
47 | "from ipywidgets import Layout\n",
48 | "from ipywidgets import Label\n",
49 | "from ipywidgets import VBox, HBox\n",
50 | "import math # for square root\n",
51 | "from geostatspy import GSLIB # affine correction\n",
52 | "seed = 73073"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. "
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "#### Set the working directory\n",
67 | "\n",
68 | "I always like to do this so I don't lose files and to simplify subsequent read and writes (avoid including the full address each time). Also, in this case make sure to place the required (see below) data file in this working directory. "
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 2,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "#os.chdir(\"C:\\PGE337\") # set the working directory"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "#### Declare Functions\n",
85 | "\n",
86 | "We need a variogram calculator that is fast and works well with 1D.\n",
87 | "\n",
88 | "* I have modified the gam function from GeostatsPy below.\n",
89 | "\n",
90 | "References:\n",
91 | "\n",
92 | "Pyrcz, M.J., Jo. H., Kupenko, A., Liu, W., Gigliotti, A.E., Salomaki, T., and Santos, J., 2021, GeostatsPy Python Package, PyPI, Python Package Index, https://pypi.org/project/geostatspy/."
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 3,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "def gam(array, tmin, tmax, xsiz, ysiz, ixd, iyd, nlag, isill):\n",
102 | " \"\"\"GSLIB's GAM program (Deutsch and Journel, 1998) converted from the\n",
103 | " original Fortran to Python by Michael Pyrcz, the University of Texas at\n",
104 | " Austin (Jan, 2019).\n",
105 | " :param array: 2D gridded data / model\n",
106 | " :param tmin: property trimming limit\n",
107 | " :param tmax: property trimming limit\n",
108 | " :param xsiz: grid cell extents in x direction\n",
109 | " :param ysiz: grid cell extents in y direction\n",
110 | " :param ixd: lag offset in grid cells\n",
111 | " :param iyd: lag offset in grid cells\n",
112 | " :param nlag: number of lags to calculate\n",
113 | " :param isill: 1 for standardize sill\n",
114 | " :return: TODO\n",
115 | " \"\"\"\n",
116 | " if array.ndim == 2:\n",
117 | " ny, nx = array.shape\n",
118 | " elif array.ndim == 1:\n",
119 | " ny, nx = len(array),1\n",
120 | " array = array.reshape((ny,1))\n",
121 | "\n",
122 | " nvarg = 1 # for multiple variograms repeat the program\n",
123 | " nxy = nx * ny # TODO: not used\n",
124 | " mxdlv = nlag\n",
125 | "\n",
126 | " # Allocate the needed memory\n",
127 | " lag = np.zeros(mxdlv)\n",
128 | " vario = np.zeros(mxdlv)\n",
129 | " hm = np.zeros(mxdlv)\n",
130 | " tm = np.zeros(mxdlv)\n",
131 | " hv = np.zeros(mxdlv) # TODO: not used\n",
132 | " npp = np.zeros(mxdlv)\n",
133 | " ivtail = np.zeros(nvarg + 2)\n",
134 | " ivhead = np.zeros(nvarg + 2)\n",
135 | " ivtype = np.zeros(nvarg + 2)\n",
136 | " ivtail[0] = 0\n",
137 | " ivhead[0] = 0\n",
138 | " ivtype[0] = 0\n",
139 | "\n",
140 | " # Summary statistics for the data after trimming\n",
141 | " inside = (array > tmin) & (array < tmax)\n",
142 | " avg = array[(array > tmin) & (array < tmax)].mean() # TODO: not used\n",
143 | " stdev = array[(array > tmin) & (array < tmax)].std()\n",
144 | " var = stdev ** 2.0\n",
145 | " vrmin = array[(array > tmin) & (array < tmax)].min() # TODO: not used\n",
146 | " vrmax = array[(array > tmin) & (array < tmax)].max() # TODO: not used\n",
147 | " num = ((array > tmin) & (array < tmax)).sum() # TODO: not used\n",
148 | "\n",
149 | " # For the fixed seed point, loop through all directions\n",
150 | " for iy in range(0, ny):\n",
151 | " for ix in range(0, nx):\n",
152 | " if inside[iy, ix]:\n",
153 | " vrt = array[iy, ix]\n",
154 | " ixinc = ixd\n",
155 | " iyinc = iyd\n",
156 | " ix1 = ix\n",
157 | " iy1 = iy\n",
158 | " for il in range(0, nlag):\n",
159 | " ix1 = ix1 + ixinc\n",
160 | " if 0 <= ix1 < nx:\n",
161 | " iy1 = iy1 + iyinc\n",
162 | " if 1 <= iy1 < ny:\n",
163 | " if inside[iy1, ix1]:\n",
164 | " vrh = array[iy1, ix1]\n",
165 | " npp[il] = npp[il] + 1\n",
166 | " tm[il] = tm[il] + vrt\n",
167 | " hm[il] = hm[il] + vrh\n",
168 | " vario[il] = vario[il] + ((vrh - vrt) ** 2.0)\n",
169 | "\n",
170 | " # Get average values for gam, hm, tm, hv, and tv, then compute the correct\n",
171 | " # \"variogram\" measure\n",
172 | " for il in range(0, nlag):\n",
173 | " if npp[il] > 0:\n",
174 | " rnum = npp[il]\n",
175 | " lag[il] = np.sqrt((ixd * xsiz * il) ** 2 + (iyd * ysiz * il) ** 2)\n",
176 | " vario[il] = vario[il] / float(rnum)\n",
177 | " hm[il] = hm[il] / float(rnum)\n",
178 | " tm[il] = tm[il] / float(rnum)\n",
179 | "\n",
180 | " # Standardize by the sill\n",
181 | " if isill == 1:\n",
182 | " vario[il] = vario[il] / var\n",
183 | "\n",
184 | " # Semivariogram\n",
185 | " vario[il] = 0.5 * vario[il]\n",
186 | " return lag, vario, npp"
187 | ]
188 | },
189 | {
190 | "cell_type": "markdown",
191 | "metadata": {},
192 | "source": [
193 | "#### Interactive Interface\n",
194 | "\n",
195 | "Here's the interactive interface. I make a correlated 1D data set, add noise and then calculate the histogram and variogram with and without noise. \n",
196 | "\n",
197 | "* the user specifies the proportion of noise and the spatial continuity range of the original data."
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": 6,
203 | "metadata": {
204 | "scrolled": false
205 | },
206 | "outputs": [],
207 | "source": [
208 | "\n",
209 | "n = 200; mean = 0.20; stdev = 0.03; nlag = 100; pnoise = 0.5; \n",
210 | "\n",
211 | "l = widgets.Text(value=' Variogram Nugger Effect Demonstration, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
212 | " layout=Layout(width='930px', height='30px'))\n",
213 | "\n",
214 | "pnoise = widgets.FloatSlider(min=0.0,max = 1.0,value=0.0,step = 0.05,description = 'Noise %',orientation='horizontal',style = {'description_width': 'initial'},layout=Layout(width='500px',height='30px'),continuous_update=False)\n",
215 | "vrange = widgets.IntSlider(min=1,max = 100,value=30,step = 5,description = 'Spatial Continuity Range',orientation='horizontal',style = {'description_width': 'initial'},layout=Layout(width='500px',height='30px'),continuous_update=False)\n",
216 | "\n",
217 | "ui = widgets.HBox([pnoise,vrange],)\n",
218 | "ui2 = widgets.VBox([l,ui],)\n",
219 | "\n",
220 | "def run_plot(pnoise,vrange):\n",
221 | "\n",
222 | " psignal = 1 - pnoise\n",
223 | "\n",
224 | " np.random.seed(seed = seed)\n",
225 | " data0 = np.random.normal(loc=0.20,scale=0.03,size=n+1000)\n",
226 | " \n",
227 | " kern1 = np.ones(vrange)\n",
228 | " data1 = np.convolve(data0,kern1,mode='same')\n",
229 | " data1_sub = GSLIB.affine(data1[500:n+500],mean,stdev)\n",
230 | " \n",
231 | " data1_sub_rescale = GSLIB.affine(data1[500:n+500],mean,stdev*math.sqrt(psignal))\n",
232 | " data1_sub_noise = data1_sub_rescale + np.random.normal(loc=0.0,scale = stdev*math.sqrt(pnoise),size=n)\n",
233 | " data1_sub_noise = GSLIB.affine(data1_sub_noise,mean,stdev)\n",
234 | " \n",
235 | " #fig, axs = plt.subplots(2,3, gridspec_kw={'width_ratios': [2, 1, 1, 1]})\n",
236 | " \n",
237 | " fig = plt.figure()\n",
238 | " spec = fig.add_gridspec(2, 3)\n",
239 | " \n",
240 | " ax1 = fig.add_subplot(spec[0, :])\n",
241 | " plt.plot(np.arange(1,n+1),data1_sub,color='blue',alpha=0.3,lw=3,label='Original')\n",
242 | " plt.plot(np.arange(1,n+1),data1_sub_noise,color='red',alpha=0.3,lw=3,label='Original + Noise')\n",
243 | " plt.xlim([0,n]); plt.ylim([mean-4*stdev,mean+4*stdev])\n",
244 | " plt.xlabel('Location (m)'); plt.ylabel('Porosity (%)'); plt.title('Porosity Over Location, Original and with Random Noise')\n",
245 | " plt.grid(); plt.legend(loc='upper right')\n",
246 | " \n",
247 | " ax2 = fig.add_subplot(spec[1, 0])\n",
248 | " plt.hist(data1_sub,color='blue',alpha=0.3,edgecolor='black',bins=np.linspace(mean-4*stdev,mean+4*stdev,30),\n",
249 | " label='Original')\n",
250 | " plt.hist(data1_sub_noise,color='red',alpha=0.3,edgecolor='black',bins=np.linspace(mean-4*stdev,mean+4*stdev,30),\n",
251 | " label='Original + Noise')\n",
252 | " plt.xlim([mean-4*stdev,mean+4*stdev]); plt.ylim([0,30])\n",
253 | " plt.xlabel('Porosity (%)'); plt.ylabel('Frequency'); plt.title('Histogram')\n",
254 | " plt.grid(); plt.legend(loc='upper right')\n",
255 | " \n",
256 | " ax3 = fig.add_subplot(spec[1, 1])\n",
257 | " labels = ['Signal','Noise',]\n",
258 | " plt.pie([psignal, pnoise,],radius = 1, autopct='%1.1f%%', \n",
259 | " colors = ['#0000FF','#FF0000'], explode = [.02,.02],wedgeprops = {\"edgecolor\":\"k\",'linewidth':1,\"alpha\":0.3},)\n",
260 | " plt.title('Variance of Signal and Noise')\n",
261 | " plt.legend(labels,loc='lower left')\n",
262 | " \n",
263 | " ax4 = fig.add_subplot(spec[1, 2])\n",
264 | " data1_sub_reshape = data1_sub.reshape((n,1))\n",
265 | " lag,gamma,npp = gam(data1_sub,-9999,9999,1.0,1.0,0,1,nlag,1)\n",
266 | " _,gamma_noise,_ = gam(data1_sub_noise,-9999,9999,1.0,1.0,0,1,nlag,1)\n",
267 | " plt.scatter(lag,gamma,s=30,color='blue',alpha=0.3,edgecolor='black',label='Original')\n",
268 | " plt.scatter(lag,gamma_noise,s=30,color='red',alpha=0.3,edgecolor='black',label='Original + Noise')\n",
269 | " plt.plot([0,nlag],[1.0,1.0],color='black',ls='--')\n",
270 | " plt.xlim([0,nlag]); plt.ylim([0,2.0]); plt.grid(); plt.legend(loc='upper right')\n",
271 | " plt.xlabel('Lag Distance (h)'); plt.ylabel('Variogram'); plt.title('Experimental Variogram')\n",
272 | "\n",
273 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.6, wspace=0.1, hspace=0.3); plt.show()\n",
274 | "\n",
275 | "# connect the function to make the samples and plot to the widgets \n",
276 | "interactive_plot = widgets.interactive_output(run_plot, {'pnoise':pnoise,'vrange':vrange})\n",
277 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
278 | ]
279 | },
280 | {
281 | "cell_type": "markdown",
282 | "metadata": {},
283 | "source": [
284 | "Take some time to observe a random phenomenon. \n",
285 | "\n",
286 | "* see any patterns, e.g., strings of low or high values, increasing or decreasing trends?\n",
287 | "\n",
288 | "#### Add Spatial Correlation\n",
289 | "\n",
290 | "We can use convolution to add spatial continuity to a random set of values\n",
291 | "\n",
292 | "* we won't go into the details, but the convolution kernel can actually be related to the variogram in sequential Gaussian simulation.\n",
293 | "\n",
294 | "* we apply an affine correction to ensure that we don't change the mean or standard deviation with the convolution, we just change the spatial continuity\n",
295 | "\n",
296 | "* since we are using convolution, it is likely that there will be edge artifacts, so we have 'cut off' the edges of the model (500 m on each side)."
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 7,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "data": {
306 | "application/vnd.jupyter.widget-view+json": {
307 | "model_id": "100c9cb7acbd4ae7b830fc4c82f8bf54",
308 | "version_major": 2,
309 | "version_minor": 0
310 | },
311 | "text/plain": [
312 | "VBox(children=(Text(value=' Variogram Nugger Effect Demons…"
313 | ]
314 | },
315 | "metadata": {},
316 | "output_type": "display_data"
317 | },
318 | {
319 | "data": {
320 | "application/vnd.jupyter.widget-view+json": {
321 | "model_id": "836cee2e2cd84072b1db197fe1481830",
322 | "version_major": 2,
323 | "version_minor": 0
324 | },
325 | "text/plain": [
326 | "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
327 | ]
328 | },
329 | "metadata": {},
330 | "output_type": "display_data"
331 | }
332 | ],
333 | "source": [
334 | "display(ui2, interactive_plot) # display the interactive plot"
335 | ]
336 | },
337 | {
338 | "cell_type": "markdown",
339 | "metadata": {},
340 | "source": [
341 | "#### Comments\n",
342 | "\n",
343 | "This was an interactive demonstration of the variogram nugget effect structure resulting from the addition of random noise to spatial data. \n",
344 | "\n",
345 | "I have many other demonstrations on simulation to build spatial models with spatial continuity and many other workflows available [here](https://github.com/GeostatsGuy/PythonNumericalDemos), along with a package for geostatistics in Python called [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy). \n",
346 | " \n",
347 | "We hope this was helpful,\n",
348 | "\n",
349 | "*Michael*\n",
350 | "\n",
351 | "***\n",
352 | "\n",
353 | "#### More on Michael Pyrcz and the Texas Center for Geostatistics:\n",
354 | "\n",
355 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
356 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
357 | "\n",
358 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
359 | "\n",
360 | "For more about Michael check out these links:\n",
361 | "\n",
362 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
363 | "\n",
364 | "#### Want to Work Together?\n",
365 | "\n",
366 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
367 | "\n",
368 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
369 | "\n",
370 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
371 | "\n",
372 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
373 | "\n",
374 | "I'm always happy to discuss,\n",
375 | "\n",
376 | "*Michael*\n",
377 | "\n",
378 | "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
379 | "\n",
380 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": null,
386 | "metadata": {},
387 | "outputs": [],
388 | "source": []
389 | }
390 | ],
391 | "metadata": {
392 | "kernelspec": {
393 | "display_name": "Python 3 (ipykernel)",
394 | "language": "python",
395 | "name": "python3"
396 | },
397 | "language_info": {
398 | "codemirror_mode": {
399 | "name": "ipython",
400 | "version": 3
401 | },
402 | "file_extension": ".py",
403 | "mimetype": "text/x-python",
404 | "name": "python",
405 | "nbconvert_exporter": "python",
406 | "pygments_lexer": "ipython3",
407 | "version": "3.9.12"
408 | }
409 | },
410 | "nbformat": 4,
411 | "nbformat_minor": 2
412 | }
413 |
--------------------------------------------------------------------------------
/Interactive_Overfit.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "
\n",
9 | " \n",
10 | "\n",
11 | "
\n",
12 | "\n",
13 | "## Subsurface Data Analytics \n",
14 | "\n",
15 | "## Interactive Demonstration of Machine Learning Model Tuning, Generalization & Overfit\n",
16 | "\n",
17 | "#### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
18 | "\n",
19 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
20 | "\n",
21 | "### PGE 383 Exercise: Interactive Predictive Model Complexity Tuning, Generalization & Overfit\n",
22 | "\n",
23 | "Here's a simple workflow, demonstration of predictive machine learning model training and testing for overfit. We use a:\n",
24 | "\n",
25 | "* simple polynomial model\n",
26 | "\n",
27 | "* 1 preditor feature and 1 response feature\n",
28 | "\n",
29 | "for an high interpretability model/ simple illustration.\n",
30 | "\n",
31 | "#### Train / Test Split\n",
32 | "\n",
33 | "The available data is split into training and testing subsets.\n",
34 | "\n",
35 | "* in general 15-30% of the data is withheld from training to apply as testing data\n",
36 | "\n",
37 | "* testing data selection should be fair, the same difficulty of predictions (offset/different from the training dat\n",
38 | "\n",
39 | "#### Machine Learning Model Traing\n",
40 | "\n",
41 | "The training data is applied to train the model parameters such that the model minimizes mismatch with the training data\n",
42 | "\n",
43 | "* it is common to use **mean square error** (known as a **L2 norm**) as a loss function summarizing the model mismatch\n",
44 | "\n",
45 | "* **miminizing the loss function** for simple models an anlytical solution may be available, but for most machine this requires an iterative optimization method to find the best model parameters\n",
46 | "\n",
47 | "This process is repeated over a range of model complexities specified by hyperparameters. \n",
48 | "\n",
49 | "#### Machine Learning Model Tuning\n",
50 | "\n",
51 | "The withheld testing data is retrieved and loss function (usually the **L2 norm** again) is calculated to summarize the error over the testing data\n",
52 | "\n",
53 | "* this is repeated over over the range of specified hypparameters\n",
54 | "\n",
55 | "* the model complexity / hyperparameters that minimize the loss function / error summary in testing is selected\n",
56 | "\n",
57 | "This is known are model hypparameter tuning.\n",
58 | "\n",
59 | "#### Machine Learning Model Overfit\n",
60 | "\n",
61 | "More model complexity/flexibility than can be justified with the available data, data accuracy, frequency and coverage\n",
62 | "\n",
63 | "* Model explains “idiosyncrasies” of the data, capturing data noise/error in the model\n",
64 | "\n",
65 | "* High accuracy in training, but low accuracy in testing / real-world use away from training data cases – poor ability of the model to generalize\n",
66 | "\n",
67 | "\n",
68 | "#### Workflow Goals\n",
69 | "\n",
70 | "Learn the basics of machine learning training, tuning for model generalization while avoiding model overfit.\n",
71 | "\n",
72 | "This includes:\n",
73 | "\n",
74 | "* Demonstrate model training and tuning by hand with an interactive exercies\n",
75 | "\n",
76 | "* Demonstrate the role of data error in leading to model overfit with complicated models\n",
77 | "\n",
78 | "#### Getting Started\n",
79 | "\n",
80 | "You will need to copy the following data files to your working directory. They are available [here](https://github.com/GeostatsGuy/GeoDataSets):\n",
81 | "\n",
82 | "* Tabular data - [Stochastic_1D_por_perm_demo.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/Stochastic_1D_por_perm_demo.csv)\n",
83 | "* Tabular data - [Random_Parabola.csv](https://github.com/GeostatsGuy/GeoDataSets/blob/master/Random_Parabola.csv)\n",
84 | "\n",
85 | "These datasets are available in the folder: https://github.com/GeostatsGuy/GeoDataSets.\n",
86 | "\n",
87 | "\n",
88 | "#### Import Required Packages\n",
89 | "\n",
90 | "We will also need some standard packages. These should have been installed with Anaconda 3."
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 1,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "import geostatspy.GSLIB as GSLIB # GSLIB utilies, visualization and wrapper\n",
100 | "import geostatspy.geostats as geostats # GSLIB methods convert to Python "
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "We will also need some standard packages. These should have been installed with Anaconda 3."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 2,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "%matplotlib inline\n",
117 | "import os # to set current working directory \n",
118 | "import sys # supress output to screen for interactive variogram modeling\n",
119 | "import io\n",
120 | "import numpy as np # arrays and matrix math\n",
121 | "import pandas as pd # DataFrames\n",
122 | "import matplotlib.pyplot as plt # plotting\n",
123 | "from sklearn.model_selection import train_test_split # train and test split\n",
124 | "from sklearn.metrics import mean_squared_error # model error calculation\n",
125 | "import scipy # kernel density estimator for PDF plot\n",
126 | "from matplotlib.pyplot import cm # color maps\n",
127 | "from ipywidgets import interactive # widgets and interactivity\n",
128 | "from ipywidgets import widgets \n",
129 | "from ipywidgets import Layout\n",
130 | "from ipywidgets import Label\n",
131 | "from ipywidgets import VBox, HBox\n",
132 | "import warnings\n",
133 | "warnings.filterwarnings('ignore') # supress warnings"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs. "
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "#### Build the Interactive Dashboard\n",
148 | "\n",
149 | "The following code:\n",
150 | "\n",
151 | "* makes a random dataset, change the random number seed and number of data for a different dataset\n",
152 | "* loops over polygonal fits of 1st-12th order, loops over mulitple realizations and calculates the average MSE and P10 and P90 vs. order\n",
153 | "* calculates a specific model example\n",
154 | "* plots the example model with training and testing data, the error distributions and the MSE envelopes vs. complexity"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 3,
160 | "metadata": {},
161 | "outputs": [],
162 | "source": [
163 | "l = widgets.Text(value=' Machine Learning Overfit/Generalization Demo, Prof. Michael Pyrcz, The University of Texas at Austin',\n",
164 | " layout=Layout(width='950px', height='30px'))\n",
165 | "\n",
166 | "n = widgets.IntSlider(min=15, max = 80, value=30, step = 1, description = 'n',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
167 | "split = widgets.FloatSlider(min=0.05, max = .95, value=0.20, step = 0.05, description = 'Test %',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
168 | "std = widgets.FloatSlider(min=0, max = 50, value=0, step = 1.0, description = 'Noise StDev',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
169 | "degree = widgets.IntSlider(min=1, max = 12, value=1, step = 1, description = 'Model Order',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
170 | "\n",
171 | "ui = widgets.HBox([n,split,std,degree],)\n",
172 | "ui2 = widgets.VBox([l,ui],)\n",
173 | "\n",
174 | "def run_plot(n,split,std,degree):\n",
175 | " seed = 13014; nreal = 20\n",
176 | " np.random.seed(seed) # seed the random number generator\n",
177 | " \n",
178 | " # make the datastet\n",
179 | " X_seq = np.linspace(0,20,100)\n",
180 | " X = np.random.rand(n)*20\n",
181 | " y = X*X + 50.0 # fit a parabola\n",
182 | " y = y + np.random.normal(loc = 0.0,scale=std,size=n) # add noise\n",
183 | " \n",
184 | " # calculate the MSE train and test over a range of complexity over multiple realizations of test/train split\n",
185 | " cdegrees = np.arange(1,13)\n",
186 | " cmse_train = np.zeros([len(cdegrees),nreal]); cmse_test = np.zeros([len(cdegrees),nreal])\n",
187 | " for j in range(0,nreal):\n",
188 | " for i, cdegree in enumerate(cdegrees):\n",
189 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed+j)\n",
190 | " ccoefs = np.polyfit(X_train,y_train,cdegree)\n",
191 | " y_pred_train = np.polyval(ccoefs, X_train)\n",
192 | " y_pred_test = np.polyval(ccoefs, X_test)\n",
193 | " cmse_train[i,j] = mean_squared_error(y_train, y_pred_train)\n",
194 | " cmse_test[i,j] = mean_squared_error(y_test, y_pred_test)\n",
195 | " # summarize over the realizations\n",
196 | " cmse_train_avg = cmse_train.mean(axis=1)\n",
197 | " cmse_test_avg = cmse_test.mean(axis=1)\n",
198 | " cmse_train_high = np.percentile(cmse_train,q=90,axis=1)\n",
199 | " cmse_train_low = np.percentile(cmse_train,q=10,axis=1) \n",
200 | " cmse_test_high = np.percentile(cmse_test,q=90,axis=1)\n",
201 | " cmse_test_low = np.percentile(cmse_test,q=10,axis=1)\n",
202 | " \n",
203 | "# cmse_train_high = np.amax(cmse_train,axis=1)\n",
204 | "# cmse_train_low = np.amin(cmse_train,axis=1) \n",
205 | "# cmse_test_high = np.amax(cmse_test,axis=1)\n",
206 | "# cmse_test_low = np.amin(cmse_test,axis=1)\n",
207 | " \n",
208 | " # build the one model example to show\n",
209 | " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split, random_state=seed)\n",
210 | " coefs = np.polyfit(X_train,y_train,degree) \n",
211 | " \n",
212 | " # calculate error\n",
213 | " error_seq = np.linspace(-100.0,100.0,100)\n",
214 | " error_train = np.polyval(coefs, X_train) - y_train\n",
215 | " #print(np.polyval(coefs, X_train))\n",
216 | " #print('truth')\n",
217 | " #print(X_train)\n",
218 | " error_test = np.polyval(coefs, X_test) - y_test\n",
219 | " \n",
220 | " mse_train = mean_squared_error(y_train, np.polyval(coefs, X_train))\n",
221 | " mse_test = mean_squared_error(y_test, np.polyval(coefs, X_test))\n",
222 | " \n",
223 | " error_train_std = np.std(error_train)\n",
224 | " error_test_std = np.std(error_test)\n",
225 | " \n",
226 | " kde_error_train = scipy.stats.gaussian_kde(error_train)\n",
227 | " kde_error_test = scipy.stats.gaussian_kde(error_test)\n",
228 | " \n",
229 | " plt.subplot(131)\n",
230 | " plt.plot(X_seq, np.polyval(coefs, X_seq), color=\"black\")\n",
231 | " plt.title(\"Polynomial Model of Degree = \"+str(degree))\n",
232 | " plt.scatter(X_train,y_train,c =\"red\",alpha=0.2,edgecolors=\"black\")\n",
233 | " plt.scatter(X_test,y_test,c =\"blue\",alpha=0.2,edgecolors=\"black\")\n",
234 | " plt.ylim([0,500]); plt.xlim([0,20]); plt.grid()\n",
235 | " plt.xlabel('Porosity (%)'); plt.ylabel('Permeability (mD)')\n",
236 | " \n",
237 | " plt.subplot(132)\n",
238 | " plt.hist(error_train, facecolor='red',bins=np.linspace(-50.0,50.0,10),alpha=0.2,density=True,edgecolor='black',label='Train')\n",
239 | " plt.hist(error_test, facecolor='blue',bins=np.linspace(-50.0,50.0,10),alpha=0.2,density=True,edgecolor='black',label='Test')\n",
240 | " #plt.plot(error_seq,kde_error_train(error_seq),lw=2,label='Train',c='red')\n",
241 | " #plt.plot(error_seq,kde_error_test(error_seq),lw=2,label='Test',c='blue') \n",
242 | " plt.xlim([-55.0,55.0]); plt.ylim([0,0.1])\n",
243 | " plt.xlabel('Model Error'); plt.ylabel('Frequency'); plt.title('Training and Testing Error, Model of Degree = '+str((degree)))\n",
244 | " plt.legend(loc='upper left')\n",
245 | " plt.grid(True)\n",
246 | " \n",
247 | " plt.subplot(133); ax = plt.gca()\n",
248 | " plt.plot(cdegrees,cmse_train_avg,lw=2,label='Train',c='red')\n",
249 | " ax.fill_between(cdegrees,cmse_train_high,cmse_train_low,facecolor='red',alpha=0.05)\n",
250 | " \n",
251 | " plt.plot(cdegrees,cmse_test_avg,lw=2,label='Test',c='blue') \n",
252 | " ax.fill_between(cdegrees,cmse_test_high,cmse_test_low,facecolor='blue',alpha=0.05)\n",
253 | " plt.xlim([1,12]); plt.yscale('log'); plt.ylim([0.0000001,10000])\n",
254 | " plt.xlabel('Complexity - Polynomial Order'); plt.ylabel('Mean Square Error'); plt.title('Training and Testing Error vs. Model Complexity')\n",
255 | " plt.legend(loc='upper left')\n",
256 | " plt.grid(True)\n",
257 | " \n",
258 | " plt.plot([degree,degree],[.0000001,100000],c = 'black',linewidth=3,alpha = 0.8)\n",
259 | " \n",
260 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.6, wspace=0.2, hspace=0.3)\n",
261 | " plt.show()\n",
262 | " \n",
263 | "# connect the function to make the samples and plot to the widgets \n",
264 | "interactive_plot = widgets.interactive_output(run_plot, {'n':n,'split':split,'std':std,'degree':degree})\n",
265 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "### Interactive Predictive Machine Learning Overfitting Demonstation \n",
273 | "\n",
274 | "#### Michael Pyrcz, Associate Professor, The University of Texas at Austin \n",
275 | "\n",
276 | "Change the number of sample data, train/test split and the data noise and observe overfit! Change the model order to observe a specific model example.\n",
277 | "\n",
278 | "### The Inputs\n",
279 | "\n",
280 | "* **n** - number of data\n",
281 | "* **Test %** - percentage of sample data withheld as testing data\n",
282 | "* **Noise StDev** - standard deviation of random Gaussian error added to the data\n",
283 | "* **Model Order** - the order of the "
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 4,
289 | "metadata": {
290 | "scrolled": false
291 | },
292 | "outputs": [
293 | {
294 | "data": {
295 | "application/vnd.jupyter.widget-view+json": {
296 | "model_id": "",
297 | "version_major": 2,
298 | "version_minor": 0
299 | },
300 | "text/plain": [
301 | "VBox(children=(Text(value=' Machine Learning Overfit/Generalization Demo…"
302 | ]
303 | },
304 | "metadata": {},
305 | "output_type": "display_data"
306 | },
307 | {
308 | "data": {
309 | "application/vnd.jupyter.widget-view+json": {
310 | "model_id": "",
311 | "version_major": 2,
312 | "version_minor": 0
313 | },
314 | "text/plain": [
315 | "Output()"
316 | ]
317 | },
318 | "metadata": {},
319 | "output_type": "display_data"
320 | }
321 | ],
322 | "source": [
323 | "display(ui2, interactive_plot) # display the interactive plot"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "#### Comments\n",
331 | "\n",
332 | "This was a basic demonstration of machine learning model training and tuning, model generalization and complexity. I have many other demonstrations and even basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
333 | " \n",
334 | "#### The Author:\n",
335 | "\n",
336 | "### Michael Pyrcz, Associate Professor, University of Texas at Austin \n",
337 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
338 | "\n",
339 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
340 | "\n",
341 | "For more about Michael check out these links:\n",
342 | "\n",
343 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
344 | "\n",
345 | "#### Want to Work Together?\n",
346 | "\n",
347 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
348 | "\n",
349 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
350 | "\n",
351 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
352 | "\n",
353 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
354 | "\n",
355 | "I'm always happy to discuss,\n",
356 | "\n",
357 | "*Michael*\n",
358 | "\n",
359 | "Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin\n",
360 | "\n",
361 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) \n",
362 | " "
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": []
371 | },
372 | {
373 | "cell_type": "code",
374 | "execution_count": null,
375 | "metadata": {},
376 | "outputs": [],
377 | "source": []
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": null,
382 | "metadata": {},
383 | "outputs": [],
384 | "source": []
385 | }
386 | ],
387 | "metadata": {
388 | "kernelspec": {
389 | "display_name": "Python 3 (ipykernel)",
390 | "language": "python",
391 | "name": "python3"
392 | },
393 | "language_info": {
394 | "codemirror_mode": {
395 | "name": "ipython",
396 | "version": 3
397 | },
398 | "file_extension": ".py",
399 | "mimetype": "text/x-python",
400 | "name": "python",
401 | "nbconvert_exporter": "python",
402 | "pygments_lexer": "ipython3",
403 | "version": "3.9.12"
404 | }
405 | },
406 | "nbformat": 4,
407 | "nbformat_minor": 2
408 | }
409 |
--------------------------------------------------------------------------------
/Interactive_Norms.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "
\n",
9 | " \n",
10 | "\n",
11 | "
\n",
12 | "\n",
13 | "## Interactive Demonstration of Machine Learning Norms\n",
14 | "\n",
15 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
16 | "\n",
17 | "##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)\n",
18 | "\n",
19 | "### Norms, Vector Norms\n",
20 | "\n",
21 | "Here is an interactive workflows demonstrationing the impact of choice on norm on a simple predictive machine learning model, linear regression, that should help you efficiently learn modeling parameter training, central for predictive machine learning.\n",
22 | "\n",
23 | "I have recorded a walk-through of this interactive dashboard in my [Data Science Interactive Python Demonstrations](https://www.youtube.com/playlist?list=PLG19vXLQHvSDy26fM3hDLg3VCU7U5BGZl) series on my [YouTube](https://www.youtube.com/@GeostatsGuyLectures) channel.\n",
24 | "\n",
25 | "* Join me for walk-through of this dashboard [04 Data Science Interactive: Norms](TBD). I'm stoked to guide you and share observations and things to try out! \n",
26 | "\n",
27 | "* I have a lecture on [Norms](https://www.youtube.com/watch?v=JmxGlrurQp0&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=20) as part of my [Machine Learning](https://www.youtube.com/playlist?list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf) course. Note, for all my recorded lecture the interactive and well-documented workflow demononstrations are available on my GitHub repository [GeostatsGuy's Python Numerical Demos](https://github.com/GeostatsGuy/PythonNumericalDemos).\n",
28 | "\n",
29 | "* Also, I have lecture with a summary of [Machine Learning](https://www.youtube.com/watch?v=zOUM_AnI1DQ&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=11).\n",
30 | "\n",
31 | "* Finally, I have lecture predictive machine learning wiwth [Linear Regression](https://www.youtube.com/watch?v=0fzbyhWiP84&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=21).\n",
32 | "\n",
33 | "#### Norms\n",
34 | "\n",
35 | "When we are training a machine learning model, or other statistical model, to training data we calculate the error at all training data, $\\Delta y_{\\alpha} = \\hat{y}_{\\alpha} - y_{\\alpha}, \\ \\alpha = 1,. \\ldots ,n_{train}$. Yet, for the purpose finding the best set of model parameters we need:\n",
36 | "\n",
37 | "1. to convert the error into a measure of loss, in other words, assign a cost to error\n",
38 | "2. summarize the errors over all training data as a single value to support optimization\n",
39 | "\n",
40 | "Here is our vector of errors over all of the training data:\n",
41 | "\n",
42 | "\\begin{equation}\n",
43 | "\\begin{pmatrix} \\Delta y_1 \\\\ \\Delta y_2 \\\\ \\Delta y_3 \\\\ \\vdots \\\\ \\Delta y_n \\end{pmatrix}\n",
44 | "\\end{equation}\n",
45 | "\n",
46 | "Firstly, we can convert the error to a loss by adding a power to the errors. We use absolute value to avoid negative loss for odd $p$.\n",
47 | "\n",
48 | "\\begin{equation}\n",
49 | "\\begin{pmatrix} |\\Delta y_1|^p \\\\ |\\Delta y_2|^p \\\\ |\\Delta y_3|^p \\\\ \\vdots \\\\ |\\Delta y_n|^p \\end{pmatrix}\n",
50 | "\\end{equation}\n",
51 | "\n",
52 | "where $p$ is the power. The higher the $p$ the greater the sensitivity to large errors, e.g., outliers.\n",
53 | "\n",
54 | "Next we take this vector and we summarize as a single value, known as a **norm**, or as a **vector norm**.\n",
55 | "\n",
56 | "\\begin{equation}\n",
57 | "\\begin{pmatrix} |\\Delta y_1|^p \\\\ |\\Delta y_2|^p \\\\ |\\Delta y_3|^p \\\\ \\vdots \\\\ |\\Delta y_n|^p \\end{pmatrix} \\rightarrow ||\\Delta y||_p \\quad ||\\Delta y||_p = \\left( \\sum_{\\alpha=1}^{n_{train}} | \\Delta y_{\\alpha} |^p \\right)^{\\frac{1}{p}}\n",
58 | "\\end{equation}\n",
59 | "\n",
60 | "such that our norm of our error vector maps to a value $\\rightarrow [0,\\infty)$.\n",
61 | "\n",
62 | "#### Common Norms, Manhattan, Euclidean and the General p-Norm\n",
63 | "\n",
64 | "These are the common choices for norm.\n",
65 | "\n",
66 | "**Manhattan Norm**, known as the **L1 Norm**, $L^1$, where $p=1$ is defined as:\n",
67 | "\n",
68 | "\\begin{equation}\n",
69 | "||\\Delta y||_1 = \\sum_{\\alpha=1}^{n_{train}} |\\Delta y_{\\alpha}| \n",
70 | "\\end{equation}\n",
71 | "\n",
72 | "**Euclidean Norm**, known as the **L2 Norm**, $L^2$, where $p=2$ is defined as:\n",
73 | "\n",
74 | "\\begin{equation}\n",
75 | "||\\Delta y||_2 = \\sqrt{ \\sum_{\\alpha=1}^{n_{train}} \\left( \\Delta y_{\\alpha} \\right)^2 }\n",
76 | "\\end{equation}\n",
77 | "\n",
78 | "**p-Norm**, $L^p$, is defined as:\n",
79 | "\n",
80 | "\\begin{equation}\n",
81 | "||\\Delta y||_p = \\left( \\sum_{\\alpha=1}^{n_{train}} | \\Delta y_{\\alpha} |^p \\right)^{\\frac{1}{p}}\n",
82 | "\\end{equation}\n",
83 | "\n",
84 | "I provide more information in my [Norms](https://www.youtube.com/watch?v=JmxGlrurQp0&list=PLG19vXLQHvSC2ZKFIkgVpI9fCjkN38kwf&index=20) lecture, but it is good to mention that there are important differences between norms, e.g., L1 norm and L2 norm.\n",
85 | "\n",
86 | "| L1 Norm | L2 Norm |\n",
87 | "| :-: | :-: |\n",
88 | "| Robust | Not Very Robust |\n",
89 | "| Unstable | Stable |\n",
90 | "| Possibly Mulitple Solutions | Always a Single Solution |\n",
91 | "| Feature Selection Built-in | No Feature Selection |\n",
92 | "| Sparse Outputs | Non-sparse Outputs |\n",
93 | "| No Analytics Solutions | Analytical Solutions Possible |\n",
94 | "\n",
95 | "A couple of definitions that will assist with understanding the differences above that you may observe in the interactivity:\n",
96 | "\n",
97 | "* **Robust**: resistant to outliers. \n",
98 | "* **Unstable**: for small changes in training the trained model predictions may ‘jump’\n",
99 | "* **Multiple Solutions**: multiple paths same lengths in a city (Manhattan distance)\n",
100 | "* **Sparse Output**: model coefficients tend to 0.0.\n",
101 | "\n",
102 | "#### Norm Dashboard\n",
103 | "\n",
104 | "To demonstrate the impact of the choice of norms I wrote a linear regression algorithm that allows us to choose any $p$-norm! Yes, you can actually use fractional norms!\n",
105 | "\n",
106 | "* let's change the norm with and without an outlier and observe the impact on the linear regression model.\n",
107 | "\n",
108 | "#### Getting Started\n",
109 | "\n",
110 | "Here's the steps to get setup in Python with the GeostatsPy package:\n",
111 | "\n",
112 | "1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). \n",
113 | "\n",
114 | "That's all!\n",
115 | "\n",
116 | "#### Load the Required Libraries\n",
117 | "\n",
118 | "We will also need some standard Python packages. These should have been installed with Anaconda 3."
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 5,
124 | "metadata": {},
125 | "outputs": [],
126 | "source": [
127 | "%matplotlib inline\n",
128 | "import sys # supress output to screen for interactive variogram modeling\n",
129 | "import io\n",
130 | "import numpy as np # arrays and matrix math\n",
131 | "import pandas as pd # DataFrames\n",
132 | "import matplotlib.pyplot as plt # plotting\n",
133 | "from matplotlib.ticker import (MultipleLocator, AutoMinorLocator) # control of axes ticks\n",
134 | "from scipy.optimize import minimize # linear regression training by-hand with variable norms\n",
135 | "from ipywidgets import interactive # widgets and interactivity\n",
136 | "from ipywidgets import widgets \n",
137 | "from ipywidgets import Layout\n",
138 | "from ipywidgets import Label\n",
139 | "from ipywidgets import VBox, HBox"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "#### Declare Functions\n",
147 | "\n",
148 | "We have functions to perform linear regression for any norm. The code was modified from [N. Wouda](https://stackoverflow.com/questions/51883058/l1-norm-instead-of-l2-norm-for-cost-function-in-regression-model).\n",
149 | "* I modified the original functions for a general p-norm linear regression method"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 6,
155 | "metadata": {
156 | "scrolled": true
157 | },
158 | "outputs": [],
159 | "source": [
160 | "def predict(X, params): # linear prediction\n",
161 | " return X.dot(params)\n",
162 | "\n",
163 | "def loss_function(params, X, y, p): # custom p-norm, linear regression cost function\n",
164 | " return np.sum(np.power(np.abs(y - predict(X, params)),p))\n",
165 | "\n",
166 | "def add_grid():\n",
167 | " plt.gca().grid(True, which='major',linewidth = 1.0); plt.gca().grid(True, which='minor',linewidth = 0.2) # add y grids\n",
168 | " plt.gca().tick_params(which='major',length=7); plt.gca().tick_params(which='minor', length=4)\n",
169 | " plt.gca().xaxis.set_minor_locator(AutoMinorLocator()); plt.gca().yaxis.set_minor_locator(AutoMinorLocator()) # turn on minor ticks "
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "#### Interactive Dashboard\n",
177 | "\n",
178 | "This code designed the interactive dashboard, prediction model and plots"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 7,
184 | "metadata": {},
185 | "outputs": [],
186 | "source": [
187 | "# widgets and dashboard\n",
188 | "l = widgets.Text(value=' Machine Learning Norms Demonstration, Prof. Michael Pyrcz, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))\n",
189 | "\n",
190 | "p_norm = widgets.FloatSlider(min=0.1, max = 10, value=1.0, step = 0.2, description = '$L^{p}$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
191 | "n = widgets.IntSlider(min=15, max = 80, value=30, step = 1, description = '$n$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
192 | "std = widgets.FloatSlider(min=0.0, max = .95, value=0.00, step = 0.05, description = 'Error ($\\sigma$)',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
193 | "xn = widgets.FloatSlider(min=0, max = 1.0, value=0.5, step = 0.05, description = '$X_{n+1}$',orientation='horizontal',style = {'description_width': 'initial'}, continuous_update=False)\n",
194 | "yn = widgets.FloatSlider(min=0, max = 1.0, value=0.5, step = 0.05, description = '$Y_{n+1}$',orientation='horizontal', style = {'description_width': 'initial'}, continuous_update=False)\n",
195 | "\n",
196 | "ui1 = widgets.HBox([p_norm,n,std],)\n",
197 | "ui2 = widgets.HBox([xn,yn],)\n",
198 | "ui = widgets.VBox([l,ui1,ui2],)\n",
199 | "\n",
200 | "def run_plot(p_norm,n,std,xn,yn): # make data, fit models and plot\n",
201 | "\n",
202 | " np.random.seed(73073) # set random number seed for repeatable results\n",
203 | "\n",
204 | " X_seq = np.linspace(0,100.0,1000) # make data and add noise\n",
205 | " X_seq = np.asarray([np.ones((len(X_seq),)), X_seq]).T\n",
206 | " X = np.random.rand(n)*0.5\n",
207 | " y = X*X + 0.0 # fit a parabola\n",
208 | " y = y + np.random.normal(loc = 0.0,scale=std,size=n) # add noise\n",
209 | " X = np.asarray([np.ones((n,)), X]).T # concatenate a vector of 1's for the constant term\n",
210 | " \n",
211 | " X = np.vstack([X,[1,xn]]); y = np.append(y,yn) # add the user specified data value to X and y\n",
212 | " \n",
213 | " x0 = [0.5,0.5] # initial guess of model parameters\n",
214 | " p = 2.0\n",
215 | " output_l2 = minimize(loss_function, x0, args=(X, y, p)) # train the L2 norm linear regression model\n",
216 | " p = 1.0\n",
217 | " output_l1 = minimize(loss_function, x0, args=(X, y, p)) # train the L1 norm linear regression model\n",
218 | " p = 3.0\n",
219 | " output_l3 = minimize(loss_function, x0, args=(X, y, p)) # train the L3 norm linear regression model\n",
220 | " \n",
221 | " p = p_norm\n",
222 | " output_lcust = minimize(loss_function, x0, args=(X, y, p)) # train the p-norm linear regression model\n",
223 | "\n",
224 | " y_hat_l1 = predict(X_seq, output_l1.x) # predict over the range of X for all models\n",
225 | " y_hat_l2 = predict(X_seq, output_l2.x)\n",
226 | " y_hat_l3 = predict(X_seq, output_l3.x)\n",
227 | " y_hat_lcust = predict(X_seq, output_lcust.x)\n",
228 | " \n",
229 | " plt.subplot(111) # plot the results\n",
230 | " plt.scatter(X[:(n-1),1],y[:(n-1)],s=40,facecolor = 'white',edgecolor = 'black',alpha = 1.0,zorder=100)\n",
231 | " plt.scatter(X[n,1],y[n],s=40,marker='x',color = 'black',alpha = 1.0,zorder=100)\n",
232 | " plt.scatter(X[n,1],y[n],s=200,marker='o',lw=1.0,edgecolor = 'black',facecolor = 'white',alpha = 1.0,zorder=98)\n",
233 | " plt.annotate(r'$n+1$',[X[n,1]+0.02,y[n]+0.02])\n",
234 | " plt.plot(X_seq[:,1],y_hat_l1,c = 'blue',lw=7,alpha = 1.0,label = \"L1 Norm\",zorder=10)\n",
235 | " plt.plot(X_seq[:,1],y_hat_l2,c = 'red',lw=7,alpha = 1.0,label = \"L2 Norm\",zorder=10)\n",
236 | " plt.plot(X_seq[:,1],y_hat_l3,c = 'green',lw=7,alpha = 1.0,label = \"L3 Norm\",zorder=10)\n",
237 | " plt.plot(X_seq[:,1],y_hat_lcust,c = 'white',lw=4,alpha = 1.0,zorder=18)\n",
238 | " plt.plot(X_seq[:,1],y_hat_lcust,c = 'black',lw=2,alpha = 1.0,label = \"L\"+ str(p_norm) + \" Norm\",zorder=20)\n",
239 | " plt.xlabel(r'Predictor Feature, $X_{1}$'); plt.ylabel(r'Response Feature, $y$'); plt.title('Linear Regression with Variable Norm')\n",
240 | " plt.xlim([0.0,1.0]); plt.ylim([0.0,1.0])\n",
241 | " plt.legend(loc = 'upper left'); add_grid()\n",
242 | " \n",
243 | " plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.2, wspace=0.9, hspace=0.3)\n",
244 | " plt.show()\n",
245 | " \n",
246 | "# connect the function to make the samples and plot to the widgets \n",
247 | "interactive_plot = widgets.interactive_output(run_plot, {'p_norm':p_norm,'n':n,'std':std,'xn':xn,'yn':yn})\n",
248 | "interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating"
249 | ]
250 | },
251 | {
252 | "cell_type": "markdown",
253 | "metadata": {},
254 | "source": [
255 | "### Interactive Machine Learning Norms Demonstation \n",
256 | "\n",
257 | "#### Michael Pyrcz, Professor, The University of Texas at Austin \n",
258 | "\n",
259 | "Observe the impact of choice of norm with variable number of sample data, the data noise, and an outlier! \n",
260 | "\n",
261 | "### The Inputs\n",
262 | "\n",
263 | "* **p-norm** - 1 = Manhattan norm, 2 = Euclidean norm, etc., **n** - number of data, **Error** - random error in standard deviations\n",
264 | "* **$x_{n+1}$**, **$y_{n+1}$** - x and y location of an additional data value, potentially an outlier"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 8,
270 | "metadata": {
271 | "scrolled": false
272 | },
273 | "outputs": [
274 | {
275 | "data": {
276 | "application/vnd.jupyter.widget-view+json": {
277 | "model_id": "bcc568fddf18475a810f87504430ac93",
278 | "version_major": 2,
279 | "version_minor": 0
280 | },
281 | "text/plain": [
282 | "VBox(children=(Text(value=' Machine Learning Norms Demonstration, Prof. …"
283 | ]
284 | },
285 | "metadata": {},
286 | "output_type": "display_data"
287 | },
288 | {
289 | "data": {
290 | "application/vnd.jupyter.widget-view+json": {
291 | "model_id": "1bd5c2acf117461abba7994efec7897e",
292 | "version_major": 2,
293 | "version_minor": 0
294 | },
295 | "text/plain": [
296 | "Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '', 'i…"
297 | ]
298 | },
299 | "metadata": {},
300 | "output_type": "display_data"
301 | }
302 | ],
303 | "source": [
304 | "display(ui, interactive_plot) # display the interactive plot"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "#### Comments\n",
312 | "\n",
313 | "This was a basic demonstration of machining learning norms. I have many other demonstrations and even basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy. \n",
314 | " \n",
315 | "#### The Author:\n",
316 | "\n",
317 | "### Michael Pyrcz, Professor, The University of Texas at Austin \n",
318 | "*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*\n",
319 | "\n",
320 | "With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. \n",
321 | "\n",
322 | "For more about Michael check out these links:\n",
323 | "\n",
324 | "#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)\n",
325 | "\n",
326 | "#### Want to Work Together?\n",
327 | "\n",
328 | "I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.\n",
329 | "\n",
330 | "* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! \n",
331 | "\n",
332 | "* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!\n",
333 | "\n",
334 | "* I can be reached at mpyrcz@austin.utexas.edu.\n",
335 | "\n",
336 | "I'm always happy to discuss,\n",
337 | "\n",
338 | "*Michael*\n",
339 | "\n",
340 | "Michael Pyrcz, Ph.D., P.Eng. Professor, Cockrell School of Engineering and The Jackson School of Geosciences, The University of Texas at Austin\n",
341 | "\n",
342 | "#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig) | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) "
343 | ]
344 | },
345 | {
346 | "cell_type": "code",
347 | "execution_count": null,
348 | "metadata": {},
349 | "outputs": [],
350 | "source": []
351 | }
352 | ],
353 | "metadata": {
354 | "kernelspec": {
355 | "display_name": "Python 3 (ipykernel)",
356 | "language": "python",
357 | "name": "python3"
358 | },
359 | "language_info": {
360 | "codemirror_mode": {
361 | "name": "ipython",
362 | "version": 3
363 | },
364 | "file_extension": ".py",
365 | "mimetype": "text/x-python",
366 | "name": "python",
367 | "nbconvert_exporter": "python",
368 | "pygments_lexer": "ipython3",
369 | "version": "3.11.4"
370 | }
371 | },
372 | "nbformat": 4,
373 | "nbformat_minor": 2
374 | }
375 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Attribution-ShareAlike 4.0 International
2 |
3 | =======================================================================
4 |
5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
6 | does not provide legal services or legal advice. Distribution of
7 | Creative Commons public licenses does not create a lawyer-client or
8 | other relationship. Creative Commons makes its licenses and related
9 | information available on an "as-is" basis. Creative Commons gives no
10 | warranties regarding its licenses, any material licensed under their
11 | terms and conditions, or any related information. Creative Commons
12 | disclaims all liability for damages resulting from their use to the
13 | fullest extent possible.
14 |
15 | Using Creative Commons Public Licenses
16 |
17 | Creative Commons public licenses provide a standard set of terms and
18 | conditions that creators and other rights holders may use to share
19 | original works of authorship and other material subject to copyright
20 | and certain other rights specified in the public license below. The
21 | following considerations are for informational purposes only, are not
22 | exhaustive, and do not form part of our licenses.
23 |
24 | Considerations for licensors: Our public licenses are
25 | intended for use by those authorized to give the public
26 | permission to use material in ways otherwise restricted by
27 | copyright and certain other rights. Our licenses are
28 | irrevocable. Licensors should read and understand the terms
29 | and conditions of the license they choose before applying it.
30 | Licensors should also secure all rights necessary before
31 | applying our licenses so that the public can reuse the
32 | material as expected. Licensors should clearly mark any
33 | material not subject to the license. This includes other CC-
34 | licensed material, or material used under an exception or
35 | limitation to copyright. More considerations for licensors:
36 | wiki.creativecommons.org/Considerations_for_licensors
37 |
38 | Considerations for the public: By using one of our public
39 | licenses, a licensor grants the public permission to use the
40 | licensed material under specified terms and conditions. If
41 | the licensor's permission is not necessary for any reason--for
42 | example, because of any applicable exception or limitation to
43 | copyright--then that use is not regulated by the license. Our
44 | licenses grant only permissions under copyright and certain
45 | other rights that a licensor has authority to grant. Use of
46 | the licensed material may still be restricted for other
47 | reasons, including because others have copyright or other
48 | rights in the material. A licensor may make special requests,
49 | such as asking that all changes be marked or described.
50 | Although not required by our licenses, you are encouraged to
51 | respect those requests where reasonable. More_considerations
52 | for the public:
53 | wiki.creativecommons.org/Considerations_for_licensees
54 |
55 | =======================================================================
56 |
57 | Creative Commons Attribution-ShareAlike 4.0 International Public
58 | License
59 |
60 | By exercising the Licensed Rights (defined below), You accept and agree
61 | to be bound by the terms and conditions of this Creative Commons
62 | Attribution-ShareAlike 4.0 International Public License ("Public
63 | License"). To the extent this Public License may be interpreted as a
64 | contract, You are granted the Licensed Rights in consideration of Your
65 | acceptance of these terms and conditions, and the Licensor grants You
66 | such rights in consideration of benefits the Licensor receives from
67 | making the Licensed Material available under these terms and
68 | conditions.
69 |
70 |
71 | Section 1 -- Definitions.
72 |
73 | a. Adapted Material means material subject to Copyright and Similar
74 | Rights that is derived from or based upon the Licensed Material
75 | and in which the Licensed Material is translated, altered,
76 | arranged, transformed, or otherwise modified in a manner requiring
77 | permission under the Copyright and Similar Rights held by the
78 | Licensor. For purposes of this Public License, where the Licensed
79 | Material is a musical work, performance, or sound recording,
80 | Adapted Material is always produced where the Licensed Material is
81 | synched in timed relation with a moving image.
82 |
83 | b. Adapter's License means the license You apply to Your Copyright
84 | and Similar Rights in Your contributions to Adapted Material in
85 | accordance with the terms and conditions of this Public License.
86 |
87 | c. BY-SA Compatible License means a license listed at
88 | creativecommons.org/compatiblelicenses, approved by Creative
89 | Commons as essentially the equivalent of this Public License.
90 |
91 | d. Copyright and Similar Rights means copyright and/or similar rights
92 | closely related to copyright including, without limitation,
93 | performance, broadcast, sound recording, and Sui Generis Database
94 | Rights, without regard to how the rights are labeled or
95 | categorized. For purposes of this Public License, the rights
96 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
97 | Rights.
98 |
99 | e. Effective Technological Measures means those measures that, in the
100 | absence of proper authority, may not be circumvented under laws
101 | fulfilling obligations under Article 11 of the WIPO Copyright
102 | Treaty adopted on December 20, 1996, and/or similar international
103 | agreements.
104 |
105 | f. Exceptions and Limitations means fair use, fair dealing, and/or
106 | any other exception or limitation to Copyright and Similar Rights
107 | that applies to Your use of the Licensed Material.
108 |
109 | g. License Elements means the license attributes listed in the name
110 | of a Creative Commons Public License. The License Elements of this
111 | Public License are Attribution and ShareAlike.
112 |
113 | h. Licensed Material means the artistic or literary work, database,
114 | or other material to which the Licensor applied this Public
115 | License.
116 |
117 | i. Licensed Rights means the rights granted to You subject to the
118 | terms and conditions of this Public License, which are limited to
119 | all Copyright and Similar Rights that apply to Your use of the
120 | Licensed Material and that the Licensor has authority to license.
121 |
122 | j. Licensor means the individual(s) or entity(ies) granting rights
123 | under this Public License.
124 |
125 | k. Share means to provide material to the public by any means or
126 | process that requires permission under the Licensed Rights, such
127 | as reproduction, public display, public performance, distribution,
128 | dissemination, communication, or importation, and to make material
129 | available to the public including in ways that members of the
130 | public may access the material from a place and at a time
131 | individually chosen by them.
132 |
133 | l. Sui Generis Database Rights means rights other than copyright
134 | resulting from Directive 96/9/EC of the European Parliament and of
135 | the Council of 11 March 1996 on the legal protection of databases,
136 | as amended and/or succeeded, as well as other essentially
137 | equivalent rights anywhere in the world.
138 |
139 | m. You means the individual or entity exercising the Licensed Rights
140 | under this Public License. Your has a corresponding meaning.
141 |
142 |
143 | Section 2 -- Scope.
144 |
145 | a. License grant.
146 |
147 | 1. Subject to the terms and conditions of this Public License,
148 | the Licensor hereby grants You a worldwide, royalty-free,
149 | non-sublicensable, non-exclusive, irrevocable license to
150 | exercise the Licensed Rights in the Licensed Material to:
151 |
152 | a. reproduce and Share the Licensed Material, in whole or
153 | in part; and
154 |
155 | b. produce, reproduce, and Share Adapted Material.
156 |
157 | 2. Exceptions and Limitations. For the avoidance of doubt, where
158 | Exceptions and Limitations apply to Your use, this Public
159 | License does not apply, and You do not need to comply with
160 | its terms and conditions.
161 |
162 | 3. Term. The term of this Public License is specified in Section
163 | 6(a).
164 |
165 | 4. Media and formats; technical modifications allowed. The
166 | Licensor authorizes You to exercise the Licensed Rights in
167 | all media and formats whether now known or hereafter created,
168 | and to make technical modifications necessary to do so. The
169 | Licensor waives and/or agrees not to assert any right or
170 | authority to forbid You from making technical modifications
171 | necessary to exercise the Licensed Rights, including
172 | technical modifications necessary to circumvent Effective
173 | Technological Measures. For purposes of this Public License,
174 | simply making modifications authorized by this Section 2(a)
175 | (4) never produces Adapted Material.
176 |
177 | 5. Downstream recipients.
178 |
179 | a. Offer from the Licensor -- Licensed Material. Every
180 | recipient of the Licensed Material automatically
181 | receives an offer from the Licensor to exercise the
182 | Licensed Rights under the terms and conditions of this
183 | Public License.
184 |
185 | b. Additional offer from the Licensor -- Adapted Material.
186 | Every recipient of Adapted Material from You
187 | automatically receives an offer from the Licensor to
188 | exercise the Licensed Rights in the Adapted Material
189 | under the conditions of the Adapter's License You apply.
190 |
191 | c. No downstream restrictions. You may not offer or impose
192 | any additional or different terms or conditions on, or
193 | apply any Effective Technological Measures to, the
194 | Licensed Material if doing so restricts exercise of the
195 | Licensed Rights by any recipient of the Licensed
196 | Material.
197 |
198 | 6. No endorsement. Nothing in this Public License constitutes or
199 | may be construed as permission to assert or imply that You
200 | are, or that Your use of the Licensed Material is, connected
201 | with, or sponsored, endorsed, or granted official status by,
202 | the Licensor or others designated to receive attribution as
203 | provided in Section 3(a)(1)(A)(i).
204 |
205 | b. Other rights.
206 |
207 | 1. Moral rights, such as the right of integrity, are not
208 | licensed under this Public License, nor are publicity,
209 | privacy, and/or other similar personality rights; however, to
210 | the extent possible, the Licensor waives and/or agrees not to
211 | assert any such rights held by the Licensor to the limited
212 | extent necessary to allow You to exercise the Licensed
213 | Rights, but not otherwise.
214 |
215 | 2. Patent and trademark rights are not licensed under this
216 | Public License.
217 |
218 | 3. To the extent possible, the Licensor waives any right to
219 | collect royalties from You for the exercise of the Licensed
220 | Rights, whether directly or through a collecting society
221 | under any voluntary or waivable statutory or compulsory
222 | licensing scheme. In all other cases the Licensor expressly
223 | reserves any right to collect such royalties.
224 |
225 |
226 | Section 3 -- License Conditions.
227 |
228 | Your exercise of the Licensed Rights is expressly made subject to the
229 | following conditions.
230 |
231 | a. Attribution.
232 |
233 | 1. If You Share the Licensed Material (including in modified
234 | form), You must:
235 |
236 | a. retain the following if it is supplied by the Licensor
237 | with the Licensed Material:
238 |
239 | i. identification of the creator(s) of the Licensed
240 | Material and any others designated to receive
241 | attribution, in any reasonable manner requested by
242 | the Licensor (including by pseudonym if
243 | designated);
244 |
245 | ii. a copyright notice;
246 |
247 | iii. a notice that refers to this Public License;
248 |
249 | iv. a notice that refers to the disclaimer of
250 | warranties;
251 |
252 | v. a URI or hyperlink to the Licensed Material to the
253 | extent reasonably practicable;
254 |
255 | b. indicate if You modified the Licensed Material and
256 | retain an indication of any previous modifications; and
257 |
258 | c. indicate the Licensed Material is licensed under this
259 | Public License, and include the text of, or the URI or
260 | hyperlink to, this Public License.
261 |
262 | 2. You may satisfy the conditions in Section 3(a)(1) in any
263 | reasonable manner based on the medium, means, and context in
264 | which You Share the Licensed Material. For example, it may be
265 | reasonable to satisfy the conditions by providing a URI or
266 | hyperlink to a resource that includes the required
267 | information.
268 |
269 | 3. If requested by the Licensor, You must remove any of the
270 | information required by Section 3(a)(1)(A) to the extent
271 | reasonably practicable.
272 |
273 | b. ShareAlike.
274 |
275 | In addition to the conditions in Section 3(a), if You Share
276 | Adapted Material You produce, the following conditions also apply.
277 |
278 | 1. The Adapter's License You apply must be a Creative Commons
279 | license with the same License Elements, this version or
280 | later, or a BY-SA Compatible License.
281 |
282 | 2. You must include the text of, or the URI or hyperlink to, the
283 | Adapter's License You apply. You may satisfy this condition
284 | in any reasonable manner based on the medium, means, and
285 | context in which You Share Adapted Material.
286 |
287 | 3. You may not offer or impose any additional or different terms
288 | or conditions on, or apply any Effective Technological
289 | Measures to, Adapted Material that restrict exercise of the
290 | rights granted under the Adapter's License You apply.
291 |
292 |
293 | Section 4 -- Sui Generis Database Rights.
294 |
295 | Where the Licensed Rights include Sui Generis Database Rights that
296 | apply to Your use of the Licensed Material:
297 |
298 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
299 | to extract, reuse, reproduce, and Share all or a substantial
300 | portion of the contents of the database;
301 |
302 | b. if You include all or a substantial portion of the database
303 | contents in a database in which You have Sui Generis Database
304 | Rights, then the database in which You have Sui Generis Database
305 | Rights (but not its individual contents) is Adapted Material,
306 |
307 | including for purposes of Section 3(b); and
308 | c. You must comply with the conditions in Section 3(a) if You Share
309 | all or a substantial portion of the contents of the database.
310 |
311 | For the avoidance of doubt, this Section 4 supplements and does not
312 | replace Your obligations under this Public License where the Licensed
313 | Rights include other Copyright and Similar Rights.
314 |
315 |
316 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
317 |
318 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
319 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
320 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
321 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
322 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
323 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
324 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
325 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
326 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
327 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
328 |
329 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
330 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
331 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
332 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
333 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
334 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
335 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
336 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
337 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
338 |
339 | c. The disclaimer of warranties and limitation of liability provided
340 | above shall be interpreted in a manner that, to the extent
341 | possible, most closely approximates an absolute disclaimer and
342 | waiver of all liability.
343 |
344 |
345 | Section 6 -- Term and Termination.
346 |
347 | a. This Public License applies for the term of the Copyright and
348 | Similar Rights licensed here. However, if You fail to comply with
349 | this Public License, then Your rights under this Public License
350 | terminate automatically.
351 |
352 | b. Where Your right to use the Licensed Material has terminated under
353 | Section 6(a), it reinstates:
354 |
355 | 1. automatically as of the date the violation is cured, provided
356 | it is cured within 30 days of Your discovery of the
357 | violation; or
358 |
359 | 2. upon express reinstatement by the Licensor.
360 |
361 | For the avoidance of doubt, this Section 6(b) does not affect any
362 | right the Licensor may have to seek remedies for Your violations
363 | of this Public License.
364 |
365 | c. For the avoidance of doubt, the Licensor may also offer the
366 | Licensed Material under separate terms or conditions or stop
367 | distributing the Licensed Material at any time; however, doing so
368 | will not terminate this Public License.
369 |
370 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
371 | License.
372 |
373 |
374 | Section 7 -- Other Terms and Conditions.
375 |
376 | a. The Licensor shall not be bound by any additional or different
377 | terms or conditions communicated by You unless expressly agreed.
378 |
379 | b. Any arrangements, understandings, or agreements regarding the
380 | Licensed Material not stated herein are separate from and
381 | independent of the terms and conditions of this Public License.
382 |
383 |
384 | Section 8 -- Interpretation.
385 |
386 | a. For the avoidance of doubt, this Public License does not, and
387 | shall not be interpreted to, reduce, limit, restrict, or impose
388 | conditions on any use of the Licensed Material that could lawfully
389 | be made without permission under this Public License.
390 |
391 | b. To the extent possible, if any provision of this Public License is
392 | deemed unenforceable, it shall be automatically reformed to the
393 | minimum extent necessary to make it enforceable. If the provision
394 | cannot be reformed, it shall be severed from this Public License
395 | without affecting the enforceability of the remaining terms and
396 | conditions.
397 |
398 | c. No term or condition of this Public License will be waived and no
399 | failure to comply consented to unless expressly agreed to by the
400 | Licensor.
401 |
402 | d. Nothing in this Public License constitutes or may be interpreted
403 | as a limitation upon, or waiver of, any privileges and immunities
404 | that apply to the Licensor or You, including from the legal
405 | processes of any jurisdiction or authority.
406 |
407 |
408 | =======================================================================
409 |
410 | Creative Commons is not a party to its public
411 | licenses. Notwithstanding, Creative Commons may elect to apply one of
412 | its public licenses to material it publishes and in those instances
413 | will be considered the “Licensor.” The text of the Creative Commons
414 | public licenses is dedicated to the public domain under the CC0 Public
415 | Domain Dedication. Except for the limited purpose of indicating that
416 | material is shared under a Creative Commons public license or as
417 | otherwise permitted by the Creative Commons policies published at
418 | creativecommons.org/policies, Creative Commons does not authorize the
419 | use of the trademark "Creative Commons" or any other trademark or logo
420 | of Creative Commons without its prior written consent including,
421 | without limitation, in connection with any unauthorized modifications
422 | to any of its public licenses or any other arrangements,
423 | understandings, or agreements concerning use of licensed material. For
424 | the avoidance of doubt, this paragraph does not form part of the
425 | public licenses.
426 |
427 | Creative Commons may be contacted at creativecommons.org.
--------------------------------------------------------------------------------