├── hello.png ├── .gitignore └── .gitignore ├── README.md ├── 05.01-What-Is-Machine-Learning.ipynb ├── Pandas-Operations.ipynb ├── Matplotlib-Errorbars.ipynb ├── Pandas-Performance-Eval-and-Query.ipynb ├── Pandas-Missing-Values.ipynb └── Pandas-Vectorized-String-Ops.ipynb /hello.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thinknewtech/Data-Science/HEAD/hello.png -------------------------------------------------------------------------------- /.gitignore/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Handbook Cribsheet 2 | Intro to Numpy, Pandas, Matplotlib and Scikit-Learn. Condensed from Python Data Science Handbook (@jakevdp) 3 | 4 | ![book cover](/figures/PDSH-cover.png) 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 |

iPython	NumPy	Pandas	Matplotlib	Scikit-Learn
Help & docs	Datatypes	Pandas objects	Line plots	Intro to ML
Keyboard shortcuts	Arrays	Indexing & selection	Scatter plots	Intro to Scikit-Learn
Magic commands	Universal functions	Data operations	Error Bars	Hyperparameters & Model validation
Input & output history	Aggregations	Missing data	Density & contour plots	Feature engineering
Shell commands	Broadcasting	Hierarchical indexing	Histograms, bins & densities	Naive Bayes
Errors & debugging	Comparisons, masks, boolean logic	Concat & append	Plot legends	Linear regression
Profiling & timing	Fancy indexing	Merge & join	Colorbars	Support vector machines (SVM)
Resources	Sorting arrays	Aggregation & grouping	Subplots	Decision trees & Random forests
	Structured data	Pivot tables	Text & Annotation	Principal component analysis (PCA)
		Vectorized strings	Custom Tickmarks	Manifold learning
		Time series	Configs & stylesheets	K-Means clustering
		High-performance ops: eval(), query()	3D plotting	Gaussian mixtures
			Geo data with Basemap	TO DO:Kernel density estimation (KDE)
			Visualization with Seaborn	TO DO:Face detection pipeline
				Resources

105 | -------------------------------------------------------------------------------- /05.01-What-Is-Machine-Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# What Is Machine Learning?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Categories of Machine Learning\n", 15 | "\n", 16 | "- __Supervised__ learning models the relationship between measured features and label associated with some data. Once the model is built, it can be used to apply labels to new, unknown data.\n", 17 | "- This is further subdivided into __classification__ and __regression__ tasks: classification labels are __discrete__ categories; regression labels are __continuous__ quantities.\n", 18 | "\n", 19 | "- __Unsupervised__ learning models the features of a dataset without reference to any label. It is often described as \"letting the dataset speak for itself.\"\n", 20 | "- These models include tasks such as __clustering__ and __dimensionality reduction.__\n", 21 | "- Clustering algorithms identify __distinct__ groups of data, while dimensionality reduction algorithms search for __more succinct representations__ of the data.\n", 22 | "- In addition, there are so-called __semi-supervised__ learning methods. They are often useful when labels are incomplete or misleading." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Classification: Predicting discrete labels\n", 30 | "\n", 31 | "- We will first take a look at a simple __classification__ task, in which you are given a set of labeled points and want to use these to classify some unlabeled points." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "![](figures/05.01-classification-1.png)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "- Here we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane. We also have one of two *class labels* for each point represented by the point's color.\n", 46 | "- We would like to create a model that will let us decide whether a new point should be labeled \"blue\" or \"red.\"\n", 47 | "- Let's assume the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group.\n", 48 | "- Here the *model* is a mathematical equivalent of \"a straight line separates the classes\", while the *model parameters* are the values describing the location and orientation of that line.\n", 49 | "- The optimal values for these model parameters are learned from the data. This is often called *training the model*.\n", 50 | "- Here's what the trained model looks like:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "![](figures/05.01-classification-2.png)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "- This trained model can now be applied to new, unlabeled data. This is called *prediction*. See the following figure:" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "![](figures/05.01-classification-3.png)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Regression: Predicting continuous labels\n", 79 | "\n", 80 | "- Consider the data shown in the following figure, which consists of a set of points with continuous labels:" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "![](figures/05.01-regression-1.png)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "- The color of each point represents the continuous label for that point. We will use a simple linear regression to predict the points.\n", 95 | "- This model assumes if we treat the label as a third spatial dimension, we can fit a plane to the data." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "![](figures/05.01-regression-2.png)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Notice that the *feature 1-feature 2* plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position.\n", 110 | "From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters.\n", 111 | "Returning to the two-dimensional projection, when we fit such a plane we get the result shown in the following figure:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "![](figures/05.01-regression-3.png)\n", 119 | "[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-3)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "This plane of fit gives us what we need to predict labels for new points.\n", 127 | "Visually, we find the results shown in the following figure:" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "![](figures/05.01-regression-4.png)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "As with the classification example, this may seem rather trivial in a low number of dimensions.\n", 142 | "But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.\n", 143 | "\n", 144 | "For example, this is similar to the task of computing the distance to galaxies observed through a telescope—in this case, we might use the following features and labels:\n", 145 | "\n", 146 | "- *feature 1*, *feature 2*, etc. $\\to$ brightness of each galaxy at one of several wave lengths or colors\n", 147 | "- *label* $\\to$ distance or redshift of the galaxy\n", 148 | "\n", 149 | "The distances for a small number of these galaxies might be determined through an independent set of (typically more expensive) observations.\n", 150 | "Distances to remaining galaxies could then be estimated using a suitable regression model, without the need to employ the more expensive observation across the entire set.\n", 151 | "In astronomy circles, this is known as the \"photometric redshift\" problem." 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### Clustering: Inferring labels on unlabeled data\n", 159 | "\n", 160 | "The classification and regression illustrations we just looked at are examples of supervised learning algorithms, in which we are trying to build a model that will predict labels for new data.\n", 161 | "Unsupervised learning involves models that describe data without reference to any known labels.\n", 162 | "\n", 163 | "One common case of unsupervised learning is \"clustering,\" in which data is automatically assigned to some number of discrete groups.\n", 164 | "For example, we might have some two-dimensional data like that shown in the following figure:" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "![](figures/05.01-clustering-1.png)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "By eye, it is clear that each of these points is part of a distinct group.\n", 179 | "Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related.\n", 180 | "The *k*-means algorithm yields the clusters shown in the following figure:" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "![](figures/05.01-clustering-2.png)\n", 188 | "[figure source in Appendix](06.00-Figure-Code.ipynb#Clustering-Example-Figure-2)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "*k*-means fits a model consisting of *k* cluster centers; the optimal centers are assumed to be those that minimize the distance of each point from its assigned center.\n", 196 | "Again, this might seem like a trivial exercise in two dimensions, but as our data becomes larger and more complex, such clustering algorithms can be employed to extract useful information from the dataset." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### Dimensionality reduction: Inferring structure of unlabeled data\n", 204 | "\n", 205 | "Dimensionality reduction is another example of an unsupervised algorithm, in which labels or other information are inferred from the structure of the dataset itself.\n", 206 | "Dimensionality reduction is a bit more abstract than the examples we looked at before, but generally it seeks to pull out some low-dimensional representation of data that in some way preserves relevant qualities of the full dataset.\n", 207 | "\n", 208 | "As an example of this, consider the data shown in the following figure:" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "![](figures/05.01-dimesionality-1.png)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "Visually, it is clear that there is some structure in this data: it is drawn from a one-dimensional line that is arranged in a spiral within this two-dimensional space.\n", 223 | "In a sense, you could say that this data is \"intrinsically\" only one dimensional, though this one-dimensional data is embedded in higher-dimensional space.\n", 224 | "A suitable dimensionality reduction model in this case would be sensitive to this nonlinear embedded structure, and be able to pull out this lower-dimensionality representation.\n", 225 | "\n", 226 | "The following figure shows a visualization of the results of the Isomap algorithm, a manifold learning algorithm that does exactly this:" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "![](figures/05.01-dimesionality-2.png)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Notice that the colors (which represent the extracted one-dimensional latent variable) change uniformly along the spiral, which indicates that the algorithm did in fact detect the structure we saw by eye.\n", 241 | "As with the previous examples, the power of dimensionality reduction algorithms becomes clearer in higher-dimensional cases.\n", 242 | "For example, we might wish to visualize important relationships within a dataset that has 100 or 1,000 features.\n", 243 | "Visualizing 1,000-dimensional data is a challenge, and one way we can make this more manageable is to use a dimensionality reduction technique to reduce the data to two or three dimensions." 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "## Summary\n", 251 | "\n", 252 | "Here we have seen a few simple examples of some of the basic types of machine learning approaches.\n", 253 | "Needless to say, there are a number of important practical details that we have glossed over, but I hope this section was enough to give you a basic idea of what types of problems machine learning approaches can solve.\n", 254 | "\n", 255 | "In short, we saw the following:\n", 256 | "\n", 257 | "- *Supervised learning*: Models that can predict labels based on labeled training data\n", 258 | "\n", 259 | " - *Classification*: Models that predict labels as two or more discrete categories\n", 260 | " - *Regression*: Models that predict continuous labels\n", 261 | " \n", 262 | "- *Unsupervised learning*: Models that identify structure in unlabeled data\n", 263 | "\n", 264 | " - *Clustering*: Models that detect and identify distinct groups in the data\n", 265 | " - *Dimensionality reduction*: Models that detect and identify lower-dimensional structure in higher-dimensional data\n", 266 | " \n", 267 | "In the following sections we will go into much greater depth within these categories, and see some more interesting examples of where these concepts can be useful." 268 | ] 269 | } 270 | ], 271 | "metadata": { 272 | "anaconda-cloud": {}, 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.6.3" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 1 293 | } 294 | -------------------------------------------------------------------------------- /Pandas-Operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Operating on Data in Pandas" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- NumPy provides quick element-wise operations for basic arithmetic and sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).\n", 15 | "\n", 16 | "- Pandas inherits this functionality and adds useful twists:\n", 17 | "\n", 18 | " - unary operations like negation and trigonometric functions *preserve index and column labels* in the output.\n", 19 | " - binary operations such as addition and multiplication automatically *align indices* when passing the objects to the ufunc." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Ufuncs: Index Preservation\n", 27 | "\n", 28 | "- Any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects." 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": { 35 | "collapsed": true 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "import pandas as pd\n", 40 | "import numpy as np" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/plain": [ 51 | "0 6\n", 52 | "1 3\n", 53 | "2 7\n", 54 | "3 4\n", 55 | "dtype: int64" 56 | ] 57 | }, 58 | "execution_count": 2, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "rng = np.random.RandomState(42)\n", 65 | "ser = pd.Series(rng.randint(0, 10, 4))\n", 66 | "ser" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "data": { 76 | "text/html": [ 77 | "

\n", 78 | "\n", 91 | "\n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | "

	A	B	C	D
0	6	9	2	6
1	7	4	3	7
2	7	2	5	4

\n", 125 | "

" 126 | ], 127 | "text/plain": [ 128 | " A B C D\n", 129 | "0 6 9 2 6\n", 130 | "1 7 4 3 7\n", 131 | "2 7 2 5 4" 132 | ] 133 | }, 134 | "execution_count": 3, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n", 141 | " columns=['A', 'B', 'C', 'D'])\n", 142 | "df" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "- If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 4, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "0 403.428793\n", 161 | "1 20.085537\n", 162 | "2 1096.633158\n", 163 | "3 54.598150\n", 164 | "dtype: float64" 165 | ] 166 | }, 167 | "execution_count": 4, 168 | "metadata": {}, 169 | "output_type": "execute_result" 170 | } 171 | ], 172 | "source": [ 173 | "np.exp(ser)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/html": [ 184 | "

\n", 185 | "\n", 198 | "\n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | "

	A	B	C	D
0	-1.000000	7.071068e-01	1.000000	-1.000000e+00
1	-0.707107	1.224647e-16	0.707107	-7.071068e-01
2	-0.707107	1.000000e+00	-0.707107	1.224647e-16

\n", 232 | "

" 233 | ], 234 | "text/plain": [ 235 | " A B C D\n", 236 | "0 -1.000000 7.071068e-01 1.000000 -1.000000e+00\n", 237 | "1 -0.707107 1.224647e-16 0.707107 -7.071068e-01\n", 238 | "2 -0.707107 1.000000e+00 -0.707107 1.224647e-16" 239 | ] 240 | }, 241 | "execution_count": 5, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "np.sin(df * np.pi / 4)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "### UFuncs: Index Alignment\n", 255 | "\n", 256 | "- For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices while performing the operation. This is very convenient when working with incomplete data." 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "### Index alignment in Series\n", 264 | "\n", 265 | "- Suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 6, 271 | "metadata": { 272 | "collapsed": true 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "area = pd.Series({'Alaska': 1723337, 'Texas': 695662,\n", 277 | " 'California': 423967}, name='area')\n", 278 | "population = pd.Series({'California': 38332521, 'Texas': 26448193,\n", 279 | " 'New York': 19651127}, name='population')" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "- What happens when we divide these to compute the population density:" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 7, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "text/plain": [ 297 | "Alaska NaN\n", 298 | "California 90.413926\n", 299 | "New York NaN\n", 300 | "Texas 38.018740\n", 301 | "dtype: float64" 302 | ] 303 | }, 304 | "execution_count": 7, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "population / area" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "- The result contains the *union* of indices of the two input arrays." 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 8, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')" 329 | ] 330 | }, 331 | "execution_count": 8, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "area.index | population.index" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "- Any item for which one or the other does not have an entry is marked with ``NaN``, or \"Not a Number,\" which is how Pandas marks missing data. Index matching is implemented this way for any built-in arithmetic expressions." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "0 NaN\n", 356 | "1 5.0\n", 357 | "2 9.0\n", 358 | "3 NaN\n", 359 | "dtype: float64" 360 | ] 361 | }, 362 | "execution_count": 9, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "A = pd.Series([2, 4, 6], index=[0, 1, 2])\n", 369 | "B = pd.Series([1, 3, 5], index=[1, 2, 3])\n", 370 | "A + B" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "- The fill value can be modified." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 10, 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "data": { 387 | "text/plain": [ 388 | "0 2.0\n", 389 | "1 5.0\n", 390 | "2 9.0\n", 391 | "3 5.0\n", 392 | "dtype: float64" 393 | ] 394 | }, 395 | "execution_count": 10, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "A.add(B, fill_value=0)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### Index alignment in DataFrame\n", 409 | "\n", 410 | "- A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 13, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/html": [ 421 | "

\n", 422 | "\n", 435 | "\n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | "

	A	B
0	8	6
1	17	3

\n", 456 | "

" 457 | ], 458 | "text/plain": [ 459 | " A B\n", 460 | "0 8 6\n", 461 | "1 17 3" 462 | ] 463 | }, 464 | "execution_count": 13, 465 | "metadata": {}, 466 | "output_type": "execute_result" 467 | } 468 | ], 469 | "source": [ 470 | "A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))\n", 471 | "B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))\n", 472 | "A" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 14, 478 | "metadata": {}, 479 | "outputs": [ 480 | { 481 | "data": { 482 | "text/html": [ 483 | "

\n", 484 | "\n", 497 | "\n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | "

	B	A	C
0	8	1	9
1	8	9	4
2	1	3	6

\n", 527 | "

" 528 | ], 529 | "text/plain": [ 530 | " B A C\n", 531 | "0 8 1 9\n", 532 | "1 8 9 4\n", 533 | "2 1 3 6" 534 | ] 535 | }, 536 | "execution_count": 14, 537 | "metadata": {}, 538 | "output_type": "execute_result" 539 | } 540 | ], 541 | "source": [ 542 | "B" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 15, 548 | "metadata": {}, 549 | "outputs": [ 550 | { 551 | "data": { 552 | "text/html": [ 553 | "

\n", 554 | "\n", 567 | "\n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | "

	A	B	C
0	9.0	14.0	NaN
1	26.0	11.0	NaN
2	NaN	NaN	NaN

\n", 597 | "

" 598 | ], 599 | "text/plain": [ 600 | " A B C\n", 601 | "0 9.0 14.0 NaN\n", 602 | "1 26.0 11.0 NaN\n", 603 | "2 NaN NaN NaN" 604 | ] 605 | }, 606 | "execution_count": 15, 607 | "metadata": {}, 608 | "output_type": "execute_result" 609 | } 610 | ], 611 | "source": [ 612 | "A + B" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "- Indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.\n", 620 | "- We can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries. \n", 621 | "- Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``)." 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 16, 627 | "metadata": {}, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/html": [ 632 | "

\n", 633 | "\n", 646 | "\n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | "

	A	B	C
0	9.0	14.0	17.5
1	26.0	11.0	12.5
2	11.5	9.5	14.5

\n", 676 | "

" 677 | ], 678 | "text/plain": [ 679 | " A B C\n", 680 | "0 9.0 14.0 17.5\n", 681 | "1 26.0 11.0 12.5\n", 682 | "2 11.5 9.5 14.5" 683 | ] 684 | }, 685 | "execution_count": 16, 686 | "metadata": {}, 687 | "output_type": "execute_result" 688 | } 689 | ], 690 | "source": [ 691 | "fill = A.stack().mean()\n", 692 | "A.add(B, fill_value=fill)" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "\n", 700 | "| Python Operator | Pandas Method(s) |\n", 701 | "|-----------------|---------------------------------------|\n", 702 | "| ``+`` | ``add()`` |\n", 703 | "| ``-`` | ``sub()``, ``subtract()`` |\n", 704 | "| ``*`` | ``mul()``, ``multiply()`` |\n", 705 | "| ``/`` | ``truediv()``, ``div()``, ``divide()``|\n", 706 | "| ``//`` | ``floordiv()`` |\n", 707 | "| ``%`` | ``mod()`` |\n", 708 | "| ``**`` | ``pow()`` |\n" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "### Ufuncs: Operations Between DataFrame and Series\n", 716 | "\n", 717 | "- Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array." 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 17, 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "data": { 727 | "text/plain": [ 728 | "array([[7, 2, 0, 3],\n", 729 | " [1, 7, 3, 1],\n", 730 | " [5, 5, 9, 3]])" 731 | ] 732 | }, 733 | "execution_count": 17, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "A = rng.randint(10, size=(3, 4))\n", 740 | "A" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": 18, 746 | "metadata": {}, 747 | "outputs": [ 748 | { 749 | "data": { 750 | "text/plain": [ 751 | "array([[ 0, 0, 0, 0],\n", 752 | " [-6, 5, 3, -2],\n", 753 | " [-2, 3, 9, 0]])" 754 | ] 755 | }, 756 | "execution_count": 18, 757 | "metadata": {}, 758 | "output_type": "execute_result" 759 | } 760 | ], 761 | "source": [ 762 | "A - A[0]" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "- According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise. Pandas has the same default convention." 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 19, 775 | "metadata": {}, 776 | "outputs": [ 777 | { 778 | "data": { 779 | "text/html": [ 780 | "

\n", 781 | "\n", 794 | "\n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | "

	Q	R	S	T
0	0	0	0	0
1	-6	5	3	-2
2	-2	3	9	0

\n", 828 | "

" 829 | ], 830 | "text/plain": [ 831 | " Q R S T\n", 832 | "0 0 0 0 0\n", 833 | "1 -6 5 3 -2\n", 834 | "2 -2 3 9 0" 835 | ] 836 | }, 837 | "execution_count": 19, 838 | "metadata": {}, 839 | "output_type": "execute_result" 840 | } 841 | ], 842 | "source": [ 843 | "df = pd.DataFrame(A, columns=list('QRST'))\n", 844 | "df - df.iloc[0]" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "- Use the ``axis`` keyword to operate column-wise." 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": 20, 857 | "metadata": {}, 858 | "outputs": [ 859 | { 860 | "data": { 861 | "text/html": [ 862 | "

\n", 863 | "\n", 876 | "\n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | "

	Q	S	T
0	5	-2	1
1	-6	-4	-6
2	0	4	-2

\n", 910 | "

" 911 | ], 912 | "text/plain": [ 913 | " Q R S T\n", 914 | "0 5 0 -2 1\n", 915 | "1 -6 0 -4 -6\n", 916 | "2 0 0 4 -2" 917 | ] 918 | }, 919 | "execution_count": 20, 920 | "metadata": {}, 921 | "output_type": "execute_result" 922 | } 923 | ], 924 | "source": [ 925 | "df.subtract(df['R'], axis=0)" 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "- These ``DataFrame``/``Series`` operations, like above, will automatically align indices between the two elements." 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 21, 938 | "metadata": {}, 939 | "outputs": [ 940 | { 941 | "data": { 942 | "text/plain": [ 943 | "Q 7\n", 944 | "S 0\n", 945 | "Name: 0, dtype: int64" 946 | ] 947 | }, 948 | "execution_count": 21, 949 | "metadata": {}, 950 | "output_type": "execute_result" 951 | } 952 | ], 953 | "source": [ 954 | "halfrow = df.iloc[0, ::2]\n", 955 | "halfrow" 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": 22, 961 | "metadata": {}, 962 | "outputs": [ 963 | { 964 | "data": { 965 | "text/html": [ 966 | "

\n", 967 | "\n", 980 | "\n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | "

	Q	R	S	T
0	0.0	NaN	0.0	NaN
1	-6.0	NaN	3.0	NaN
2	-2.0	NaN	9.0	NaN

\n", 1014 | "

" 1015 | ], 1016 | "text/plain": [ 1017 | " Q R S T\n", 1018 | "0 0.0 NaN 0.0 NaN\n", 1019 | "1 -6.0 NaN 3.0 NaN\n", 1020 | "2 -2.0 NaN 9.0 NaN" 1021 | ] 1022 | }, 1023 | "execution_count": 22, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "df - halfrow" 1030 | ] 1031 | } 1032 | ], 1033 | "metadata": { 1034 | "anaconda-cloud": {}, 1035 | "kernelspec": { 1036 | "display_name": "Python 3", 1037 | "language": "python", 1038 | "name": "python3" 1039 | }, 1040 | "language_info": { 1041 | "codemirror_mode": { 1042 | "name": "ipython", 1043 | "version": 3 1044 | }, 1045 | "file_extension": ".py", 1046 | "mimetype": "text/x-python", 1047 | "name": "python", 1048 | "nbconvert_exporter": "python", 1049 | "pygments_lexer": "ipython3", 1050 | "version": "3.6.3" 1051 | } 1052 | }, 1053 | "nbformat": 4, 1054 | "nbformat_minor": 1 1055 | } 1056 | -------------------------------------------------------------------------------- /Matplotlib-Errorbars.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Visualizing Errors" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Basic Errorbars\n", 15 | "\n", 16 | "- A basic errorbar can be created with a single Matplotlib function call:" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": { 23 | "collapsed": true 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "%matplotlib inline\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "plt.style.use('seaborn-whitegrid')\n", 30 | "import numpy as np" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAF7NJREFUeJzt3X+MFOUdx/HPcT8KK2YPD+8qprYk\nJnajoVepNqdirh5tk7axP8z9kGhNbWoaUlJCGsXa1CZYkqNtUqktZ4Uz5AB7t1iRP0ygWM7wx+Kl\n2wBBt0pbEXoocAJ76t7KwV7/aO+8g929m9mZnXlm3q+/YH/Mfp/b2c8++zzPzFSMjY2NCQDge7O8\nLgAAMDMENgAYgsAGAEMQ2ABgCAIbAAxBYAOAIarc2nAymXRr0wAQaIsXL857u2uBXexFp5NKpRSL\nxRyuxt9oczjQ5nAopc3FOrsMiQCAIWz1sEdGRrR69Wq99957+uijj7R8+XJ96Utfcro2AMAktgJ7\n7969uummm/SDH/xAg4ODevDBBwlsAHCZrcD+2te+NvHvd955Rw0NDY4VBADIr6RJx46ODr377rvq\n6upyqh4AQAEVpZ6tL5VK6eGHH9bOnTtVUVExcXsymVQkErG1zWw2q9mzZ5dSlnFoczjQ5nAopc2Z\nTMbZZX2HDx9WXV2drrnmGsViMV28eFFnzpxRXV3dlMfZXdbCMqBwoM3hQJutcXxZ39/+9jd1d3dL\nkoaGhpTJZDRv3jxbxQEAZsZWYHd0dOjMmTNatmyZHnroIf385z/XrFn+W9Ld3Nys5uZmr8sAAEfY\nGhKZPXu2fvOb3zhdCwCgCP91iwEAeRHYAGAIAhsADEFgA/A1Fg98jMAGAEMQ2ABgCAIbAAxBYAOA\nIQhsIISYyDMTgQ0AhiCwAcAQBDYAGILABgBDENgAYIhQBjYz5ABMFMrABgATEdgAyopfuPYR2ABg\nCAIbAAxBYAOAIQjsAGKMEAgmAhsADBHowE6n0zp27JgSicSMHk/PFPAOn7/pBTawE4mEDh06pLfe\nekstLS0zDm0A8KuSAnvdunVqb2/XPffco927dztVkyP6+/uVy+UkSefPn1d/f7+3BQFAiarsPnH/\n/v06cuSIent7dfbsWX3729/WV77yFSdrK0lzc7NmzZqlXC6nmpoafmoBMJ7twL7lllu0aNEiSVI0\nGtXIyIguXryoyspKx4orRVNTkxYtWqR0Oq2tW7eqqanJ65IAoCS2A7uyslKRSESSFI/Hdeedd/om\nrMdFo1FFo1FPwnq8R89QDACn2A7scXv27NH27dvV3d192X2pVMrWNrPZrO3nTpbJZPLWYfV2J1+7\nEKfabOe1veJkm03hlzaXcx+5tM1WP3+nTp3S8PCwnnvuOTU2NrpcrTPcep9LCux9+/apq6tLGzdu\n1JVXXnnZ/bFYzNZ2U6mU7edONv4L4NJtWb29mEI9aavbcqrNdl7bK0622RR+aXM595FL22zl85dI\nJPTGG28ol8vp+9//vl5++WUjhjdLeZ+TyWTB+2yvEnn//fe1bt06Pf3006qtrbW7GQAoiNVeU9nu\nYb/00ks6e/asVq5cOXFbZ2enFixY4EhhQcXYNqbDPvIxVntNZTuw29vb1d7e7mQtAMoknU4rnU4r\nkUj4eoiB1V5TBfZIRwD5mXYUcDQa1XXXXRf6sJYIbCB0GBc2F4EdIpxcB9LH48KSGBc2DIENhMz4\nuPDChQuNWSaH/yn5wBkA5vHyKGDYF8oettXzZAOAH4QusE2bIQeAcaELbGbIAZgqdIHNDDngLYYk\n7QtdYDNDDpRHvmWkDEmWJnSBLXHkFOAVhiRLE8rADjp+csKvGJIsDYEdMPzkhJ8xJFkaAnuSIPRM\n+ckJv2NI0j6jAtvNc2H4uWdqpd385EQprH7GnPxMBqHD5DajAttNQemZ8pMTJvJzh8lPCOz/C1LP\nlJ+cME1QOkxuI7D/j54p4J0gdZjcFOiz9Vn9luYMZoA3uBTYzNDDdgkTKHBa0C9AwVDe9AhsFzCB\nAsANgQhsL3se+XrSTKAAcEMgAtsrhXrSTKCgFAynoRACuwSFetLFVpzwYUQxDKehGAK7BMV60vkm\nUPgwYjoMp12uv7+fv8P/lRTYb775ppYuXaotW7Y4VY9RrK7d5sOI6TCchmJsB3Ymk9GaNWtCvwTH\nylIkrz+MDMf4HwdwoRjbgV1TU6NnnnlG9fX1TtYTaF5+GBmOMYfV9chBX5+Nj9k+0rGqqkpVVcWf\nnkqlbG07m83mfW4mk8m73UK3F+LUduxsq7q6WvPnz1dtbe2U+wq12U5d+R7f19c3ZTimr69PtbW1\nM9qeW4q1Oahm0mar+5Sd/XbDhg2ObKvY48fvu7TNTrbDr9zat109ND0Wi9l6XiqVyvvcSCSSd7uF\nbi/Eqe3Y2Vah2wu12U5d+R7f1tam9evXK5fLqaamRm1tbbbfH6cUa3NQzaTNTu1Tdjj1WZp83+zZ\ns6fcX452eK2UfTuZTBa8j1UiIcHYKGCd34abAn3yJ0zFya0A54wHeTlXe9kO7MOHD6uzs1ODg4Oq\nqqrSrl279Lvf/c7zcVGTebEDADCH7cC+6aab1NPT42QtAAIknU4rnU7rwIEDgRiX9oNQDonQgwXc\nNb6MNJfL6cEHH9RnPvMZhuIcwKQjEGBOTZpZ3c7ko3pHR0fpJDkklD1sAO4aP6o3l8upurp6StgT\n3vbRw0ZRflvWBDNMXkba3d3NcIhDCGwAthU7P834IfaNjY0eVBZMRgV2WE9eFNZ2O4VfCTNnZV/j\n/DTlZ0xgh3XnsNNuzh8MO6zua5wuuPyMCeyw7hymtZverLms7mteny44jIxZJTJ51tmtncOPYViO\ndgOS9X1tfGIxnU5r69atJU8s+vHz5zfG9LDDevKisLYb5WdnX7N67m6UxpgethTekxeFtd2YavxQ\n70Qi4dq+wL7mb8b0sIEwC+ukO6YKRGCHddkbE3zhYdrkM9xhfGDT84CprHzhsiIDkmFj2Pnk63n4\nYfyNHhCc5PSKDJjJ+MBm2RvCgglBGB/YXvc86EkDKBfjA1ui54FgsdoJKMdyP78qdFk9q5fbM+Xy\nfIEIbJP4fYeAWSZf2aWlpYWDqwLO+FUiQJix3C9cCGzAYCz3CxeGREKE3lfweD3pHnR+mx+ghw0Y\njhMwuWO6g/K8OMKawPaRsB5i7xQO1cd0rOwjxeYHvDrC2nZgr127Vu3t7ero6NChQ4ecrCmUDhw4\nwCH2gI8Umx/warLXVmAPDAzo7bffVm9vr5544gmtWbPG6bpCZ2BggNl+wEeKnR/cq8leW4GdSCS0\ndOlSSdL111+v4eFhffDBB44WFja33nors/1wnFPDbGEdris0P+DVhUVsBfbQ0JDmzZs38f+6ujqd\nPn3asaLCqLGx0ZdXlgnrBzUInBpn5YyY+Xkx2WtrWd/Y2Nhl/6+oqLjscalUyvK2H3jgAeVyOfX0\n9Fx2XyaTybvdQrebJJvNqrq6WvPnz1dtbe2UtnjV7gMHDujgwYMaGxvTXXfdpe7ubjU2NhZ9jpWa\nstmso7WbsH9MbrPVuqy2r6+vb8owW19fn2pray2/RrHtFGvDhg0bJDn/Pk9Xbzlun+4+t9psK7Ab\nGho0NDQ08f9Tp05p/vz5lz0uFotZ3nYkElEmk8n73Egkkne7hW53ktvnGkilUpbb53a7d+zYMfHl\nPDo6qqNHj+ree+8t+hwrNaVSKUdrL/Tao6OjSqfTOnfunOe/XCa32er7Z3U/aGtr0/r16yfOZNnW\n1jbta+XbVrHtzKQNTr/P09Vbjtunu6+UNieTyYL32RoSuf3227Vr1y5J0uuvv676+nrNnTvXVnHw\nr2ITK6YsoZvu57wp7bDDqXFWLgTtH7Z62DfffLNuvPFGdXR0qKKiQo8//rjTdcEHgnAUnV8vcFEu\nTp3JkjNi+oPtQ9N/8pOfOFkHfMrqB9Vvh/I6fYELU07DiWDiXCJwjB9P9RmEXwnlZPWLiC+u8jIq\nsMO6c5jS7mLDD172TO38nKcnDbuam5uVyWQ0MDDg+LY5lwgcY/for0ITf0GeEATsILDhGFYTAO4i\nsA3mx6MQOdUn4B4C21AcLmw+P37hwt8IbENxLT+z2fnC7e/v530OOQLbUFzLz2x84cIOo5b1FRLG\nnZ31xWZz+oCesPLbgVpuo4dtMCb4zMWKmtIVG1ayOj9gynwCgQ14hC/c0hQaVrI6P2DSBD6BDcBI\nheZxrM4PmDSfQGADLuOITXcUGlayOiFv0gR+ICYdgTDzc4/QbfnOE2N1Qt6kCXwCG0DgWD3hlynn\n+2ZIBIFhykw/vGP6PkJgIxBMmumHN4KwjzAkgkAI+6XAMD07+0ix+QEv5g7oYSMQTJrphzeCsI/Q\nw/YRL2f7Tb/Citcz/ab//cLA633ECb7rYafTaZ04ccLI8SV4iyMHMR3T9xFfBfb4pMDg4KDvJgVM\nn11GfkF/Xzkla7D4KrD9eohoEGaXnRaEoON9nTmC3x98Fdh+nRTw6xdJOeT7oAYl6ML8vsJMtgN7\nYGBATU1N2rt3r2PFjE8KXHvttb465aRfv0i8EpSgs/O+BuGXBcxla5XIsWPH9Oyzz2rx4sVO16No\nNKrq6mrfhLUUjNllJ5l28v1CXyhW39fxXxa5XE4tLS2+6lQgHGz1sK+++mo99dRTmjt3rtP1+Jbp\ns8tOsnPyfb/2TAu9r/nqDcovC7jLzZVutgJ7zpw5qqysdLoWGKRQ0NkZ8y4U5l6FfKF6GRrDdNxe\n6TbtkEg8Hlc8Hp9y24oVK7RkyZJpN55KpSwXlMlklMvlbD3XTZlMRpK9Ns1ENpu1vG0na7K6LSuP\n7+vrm9Iz7evrU21trbLZrJ577jkdPHhQY2Njuuuuu9Td3a3GxkYdOHAg7+1OtqHQcwrVW1tbqxtu\nuEHDw8P61a9+pdra2onnFXvtyfdNfp/d3qf8ws6+PVOF/oZO3W5VoX3HKdMGdmtrq1pbW21tPBaL\nWX5OJBJRJpOx9Vw3RSIRSfbaNBOpVMrytp2syeq2rDy+ra1N69evnxjzbmtrUywWUyqV0tGjRzU2\nNiZJGh0d1dGjR3Xvvfdqx44deW93sg2FnlOoXkmqr69XfX39ZbUUe+3J901+n93ep/zCzr49U4X+\nhk7dblWxfWemkslkwft8tawPwVRszLvQMIOXww9cIBd2ub3SzdYqkf7+fm3atEn//ve/9dprr6mn\np0fd3d2OFoZgKXSC+EIrNbxemWPKCe3hP26udLMV2FyjDk4qFI6EJjAVZ+szGMvKzMb7B6sYwwYA\nQxDYAGAIAhsADMEYNuCydDqtdDqtRCLh6EEUKDwPENT5AQIbcNGlJ4zatGlT4A+UMZEpAc+QCOCi\nS08YNTAw4HFFMBmBDbjo0iM2b731Vo8rgskYEgFcdOkRm4xhoxT0sGEc04605VzqcAo9bEiaupLB\n1GAxZeIIsIseNgJzUV0v+fWKOpjK9Ku/E9jg0lcl4gsP5UJgg0tflYgvPJQLgQ1O2F8ivvBQLkw6\nQhLnni6F1xdbQHj4LrD7+/sDf1FSBA9feCgH3wW2XzEuGUy8rzAJgQ1bCDqg/AhswAK+qOAlVokA\ngCHoYaMs6JkiLNxcOEFgw7e8DHm+YOBHBDY8RzgCM8MYNgAYwlYP+8KFC3rsscd0/PhxXbhwQQ8/\n/LC+8IUvOF0bAGASW4H94osvas6cOdq2bZuOHDmiRx99VNu3b3e6NgDAJLYC++6779Y3vvENSdJV\nV12lc+fOOVoUAOBytgK7urp64t+bN2+eCO9L2V3aks1mQ3c+Ea/bnMlkJNl/z+yw22Yvap1OsZom\n3+f1++wF2uycaQM7Ho8rHo9PuW3FihVasmSJtm7dqtdee01dXV15nxuLxWwVlUqlbD/XVF63ORKJ\nSLL/ntlht81e1DqdYjVNvs/r99kLtNmaZDJZ8L5pA7u1tVWtra2X3R6Px/XXv/5Vf/jDH6b0uAEA\n7rA1JHL8+HH96U9/0pYtW/SJT3zC6ZoAAHnYCux4PK5z587poYcemrht06ZNqqmpcawwAMBUtgJ7\n1apVWrVqldO1AIHEkZxwCkc6wjjpdFrHjh3j6uQIHQIbRkkkEjp06JDeeusttbS0ENoIFQIbRunv\n71cul5MknT9/nuEGhAqBDaM0Nzdr1qz/7bY1NTVqbm72tiCgjAhsGKWpqUmLFi3SwoUL9fLLL3OV\ncoQK58OGJLNWMkSjUUWjUV+FtUl/P5iLHjYAGILABgBDENgAYAgCGwAMQWADgCEIbAAwBIENAIYg\nsAHAEAQ2ABiCwAYAQxDYAGAIAhsADEFgA4AhCGwAMASBDQCGILABwBAENgAYgsAGAEPYukTYe++9\np0ceeUQfffSRRkdH9eijj+pzn/uc07UBACax1cPeuXOnvvnNb6qnp0erVq3Sk08+6XRdAIBL2Oph\nf+9735v49zvvvKOGhgbHCgIA5Gf7qumnT5/WD3/4Q3344YfavHmzkzUBAPKoGBsbGyv2gHg8rng8\nPuW2FStWaMmSJZKkV155RZs3b1Z3d/eUxySTSUUiEVtFZbNZzZ4929ZzTUWbw4E2h0Mpbc5kMlq8\neHHe+6YN7HwGBgZ0ww03KBqNSpK++MUv6tVXX53ymGQyWfBFp5NKpRSLxWw911S0ORxocziU0uZi\n2Wlr0nH37t164YUXJElvvPGGrrnmGluFAQBmztYY9vLly7V69Wr95S9/0fnz5/WLX/zC4bIAAJey\nFdhXXXWV/vjHPzpdCwCgCI50BABDENgAYAgCGwAMQWADgCEIbAAwhK0DZ2YimUy6sVkACDxHj3QE\nAJQfQyIAYAgCGwAM4bvAXrt2rdrb29XR0aFDhw55XU5ZrFu3Tu3t7brnnnu0e/dur8spm2w2q5aW\nFv35z3/2upSy2Llzp+6++2595zvf0SuvvOJ1Oa778MMP9aMf/Uj333+/Ojo6tG/fPq9Lcs2bb76p\npUuXasuWLZL+d52A+++/X8uWLdOPf/xjnT9/3pHX8VVgDwwM6O2331Zvb6+eeOIJrVmzxuuSXLd/\n/34dOXJEvb292rhxo9auXet1SWWzYcMG1dbWel1GWZw9e1a///3vtW3bNnV1dWnPnj1el+S6F154\nQQsXLlRPT4+efPJJ/fKXv/S6JFdkMhmtWbNGTU1NE7etX79ey5Yt07Zt23Tttddq+/btjryWrwI7\nkUho6dKlkqTrr79ew8PD+uCDDzyuyl233HLLxCXWotGoRkZGdPHiRY+rct+//vUv/fOf/1Rzc7PX\npZRFIpFQU1OT5s6dq/r6+lB0RubNm6dz585JkoaHhzVv3jyPK3JHTU2NnnnmGdXX10/c9uqrr6ql\npUWS1NLSokQi4chr+Sqwh4aGprypdXV1On36tIcVua+ysnLiQg/xeFx33nmnKisrPa7KfZ2dnVq9\nerXXZZTNf/7zH42NjWnlypVatmyZYx9gP/v617+uEydO6Mtf/rLuu+8+PfLII16X5IqqqqrLLlYw\nMjKimpoaSdLVV1/tWI7ZvkSYGy5dYTg2NqaKigqPqimvPXv2aPv27ZdduSeIduzYocbGRn3qU5/y\nupSyOnnypJ566imdOHFC3/3ud7V3795A798vvviiFixYoE2bNukf//iHHnvsMT3//PNel1UWk99X\nJ1dO+yqwGxoaNDQ0NPH/U6dOaf78+R5WVB779u1TV1eXNm7cqCuvvNLrclzX39+v48ePq7+/X+++\n+65qamr0yU9+UrfddpvXpbmmrq5On//851VVVaXrrrtOV1xxhc6cOaO6ujqvS3PN3//+d91xxx2S\npM9+9rM6efKkLly4oKoqX8WOK+bMmTNxmbCTJ09OGS4pha+GRG6//Xbt2rVLkvT666+rvr5ec+fO\n9bgqd73//vtat26dnn766dBMwP32t7/V888/r76+PrW2tmr58uWBDmtJuuOOO7R//37lcjmdOXNG\nmUwmsGO64z796U/r4MGDkqTBwUFdccUVoQhrSbrtttsmsmz37t0T18Atla/+ejfffLNuvPFGdXR0\nqKKiQo8//rjXJbnupZde0tmzZ7Vy5cqJ2zo7O7VgwQIPq4LTGhoa9NWvflUPPPCARkZG9LOf/Uyz\nZvmqv+S49vZ2/fSnP9V9992nCxcuBPbKVIcPH1ZnZ6cGBwdVVVWlXbt26de//rVWr16t3t5eLViw\nQN/61rcceS0OTQcAQwT7Kx4AAoTABgBDENgAYAgCGwAMQWADgCEIbAAwBIENAIYgsAHAEP8FHNJC\no0qnIQoAAAAASUVORK5CYII=\n", 41 | "text/plain": [ 42 | "" 43 | ] 44 | }, 45 | "metadata": {}, 46 | "output_type": "display_data" 47 | } 48 | ], 49 | "source": [ 50 | "x = np.linspace(0, 10, 50)\n", 51 | "dy = 0.8\n", 52 | "y = np.sin(x) + dy * np.random.randn(50)\n", 53 | "\n", 54 | "plt.errorbar(x, y, yerr=dy, fmt='.k');" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "- ``fmt`` is line-and-point format code. It uses the same syntax as the shorthand used in ``plt.plot``, outlined in [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) and [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb).\n", 62 | "\n", 63 | "- The ``errorbar`` function has many options to fine-tune the outputs. Example (good for crowded plots): to make the errorbars lighter than the points themselves:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAG0lJREFUeJzt3X9sE+cZB/Cv4ywzKRWkgYRkGlCp\nUmcFrVuzaqKUigpGtS5qt1ZRMkI2rdLYxIYKaGvTpaKI1EywrV3WdqOUsiEwWmbargghkdFBhSYo\nmicNmWYrmzRgkKQJAdHWcUsS748pbn7cnX3n9+7e997vR0IK59h+Ln7vudfv8957oWw2mwUREUmv\nxO8AiIioMEzYRESKYMImIlIEEzYRkSKYsImIFMGETUSkiFK3XjiZTLr10kREgVZfX2+43bWEbfWm\n+fT09CAajQqORm7cZz1wn/VQzD5bdXY5JEJEpAhHPezh4WG0tbXhypUr+Oijj7B27Vrcd999omMj\nIqIJHCXsY8eOYdGiRfjud7+LS5cu4dFHH2XCJiJymaOE/cADD+R+7u3tRXV1tbCAiIjIWFFFx+bm\nZvT19WHHjh2i4iEiIhOhYlfr6+npweOPP46DBw8iFArltieTSZSXlzt6zUwmg0gkUkxYyuE+64H7\nrIdi9jmdToud1pdKpVBZWYmamhpEo1GMjo5iaGgIlZWVk37P6bQWTgPSA/dZD9xne4RP6/vrX/+K\n3bt3AwAGBweRTqdRUVHhKDgiIiqMox52c3Mz2tvbsWrVKmQyGWzatAklJfJN6e7v78/9zMIoEanO\nUcKORCL4xS9+IToW4QYGBnI/M2ETkerk6xYTEZEhJmwiIkUwYRMRKcLV1fqIiIrFyQOfYMImIqlx\n8sAnOCRCRKQIJmwiIkUwYRMRKYJj2EQaYiFPTUzYRBpiIU9NHBIhIlIEEzYRkSKYsImIFMGETUSk\nCC2LjqyQE5GKtEzYrJATkYq0TNhE5B9+w3WOCZuIPMVvuM6x6EhEpAgmbCIiRXBIJIA4RkgUTIHt\nYcfjcaxcuRKf//znsXLlSsTjcb9D8szAwEDuHxEFRyB72PF4HGvWrEE6nQYA9Pb2Ys2aNQCAlpYW\n0+exZ0rkHx5/+QWyh93e3p5L1uPS6TTa29stn8eeKZF/ePzlV1QPe/v27UgmkxgZGcH3vvc9rFy5\nUlRcRblw4YKt7UREKnCcsE+dOoVz586hq6sLV69exTe+8Q1pEvb8+fNx/vx5w+1ERKpyPCRy1113\nobOzEwAwa9YsDA8PY3R0VFhgxYjFYigvL5+0rby8HLFYzKeIiIiK5zhhh8PhXFJMJBK49957EQ6H\nhQVWjJaWFuzcuRM1NTUIhUKoqanBzp07LQuOovX39+f+EZEzOs/2MhLKZrPZYl7g6NGjeOmll7B7\n927cfPPNue3JZHJaL7dQmUwGkUikmLAAYFKPf+LJxO52ke9tRtQ+O3lvv4jcZ1XIss9etpGp+1zo\n8Xfo0CFs2rQJmUwmtz0SiWDLli1oaGhwNeZiFfM5p9Np1NfXGz5WVMI+ceIEOjs7sWvXLsyePXvS\nY8lk0vRN8+np6UE0GnUaVk4qlcr9vGjRIsfbrZhNRbL7WqL22cl7+0XkPqtCln32so1M3edCj7+F\nCxca1qIWLFiA//znP+4EK0gxn7NV7nRcdHz//fexfft2/O53v5uWrHXChWyI3MHZXtM5HsM+fPgw\nrl69ivXr16O1tRWtra24fPmyyNgCiWPblI/bbUSVcWGzWV06z/Zy3MNuampCU1OTyFi0wB455eNm\nG3F6FbAfYrHYpFgBzvYK5JWORGTM6VXAfpBhtpdsArmWCBEZU21cuKWlBXfccUfu/zIX0b3AHrZG\nOH5OHBdWGxO2Rri4DvEqYLVpl7BVqZATuYHjwmrTagxbpQo5kVs4LqwurXrYKlXIiYim0iphq1Yh\nJwoaDkkWR6uEzQo5kX/GhyR7e3uRzWZzQ5JM2oXTKmGzQk7knfEppGNjYwA4JCmCVglbhwo5v3KS\nLMankI4vCMohyeJpNUsECHaFnLNgSGa8dV/xtOphWwlCz5RfOUlmHJIsHhM2glMM4VdOkpnVkGQQ\nOkxeUGpIxOzuLsWy6pnKMJRQ6H7zKycVy+4xZvf3jYYkOZRXOKV62G6thSF7z7TQ/eZXTiqW3WNM\nxDHJobzCKZWw3RKU+dk6zIKh4JG9wyQTpYZE3BKkO1sEeRYMBROH8goX6B723Llzc/+siO6ZsoBC\nVDgO5RUu0D1sO4VJUT1TFlDILW4V3f02flz8+Mc/Rl9fH+bNm4ef/exnPF4MBDphuy0ej09rZLLP\nOCF1BfkGzhzKK0wghkT8uPWV2dxto7E4gAUUKgyH08hKIBK2H7e+MutJh8Nhw9+fP38+D0ayFJQL\nuMg9gUjYfjDrMY+OjhoWUB544AEejGSJ85EpHyZsh8ymHC1YsMBwxsnhw4d5MJIlzkc2VuhsLx0U\nlbDfffddrFixAvv27RMVjzKspiK1tLSgu7sbZ86cQXd3N1paWnw/GDkcI7+gXMAlWnV1de6f7hwn\n7HQ6jY6ODixevFhkPMqwO3fbz4ORY6Nq4Hxkysdxwi4rK8PLL7+MqqoqkfEoxagnbcbPg5Fjo2pw\negGXH7OkyB+h7PjtIBx6/vnnUVFRgdWrV0/ankwmpyWoQmUyGUQikWnbR0dHcz9PnI1htt2MqNex\n+1qHDh3Cc889l5u3vWHDBjQ0NAAw32cncU39/bq6Ohh9zKFQCGfPns37em6x2uegKmSf7bZPJ+12\n/LZdAFBS8km/TdSx5CReJ/shq2LadjqdRn19veFjrl44E41GHT2vp6fH8LmpVMrwtc22mxH1OnZf\nKxqN4qtf/Wpu+8SLA8z22UlcU3/faq0Gp5+RCFb7HFSF7LPd9umk3dp9bye/7+d++K2Ytp1MJk0f\n4ywRDXBslMgZ2YabeGm6BrhWA5EzVssB+LG2i+OEnUqlsG3bNly6dAmlpaU4cuQInn/+ecyePVtk\nfFpxswFwrQbyktE6O0HrIPixtovjhL1o0SLs3btXZCzaC/LiPqQPrljpHi3HsHnlFJF7OI3UPVqO\nYbP3SroQNcxm53X8vqo3yLRM2ES6EDXMZud18t3yi99sndNySIQKJ9u0JpKH2fo0+aaRcm0Q55RJ\n2LouXuT3fvux1jj5w05bs1qfRvQ9UukTSgyJ6Fp11nW/RQvqvRBFstvW8t0Kj9NI3aFED1vXqrPT\n/eYsmMn4LSE/u22NhUV/KNHD1rVxON1vP3uR7M2qyW5by1dYJHco0cP2ai1p2XqmKi5oz96smuy2\nNTfWp5Ht+JOREgnbq8WLZKtec9EmGud28dluW3OjsCjb8ScjJYZEdF28SNf9psm8KD47aWssLHpP\niYQNmDeOoC8yw4OC8s3IEIVtTX7KJGwjuk97Y4FPD7oW3Wk6Jcawzeg63W8cC3xqK/QqUhWLz+QO\npRO2zD0PVrwpn0JPuCw+0zilE7bMPQ9WvEkUXupN45Qew47FYpPGsAHvex7sQZMXdC2602RKJ2wZ\npr2xB02iFdoJ0L3oDpgX3u0W5FUp4CudsAH1piKxR075FJowvJruJzOzdbrtrgOuyu35lE/YqpG5\nMZBaZC66kzuULjoS6UzmonsQ+L0WvREmbI1wqmGwcLqfe6xu0DD+uB/JnEMikjh06BCee+45V4un\nHI4JFhmK7kGV76I8v4q9TNgSiMfj2LRpEzKZDAA9q/0iqFLpF0m1orvfCm0jVvUBP4u9jodEtm7d\niqamJjQ3N+PMmTMiY9JOe3t7LlmP0+kSe1F4qT7lU2gbsaoP+FnsdZSwT58+jfPnz6OrqwvPPPMM\nOjo6RMelFVb7yQ2ixlllLL65zao+4Gex11HCPnnyJFasWAEAuO2223D9+nV88MEHQgPTiazVfh0P\n1KDIVzTz+nVUY7UcgJ/FXkcJe3BwEBUVFbn/V1ZW8mtoEWKxGCKRyKRtflf7dT1Qg0LUSpY6r4jZ\n0tKC7u5unDlzBt3d3bnxaT/Xdglls9ms3Sc99dRTWLZsWa6X/c1vfhM//elPsXDhwtzvJJPJaWeh\nQoyNjWFkZASlpaUoKZl8PhkdHc39HA6H825XyWuvvYYXX3wxV+3fsGEDGhoaAPiz38uXL0dvb++0\n7TU1NXjzzTdNn2cnpkwmM+1EVQyj9546+2bi39UPE/fZ7udnpx3U1dXB6NAOhUI4e/Zswe+R73Ws\n9mFsbAwAMDIygrKyMtP3LIbdY0PU9nyPFdO20+k06uvrDR9zNEukuroag4ODuf+/9957mDNnzrTf\ni0ajtl87lUohHA4jm81Oe34qlTJ8bbPtIrk9A+Hhhx/GQw89lPv/xGq/H/vd19dnut3qvezE1NPT\nIzTuqe8dj8exefPmSdOvNm/ejNraWt9m30zcZ7ufn512YHVXczufX77XKWQfRH/OE9k9NkRtz/dY\nMfucTCZNH3M0JLJkyRIcOXIEAPDOO++gqqoKM2fOdBScKnSbgZBvXL3Qxff9VMjXeRX2wwlR46y8\nOEcujhL2nXfeibq6OjQ3N6OjowNPP/206LjIZ/kOVKMTmGxFykJm3wT1RCxqnJVrccvF8YUzP/rR\nj0TGQZKxexWdjEt9Wn2dd0qli3NEXVTDi3PkwSsdyZSdA1XGpT7duMGFKstwOmV3nRmuS+MtpRK2\nro1Dhf3ON/zgR8+0mLU2VOpJi2R3X3X62xSqv78fY2Nj6O/vF/73USph69o4VNjvfMMPVj1TUXcN\nMeL063zQe9LknvG2MzAwILztcHlVRclW4CtmNoFZ4S+oBUEip5iwFSTjVYicTWCPbCdcUoNSQyL0\nfzIW+ADOJiiUjDNqSA3sYSuIq/upzen6HLxjEDFhK0jW1f2oME5PuNXV1bl/pOewUiAStm49D14u\nrDaecItnVcexm8hVSvyBGMPWrcfBe/mpzY0LenRjNqz02GOPYXh4uOD6gGr1hED0sHVktlYvyY8z\naopnNnx05coVW/UB1db7DkQPm0g1nFFTHLMLtczYrRvIWsBnD5vIZUFdwtVPZnWcyspKw9+3WzeQ\ntZ7AhE3kMrev2NSt6A6YDyt1dnbaKsirVsDnkAgFQjwe17YIq1vRfZzVsJJRW7BqI6q0HSZsUp5q\nlX5yl1Eiz9dGVKkncEiElKdapZ+8F5Q2wh42KU+1Sj95z2kbsaoL+FEzYMIm5blxKzAKFqdtxKo+\n4EftQKohEZUuEXWDn9V+laeeyVDpV/nvpwMZ2ogI0iRsGdd4nhibFycSPxf3UflmAU6vHBT5uar8\n99NBUK4ulWZIRNY1njkDYToZp9DZrfTr8rnqNDc7H5Vmg5iRpocta+EoKNVlUWT+JmSHLp+rqG9t\nOl6cIyNpErasl4jKeiLxytQDNSiJzsnnqnONhWtxy8Fxwj59+jQWL16MY8eOCQlE1qKArCcSr0w9\nUINyArP7uQblmwWpzVHCvnDhAn7729+ivr5eWCCyFgVkPZH4xckJzO+eqdHXeavP1SjeoHyzIHe5\n3dYdJey5c+fihRdewMyZM4UGI+Maz7KeSPxi9wSWr2dq1sBFNnyjr/NmnysAw3jNlvJU7ZsFuceL\nb2GOZonMmDFDWAAqCEJ1WZR8i+VMLUpZ9Ux/8IMfYPPmzdNmavzlL3/Bnj17XJ/BYfS5Lly40DDe\ncDiM0dHRaa+hy9AY5efFTLdQNpvNWv1CIpFAIpGYtG3dunVYunQp2tracP/99+O+++6b9rxkMjmt\nJ1aIiQdFOBy2/Xy3uB1XJpNBJBKx9RyRMdl9rUJ/v66uDkZNLBQKYd68eejt7Z32WElJCcbGxqZt\nr6mpwZtvvll0TFbPMYsXACKRCDKZzKT/b9myBQ0NDZbvPfGxGzdu5D5nWdu6aE7adqHM/oaittth\n1dbPnj1b8Ouk02nT4ea8PezGxkY0NjYW/GYTRaNR289JpVJFPd8tbsfV09Nj+3VFxmT3tQr9fatL\ngs2GE4ySNQD09fVZvpeTv8fU55jFu2DBAsRiMdNvFlbvPfGxSCSSe1zWti6ak7ZdKLO/oajtdli1\ndTuvmUwmTR+TZlofBZPVmPe8efMMn2PWw/Fi+MEqXhlrLCQPLyYoOErYx48fR2trK06cOIFnn30W\njz76qLCAKFisirYbNmwwbOBr1qzxbWYOi8zklBdtx1HRcdmyZVi2bJmwICjYzIq2DQ0NqK2tNRxm\nWLJkiW+Xv7PITE653XakWUuE7AvCZcJmDVyHpBmEz4+8xYStMF4mrDZ+fmQXi45ERIpgwiZy0dQr\nNg8dOuR3SKQwDokQucRoze1NmzahtraWs04EMasDBLU+wIRN5BKjS5UzmYzvN+UIErM6gN36gCoJ\nngmbyCVBWYpWB6oUgDmGTeQS3ddSJ/GYsIlcYnSpciQS0XYtdSoeEzb5foMBu/r7+3P/ZGZ0qfKW\nLVs4fk2OcQxbcyrePXxgYCD388SxR78KR1Z3kZ96xWaQl1El9zFha86LRde94kfhSMUTns5UmQ1i\nhkMimuNMhuLwXo9qUf3u70zYmuNMhuLwhEdeYsLWHO8KXxye8MhLTNia44L9xeEJj7wkXdFx7ty5\nGBwcxJw5c/wORRs6rD3tlnx3kScSSbqEXV1djaGhIemKAqpXl8mYiM+VJzzyinQJW1aynUD8FpQT\nGD9XUgkTNjmia6ILyomK1MSETWSDricqkgMTNnmCPVPShZsTJ5iwyRNOeqZ+JnmeYMgpNydOMGGT\n78ySo5/DDxz6IBkxYZPvmByJCuMoYY+MjKC9vR0XL17EyMgIHn/8cXzpS18SHRsREU3gKGG/8cYb\nmDFjBvbv349z587hySefxIEDB0THRkREEzhK2A8++CAaGhoAALfccguuXbsmNCgiIpoulM1ms8W8\nwLPPPouSkhKsX79+0vZkMjltUZxCZTIZRCKRYsJSjt/7PDo6mvvZq7uiON1nP2LNxyqmiY/duHGD\nbVsDxexzOp1GfX294WN5e9iJRAKJRGLStnXr1mHp0qWIx+M4e/YsduzYYfjcaDTqIFygp6fH8XNV\n5fc+p1Kp3M9exeF0n/2INR+rmCY+FolEpInZK363bT8Us8/JZNL0sbwJu7GxEY2NjdO2JxIJ/PnP\nf8avf/1rfOpTn3IUGBERFc7RGPbFixfx+9//Hvv27cOnP/1p0TEREZEBRwk7kUjg2rVruZuNAsAr\nr7yCsrIyYYEREdFkjhL2xo0bsXHjRtGxEOUVj8eVu1nAxCs5h4aGfIyEVMcrHUkZ8Xgca9asyd2l\nvLe3N/ctT+akPfFKTiZsKgbv6UjKaG9vzyXrcel0Gu3t7T5FROQtJmxSxoULF2xtJwoaJmxSxvz5\n821tJwoaJmxSRiwWm3b1bHl5OWKxmE8REXmLRUcCoMaC/eOFRRlniajw9yP1MWETAHXWpG5pacEd\nd9yR+/+iRYt8jOYTqvz9SG0cEiEiUgQTNhGRIpiwiYgUwYRNRKQIJmwiIkUwYRMRKYIJm4hIEUzY\nRESKYMImIlIEEzYRkSKYsImIFMGETUSkCCZsIiJFMGETESmCCZuISBFM2EREimDCJiJShKM7zly5\ncgVPPPEEPvroI9y4cQNPPvnkpLuAEBGReI562AcPHsRDDz2EvXv3YuPGjejs7BQdFxERTeGoh/2d\n73wn93Nvby/vZ0dE5AHHN+EdGBjA97//fXz44YfYs2ePyJiIiMhAKJvNZq1+IZFIIJFITNq2bt06\nLF26FADw1ltvYc+ePdi9e/ek30kmkygvL3cUVCaTQSQScfRcVXGfCzc2Npb7uaRErbo5P2c9FLPP\n6XQa9fX1ho/lTdhGTp8+jdtvvx2zZs0CAHz5y1/G22+/Pel3ksmk6Zvm09PTg2g06ui5quI+64H7\nrIdi9tkqdzrqnnR3d+P1118HAPzzn/9ETU2No8CIiKhwjsaw165di7a2NvzpT3/Cxx9/jM2bNwsO\ni4iIpnKUsG+55Rbs3LlTdCxERGRBrYoNEZHGmLCJiBTBhE1EpAgmbCIiRTBhExEpwtGFM4VIJpNu\nvCwRUeAJvdKRiIi8xyERIiJFMGETESlCuoS9detWNDU1obm5GWfOnPE7HE9s374dTU1NeOSRR9Dd\n3e13OJ7JZDJYvnw5XnvtNb9D8cTBgwfx4IMP4uGHH8Zbb73ldziu+/DDD/HDH/4Qra2taG5uxokT\nJ/wOyTXvvvsuVqxYgX379gH4/30CWltbsWrVKjz22GP4+OOPhbyPVAn79OnTOH/+PLq6uvDMM8+g\no6PD75Bcd+rUKZw7dw5dXV3YtWsXtm7d6ndInvnNb36D2bNn+x2GJ65evYoXX3wR+/fvx44dO3D0\n6FG/Q3Ld66+/jltvvRV79+5FZ2cnYrGY3yG5Ip1Oo6OjA4sXL85t+9WvfoVVq1Zh//79+MxnPoMD\nBw4IeS+pEvbJkyexYsUKAMBtt92G69ev44MPPvA5KnfddddduVuszZo1C8PDwxgdHfU5Kvf9+9//\nxr/+9S8sW7bM71A8cfLkSSxevBgzZ85EVVWVFp2RiooKXLt2DQBw/fp1VFRU+ByRO8rKyvDyyy+j\nqqoqt+3tt9/G8uXLAQDLly/HyZMnhbyXVAl7cHBw0odaWVmJgYEBHyNyXzgczt3oIZFI4N5770U4\nHPY5Kvdt27YNbW1tfofhmf/+97/IZrNYv349Vq1aJewAltnXvvY1XL58GV/5ylewevVqPPHEE36H\n5IrS0tJpNysYHh5GWVkZAGDu3LnC8pjjW4S5YeoMw2w2i1Ao5FM03jp69CgOHDgw7c49QfTHP/4R\nX/jCF/DZz37W71A81d/fjxdeeAGXL1/Gt771LRw7dizQ7fuNN95AbW0tXnnlFfzjH/9Ae3s7Xn31\nVb/D8sTEz1XkzGmpEnZ1dTUGBwdz/3/vvfcwZ84cHyPyxokTJ7Bjxw7s2rULN998s9/huO748eO4\nePEijh8/jr6+PpSVlWHevHm4++67/Q7NNZWVlfjiF7+I0tJSzJ8/HzfddBOGhoZQWVnpd2iu+dvf\n/oZ77rkHAPC5z30O/f39GBkZQWmpVGnHFTNmzMjdJqy/v3/ScEkxpBoSWbJkCY4cOQIAeOedd1BV\nVYWZM2f6HJW73n//fWzfvh0vvfSSNgW4X/7yl3j11Vfxhz/8AY2NjVi7dm2gkzUA3HPPPTh16hTG\nxsYwNDSEdDod2DHdcQsWLMDf//53AMClS5dw0003aZGsAeDuu+/O5bLu7u7cPXCLJdVf784770Rd\nXR2am5sRCoXw9NNP+x2S6w4fPoyrV69i/fr1uW3btm1DbW2tj1GRaNXV1bj//vvx7W9/G8PDw3jq\nqaeUu4GwXU1NTfjJT36C1atXY2RkJLB3pkqlUti2bRsuXbqE0tJSHDlyBD//+c/R1taGrq4u1NbW\n4utf/7qQ9+Kl6UREigj2KZ6IKECYsImIFMGETUSkCCZsIiJFMGETESmCCZuISBFM2EREimDCJiJS\nxP8AGkE7uHLup/wAAAAASUVORK5CYII=\n", 74 | "text/plain": [ 75 | "" 76 | ] 77 | }, 78 | "metadata": {}, 79 | "output_type": "display_data" 80 | } 81 | ], 82 | "source": [ 83 | "plt.errorbar(x, y, yerr=dy, fmt='o', color='black', ecolor='lightgray', elinewidth=3, capsize=0);" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "- You can also specify horizontal errorbars (``xerr``), one-sided errorbars, and many other variants." 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Continuous Errors\n", 98 | "\n", 99 | "- Sometimes you want to show errorbars on continuous quantities. You can combine ``plt.plot`` and ``plt.fill_between`` to do this.\n", 100 | "- Example: a simple *Gaussian process regression* using the Scikit-Learn API. This fits a non-parametric function to data with a continuous measure of the uncertainty.\n", 101 | "\n", 102 | "## Need to debug GaussianProcessRegressor in sklearn 0.19" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 6, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "ename": "TypeError", 112 | "evalue": "__init__() got an unexpected keyword argument 'corr'", 113 | "output_type": "error", 114 | "traceback": [ 115 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 116 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 117 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Compute the Gaussian process fit\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mgp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mGaussianProcessRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'cubic'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtheta0\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1e-2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mthetaL\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1e-4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mthetaU\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1E-1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrandom_start\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 11\u001b[0m \u001b[0mgp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnewaxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mydata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 118 | "\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'corr'" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "#from sklearn.gaussian_process import GaussianProcess -- deprecated in sklearn v0.18\n", 124 | "from sklearn.gaussian_process import GaussianProcessRegressor \n", 125 | "\n", 126 | "# define the model and draw some data\n", 127 | "model = lambda x: x * np.sin(x)\n", 128 | "xdata = np.array([1, 3, 5, 6, 8])\n", 129 | "ydata = model(xdata)\n", 130 | "\n", 131 | "# Compute the Gaussian process fit\n", 132 | "gp = GaussianProcessRegressor(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1, random_start=100)\n", 133 | "gp.fit(xdata[:, np.newaxis], ydata)\n", 134 | "\n", 135 | "xfit = np.linspace(0, 10, 1000)\n", 136 | "yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)\n", 137 | "dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "- We now have ``xfit``, ``yfit``, and ``dyfit``, which sample the continuous fit to our data.\n", 145 | "- We could pass these to the ``plt.errorbar`` function but we don't really want to plot 1,000 points with 1,000 errorbars.\n", 146 | "- Instead, we can use the ``plt.fill_between`` function with a light color to visualize this continuous error:" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 7, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "ename": "NameError", 156 | "evalue": "name 'xfit' is not defined", 157 | "output_type": "error", 158 | "traceback": [ 159 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 160 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 161 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Visualize the result\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mydata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'or'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxfit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myfit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'-'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'gray'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,\n", 162 | "\u001b[0;31mNameError\u001b[0m: name 'xfit' is not defined" 163 | ] 164 | }, 165 | { 166 | "data": { 167 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAEc9JREFUeJzt3V1sFGX/xvFrba21rUKpWsDQlvgS\nGgxBKwdgQY0vJA/GpCRNG5ZYRYUEYqyCUmlV8hCMbTyAYCMgGjWWgCuK5AmCIYLxoJWwGE3DomIM\nxaKV0kKR7QqFeQ76t38L9G12+sz+4PtJCDv37MxeB8219947uxtwHMcRACDhXeV3AADA4FDYAGAE\nhQ0ARlDYAGAEhQ0ARlDYAGBE8nCdOBwOD9epAeCyVlBQcMnxYSvs/h50IJFIRPn5+R6nGT6W8lrK\nKtnKaymrZCuvpaxSfHn7m+yyJAIARlDYAGAEhQ0ARrhewz59+rSWLl2qkydP6uzZs1q0aJGmT5/u\nZTYAwD+4LuxPP/1U48eP1+LFi9XS0qKysjLt2LHDy2wAgH9wvSSSmZmpEydOSJI6OjqUmZnpWSgA\nMKmuTsrL04SJE6W8vO5tDwXi+XrVJ598Uk1NTero6NC6des0efLknn3hcFhpaWmuzhuLxZSamuo2\n1v+cpbyWskq28lrKKtnKayHr9f/5j8a88oquisV6xs6npuq3f/9bHY88MujzRKPRvi+JdlzaunWr\nU1VV5TiO40QiEWf27Nm99u/bt8/tqZ0DBw64PtYPlvJayuo4tvJayuo4tvKayJqb6zjSxf9yc4d0\nmv660/WSyP79+1VYWChJmjBhglpaWtTV1eX2dABgW1PT0MZdcF3Yubm5+u677yRJzc3NSk9PV3Ly\nsH5wEgASV07O0MZdcF3YJSUlam5u1ty5c7V48WItX77cs1AAYM7KldKF79ulpXWPe8T1lDg9PV2r\nV6/2LAgAmBYMdv9fWSmnqUmBnJzusv573AOsYQCAV4JBKRjUwWH6sio+mg4ARlDYAGAEhQ0ARlDY\nAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAE\nhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARsRV2Nu2bdOjjz6q2bNn66uvvvIqEwDgElwXdnt7u2pra7Vx\n40atXbtWu3bt8jIXAOACyW4PrK+v19SpU5WRkaGMjAytWLHCy1wAgAu4nmH/+uuvchxH5eXlmjNn\njurr673MBQC4QMBxHMfNgevXr9f+/fv15ptv6ujRo3rssce0e/duBQIBSVI4HFZaWpqrULFYTKmp\nqa6O9YOlvJaySrbyWsoq2cprKasUX95oNKqCgoJL7nO9JJKVlaU777xTycnJysnJUXp6utra2pSV\nldVzn/z8fFfnjkQiro/1g6W8lrJKtvJayirZymspqxRf3nA43Oc+10sihYWFamho0Pnz59XW1qZo\nNKrMzEy3pwMADMD1DDs7O1szZ85UWVmZOjs7VVVVpauu4rJuABgurgtbkkpLS1VaWupVFgBAP5gS\nA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4AR\nFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGBFXYcdiMT3wwAP65JNPvMoD\nAOhDXIX91ltvaeTIkV5lAQD0w3Vh//zzzzp06JDuu+8+D+MAAPriurCrq6tVUVHhZRYAQD+S3Ry0\ndetWTZ48WePGjev3fpFIxFWoWCzm+lg/WMprKatkK6+lrJKtvJaySsOX11Vh79mzR0eOHNGePXv0\n+++/KyUlRaNHj9a0adN63S8/P99VqEgk4vpYP1jKaymrZCuvpaySrbyWskrx5Q2Hw33uc1XYq1at\n6rm9Zs0a3XzzzReVNQDAW1yHDQBGuJph/9MzzzzjRQ4AwACYYQOAERQ2ABhBYQOAERQ2ABhBYQOA\nERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2\nABhBYQOAERQ2ABhBYQOAERQ2ABiRHM/BNTU1CofD6urq0oIFC/Twww97lQsAcAHXhd3Q0KCffvpJ\nmzdvVnt7u4qKiihsABhGrgt7ypQpmjRpkiRpxIgR6uzs1Llz55SUlORZOADA/3O9hp2UlKS0tDRJ\nUigU0owZMyhrABhGAcdxnHhOsGvXLq1bt07vvvuurrvuup7xcDjcU+hDFYvFlJqaGk+s/ylLeS1l\nlWzltZRVspXXUlYpvrzRaFQFBQWX3BfXm45ff/211q5dqw0bNvQq67/l5+e7Om8kEnF9rB8s5bWU\nVbKV11JWyVZeS1ml+PKGw+E+97ku7FOnTqmmpkbvvfeeRo4c6fY0AIBBcl3Y27dvV3t7u8rLy3vG\nqqurNXbsWE+CAQB6c13YJSUlKikp8TILAKAffNIRAIygsAHACAobAIygsAHACAobAIygsAHACAob\nAIygsAHACAobAIygsAHAiMQq7Lo6KS9PEyZOlPLyurcBAJLi/HpVT9XVSfPnS9GoApJ0+HD3tiQF\ng34mA4CEkDgz7MpKKRrtPRaNdo8DABKosJuahjYOAFeYxCnsnJyhjQPAFSZxCnvlSunC34BMS+se\nBwAkUGEHg9L69VJurpxAQMrN7d7mDUdvcAUOYF7iXCUidZdzMKiDxn5wM+FxBQ5wWUicGTaGD1fg\nAJcFCvtKwBU4wGWBwr4ScAUOcFmgsK8EXIEDXBZcF/Zrr72mkpISlZaW6vvvv/cyE7zGFTjAZcHV\nVSJ79+7V4cOHtXnzZh06dEgvvfSSQqGQ19ngJa7AAcxzNcOur6/Xgw8+KEm69dZb1dHRoT///NPT\nYACA3lwVdmtrqzIzM3u2s7KydOzYMc9CAQAu5mpJxHGci7YDgcBF94tEIq5CxWIx18f6wVJeS1kl\nW3ktZZVs5bWUVRq+vK4KOzs7W62trT3bf/zxh2644YaL7ud2rTRibJ3VUl5LWSVbeS1llWzltZRV\nii9vOBzuc5+rJZF77rlHO3fulCQdOHBAN910kzIyMlyFAwAMjqsZ9l133aWJEyeqtLRUgUBAr776\nqte5AAAXcP3lT0uWLPEyBwBgAHzSEQCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgK\nGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKG7jS1NVJeXmaMHGi\nlJfXvQ0TXP+mIwCD6uqk+fOlaFQBSTp8uHtbkoJBP5NhEJhhA1eSykopGu09Fo12jyPhUdjAlaSp\naWjjSCiulkS6urpUWVmpI0eOqKurSy+++KLuvvtur7MB8FpOTvcyyKXGkfBczbA/++wzXXvttdq4\ncaNWrlyp119/3etcAIbDypVSWlrvsbS07nEkPFcz7EcffVSPPPKIJGnUqFE6ceKEp6EADJO/31is\nrJTT1KRATk53WfOGowmuCvvqq6/uuf3+++/3lDcAA4JBKRjUwUhE+fn5fqfBEAQcx3H6u0MoFFIo\nFOo19swzz2j69Omqq6vTl19+qbVr1/YqcUkKh8NKu/Cl1yDFYjGlpqa6OtYPlvJayirZymspq2Qr\nr6WsUnx5o9GoCgoKLr3Tcemjjz5y5s2b58RisUvu37dvn9tTOwcOHHB9rB8s5bWU1XFs5bWU1XFs\n5bWU1XHiy9tfd7paEjly5Ig2bdqkDz/8UNdcc42rZxEAwNC4KuxQKKQTJ05o/t+fkJL0zjvvKCUl\nxbNgAIDeXBX2888/r+eff97rLACAfvBJRwAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIG\nACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMo\nbAAwgsIGACPiKuzW1lZNmTJF33zzjVd5AAB9iKuwa2pqNG7cOK+yAAD64bqw6+vrlZ6erttvv93L\nPACAPrgq7DNnzqi2tlbPPfec13kAAH1IHugOoVBIoVCo19iMGTNUXFys66+/vt9jI5GIq1CxWMz1\nsX6wlNdSVslWXktZJVt5LWWVhi9vwHEcZ6gHlZaW6vz585KkpqYmjRo1SqtXr9Ztt93Wc59wOKyC\nggJXoSKRiPLz810d6wdLeS1llWzltZRVspXXUlYpvrz9deeAM+xL2bRpU8/tiooKFRUV9SprAID3\nuA4bAIxwNcP+p9dff92LHACAATDDBgAjKGwAMILCBgAjKGwAMILCBuJVVyfl5WnCxIlSXl73NjAM\n4r5KBLii1dVJ8+dL0agCknT4cPe2JAWDfibDZYgZNhCPykopGu09Fo12jwMeo7CBeDQ1DW0ciAOF\nDcQjJ2do40AcKGwgHitXSmlpvcfS0rrHAY9R2EA8gkFp/XopN1dOICDl5nZv84YjhgFXiQDxCgal\nYFAHjX0FKOxhhg0ARlDYAGAEhQ0ARlDYAGAEhQ0ARrj6Ed7BCIfDw3FaALjs9fUjvMNW2AAAb7Ek\nAgBGUNgAYETCfdLxxx9/1MKFC/X4449r7ty5fscZUE1NjcLhsLq6urRgwQI9/PDDfke6pM7OTlVU\nVOj48eP666+/tHDhQt1///1+x+pXLBbTrFmztGjRIs2ePdvvOH1qbGzUwoULlZubK0m6/fbb9fLL\nL/ucqn/btm3Thg0blJycrGeffVb33nuv35EuKRQKadu2bT3bjY2N+vbbb31M1LfTp09r6dKlOnny\npM6ePatFixZp+vTpnj5GQhV2NBrVihUrNHXqVL+jDEpDQ4N++uknbd68We3t7SoqKkrYwt69e7fu\nuOMOPf3002pubta8efMSvrDfeustjRw50u8YA4pGo5o5c6YqjXwHdnt7u2pra7VlyxZFo1GtWbMm\nYQu7uLhYxcXFkqS9e/fq888/9zlR3z799FONHz9eixcvVktLi8rKyrRjxw5PHyOhCjslJUVvv/22\n3n77bb+jDMqUKVM0adIkSdKIESPU2dmpc+fOKSkpyedkF/vXv/7Vc/u3335Tdna2j2kG9vPPP+vQ\noUO67777/I4yoNOnT/sdYUjq6+s1depUZWRkKCMjQytWrPA70qDU1tbqjTfe8DtGnzIzM/XDDz9I\nkjo6OpSZmen5YyTUGnZycrJSU1P9jjFoSUlJSvu/r9YMhUKaMWNGQpb1P5WWlmrJkiVatmyZ31H6\nVV1drYqKCr9jDEo0GlU4HNZTTz2lYDCohoYGvyP169dff5XjOCovL9ecOXNUX1/vd6QBff/99xoz\nZoxuvPFGv6P0adasWTp69KgeeughzZ07V0uXLvX8MRJqhm3Vrl279PHHH+vdd9/1O8qANm3apEgk\nohdeeEHbtm1TIBDwO9JFtm7dqsmTJ2vcuHF+RxmUCRMmaNGiRXrggQf0yy+/6IknntAXX3yhlJQU\nv6P1qaWlRW+++aaOHj2qxx57TLt3707Iv4W/ffzxxyoqKvI7Rr8+++wzjR07Vu+8844OHjyoyspK\nbdmyxdPHoLDj9PXXX2vt2rXasGGDrrvuOr/j9KmxsVFZWVkaM2aM8vPzde7cObW1tSkrK8vvaBfZ\ns2ePjhw5oj179uj3339XSkqKRo8erWnTpvkd7ZJuueUW3XLLLZKk8ePH64YbblBLS0vCPuFkZWXp\nzjvvVHJysnJycpSenp6wfwt/++abb1RVVeV3jH7t379fhYWFkrqfxFtaWtTV1aXkZO9qNqGWRKw5\ndeqUampqtG7duoR/c2zfvn09rwBaW1sVjUaHZY3NC6tWrdKWLVv00Ucfqbi4WAsXLkzYspa6Z38f\nfPCBJOnYsWM6fvx4Qr9HUFhYqIaGBp0/f15tbW0J/bcgdb8aSE9PT+hXLJKUm5ur7777TpLU3Nys\n9PR0T8taSrAZdmNjo6qrq9Xc3Kzk5GTt3LlTa9asSdgy3L59u9rb21VeXt4zVl1drbFjx/qY6tJK\nS0tVWVmpOXPmKBaL6ZVXXtFVV/F87YWHHnpIS5Ys0c6dO3XmzBktX748ocslOztbM2fOVFlZmTo7\nO1VVVZXQfwvHjh3TqFGj/I4xoJKSEi1btkxz585VV1eXli9f7vlj8NF0ADAicZ9WAQC9UNgAYASF\nDQBGUNgAYASFDQBGUNgAYASFDQBGUNgAYMR/AaKqu6/TxdMiAAAAAElFTkSuQmCC\n", 168 | "text/plain": [ 169 | "" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "output_type": "display_data" 174 | } 175 | ], 176 | "source": [ 177 | "# Visualize the result\n", 178 | "plt.plot(xdata, ydata, 'or')\n", 179 | "plt.plot(xfit, yfit, '-', color='gray')\n", 180 | "\n", 181 | "plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,\n", 182 | " color='gray', alpha=0.2)\n", 183 | "plt.xlim(0, 10);" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "- This gives an intuitive view into what the Gaussian process regression algorithm is doing: in regions near a measured data point, the model is strongly constrained and this is reflected in the small model errors. In regions far from a measured data point, the model is not strongly constrained, and the model errors increase.\n", 191 | "\n", 192 | "- For more information on the options available in ``plt.fill_between()`` (and the closely related ``plt.fill()`` function), see the function docstring or the Matplotlib documentation." 193 | ] 194 | } 195 | ], 196 | "metadata": { 197 | "anaconda-cloud": {}, 198 | "kernelspec": { 199 | "display_name": "Python 3", 200 | "language": "python", 201 | "name": "python3" 202 | }, 203 | "language_info": { 204 | "codemirror_mode": { 205 | "name": "ipython", 206 | "version": 3 207 | }, 208 | "file_extension": ".py", 209 | "mimetype": "text/x-python", 210 | "name": "python", 211 | "nbconvert_exporter": "python", 212 | "pygments_lexer": "ipython3", 213 | "version": "3.6.3" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 1 218 | } 219 | -------------------------------------------------------------------------------- /Pandas-Performance-Eval-and-Query.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# High-Performance Pandas: eval() and query()" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- NumPy and Pandas rely on the creation of temporary intermediate objects. They can cause increased computational time and memory use.\n", 15 | "- Pandas v0.13 includes ``eval()`` and ``query()`` functions, which allow you to directly access C-speed operations without the need for intermediate arrays. They rely on the [Numexpr](https://github.com/pydata/numexpr) package." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### How Compound Expressions cause slowdowns\n", 23 | "- Consider adding the elements of two arrays in Numpy vs Python:" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "1.73 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "import numpy as np\n", 41 | "rng = np.random.RandomState(42)\n", 42 | "x = rng.rand(1000000)\n", 43 | "y = rng.rand(1000000)\n", 44 | "%timeit x + y" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "168 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "- Now consider what happens when computing compound expressions. NumPy evaluates each subexpression, which looks like this:" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "collapsed": true 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "mask = (x > 0.5) & (y < 0.5)\n", 80 | "tmp1 = (x > 0.5)\n", 81 | "tmp2 = (y < 0.5)\n", 82 | "mask = tmp1 & tmp2" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "- *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this is a problem.\n", 90 | "- The Numexpr library allows you to eval this compound expression element by element, without the need to allocate full intermediate arrays. The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 4, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "data": { 100 | "text/plain": [ 101 | "True" 102 | ] 103 | }, 104 | "execution_count": 4, 105 | "metadata": {}, 106 | "output_type": "execute_result" 107 | } 108 | ], 109 | "source": [ 110 | "import numexpr\n", 111 | "mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')\n", 112 | "\n", 113 | "# NumPy allclose(): returns True if two arrays are element-wise equal within a tolerance.\n", 114 | "np.allclose(mask, mask_numexpr)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### pandas.eval() for Efficient Operations\n", 122 | "- ``eval()`` uses string expressions to compute operations using ``DataFrame``s." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 9, 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# build 4 dataframes, all 100x100K\n", 134 | "import pandas as pd\n", 135 | "nrows, ncols = 100000, 100\n", 136 | "rng = np.random.RandomState(42)\n", 137 | "df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))\n", 138 | " for i in range(4))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "- How fast using pd.eval() provides ~50% speedup, and uses less memory." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 10, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "92.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "%timeit df1 + df2 + df3 + df4" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 11, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "37.3 ms ± 699 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "%timeit pd.eval('df1 + df2 + df3 + df4')" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 12, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "True" 191 | ] 192 | }, 193 | "execution_count": 12, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "### Operations supported by pd.eval()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 13, 212 | "metadata": {}, 213 | "outputs": [ 214 | { 215 | "data": { 216 | "text/html": [ 217 | "

\n", 218 | "\n", 231 | "\n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | "

	0	1	2
0	180	112	748
1	447	205	487
2	656	100	98
3	90	450	613
4	529	224	530

\n", 273 | "

" 274 | ], 275 | "text/plain": [ 276 | " 0 1 2\n", 277 | "0 180 112 748\n", 278 | "1 447 205 487\n", 279 | "2 656 100 98\n", 280 | "3 90 450 613\n", 281 | "4 529 224 530" 282 | ] 283 | }, 284 | "execution_count": 13, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))\n", 291 | " for i in range(5))\n", 292 | "df1.head()" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "#### pd.eval() supports all arithmetic operators:" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 14, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "True" 311 | ] 312 | }, 313 | "execution_count": 14, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "result1 = -df1 * df2 / (df3 + df4) - df5\n", 320 | "result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')\n", 321 | "np.allclose(result1, result2)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "#### ``pd.eval()`` supports all comparison operators, including chained expressions:" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 15, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "data": { 338 | "text/plain": [ 339 | "True" 340 | ] 341 | }, 342 | "execution_count": 15, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)\n", 349 | "result2 = pd.eval('df1 < df2 <= df3 != df4')\n", 350 | "np.allclose(result1, result2)" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "#### ``pd.eval()`` supports the ``&`` and ``|`` bitwise operators:" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 16, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "True" 369 | ] 370 | }, 371 | "execution_count": 16, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)\n", 378 | "result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')\n", 379 | "np.allclose(result1, result2)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "#### ``pd(eval()`` also supports literal ``and`` and ``or`` in Boolean expressions:" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 17, 392 | "metadata": {}, 393 | "outputs": [ 394 | { 395 | "data": { 396 | "text/plain": [ 397 | "True" 398 | ] 399 | }, 400 | "execution_count": 17, 401 | "metadata": {}, 402 | "output_type": "execute_result" 403 | } 404 | ], 405 | "source": [ 406 | "result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')\n", 407 | "np.allclose(result1, result3)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "#### ``pd.eval()`` supports access to object attributes via ``obj.attr``, and indexes via ``obj[index]``:" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 18, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "data": { 424 | "text/plain": [ 425 | "True" 426 | ] 427 | }, 428 | "execution_count": 18, 429 | "metadata": {}, 430 | "output_type": "execute_result" 431 | } 432 | ], 433 | "source": [ 434 | "result1 = df2.T[0] + df3.iloc[1]\n", 435 | "result2 = pd.eval('df2.T[0] + df3.iloc[1]')\n", 436 | "np.allclose(result1, result2)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "#### Other operations\n", 444 | "- function calls, conditional statements, loops, and more complex constructs are __currently *not* implemented__ in ``pd.eval()``." 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "### ``DataFrame.eval()`` for Column-Wise Operations\n", 452 | "\n", 453 | "- Just as Pandas uses ``pd.eval()``, ``DataFrame``s have a similar ``eval()`` method. The benefit of the ``eval()`` method is that columns can be referred to *by name*." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 19, 459 | "metadata": {}, 460 | "outputs": [ 461 | { 462 | "data": { 463 | "text/html": [ 464 | "

\n", 465 | "\n", 478 | "\n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | "

	A	B	C
0	0.375506	0.406939	0.069938
1	0.069087	0.235615	0.154374
2	0.677945	0.433839	0.652324
3	0.264038	0.808055	0.347197
4	0.589161	0.252418	0.557789

\n", 520 | "

" 521 | ], 522 | "text/plain": [ 523 | " A B C\n", 524 | "0 0.375506 0.406939 0.069938\n", 525 | "1 0.069087 0.235615 0.154374\n", 526 | "2 0.677945 0.433839 0.652324\n", 527 | "3 0.264038 0.808055 0.347197\n", 528 | "4 0.589161 0.252418 0.557789" 529 | ] 530 | }, 531 | "execution_count": 19, 532 | "metadata": {}, 533 | "output_type": "execute_result" 534 | } 535 | ], 536 | "source": [ 537 | "df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])\n", 538 | "df.head()" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "- Now we can compute expressions using the column names:" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 20, 551 | "metadata": {}, 552 | "outputs": [ 553 | { 554 | "data": { 555 | "text/plain": [ 556 | "True" 557 | ] 558 | }, 559 | "execution_count": 20, 560 | "metadata": {}, 561 | "output_type": "execute_result" 562 | } 563 | ], 564 | "source": [ 565 | "result1 = (df['A'] + df['B']) / (df['C'] - 1)\n", 566 | "result2 = pd.eval(\"(df.A + df.B) / (df.C - 1)\")\n", 567 | "np.allclose(result1, result2)" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "- ``DataFrame.eval()`` allows more succinct evaluation of expressions with the columns. We can treat *column names as variables*." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": 21, 580 | "metadata": {}, 581 | "outputs": [ 582 | { 583 | "data": { 584 | "text/plain": [ 585 | "True" 586 | ] 587 | }, 588 | "execution_count": 21, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "result3 = df.eval('(A + B) / (C - 1)')\n", 595 | "np.allclose(result1, result3)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "### Assignment in DataFrame.eval()\n", 603 | "- ``DataFrame.eval()`` also allows assignment to any column." 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": 22, 609 | "metadata": {}, 610 | "outputs": [ 611 | { 612 | "data": { 613 | "text/html": [ 614 | "

\n", 615 | "\n", 628 | "\n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | "

	A	B	C
0	0.375506	0.406939	0.069938
1	0.069087	0.235615	0.154374
2	0.677945	0.433839	0.652324
3	0.264038	0.808055	0.347197
4	0.589161	0.252418	0.557789

\n", 670 | "

" 671 | ], 672 | "text/plain": [ 673 | " A B C\n", 674 | "0 0.375506 0.406939 0.069938\n", 675 | "1 0.069087 0.235615 0.154374\n", 676 | "2 0.677945 0.433839 0.652324\n", 677 | "3 0.264038 0.808055 0.347197\n", 678 | "4 0.589161 0.252418 0.557789" 679 | ] 680 | }, 681 | "execution_count": 22, 682 | "metadata": {}, 683 | "output_type": "execute_result" 684 | } 685 | ], 686 | "source": [ 687 | "df.head()" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 23, 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "data": { 697 | "text/html": [ 698 | "

\n", 699 | "\n", 712 | "\n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | "

	A	B	C	D
0	0.375506	0.406939	0.069938	11.187620
1	0.069087	0.235615	0.154374	1.973796
2	0.677945	0.433839	0.652324	1.704344
3	0.264038	0.808055	0.347197	3.087857
4	0.589161	0.252418	0.557789	1.508776

\n", 760 | "

" 761 | ], 762 | "text/plain": [ 763 | " A B C D\n", 764 | "0 0.375506 0.406939 0.069938 11.187620\n", 765 | "1 0.069087 0.235615 0.154374 1.973796\n", 766 | "2 0.677945 0.433839 0.652324 1.704344\n", 767 | "3 0.264038 0.808055 0.347197 3.087857\n", 768 | "4 0.589161 0.252418 0.557789 1.508776" 769 | ] 770 | }, 771 | "execution_count": 23, 772 | "metadata": {}, 773 | "output_type": "execute_result" 774 | } 775 | ], 776 | "source": [ 777 | "df.eval('D = (A + B) / C', inplace=True)\n", 778 | "df.head()" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "- Any existing column can be modified:" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 24, 791 | "metadata": {}, 792 | "outputs": [ 793 | { 794 | "data": { 795 | "text/html": [ 796 | "

\n", 797 | "\n", 810 | "\n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | "

	A	B	C	D
0	0.375506	0.406939	0.069938	-0.449425
1	0.069087	0.235615	0.154374	-1.078728
2	0.677945	0.433839	0.652324	0.374209
3	0.264038	0.808055	0.347197	-1.566886
4	0.589161	0.252418	0.557789	0.603708

\n", 858 | "

" 859 | ], 860 | "text/plain": [ 861 | " A B C D\n", 862 | "0 0.375506 0.406939 0.069938 -0.449425\n", 863 | "1 0.069087 0.235615 0.154374 -1.078728\n", 864 | "2 0.677945 0.433839 0.652324 0.374209\n", 865 | "3 0.264038 0.808055 0.347197 -1.566886\n", 866 | "4 0.589161 0.252418 0.557789 0.603708" 867 | ] 868 | }, 869 | "execution_count": 24, 870 | "metadata": {}, 871 | "output_type": "execute_result" 872 | } 873 | ], 874 | "source": [ 875 | "df.eval('D = (A - B) / C', inplace=True)\n", 876 | "df.head()" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "### Local variables in DataFrame.eval()\n", 884 | "- ``DataFrame.eval()`` local Python variables via the '@' syntax:" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 25, 890 | "metadata": {}, 891 | "outputs": [ 892 | { 893 | "data": { 894 | "text/plain": [ 895 | "True" 896 | ] 897 | }, 898 | "execution_count": 25, 899 | "metadata": {}, 900 | "output_type": "execute_result" 901 | } 902 | ], 903 | "source": [ 904 | "column_mean = df.mean(1) # local variable\n", 905 | "\n", 906 | "result1 = df['A'] + column_mean\n", 907 | "result2 = df.eval('A + @column_mean')\n", 908 | "np.allclose(result1, result2)" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "- The ``@`` character marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving __two \"namespaces\"__: the namespace of columns, and the namespace of Python objects.\n", 916 | "\n", 917 | "- Note: this ``@`` character is only supported by ``DataFrame.eval()``, not by ``pandas.eval()`` - ``pandas.eval()`` only has access to one (Python) namespace." 918 | ] 919 | }, 920 | { 921 | "cell_type": "markdown", 922 | "metadata": {}, 923 | "source": [ 924 | "## DataFrame.query() Method\n", 925 | "\n", 926 | "Consider the following:" 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "execution_count": 26, 932 | "metadata": {}, 933 | "outputs": [ 934 | { 935 | "data": { 936 | "text/plain": [ 937 | "True" 938 | ] 939 | }, 940 | "execution_count": 26, 941 | "metadata": {}, 942 | "output_type": "execute_result" 943 | } 944 | ], 945 | "source": [ 946 | "result1 = df[(df.A < 0.5) & (df.B < 0.5)]\n", 947 | "result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')\n", 948 | "np.allclose(result1, result2)" 949 | ] 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "metadata": {}, 954 | "source": [ 955 | "- These expressions involve columns of ``DataFrame``s. But it cannot be evaluated using ``DataFrame.eval()``... instead, for this type of filtering operation, you can use the ``query()`` method:" 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": 27, 961 | "metadata": {}, 962 | "outputs": [ 963 | { 964 | "data": { 965 | "text/plain": [ 966 | "True" 967 | ] 968 | }, 969 | "execution_count": 27, 970 | "metadata": {}, 971 | "output_type": "execute_result" 972 | } 973 | ], 974 | "source": [ 975 | "result2 = df.query('A < 0.5 and B < 0.5')\n", 976 | "np.allclose(result1, result2)" 977 | ] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": {}, 982 | "source": [ 983 | "- This is much more efficient than the masking expression above, plus it's much easier to read. Note that ``query()`` also accepts the ``@`` flag to mark local variables:" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 28, 989 | "metadata": {}, 990 | "outputs": [ 991 | { 992 | "data": { 993 | "text/plain": [ 994 | "True" 995 | ] 996 | }, 997 | "execution_count": 28, 998 | "metadata": {}, 999 | "output_type": "execute_result" 1000 | } 1001 | ], 1002 | "source": [ 1003 | "Cmean = df['C'].mean()\n", 1004 | "result1 = df[(df.A < Cmean) & (df.B < Cmean)]\n", 1005 | "result2 = df.query('A < @Cmean and B < @Cmean')\n", 1006 | "np.allclose(result1, result2)" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "metadata": {}, 1012 | "source": [ 1013 | "### Performance: When to Use These Functions\n", 1014 | "\n", 1015 | "- Memory use is the most predictable aspect. Every compound expression involving NumPy arrays or Pandas ``DataFrame``s will result in implicit creation of temporary arrays:" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "code", 1020 | "execution_count": 29, 1021 | "metadata": { 1022 | "collapsed": true 1023 | }, 1024 | "outputs": [], 1025 | "source": [ 1026 | "x = df[(df.A < 0.5) & (df.B < 0.5)]" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "metadata": {}, 1032 | "source": [ 1033 | "- Is roughly equivalent to:" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "code", 1038 | "execution_count": 30, 1039 | "metadata": { 1040 | "collapsed": true 1041 | }, 1042 | "outputs": [], 1043 | "source": [ 1044 | "tmp1 = df.A < 0.5\n", 1045 | "tmp2 = df.B < 0.5\n", 1046 | "tmp3 = tmp1 & tmp2\n", 1047 | "x = df[tmp3]" 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "markdown", 1052 | "metadata": {}, 1053 | "source": [ 1054 | "- If the size of the temporary ``DataFrame``s is significant compared to available system memory (typically several GB) then consider using ``eval()`` or ``query()``. You can check the approximate size of your array in bytes using this:" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": 31, 1060 | "metadata": {}, 1061 | "outputs": [ 1062 | { 1063 | "data": { 1064 | "text/plain": [ 1065 | "32000" 1066 | ] 1067 | }, 1068 | "execution_count": 31, 1069 | "metadata": {}, 1070 | "output_type": "execute_result" 1071 | } 1072 | ], 1073 | "source": [ 1074 | "df.values.nbytes" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "markdown", 1079 | "metadata": {}, 1080 | "source": [ 1081 | "- ``eval()`` can be faster even when you are not maxing-out your system memory.\n", 1082 | "- The issue is how your temporary ``DataFrame``s compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes); if they are much bigger, then ``eval()`` can avoid some potentially slow movement of values between the different memory caches.\n", 1083 | "- See [\"Enhancing Performance\" section](http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html) for more detail." 1084 | ] 1085 | } 1086 | ], 1087 | "metadata": { 1088 | "anaconda-cloud": {}, 1089 | "kernelspec": { 1090 | "display_name": "Python 3", 1091 | "language": "python", 1092 | "name": "python3" 1093 | }, 1094 | "language_info": { 1095 | "codemirror_mode": { 1096 | "name": "ipython", 1097 | "version": 3 1098 | }, 1099 | "file_extension": ".py", 1100 | "mimetype": "text/x-python", 1101 | "name": "python", 1102 | "nbconvert_exporter": "python", 1103 | "pygments_lexer": "ipython3", 1104 | "version": "3.6.3" 1105 | } 1106 | }, 1107 | "nbformat": 4, 1108 | "nbformat_minor": 1 1109 | } 1110 | -------------------------------------------------------------------------------- /Pandas-Missing-Values.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Handling Missing Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Trade-Offs\n", 15 | "\n", 16 | "- There are multiple schemes to indicate missing data in a table or DataFrame. They use one of two strategies: using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry.\n", 17 | "\n", 18 | "- A mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.\n", 19 | "\n", 20 | "- A sentinel value could be a data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number).\n", 21 | "\n", 22 | "- Using a separate mask array requires allocation of an additional Boolean array; a sentinel value reduces the range of valid values that can be represented and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Missing Data in Pandas\n", 30 | "\n", 31 | "- Pandas is constrained by its reliance on NumPy, which does not have a built-in notion of NA values for non-floating-point data types.\n", 32 | "\n", 33 | "- Pandas could have specified bit patterns for each individual data type to indicate nullness, but NumPy supports (for example) *fourteen* basic integer types once you account for available precisions, signedness, and endianness of the encoding.\n", 34 | "\n", 35 | "- NumPy supports masked arrays – that is, arrays that have a separate Boolean mask array attached for marking data as \"good\" or \"bad.\" Pandas could have used this approach, but it used too much overhead.\n", 36 | "\n", 37 | "- Therefore Pandas chose to use sentinels for missing data and to use two already-existing Python null values: the special floating-point ``NaN`` value, and the Python ``None`` object." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### ``None``: Pythonic missing data\n", 45 | "\n", 46 | "- ``None`` is a singleton object that is often used for missing data in Python code. Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 1, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "import numpy as np\n", 58 | "import pandas as pd" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "array([1, None, 3, 4], dtype=object)" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "vals1 = np.array([1, None, 3, 4])\n", 79 | "vals1" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "- While an object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead." 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "dtype = object\n", 99 | "48.4 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", 100 | "\n", 101 | "dtype = int\n", 102 | "1.45 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", 103 | "\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "for dtype in ['object', 'int']:\n", 109 | " print(\"dtype =\", dtype)\n", 110 | " %timeit np.arange(1E6, dtype=dtype).sum()\n", 111 | " print()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "- The use of Python objects in an array also means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error:" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "ename": "TypeError", 128 | "evalue": "unsupported operand type(s) for +: 'int' and 'NoneType'", 129 | "output_type": "error", 130 | "traceback": [ 131 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 132 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 133 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mvals1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 134 | "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 31\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 32\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 33\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 34\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 135 | "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "vals1.sum()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "### ``NaN``: Missing numerical data\n", 148 | "\n", 149 | "- ``NaN`` (acronym for *Not a Number*), is different; it is a special floating-point value recognized by all systems that use the standard IEEE format." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 5, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "dtype('float64')" 161 | ] 162 | }, 163 | "execution_count": 5, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "vals2 = np.array([1, np.nan, 3, 4]) \n", 170 | "vals2.dtype" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "- Be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 6, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/plain": [ 188 | "nan" 189 | ] 190 | }, 191 | "execution_count": 6, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "1 + np.nan" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 7, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "nan" 209 | ] 210 | }, 211 | "execution_count": 7, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "0 * np.nan" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 8, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "(nan, nan, nan)" 229 | ] 230 | }, 231 | "execution_count": 8, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "vals2.sum(), vals2.min(), vals2.max()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "- NumPy provides some special aggregations that will ignore these missing values:" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 9, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "(8.0, 1.0, 4.0)" 256 | ] 257 | }, 258 | "execution_count": 9, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "### NaN and None in Pandas\n", 272 | "\n", 273 | "- Pandas is built to handle both ``NaN`` and ``None`` interchangeably, converting between them where appropriate." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 10, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "data": { 283 | "text/plain": [ 284 | "0 1.0\n", 285 | "1 NaN\n", 286 | "2 2.0\n", 287 | "3 NaN\n", 288 | "dtype: float64" 289 | ] 290 | }, 291 | "execution_count": 10, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "pd.Series([1, np.nan, 2, None])" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "- For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to ``np.nan``, it will automatically be upcast to a floating-point type to accommodate the NA." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 11, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "0 0\n", 316 | "1 1\n", 317 | "dtype: int64" 318 | ] 319 | }, 320 | "execution_count": 11, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "x = pd.Series(range(2), dtype=int)\n", 327 | "x" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 12, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "0 NaN\n", 339 | "1 1.0\n", 340 | "dtype: float64" 341 | ] 342 | }, 343 | "execution_count": 12, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "x[0] = None\n", 350 | "x" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "- Upcasting conventions in Pandas when NA values are introduced:\n", 358 | "\n", 359 | "|Typeclass | Conversion When Storing NAs | NA Sentinel Value |\n", 360 | "|--------------|-----------------------------|------------------------|\n", 361 | "| ``floating`` | No change | ``np.nan`` |\n", 362 | "| ``object`` | No change | ``None`` or ``np.nan`` |\n", 363 | "| ``integer`` | Cast to ``float64`` | ``np.nan`` |\n", 364 | "| ``boolean`` | Cast to ``object`` | ``None`` or ``np.nan`` |\n", 365 | "\n", 366 | "Keep in mind that in Pandas, string data is always stored with an ``object`` dtype." 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "### Operating on Null Values\n", 374 | "\n", 375 | "- There are several methods for detecting, removing, and replacing null values in Pandas data structures:\n", 376 | "\n", 377 | "- ``isnull()``: Generate a boolean mask indicating missing values\n", 378 | "- ``notnull()``: Opposite of ``isnull()``\n", 379 | "- ``dropna()``: Return a filtered version of the data\n", 380 | "- ``fillna()``: Return a copy of the data with missing values filled or imputed" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### Detecting null values\n", 388 | "- Pandas data structures have two methods for detecting null data: ``isnull()`` and ``notnull()``. Either one will return a Boolean mask over the data." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 13, 394 | "metadata": { 395 | "collapsed": true 396 | }, 397 | "outputs": [], 398 | "source": [ 399 | "data = pd.Series([1, np.nan, 'hello', None])" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 14, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "0 False\n", 411 | "1 True\n", 412 | "2 False\n", 413 | "3 True\n", 414 | "dtype: bool" 415 | ] 416 | }, 417 | "execution_count": 14, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "data.isnull()" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "- Boolean masks can be used directly as a ``Series`` or ``DataFrame`` index:" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 15, 436 | "metadata": {}, 437 | "outputs": [ 438 | { 439 | "data": { 440 | "text/plain": [ 441 | "0 1\n", 442 | "2 hello\n", 443 | "dtype: object" 444 | ] 445 | }, 446 | "execution_count": 15, 447 | "metadata": {}, 448 | "output_type": "execute_result" 449 | } 450 | ], 451 | "source": [ 452 | "data[data.notnull()]" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "- ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s." 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "### Dropping null values\n", 467 | "\n", 468 | "- More convenience methods are available:\n", 469 | " - ``dropna()`` (which removes NA values)\n", 470 | " - ``fillna()`` (which fills in NA values)\n", 471 | "- For a ``Series``, the result is straightforward." 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 16, 477 | "metadata": {}, 478 | "outputs": [ 479 | { 480 | "data": { 481 | "text/plain": [ 482 | "0 1\n", 483 | "2 hello\n", 484 | "dtype: object" 485 | ] 486 | }, 487 | "execution_count": 16, 488 | "metadata": {}, 489 | "output_type": "execute_result" 490 | } 491 | ], 492 | "source": [ 493 | "data.dropna()" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "-``DataFrame``s have more options.\n", 501 | "\n", 502 | "- We cannot drop single values from a ``DataFrame`` - only full rows or full columns. Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options.\n", 503 | "\n", 504 | "- By default, ``dropna()`` will drop all rows in which *any* null value is present:" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 18, 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "data": { 514 | "text/html": [ 515 | "

\n", 516 | "\n", 529 | "\n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | "

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

\n", 559 | "

" 560 | ], 561 | "text/plain": [ 562 | " 0 1 2\n", 563 | "0 1.0 NaN 2\n", 564 | "1 2.0 3.0 5\n", 565 | "2 NaN 4.0 6" 566 | ] 567 | }, 568 | "execution_count": 18, 569 | "metadata": {}, 570 | "output_type": "execute_result" 571 | } 572 | ], 573 | "source": [ 574 | "df = pd.DataFrame([[1, np.nan, 2],\n", 575 | " [2, 3, 5],\n", 576 | " [np.nan, 4, 6]])\n", 577 | "df" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 19, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/html": [ 588 | "

\n", 589 | "\n", 602 | "\n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | "

	0	1	2
1	2.0	3.0	5

\n", 620 | "

" 621 | ], 622 | "text/plain": [ 623 | " 0 1 2\n", 624 | "1 2.0 3.0 5" 625 | ] 626 | }, 627 | "execution_count": 19, 628 | "metadata": {}, 629 | "output_type": "execute_result" 630 | } 631 | ], 632 | "source": [ 633 | "df.dropna()" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "- Or you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value." 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": 20, 646 | "metadata": {}, 647 | "outputs": [ 648 | { 649 | "data": { 650 | "text/html": [ 651 | "

\n", 652 | "\n", 665 | "\n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | "

	2
0	2
1	5
2	6

\n", 687 | "

" 688 | ], 689 | "text/plain": [ 690 | " 2\n", 691 | "0 2\n", 692 | "1 5\n", 693 | "2 6" 694 | ] 695 | }, 696 | "execution_count": 20, 697 | "metadata": {}, 698 | "output_type": "execute_result" 699 | } 700 | ], 701 | "source": [ 702 | "df.dropna(axis='columns')" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "- This drops good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values. This can be specified through the ``how`` or ``thresh`` parameters.\n", 710 | "\n", 711 | "- The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.\n", 712 | "- You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 21, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "data": { 722 | "text/html": [ 723 | "

\n", 724 | "\n", 737 | "\n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | "

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

\n", 771 | "

" 772 | ], 773 | "text/plain": [ 774 | " 0 1 2 3\n", 775 | "0 1.0 NaN 2 NaN\n", 776 | "1 2.0 3.0 5 NaN\n", 777 | "2 NaN 4.0 6 NaN" 778 | ] 779 | }, 780 | "execution_count": 21, 781 | "metadata": {}, 782 | "output_type": "execute_result" 783 | } 784 | ], 785 | "source": [ 786 | "df[3] = np.nan\n", 787 | "df" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 22, 793 | "metadata": {}, 794 | "outputs": [ 795 | { 796 | "data": { 797 | "text/html": [ 798 | "

\n", 799 | "\n", 812 | "\n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | "

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

\n", 842 | "

" 843 | ], 844 | "text/plain": [ 845 | " 0 1 2\n", 846 | "0 1.0 NaN 2\n", 847 | "1 2.0 3.0 5\n", 848 | "2 NaN 4.0 6" 849 | ] 850 | }, 851 | "execution_count": 22, 852 | "metadata": {}, 853 | "output_type": "execute_result" 854 | } 855 | ], 856 | "source": [ 857 | "df.dropna(axis='columns', how='all')" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "- The ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 23, 870 | "metadata": {}, 871 | "outputs": [ 872 | { 873 | "data": { 874 | "text/html": [ 875 | "

\n", 876 | "\n", 889 | "\n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | "

	0	1	2	3
1	2.0	3.0	5	NaN

\n", 909 | "

" 910 | ], 911 | "text/plain": [ 912 | " 0 1 2 3\n", 913 | "1 2.0 3.0 5 NaN" 914 | ] 915 | }, 916 | "execution_count": 23, 917 | "metadata": {}, 918 | "output_type": "execute_result" 919 | } 920 | ], 921 | "source": [ 922 | "df.dropna(axis='rows', thresh=3)" 923 | ] 924 | }, 925 | { 926 | "cell_type": "markdown", 927 | "metadata": {}, 928 | "source": [ 929 | "- The first and last row have been dropped because they contain only two non-null values." 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "metadata": {}, 935 | "source": [ 936 | "### Filling null values\n", 937 | "\n", 938 | "- Sometimes you'd rather replace NA values with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.\n", 939 | "\n", 940 | "- You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced." 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 24, 946 | "metadata": {}, 947 | "outputs": [ 948 | { 949 | "data": { 950 | "text/plain": [ 951 | "a 1.0\n", 952 | "b NaN\n", 953 | "c 2.0\n", 954 | "d NaN\n", 955 | "e 3.0\n", 956 | "dtype: float64" 957 | ] 958 | }, 959 | "execution_count": 24, 960 | "metadata": {}, 961 | "output_type": "execute_result" 962 | } 963 | ], 964 | "source": [ 965 | "data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", 966 | "data" 967 | ] 968 | }, 969 | { 970 | "cell_type": "markdown", 971 | "metadata": {}, 972 | "source": [ 973 | "- Fill NA entries with a single value, such as zero:" 974 | ] 975 | }, 976 | { 977 | "cell_type": "code", 978 | "execution_count": 25, 979 | "metadata": {}, 980 | "outputs": [ 981 | { 982 | "data": { 983 | "text/plain": [ 984 | "a 1.0\n", 985 | "b 0.0\n", 986 | "c 2.0\n", 987 | "d 0.0\n", 988 | "e 3.0\n", 989 | "dtype: float64" 990 | ] 991 | }, 992 | "execution_count": 25, 993 | "metadata": {}, 994 | "output_type": "execute_result" 995 | } 996 | ], 997 | "source": [ 998 | "data.fillna(0)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "metadata": {}, 1004 | "source": [ 1005 | "- Forward-fill to propagate the previous value forward:" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": 26, 1011 | "metadata": {}, 1012 | "outputs": [ 1013 | { 1014 | "data": { 1015 | "text/plain": [ 1016 | "a 1.0\n", 1017 | "b 1.0\n", 1018 | "c 2.0\n", 1019 | "d 2.0\n", 1020 | "e 3.0\n", 1021 | "dtype: float64" 1022 | ] 1023 | }, 1024 | "execution_count": 26, 1025 | "metadata": {}, 1026 | "output_type": "execute_result" 1027 | } 1028 | ], 1029 | "source": [ 1030 | "# forward-fill\n", 1031 | "data.fillna(method='ffill')" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "markdown", 1036 | "metadata": {}, 1037 | "source": [ 1038 | "- Back-fill to propagate the next values backward:" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 27, 1044 | "metadata": {}, 1045 | "outputs": [ 1046 | { 1047 | "data": { 1048 | "text/plain": [ 1049 | "a 1.0\n", 1050 | "b 2.0\n", 1051 | "c 2.0\n", 1052 | "d 3.0\n", 1053 | "e 3.0\n", 1054 | "dtype: float64" 1055 | ] 1056 | }, 1057 | "execution_count": 27, 1058 | "metadata": {}, 1059 | "output_type": "execute_result" 1060 | } 1061 | ], 1062 | "source": [ 1063 | "# back-fill\n", 1064 | "data.fillna(method='bfill')" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "metadata": { 1070 | "collapsed": true 1071 | }, 1072 | "source": [ 1073 | "- We can also specify an ``axis`` along which the fills take place in a ``DataFrame``." 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": 28, 1079 | "metadata": {}, 1080 | "outputs": [ 1081 | { 1082 | "data": { 1083 | "text/html": [ 1084 | "

\n", 1085 | "\n", 1098 | "\n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | "

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

\n", 1132 | "

" 1133 | ], 1134 | "text/plain": [ 1135 | " 0 1 2 3\n", 1136 | "0 1.0 NaN 2 NaN\n", 1137 | "1 2.0 3.0 5 NaN\n", 1138 | "2 NaN 4.0 6 NaN" 1139 | ] 1140 | }, 1141 | "execution_count": 28, 1142 | "metadata": {}, 1143 | "output_type": "execute_result" 1144 | } 1145 | ], 1146 | "source": [ 1147 | "df" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "code", 1152 | "execution_count": 29, 1153 | "metadata": {}, 1154 | "outputs": [ 1155 | { 1156 | "data": { 1157 | "text/html": [ 1158 | "

\n", 1159 | "\n", 1172 | "\n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | "

	0	1	2	3
0	1.0	1.0	2.0	2.0
1	2.0	3.0	5.0	5.0
2	NaN	4.0	6.0	6.0

\n", 1206 | "

" 1207 | ], 1208 | "text/plain": [ 1209 | " 0 1 2 3\n", 1210 | "0 1.0 1.0 2.0 2.0\n", 1211 | "1 2.0 3.0 5.0 5.0\n", 1212 | "2 NaN 4.0 6.0 6.0" 1213 | ] 1214 | }, 1215 | "execution_count": 29, 1216 | "metadata": {}, 1217 | "output_type": "execute_result" 1218 | } 1219 | ], 1220 | "source": [ 1221 | "df.fillna(method='ffill', axis=1)" 1222 | ] 1223 | } 1224 | ], 1225 | "metadata": { 1226 | "anaconda-cloud": {}, 1227 | "kernelspec": { 1228 | "display_name": "Python 3", 1229 | "language": "python", 1230 | "name": "python3" 1231 | }, 1232 | "language_info": { 1233 | "codemirror_mode": { 1234 | "name": "ipython", 1235 | "version": 3 1236 | }, 1237 | "file_extension": ".py", 1238 | "mimetype": "text/x-python", 1239 | "name": "python", 1240 | "nbconvert_exporter": "python", 1241 | "pygments_lexer": "ipython3", 1242 | "version": "3.6.3" 1243 | } 1244 | }, 1245 | "nbformat": 4, 1246 | "nbformat_minor": 1 1247 | } 1248 | -------------------------------------------------------------------------------- /Pandas-Vectorized-String-Ops.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vectorized String Operations" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 4, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/plain": [ 25 | "array([ 4, 6, 10, 14, 22, 26])" 26 | ] 27 | }, 28 | "execution_count": 4, 29 | "metadata": {}, 30 | "output_type": "execute_result" 31 | } 32 | ], 33 | "source": [ 34 | "import numpy as np\n", 35 | "x = np.array([2, 3, 5, 7, 11, 13])\n", 36 | "x * 2" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "- NumPy does not do vectorization for string arrays - thus you're stuck using a loop syntax:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 5, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "['Peter', 'Paul', 'Mary', 'Guido']" 55 | ] 56 | }, 57 | "execution_count": 5, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "data = ['peter', 'Paul', 'MARY', 'gUIDO']\n", 64 | "[s.capitalize() for s in data]" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "- This will break if there are any missing values." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 6, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "ename": "AttributeError", 81 | "evalue": "'NoneType' object has no attribute 'capitalize'", 82 | "output_type": "error", 83 | "traceback": [ 84 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 85 | "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", 86 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 87 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 88 | "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'capitalize'" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "data = ['peter', 'Paul', None, 'MARY', 'gUIDO']\n", 94 | "[s.capitalize() for s in data]" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "- Pandas solves this problem via the ``str`` attribute of Pandas Series and Index objects containing strings:" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 7, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "0 peter\n", 113 | "1 Paul\n", 114 | "2 None\n", 115 | "3 MARY\n", 116 | "4 gUIDO\n", 117 | "dtype: object" 118 | ] 119 | }, 120 | "execution_count": 7, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "import pandas as pd\n", 127 | "names = pd.Series(data)\n", 128 | "names" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 8, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "0 Peter\n", 140 | "1 Paul\n", 141 | "2 None\n", 142 | "3 Mary\n", 143 | "4 Guido\n", 144 | "dtype: object" 145 | ] 146 | }, 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "names.str.capitalize()" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "- Reminder: using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas." 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### Tables of Pandas String Methods" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 9, 173 | "metadata": { 174 | "collapsed": true 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',\n", 179 | " 'Eric Idle', 'Terry Jones', 'Michael Palin'])" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "### Methods similar to Python string methods\n", 187 | "Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:\n", 188 | "\n", 189 | "| | | | |\n", 190 | "|-------------|------------------|------------------|------------------|\n", 191 | "|``len()`` | ``lower()`` | ``translate()`` | ``islower()`` | \n", 192 | "|``ljust()`` | ``upper()`` | ``startswith()`` | ``isupper()`` | \n", 193 | "|``rjust()`` | ``find()`` | ``endswith()`` | ``isnumeric()`` | \n", 194 | "|``center()`` | ``rfind()`` | ``isalnum()`` | ``isdecimal()`` | \n", 195 | "|``zfill()`` | ``index()`` | ``isalpha()`` | ``split()`` | \n", 196 | "|``strip()`` | ``rindex()`` | ``isdigit()`` | ``rsplit()`` | \n", 197 | "|``rstrip()`` | ``capitalize()`` | ``isspace()`` | ``partition()`` | \n", 198 | "|``lstrip()`` | ``swapcase()`` | ``istitle()`` | ``rpartition()`` |\n", 199 | "\n", 200 | "- Notice that these have various return values. Some, like ``lower()``, return a series of strings:" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 10, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "data": { 210 | "text/plain": [ 211 | "0 graham chapman\n", 212 | "1 john cleese\n", 213 | "2 terry gilliam\n", 214 | "3 eric idle\n", 215 | "4 terry jones\n", 216 | "5 michael palin\n", 217 | "dtype: object" 218 | ] 219 | }, 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "monte.str.lower()" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "- Some return numbers:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 11, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "0 14\n", 245 | "1 11\n", 246 | "2 13\n", 247 | "3 9\n", 248 | "4 11\n", 249 | "5 13\n", 250 | "dtype: int64" 251 | ] 252 | }, 253 | "execution_count": 11, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "monte.str.len()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "- Or Boolean values:" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 12, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "data": { 276 | "text/plain": [ 277 | "0 False\n", 278 | "1 False\n", 279 | "2 True\n", 280 | "3 False\n", 281 | "4 True\n", 282 | "5 False\n", 283 | "dtype: bool" 284 | ] 285 | }, 286 | "execution_count": 12, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "monte.str.startswith('T')" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "- Others return lists or other compound values for each element:" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "0 [Graham, Chapman]\n", 311 | "1 [John, Cleese]\n", 312 | "2 [Terry, Gilliam]\n", 313 | "3 [Eric, Idle]\n", 314 | "4 [Terry, Jones]\n", 315 | "5 [Michael, Palin]\n", 316 | "dtype: object" 317 | ] 318 | }, 319 | "execution_count": 13, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "monte.str.split()" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "### Methods using regular expressions\n", 333 | "\n", 334 | "- Several methods accept regular expressions. They follow the API conventions of Python's built-in ``re`` module:\n", 335 | "\n", 336 | "| Method | Description |\n", 337 | "|--------|-------------|\n", 338 | "| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |\n", 339 | "| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|\n", 340 | "| ``findall()`` | Call ``re.findall()`` on each element |\n", 341 | "| ``replace()`` | Replace occurrences of pattern with some other string|\n", 342 | "| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |\n", 343 | "| ``count()`` | Count occurrences of pattern|\n", 344 | "| ``split()`` | Equivalent to ``str.split()``, but accepts regexps |\n", 345 | "| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "- For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 14, 358 | "metadata": {}, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "0 Graham\n", 364 | "1 John\n", 365 | "2 Terry\n", 366 | "3 Eric\n", 367 | "4 Terry\n", 368 | "5 Michael\n", 369 | "dtype: object" 370 | ] 371 | }, 372 | "execution_count": 14, 373 | "metadata": {}, 374 | "output_type": "execute_result" 375 | } 376 | ], 377 | "source": [ 378 | "monte.str.extract('([A-Za-z]+)', expand=False)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "- Find all names that start and end with a consonant, using the start-of-string (``^``) and end-of-string (``$``) regular expression characters:" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 15, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "data": { 395 | "text/plain": [ 396 | "0 [Graham Chapman]\n", 397 | "1 []\n", 398 | "2 [Terry Gilliam]\n", 399 | "3 []\n", 400 | "4 [Terry Jones]\n", 401 | "5 [Michael Palin]\n", 402 | "dtype: object" 403 | ] 404 | }, 405 | "execution_count": 15, 406 | "metadata": {}, 407 | "output_type": "execute_result" 408 | } 409 | ], 410 | "source": [ 411 | "monte.str.findall(r'^[^AEIOU].*[^aeiou]$')" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "### Miscellaneous methods\n", 419 | "\n", 420 | "| Method | Description |\n", 421 | "|--------|-------------|\n", 422 | "| ``get()`` | Index each element |\n", 423 | "| ``slice()`` | Slice each element|\n", 424 | "| ``slice_replace()`` | Replace slice in each element with passed value|\n", 425 | "| ``cat()`` | Concatenate strings|\n", 426 | "| ``repeat()`` | Repeat values |\n", 427 | "| ``normalize()`` | Return Unicode form of string |\n", 428 | "| ``pad()`` | Add whitespace to left, right, or both sides of strings|\n", 429 | "| ``wrap()`` | Split long strings into lines with length less than a given width|\n", 430 | "| ``join()`` | Join strings in each element of the Series with passed separator|\n", 431 | "| ``get_dummies()`` | extract dummy variables as a dataframe |" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "#### Vectorized item access and slicing\n", 439 | "\n", 440 | "- ``get()`` and ``slice()`` provide vectorized element access from each string array. For example, we can get the first three characters of each array using ``str.slice(0, 3)``. This behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 16, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/plain": [ 451 | "0 Gra\n", 452 | "1 Joh\n", 453 | "2 Ter\n", 454 | "3 Eri\n", 455 | "4 Ter\n", 456 | "5 Mic\n", 457 | "dtype: object" 458 | ] 459 | }, 460 | "execution_count": 16, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "monte.str[0:3]" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "- Indexing via ``df.str.get(i)`` and ``df.str[i]`` works similarly. \n", 474 | "\n", 475 | "- ``get()`` and ``slice()`` also let you access elements of arrays returned by ``split()``. For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 17, 481 | "metadata": {}, 482 | "outputs": [ 483 | { 484 | "data": { 485 | "text/plain": [ 486 | "0 Chapman\n", 487 | "1 Cleese\n", 488 | "2 Gilliam\n", 489 | "3 Idle\n", 490 | "4 Jones\n", 491 | "5 Palin\n", 492 | "dtype: object" 493 | ] 494 | }, 495 | "execution_count": 17, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "monte.str.split().str.get(-1)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "#### Indicator variables\n", 509 | "\n", 510 | "- ``get_dummies()`` is useful when your data has a column containing a coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A=\"born in America,\" B=\"born in the United Kingdom,\" C=\"likes cheese,\" D=\"likes spam\":" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 18, 516 | "metadata": {}, 517 | "outputs": [ 518 | { 519 | "data": { 520 | "text/html": [ 521 | "

\n", 522 | "\n", 535 | "\n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | "

	info	name
0	B\|C\|D	Graham Chapman
1	B\|D	John Cleese
2	A\|C	Terry Gilliam
3	B\|D	Eric Idle
4	B\|C	Terry Jones
5	B\|C\|D	Michael Palin

\n", 576 | "

" 577 | ], 578 | "text/plain": [ 579 | " info name\n", 580 | "0 B|C|D Graham Chapman\n", 581 | "1 B|D John Cleese\n", 582 | "2 A|C Terry Gilliam\n", 583 | "3 B|D Eric Idle\n", 584 | "4 B|C Terry Jones\n", 585 | "5 B|C|D Michael Palin" 586 | ] 587 | }, 588 | "execution_count": 18, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "full_monte = pd.DataFrame({'name': monte,\n", 595 | " 'info': ['B|C|D', 'B|D', 'A|C',\n", 596 | " 'B|D', 'B|C', 'B|C|D']})\n", 597 | "full_monte" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "- ``get_dummies()`` lets you split these indicator variables into a ``DataFrame``:" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 19, 610 | "metadata": {}, 611 | "outputs": [ 612 | { 613 | "data": { 614 | "text/html": [ 615 | "

\n", 616 | "\n", 629 | "\n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | "

	A	B	C	D
0	0	1	1	1
1	0	1	0	1
2	1	0	1	0
3	0	1	0	1
4	0	1	1	0
5	0	1	1	1

\n", 684 | "

" 685 | ], 686 | "text/plain": [ 687 | " A B C D\n", 688 | "0 0 1 1 1\n", 689 | "1 0 1 0 1\n", 690 | "2 1 0 1 0\n", 691 | "3 0 1 0 1\n", 692 | "4 0 1 1 0\n", 693 | "5 0 1 1 1" 694 | ] 695 | }, 696 | "execution_count": 19, 697 | "metadata": {}, 698 | "output_type": "execute_result" 699 | } 700 | ], 701 | "source": [ 702 | "full_monte['info'].str.get_dummies('|')" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "### Example: Recipe Database\n", 710 | "\n", 711 | "- Vectorized string operations become most useful when cleaning up messy, real-world data. Let's parse some recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.\n", 712 | "\n", 713 | "- The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.\n", 714 | "\n", 715 | "- See [issue# 62](https://github.com/jakevdp/PythonDataScienceHandbook/issues/62) - updated dataset is [here](https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz)." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 20, 721 | "metadata": {}, 722 | "outputs": [], 723 | "source": [ 724 | "#!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz\n", 725 | "#!gunzip recipeitems-latest.json.gz" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": { 731 | "collapsed": true 732 | }, 733 | "source": [ 734 | "- The database is in JSON format - try ``pd.read_json`` to read it:" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 21, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "name": "stdout", 744 | "output_type": "stream", 745 | "text": [ 746 | "ValueError: Trailing data\n" 747 | ] 748 | } 749 | ], 750 | "source": [ 751 | "try:\n", 752 | " #recipes = pd.read_json('recipeitems-latest.json')\n", 753 | " recipes = pd.read_json('20170107-061401-recipeitems.json')\n", 754 | "except ValueError as e:\n", 755 | " print(\"ValueError:\", e)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "markdown", 760 | "metadata": {}, 761 | "source": [ 762 | "- Oops! We get a ``ValueError`` mentioning that there is \"trailing data.\"\n", 763 | "Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not. Let's check if this interpretation is true:" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 22, 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "data": { 773 | "text/plain": [ 774 | "(2, 12)" 775 | ] 776 | }, 777 | "execution_count": 22, 778 | "metadata": {}, 779 | "output_type": "execute_result" 780 | } 781 | ], 782 | "source": [ 783 | "with open('20170107-061401-recipeitems.json') as f:\n", 784 | " line = f.readline()\n", 785 | "pd.read_json(line).shape" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "- Yes, apparently each line is a valid JSON, so we'll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": 23, 798 | "metadata": { 799 | "collapsed": true 800 | }, 801 | "outputs": [], 802 | "source": [ 803 | "# read the entire file into a Python array\n", 804 | "with open('20170107-061401-recipeitems.json', 'r') as f:\n", 805 | " # Extract each line\n", 806 | " data = (line.strip() for line in f)\n", 807 | " # Reformat so each line is the element of a list\n", 808 | " data_json = \"[{0}]\".format(','.join(data))\n", 809 | "# read the result as a JSON\n", 810 | "recipes = pd.read_json(data_json)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 24, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "data": { 820 | "text/plain": [ 821 | "(173278, 17)" 822 | ] 823 | }, 824 | "execution_count": 24, 825 | "metadata": {}, 826 | "output_type": "execute_result" 827 | } 828 | ], 829 | "source": [ 830 | "recipes.shape" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "- Examine one row to see what we have:" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": 25, 843 | "metadata": {}, 844 | "outputs": [ 845 | { 846 | "data": { 847 | "text/plain": [ 848 | "_id {'$oid': '5160756b96cc62079cc2db15'}\n", 849 | "cookTime PT30M\n", 850 | "creator NaN\n", 851 | "dateModified NaN\n", 852 | "datePublished 2013-03-11\n", 853 | "description Late Saturday afternoon, after Marlboro Man ha...\n", 854 | "image http://static.thepioneerwoman.com/cooking/file...\n", 855 | "ingredients Biscuits\\n3 cups All-purpose Flour\\n2 Tablespo...\n", 856 | "name Drop Biscuits and Sausage Gravy\n", 857 | "prepTime PT10M\n", 858 | "recipeCategory NaN\n", 859 | "recipeInstructions NaN\n", 860 | "recipeYield 12\n", 861 | "source thepioneerwoman\n", 862 | "totalTime NaN\n", 863 | "ts {'$date': 1365276011104}\n", 864 | "url http://thepioneerwoman.com/cooking/2013/03/dro...\n", 865 | "Name: 0, dtype: object" 866 | ] 867 | }, 868 | "execution_count": 25, 869 | "metadata": {}, 870 | "output_type": "execute_result" 871 | } 872 | ], 873 | "source": [ 874 | "recipes.iloc[0]" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "- There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. The ingredient list is in string format; we're going to have to carefully extract the information we're interested in. Describe the ingredients:" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": 26, 887 | "metadata": {}, 888 | "outputs": [ 889 | { 890 | "data": { 891 | "text/plain": [ 892 | "count 173278.000000\n", 893 | "mean 244.617926\n", 894 | "std 146.705285\n", 895 | "min 0.000000\n", 896 | "25% 147.000000\n", 897 | "50% 221.000000\n", 898 | "75% 314.000000\n", 899 | "max 9067.000000\n", 900 | "Name: ingredients, dtype: float64" 901 | ] 902 | }, 903 | "execution_count": 26, 904 | "metadata": {}, 905 | "output_type": "execute_result" 906 | } 907 | ], 908 | "source": [ 909 | "recipes.ingredients.str.len().describe()" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": {}, 915 | "source": [ 916 | "- The ingredient lists averages ~244 characters. Which recipe has the longest ingredient list?" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": 27, 922 | "metadata": {}, 923 | "outputs": [ 924 | { 925 | "data": { 926 | "text/plain": [ 927 | "'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'" 928 | ] 929 | }, 930 | "execution_count": 27, 931 | "metadata": {}, 932 | "output_type": "execute_result" 933 | } 934 | ], 935 | "source": [ 936 | "recipes.name[np.argmax(recipes.ingredients.str.len())]" 937 | ] 938 | }, 939 | { 940 | "cell_type": "markdown", 941 | "metadata": {}, 942 | "source": [ 943 | "- How many of the recipes are for breakfast?" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": 28, 949 | "metadata": {}, 950 | "outputs": [ 951 | { 952 | "data": { 953 | "text/plain": [ 954 | "3524" 955 | ] 956 | }, 957 | "execution_count": 28, 958 | "metadata": {}, 959 | "output_type": "execute_result" 960 | } 961 | ], 962 | "source": [ 963 | "recipes.description.str.contains('[Bb]reakfast').sum()" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "- How many of the recipes list cinnamon as an ingredient?" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": 29, 976 | "metadata": {}, 977 | "outputs": [ 978 | { 979 | "data": { 980 | "text/plain": [ 981 | "10526" 982 | ] 983 | }, 984 | "execution_count": 29, 985 | "metadata": {}, 986 | "output_type": "execute_result" 987 | } 988 | ], 989 | "source": [ 990 | "recipes.ingredients.str.contains('[Cc]innamon').sum()" 991 | ] 992 | }, 993 | { 994 | "cell_type": "markdown", 995 | "metadata": {}, 996 | "source": [ 997 | "- Any recipes misspell the ingredient as \"cinamon\"?" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "code", 1002 | "execution_count": 30, 1003 | "metadata": {}, 1004 | "outputs": [ 1005 | { 1006 | "data": { 1007 | "text/plain": [ 1008 | "11" 1009 | ] 1010 | }, 1011 | "execution_count": 30, 1012 | "metadata": {}, 1013 | "output_type": "execute_result" 1014 | } 1015 | ], 1016 | "source": [ 1017 | "recipes.ingredients.str.contains('[Cc]inamon').sum()" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": {}, 1023 | "source": [ 1024 | "### A simple recipe recommender\n", 1025 | "\n", 1026 | "- Design a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.\n", 1027 | "- While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.\n", 1028 | "- So we will cheat by starting with a list of common ingredients and search to see whether they are in each recipe's ingredient list.\n", 1029 | "- For simplicity, start with herbs and spices." 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": 31, 1035 | "metadata": { 1036 | "collapsed": true 1037 | }, 1038 | "outputs": [], 1039 | "source": [ 1040 | "spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',\n", 1041 | " 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": {}, 1047 | "source": [ 1048 | "- Build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list." 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": 32, 1054 | "metadata": {}, 1055 | "outputs": [ 1056 | { 1057 | "data": { 1058 | "text/html": [ 1059 | "

\n", 1060 | "\n", 1073 | "\n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | "

	cumin	oregano	paprika	parsley	pepper	rosemary	sage	salt	tarragon	thyme
0	False	False	False	False	False	False	True	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	True	False	False	False	True	False	False	True	False	False
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False

\n", 1157 | "

" 1158 | ], 1159 | "text/plain": [ 1160 | " cumin oregano paprika parsley pepper rosemary sage salt tarragon \\\n", 1161 | "0 False False False False False False True False False \n", 1162 | "1 False False False False False False False False False \n", 1163 | "2 True False False False True False False True False \n", 1164 | "3 False False False False False False False False False \n", 1165 | "4 False False False False False False False False False \n", 1166 | "\n", 1167 | " thyme \n", 1168 | "0 False \n", 1169 | "1 False \n", 1170 | "2 False \n", 1171 | "3 False \n", 1172 | "4 False " 1173 | ] 1174 | }, 1175 | "execution_count": 32, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "import re\n", 1182 | "spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))\n", 1183 | " for spice in spice_list))\n", 1184 | "spice_df.head()" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "markdown", 1189 | "metadata": {}, 1190 | "source": [ 1191 | "- Let's say we'd want a recipe that uses parsley, paprika, and tarragon. Use the ``query()`` method of ``DataFrame``s." 1192 | ] 1193 | }, 1194 | { 1195 | "cell_type": "code", 1196 | "execution_count": 33, 1197 | "metadata": {}, 1198 | "outputs": [ 1199 | { 1200 | "data": { 1201 | "text/plain": [ 1202 | "10" 1203 | ] 1204 | }, 1205 | "execution_count": 33, 1206 | "metadata": {}, 1207 | "output_type": "execute_result" 1208 | } 1209 | ], 1210 | "source": [ 1211 | "selection = spice_df.query('parsley & paprika & tarragon')\n", 1212 | "len(selection)" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "- Use the index returned by this selection to discover the names of the 10 recipes that have this combination." 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 34, 1225 | "metadata": {}, 1226 | "outputs": [ 1227 | { 1228 | "data": { 1229 | "text/plain": [ 1230 | "2069 All cremat with a Little Gem, dandelion and wa...\n", 1231 | "74964 Lobster with Thermidor butter\n", 1232 | "93768 Burton's Southern Fried Chicken with White Gravy\n", 1233 | "113926 Mijo's Slow Cooker Shredded Beef\n", 1234 | "137686 Asparagus Soup with Poached Eggs\n", 1235 | "140530 Fried Oyster Po’boys\n", 1236 | "158475 Lamb shank tagine with herb tabbouleh\n", 1237 | "158486 Southern fried chicken in buttermilk\n", 1238 | "163175 Fried Chicken Sliders with Pickles + Slaw\n", 1239 | "165243 Bar Tartine Cauliflower Salad\n", 1240 | "Name: name, dtype: object" 1241 | ] 1242 | }, 1243 | "execution_count": 34, 1244 | "metadata": {}, 1245 | "output_type": "execute_result" 1246 | } 1247 | ], 1248 | "source": [ 1249 | "recipes.name[selection.index]" 1250 | ] 1251 | } 1252 | ], 1253 | "metadata": { 1254 | "anaconda-cloud": {}, 1255 | "kernelspec": { 1256 | "display_name": "Python 3", 1257 | "language": "python", 1258 | "name": "python3" 1259 | }, 1260 | "language_info": { 1261 | "codemirror_mode": { 1262 | "name": "ipython", 1263 | "version": 3 1264 | }, 1265 | "file_extension": ".py", 1266 | "mimetype": "text/x-python", 1267 | "name": "python", 1268 | "nbconvert_exporter": "python", 1269 | "pygments_lexer": "ipython3", 1270 | "version": "3.6.3" 1271 | } 1272 | }, 1273 | "nbformat": 4, 1274 | "nbformat_minor": 1 1275 | } 1276 | --------------------------------------------------------------------------------