├── hello.png ├── .gitignore └── .gitignore ├── README.md ├── 05.01-What-Is-Machine-Learning.ipynb ├── Pandas-Operations.ipynb ├── Matplotlib-Errorbars.ipynb ├── Pandas-Performance-Eval-and-Query.ipynb ├── Pandas-Missing-Values.ipynb └── Pandas-Vectorized-String-Ops.ipynb /hello.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thinknewtech/Data-Science/HEAD/hello.png -------------------------------------------------------------------------------- /.gitignore/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | local_settings.py 56 | 57 | # Flask stuff: 58 | instance/ 59 | .webassets-cache 60 | 61 | # Scrapy stuff: 62 | .scrapy 63 | 64 | # Sphinx documentation 65 | docs/_build/ 66 | 67 | # PyBuilder 68 | target/ 69 | 70 | # Jupyter Notebook 71 | .ipynb_checkpoints 72 | 73 | # pyenv 74 | .python-version 75 | 76 | # celery beat schedule file 77 | celerybeat-schedule 78 | 79 | # SageMath parsed files 80 | *.sage.py 81 | 82 | # dotenv 83 | .env 84 | 85 | # virtualenv 86 | .venv 87 | venv/ 88 | ENV/ 89 | 90 | # Spyder project settings 91 | .spyderproject 92 | .spyproject 93 | 94 | # Rope project settings 95 | .ropeproject 96 | 97 | # mkdocs documentation 98 | /site 99 | 100 | # mypy 101 | .mypy_cache/ 102 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science Handbook Cribsheet 2 | Intro to Numpy, Pandas, Matplotlib and Scikit-Learn. Condensed from Python Data Science Handbook (@jakevdp) 3 | 4 | ![book cover](/figures/PDSH-cover.png) 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 |
iPythonNumPyPandasMatplotlibScikit-Learn
Help & docsDatatypesPandas objectsLine plotsIntro to ML
Keyboard shortcutsArraysIndexing & selectionScatter plotsIntro to Scikit-Learn
Magic commandsUniversal functionsData operationsError BarsHyperparameters & Model validation
Input & output historyAggregationsMissing dataDensity & contour plotsFeature engineering
Shell commandsBroadcastingHierarchical indexingHistograms, bins & densitiesNaive Bayes
Errors & debuggingComparisons, masks, boolean logicConcat & appendPlot legendsLinear regression
Profiling & timingFancy indexingMerge & joinColorbarsSupport vector machines (SVM)
ResourcesSorting arraysAggregation & groupingSubplotsDecision trees & Random forests
Structured dataPivot tablesText & AnnotationPrincipal component analysis (PCA)
Vectorized stringsCustom TickmarksManifold learning
Time seriesConfigs & stylesheetsK-Means clustering
High-performance ops: eval(), query()3D plottingGaussian mixtures
Geo data with BasemapTO DO:Kernel density estimation (KDE)
Visualization with SeabornTO DO:Face detection pipeline
Resources
105 | -------------------------------------------------------------------------------- /05.01-What-Is-Machine-Learning.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# What Is Machine Learning?" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Categories of Machine Learning\n", 15 | "\n", 16 | "- __Supervised__ learning models the relationship between measured features and label associated with some data. Once the model is built, it can be used to apply labels to new, unknown data.\n", 17 | "- This is further subdivided into __classification__ and __regression__ tasks: classification labels are __discrete__ categories; regression labels are __continuous__ quantities.\n", 18 | "\n", 19 | "- __Unsupervised__ learning models the features of a dataset without reference to any label. It is often described as \"letting the dataset speak for itself.\"\n", 20 | "- These models include tasks such as __clustering__ and __dimensionality reduction.__\n", 21 | "- Clustering algorithms identify __distinct__ groups of data, while dimensionality reduction algorithms search for __more succinct representations__ of the data.\n", 22 | "- In addition, there are so-called __semi-supervised__ learning methods. They are often useful when labels are incomplete or misleading." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Classification: Predicting discrete labels\n", 30 | "\n", 31 | "- We will first take a look at a simple __classification__ task, in which you are given a set of labeled points and want to use these to classify some unlabeled points." 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "![](figures/05.01-classification-1.png)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "- Here we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane. We also have one of two *class labels* for each point represented by the point's color.\n", 46 | "- We would like to create a model that will let us decide whether a new point should be labeled \"blue\" or \"red.\"\n", 47 | "- Let's assume the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group.\n", 48 | "- Here the *model* is a mathematical equivalent of \"a straight line separates the classes\", while the *model parameters* are the values describing the location and orientation of that line.\n", 49 | "- The optimal values for these model parameters are learned from the data. This is often called *training the model*.\n", 50 | "- Here's what the trained model looks like:" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "![](figures/05.01-classification-2.png)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "- This trained model can now be applied to new, unlabeled data. This is called *prediction*. See the following figure:" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "![](figures/05.01-classification-3.png)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### Regression: Predicting continuous labels\n", 79 | "\n", 80 | "- Consider the data shown in the following figure, which consists of a set of points with continuous labels:" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "![](figures/05.01-regression-1.png)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "- The color of each point represents the continuous label for that point. We will use a simple linear regression to predict the points.\n", 95 | "- This model assumes if we treat the label as a third spatial dimension, we can fit a plane to the data." 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "![](figures/05.01-regression-2.png)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "Notice that the *feature 1-feature 2* plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position.\n", 110 | "From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters.\n", 111 | "Returning to the two-dimensional projection, when we fit such a plane we get the result shown in the following figure:" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "![](figures/05.01-regression-3.png)\n", 119 | "[figure source in Appendix](06.00-Figure-Code.ipynb#Regression-Example-Figure-3)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "This plane of fit gives us what we need to predict labels for new points.\n", 127 | "Visually, we find the results shown in the following figure:" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "![](figures/05.01-regression-4.png)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "As with the classification example, this may seem rather trivial in a low number of dimensions.\n", 142 | "But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.\n", 143 | "\n", 144 | "For example, this is similar to the task of computing the distance to galaxies observed through a telescope—in this case, we might use the following features and labels:\n", 145 | "\n", 146 | "- *feature 1*, *feature 2*, etc. $\\to$ brightness of each galaxy at one of several wave lengths or colors\n", 147 | "- *label* $\\to$ distance or redshift of the galaxy\n", 148 | "\n", 149 | "The distances for a small number of these galaxies might be determined through an independent set of (typically more expensive) observations.\n", 150 | "Distances to remaining galaxies could then be estimated using a suitable regression model, without the need to employ the more expensive observation across the entire set.\n", 151 | "In astronomy circles, this is known as the \"photometric redshift\" problem." 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### Clustering: Inferring labels on unlabeled data\n", 159 | "\n", 160 | "The classification and regression illustrations we just looked at are examples of supervised learning algorithms, in which we are trying to build a model that will predict labels for new data.\n", 161 | "Unsupervised learning involves models that describe data without reference to any known labels.\n", 162 | "\n", 163 | "One common case of unsupervised learning is \"clustering,\" in which data is automatically assigned to some number of discrete groups.\n", 164 | "For example, we might have some two-dimensional data like that shown in the following figure:" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "![](figures/05.01-clustering-1.png)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "By eye, it is clear that each of these points is part of a distinct group.\n", 179 | "Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related.\n", 180 | "The *k*-means algorithm yields the clusters shown in the following figure:" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "![](figures/05.01-clustering-2.png)\n", 188 | "[figure source in Appendix](06.00-Figure-Code.ipynb#Clustering-Example-Figure-2)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "*k*-means fits a model consisting of *k* cluster centers; the optimal centers are assumed to be those that minimize the distance of each point from its assigned center.\n", 196 | "Again, this might seem like a trivial exercise in two dimensions, but as our data becomes larger and more complex, such clustering algorithms can be employed to extract useful information from the dataset." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### Dimensionality reduction: Inferring structure of unlabeled data\n", 204 | "\n", 205 | "Dimensionality reduction is another example of an unsupervised algorithm, in which labels or other information are inferred from the structure of the dataset itself.\n", 206 | "Dimensionality reduction is a bit more abstract than the examples we looked at before, but generally it seeks to pull out some low-dimensional representation of data that in some way preserves relevant qualities of the full dataset.\n", 207 | "\n", 208 | "As an example of this, consider the data shown in the following figure:" 209 | ] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "metadata": {}, 214 | "source": [ 215 | "![](figures/05.01-dimesionality-1.png)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "Visually, it is clear that there is some structure in this data: it is drawn from a one-dimensional line that is arranged in a spiral within this two-dimensional space.\n", 223 | "In a sense, you could say that this data is \"intrinsically\" only one dimensional, though this one-dimensional data is embedded in higher-dimensional space.\n", 224 | "A suitable dimensionality reduction model in this case would be sensitive to this nonlinear embedded structure, and be able to pull out this lower-dimensionality representation.\n", 225 | "\n", 226 | "The following figure shows a visualization of the results of the Isomap algorithm, a manifold learning algorithm that does exactly this:" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "![](figures/05.01-dimesionality-2.png)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "Notice that the colors (which represent the extracted one-dimensional latent variable) change uniformly along the spiral, which indicates that the algorithm did in fact detect the structure we saw by eye.\n", 241 | "As with the previous examples, the power of dimensionality reduction algorithms becomes clearer in higher-dimensional cases.\n", 242 | "For example, we might wish to visualize important relationships within a dataset that has 100 or 1,000 features.\n", 243 | "Visualizing 1,000-dimensional data is a challenge, and one way we can make this more manageable is to use a dimensionality reduction technique to reduce the data to two or three dimensions." 244 | ] 245 | }, 246 | { 247 | "cell_type": "markdown", 248 | "metadata": {}, 249 | "source": [ 250 | "## Summary\n", 251 | "\n", 252 | "Here we have seen a few simple examples of some of the basic types of machine learning approaches.\n", 253 | "Needless to say, there are a number of important practical details that we have glossed over, but I hope this section was enough to give you a basic idea of what types of problems machine learning approaches can solve.\n", 254 | "\n", 255 | "In short, we saw the following:\n", 256 | "\n", 257 | "- *Supervised learning*: Models that can predict labels based on labeled training data\n", 258 | "\n", 259 | " - *Classification*: Models that predict labels as two or more discrete categories\n", 260 | " - *Regression*: Models that predict continuous labels\n", 261 | " \n", 262 | "- *Unsupervised learning*: Models that identify structure in unlabeled data\n", 263 | "\n", 264 | " - *Clustering*: Models that detect and identify distinct groups in the data\n", 265 | " - *Dimensionality reduction*: Models that detect and identify lower-dimensional structure in higher-dimensional data\n", 266 | " \n", 267 | "In the following sections we will go into much greater depth within these categories, and see some more interesting examples of where these concepts can be useful." 268 | ] 269 | } 270 | ], 271 | "metadata": { 272 | "anaconda-cloud": {}, 273 | "kernelspec": { 274 | "display_name": "Python 3", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.6.3" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 1 293 | } 294 | -------------------------------------------------------------------------------- /Pandas-Operations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Operating on Data in Pandas" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- NumPy provides quick element-wise operations for basic arithmetic and sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).\n", 15 | "\n", 16 | "- Pandas inherits this functionality and adds useful twists:\n", 17 | "\n", 18 | " - unary operations like negation and trigonometric functions *preserve index and column labels* in the output.\n", 19 | " - binary operations such as addition and multiplication automatically *align indices* when passing the objects to the ufunc." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### Ufuncs: Index Preservation\n", 27 | "\n", 28 | "- Any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects." 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 1, 34 | "metadata": { 35 | "collapsed": true 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "import pandas as pd\n", 40 | "import numpy as np" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/plain": [ 51 | "0 6\n", 52 | "1 3\n", 53 | "2 7\n", 54 | "3 4\n", 55 | "dtype: int64" 56 | ] 57 | }, 58 | "execution_count": 2, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "rng = np.random.RandomState(42)\n", 65 | "ser = pd.Series(rng.randint(0, 10, 4))\n", 66 | "ser" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 3, 72 | "metadata": {}, 73 | "outputs": [ 74 | { 75 | "data": { 76 | "text/html": [ 77 | "
\n", 78 | "\n", 91 | "\n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | "
ABCD
06926
17437
27254
\n", 125 | "
" 126 | ], 127 | "text/plain": [ 128 | " A B C D\n", 129 | "0 6 9 2 6\n", 130 | "1 7 4 3 7\n", 131 | "2 7 2 5 4" 132 | ] 133 | }, 134 | "execution_count": 3, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "df = pd.DataFrame(rng.randint(0, 10, (3, 4)),\n", 141 | " columns=['A', 'B', 'C', 'D'])\n", 142 | "df" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "- If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 4, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "0 403.428793\n", 161 | "1 20.085537\n", 162 | "2 1096.633158\n", 163 | "3 54.598150\n", 164 | "dtype: float64" 165 | ] 166 | }, 167 | "execution_count": 4, 168 | "metadata": {}, 169 | "output_type": "execute_result" 170 | } 171 | ], 172 | "source": [ 173 | "np.exp(ser)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 5, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/html": [ 184 | "
\n", 185 | "\n", 198 | "\n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | "
ABCD
0-1.0000007.071068e-011.000000-1.000000e+00
1-0.7071071.224647e-160.707107-7.071068e-01
2-0.7071071.000000e+00-0.7071071.224647e-16
\n", 232 | "
" 233 | ], 234 | "text/plain": [ 235 | " A B C D\n", 236 | "0 -1.000000 7.071068e-01 1.000000 -1.000000e+00\n", 237 | "1 -0.707107 1.224647e-16 0.707107 -7.071068e-01\n", 238 | "2 -0.707107 1.000000e+00 -0.707107 1.224647e-16" 239 | ] 240 | }, 241 | "execution_count": 5, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "np.sin(df * np.pi / 4)" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "### UFuncs: Index Alignment\n", 255 | "\n", 256 | "- For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices while performing the operation. This is very convenient when working with incomplete data." 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "### Index alignment in Series\n", 264 | "\n", 265 | "- Suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 6, 271 | "metadata": { 272 | "collapsed": true 273 | }, 274 | "outputs": [], 275 | "source": [ 276 | "area = pd.Series({'Alaska': 1723337, 'Texas': 695662,\n", 277 | " 'California': 423967}, name='area')\n", 278 | "population = pd.Series({'California': 38332521, 'Texas': 26448193,\n", 279 | " 'New York': 19651127}, name='population')" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "- What happens when we divide these to compute the population density:" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 7, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "text/plain": [ 297 | "Alaska NaN\n", 298 | "California 90.413926\n", 299 | "New York NaN\n", 300 | "Texas 38.018740\n", 301 | "dtype: float64" 302 | ] 303 | }, 304 | "execution_count": 7, 305 | "metadata": {}, 306 | "output_type": "execute_result" 307 | } 308 | ], 309 | "source": [ 310 | "population / area" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "- The result contains the *union* of indices of the two input arrays." 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 8, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "data": { 327 | "text/plain": [ 328 | "Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')" 329 | ] 330 | }, 331 | "execution_count": 8, 332 | "metadata": {}, 333 | "output_type": "execute_result" 334 | } 335 | ], 336 | "source": [ 337 | "area.index | population.index" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "- Any item for which one or the other does not have an entry is marked with ``NaN``, or \"Not a Number,\" which is how Pandas marks missing data. Index matching is implemented this way for any built-in arithmetic expressions." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 9, 350 | "metadata": {}, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "0 NaN\n", 356 | "1 5.0\n", 357 | "2 9.0\n", 358 | "3 NaN\n", 359 | "dtype: float64" 360 | ] 361 | }, 362 | "execution_count": 9, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "A = pd.Series([2, 4, 6], index=[0, 1, 2])\n", 369 | "B = pd.Series([1, 3, 5], index=[1, 2, 3])\n", 370 | "A + B" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [ 377 | "- The fill value can be modified." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 10, 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "data": { 387 | "text/plain": [ 388 | "0 2.0\n", 389 | "1 5.0\n", 390 | "2 9.0\n", 391 | "3 5.0\n", 392 | "dtype: float64" 393 | ] 394 | }, 395 | "execution_count": 10, 396 | "metadata": {}, 397 | "output_type": "execute_result" 398 | } 399 | ], 400 | "source": [ 401 | "A.add(B, fill_value=0)" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### Index alignment in DataFrame\n", 409 | "\n", 410 | "- A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 13, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/html": [ 421 | "
\n", 422 | "\n", 435 | "\n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | "
AB
086
1173
\n", 456 | "
" 457 | ], 458 | "text/plain": [ 459 | " A B\n", 460 | "0 8 6\n", 461 | "1 17 3" 462 | ] 463 | }, 464 | "execution_count": 13, 465 | "metadata": {}, 466 | "output_type": "execute_result" 467 | } 468 | ], 469 | "source": [ 470 | "A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))\n", 471 | "B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))\n", 472 | "A" 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "execution_count": 14, 478 | "metadata": {}, 479 | "outputs": [ 480 | { 481 | "data": { 482 | "text/html": [ 483 | "
\n", 484 | "\n", 497 | "\n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | "
BAC
0819
1894
2136
\n", 527 | "
" 528 | ], 529 | "text/plain": [ 530 | " B A C\n", 531 | "0 8 1 9\n", 532 | "1 8 9 4\n", 533 | "2 1 3 6" 534 | ] 535 | }, 536 | "execution_count": 14, 537 | "metadata": {}, 538 | "output_type": "execute_result" 539 | } 540 | ], 541 | "source": [ 542 | "B" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 15, 548 | "metadata": {}, 549 | "outputs": [ 550 | { 551 | "data": { 552 | "text/html": [ 553 | "
\n", 554 | "\n", 567 | "\n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | "
ABC
09.014.0NaN
126.011.0NaN
2NaNNaNNaN
\n", 597 | "
" 598 | ], 599 | "text/plain": [ 600 | " A B C\n", 601 | "0 9.0 14.0 NaN\n", 602 | "1 26.0 11.0 NaN\n", 603 | "2 NaN NaN NaN" 604 | ] 605 | }, 606 | "execution_count": 15, 607 | "metadata": {}, 608 | "output_type": "execute_result" 609 | } 610 | ], 611 | "source": [ 612 | "A + B" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "- Indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.\n", 620 | "- We can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries. \n", 621 | "- Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``)." 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 16, 627 | "metadata": {}, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/html": [ 632 | "
\n", 633 | "\n", 646 | "\n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | "
ABC
09.014.017.5
126.011.012.5
211.59.514.5
\n", 676 | "
" 677 | ], 678 | "text/plain": [ 679 | " A B C\n", 680 | "0 9.0 14.0 17.5\n", 681 | "1 26.0 11.0 12.5\n", 682 | "2 11.5 9.5 14.5" 683 | ] 684 | }, 685 | "execution_count": 16, 686 | "metadata": {}, 687 | "output_type": "execute_result" 688 | } 689 | ], 690 | "source": [ 691 | "fill = A.stack().mean()\n", 692 | "A.add(B, fill_value=fill)" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "\n", 700 | "| Python Operator | Pandas Method(s) |\n", 701 | "|-----------------|---------------------------------------|\n", 702 | "| ``+`` | ``add()`` |\n", 703 | "| ``-`` | ``sub()``, ``subtract()`` |\n", 704 | "| ``*`` | ``mul()``, ``multiply()`` |\n", 705 | "| ``/`` | ``truediv()``, ``div()``, ``divide()``|\n", 706 | "| ``//`` | ``floordiv()`` |\n", 707 | "| ``%`` | ``mod()`` |\n", 708 | "| ``**`` | ``pow()`` |\n" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "metadata": {}, 714 | "source": [ 715 | "### Ufuncs: Operations Between DataFrame and Series\n", 716 | "\n", 717 | "- Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array." 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 17, 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "data": { 727 | "text/plain": [ 728 | "array([[7, 2, 0, 3],\n", 729 | " [1, 7, 3, 1],\n", 730 | " [5, 5, 9, 3]])" 731 | ] 732 | }, 733 | "execution_count": 17, 734 | "metadata": {}, 735 | "output_type": "execute_result" 736 | } 737 | ], 738 | "source": [ 739 | "A = rng.randint(10, size=(3, 4))\n", 740 | "A" 741 | ] 742 | }, 743 | { 744 | "cell_type": "code", 745 | "execution_count": 18, 746 | "metadata": {}, 747 | "outputs": [ 748 | { 749 | "data": { 750 | "text/plain": [ 751 | "array([[ 0, 0, 0, 0],\n", 752 | " [-6, 5, 3, -2],\n", 753 | " [-2, 3, 9, 0]])" 754 | ] 755 | }, 756 | "execution_count": 18, 757 | "metadata": {}, 758 | "output_type": "execute_result" 759 | } 760 | ], 761 | "source": [ 762 | "A - A[0]" 763 | ] 764 | }, 765 | { 766 | "cell_type": "markdown", 767 | "metadata": {}, 768 | "source": [ 769 | "- According to NumPy's broadcasting rules, subtraction between a two-dimensional array and one of its rows is applied row-wise. Pandas has the same default convention." 770 | ] 771 | }, 772 | { 773 | "cell_type": "code", 774 | "execution_count": 19, 775 | "metadata": {}, 776 | "outputs": [ 777 | { 778 | "data": { 779 | "text/html": [ 780 | "
\n", 781 | "\n", 794 | "\n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | "
QRST
00000
1-653-2
2-2390
\n", 828 | "
" 829 | ], 830 | "text/plain": [ 831 | " Q R S T\n", 832 | "0 0 0 0 0\n", 833 | "1 -6 5 3 -2\n", 834 | "2 -2 3 9 0" 835 | ] 836 | }, 837 | "execution_count": 19, 838 | "metadata": {}, 839 | "output_type": "execute_result" 840 | } 841 | ], 842 | "source": [ 843 | "df = pd.DataFrame(A, columns=list('QRST'))\n", 844 | "df - df.iloc[0]" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "- Use the ``axis`` keyword to operate column-wise." 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": 20, 857 | "metadata": {}, 858 | "outputs": [ 859 | { 860 | "data": { 861 | "text/html": [ 862 | "
\n", 863 | "\n", 876 | "\n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | "
QRST
050-21
1-60-4-6
2004-2
\n", 910 | "
" 911 | ], 912 | "text/plain": [ 913 | " Q R S T\n", 914 | "0 5 0 -2 1\n", 915 | "1 -6 0 -4 -6\n", 916 | "2 0 0 4 -2" 917 | ] 918 | }, 919 | "execution_count": 20, 920 | "metadata": {}, 921 | "output_type": "execute_result" 922 | } 923 | ], 924 | "source": [ 925 | "df.subtract(df['R'], axis=0)" 926 | ] 927 | }, 928 | { 929 | "cell_type": "markdown", 930 | "metadata": {}, 931 | "source": [ 932 | "- These ``DataFrame``/``Series`` operations, like above, will automatically align indices between the two elements." 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 21, 938 | "metadata": {}, 939 | "outputs": [ 940 | { 941 | "data": { 942 | "text/plain": [ 943 | "Q 7\n", 944 | "S 0\n", 945 | "Name: 0, dtype: int64" 946 | ] 947 | }, 948 | "execution_count": 21, 949 | "metadata": {}, 950 | "output_type": "execute_result" 951 | } 952 | ], 953 | "source": [ 954 | "halfrow = df.iloc[0, ::2]\n", 955 | "halfrow" 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": 22, 961 | "metadata": {}, 962 | "outputs": [ 963 | { 964 | "data": { 965 | "text/html": [ 966 | "
\n", 967 | "\n", 980 | "\n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | "
QRST
00.0NaN0.0NaN
1-6.0NaN3.0NaN
2-2.0NaN9.0NaN
\n", 1014 | "
" 1015 | ], 1016 | "text/plain": [ 1017 | " Q R S T\n", 1018 | "0 0.0 NaN 0.0 NaN\n", 1019 | "1 -6.0 NaN 3.0 NaN\n", 1020 | "2 -2.0 NaN 9.0 NaN" 1021 | ] 1022 | }, 1023 | "execution_count": 22, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "df - halfrow" 1030 | ] 1031 | } 1032 | ], 1033 | "metadata": { 1034 | "anaconda-cloud": {}, 1035 | "kernelspec": { 1036 | "display_name": "Python 3", 1037 | "language": "python", 1038 | "name": "python3" 1039 | }, 1040 | "language_info": { 1041 | "codemirror_mode": { 1042 | "name": "ipython", 1043 | "version": 3 1044 | }, 1045 | "file_extension": ".py", 1046 | "mimetype": "text/x-python", 1047 | "name": "python", 1048 | "nbconvert_exporter": "python", 1049 | "pygments_lexer": "ipython3", 1050 | "version": "3.6.3" 1051 | } 1052 | }, 1053 | "nbformat": 4, 1054 | "nbformat_minor": 1 1055 | } 1056 | -------------------------------------------------------------------------------- /Matplotlib-Errorbars.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Visualizing Errors" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Basic Errorbars\n", 15 | "\n", 16 | "- A basic errorbar can be created with a single Matplotlib function call:" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": { 23 | "collapsed": true 24 | }, 25 | "outputs": [], 26 | "source": [ 27 | "%matplotlib inline\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "plt.style.use('seaborn-whitegrid')\n", 30 | "import numpy as np" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "data": { 40 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAF7NJREFUeJzt3X+MFOUdx/HPcT8KK2YPD+8qprYk\nJnajoVepNqdirh5tk7axP8z9kGhNbWoaUlJCGsXa1CZYkqNtUqktZ4Uz5AB7t1iRP0ygWM7wx+Kl\n2wBBt0pbEXoocAJ76t7KwV7/aO+8g929m9mZnXlm3q+/YH/Mfp/b2c8++zzPzFSMjY2NCQDge7O8\nLgAAMDMENgAYgsAGAEMQ2ABgCAIbAAxBYAOAIarc2nAymXRr0wAQaIsXL857u2uBXexFp5NKpRSL\nxRyuxt9oczjQ5nAopc3FOrsMiQCAIWz1sEdGRrR69Wq99957+uijj7R8+XJ96Utfcro2AMAktgJ7\n7969uummm/SDH/xAg4ODevDBBwlsAHCZrcD+2te+NvHvd955Rw0NDY4VBADIr6RJx46ODr377rvq\n6upyqh4AQAEVpZ6tL5VK6eGHH9bOnTtVUVExcXsymVQkErG1zWw2q9mzZ5dSlnFoczjQ5nAopc2Z\nTMbZZX2HDx9WXV2drrnmGsViMV28eFFnzpxRXV3dlMfZXdbCMqBwoM3hQJutcXxZ39/+9jd1d3dL\nkoaGhpTJZDRv3jxbxQEAZsZWYHd0dOjMmTNatmyZHnroIf385z/XrFn+W9Ld3Nys5uZmr8sAAEfY\nGhKZPXu2fvOb3zhdCwCgCP91iwEAeRHYAGAIAhsADEFgA/A1Fg98jMAGAEMQ2ABgCAIbAAxBYAOA\nIQhsIISYyDMTgQ0AhiCwAcAQBDYAGILABgBDENgAYIhQBjYz5ABMFMrABgATEdgAyopfuPYR2ABg\nCAIbAAxBYAOAIQjsAGKMEAgmAhsADBHowE6n0zp27JgSicSMHk/PFPAOn7/pBTawE4mEDh06pLfe\nekstLS0zDm0A8KuSAnvdunVqb2/XPffco927dztVkyP6+/uVy+UkSefPn1d/f7+3BQFAiarsPnH/\n/v06cuSIent7dfbsWX3729/WV77yFSdrK0lzc7NmzZqlXC6nmpoafmoBMJ7twL7lllu0aNEiSVI0\nGtXIyIguXryoyspKx4orRVNTkxYtWqR0Oq2tW7eqqanJ65IAoCS2A7uyslKRSESSFI/Hdeedd/om\nrMdFo1FFo1FPwnq8R89QDACn2A7scXv27NH27dvV3d192X2pVMrWNrPZrO3nTpbJZPLWYfV2J1+7\nEKfabOe1veJkm03hlzaXcx+5tM1WP3+nTp3S8PCwnnvuOTU2NrpcrTPcep9LCux9+/apq6tLGzdu\n1JVXXnnZ/bFYzNZ2U6mU7edONv4L4NJtWb29mEI9aavbcqrNdl7bK0622RR+aXM595FL22zl85dI\nJPTGG28ol8vp+9//vl5++WUjhjdLeZ+TyWTB+2yvEnn//fe1bt06Pf3006qtrbW7GQAoiNVeU9nu\nYb/00ks6e/asVq5cOXFbZ2enFixY4EhhQcXYNqbDPvIxVntNZTuw29vb1d7e7mQtAMoknU4rnU4r\nkUj4eoiB1V5TBfZIRwD5mXYUcDQa1XXXXRf6sJYIbCB0GBc2F4EdIpxcB9LH48KSGBc2DIENhMz4\nuPDChQuNWSaH/yn5wBkA5vHyKGDYF8oettXzZAOAH4QusE2bIQeAcaELbGbIAZgqdIHNDDngLYYk\n7QtdYDNDDpRHvmWkDEmWJnSBLXHkFOAVhiRLE8rADjp+csKvGJIsDYEdMPzkhJ8xJFkaAnuSIPRM\n+ckJv2NI0j6jAtvNc2H4uWdqpd385EQprH7GnPxMBqHD5DajAttNQemZ8pMTJvJzh8lPCOz/C1LP\nlJ+cME1QOkxuI7D/j54p4J0gdZjcFOiz9Vn9luYMZoA3uBTYzNDDdgkTKHBa0C9AwVDe9AhsFzCB\nAsANgQhsL3se+XrSTKAAcEMgAtsrhXrSTKCgFAynoRACuwSFetLFVpzwYUQxDKehGAK7BMV60vkm\nUPgwYjoMp12uv7+fv8P/lRTYb775ppYuXaotW7Y4VY9RrK7d5sOI6TCchmJsB3Ymk9GaNWtCvwTH\nylIkrz+MDMf4HwdwoRjbgV1TU6NnnnlG9fX1TtYTaF5+GBmOMYfV9chBX5+Nj9k+0rGqqkpVVcWf\nnkqlbG07m83mfW4mk8m73UK3F+LUduxsq7q6WvPnz1dtbe2U+wq12U5d+R7f19c3ZTimr69PtbW1\nM9qeW4q1Oahm0mar+5Sd/XbDhg2ObKvY48fvu7TNTrbDr9zat109ND0Wi9l6XiqVyvvcSCSSd7uF\nbi/Eqe3Y2Vah2wu12U5d+R7f1tam9evXK5fLqaamRm1tbbbfH6cUa3NQzaTNTu1Tdjj1WZp83+zZ\ns6fcX452eK2UfTuZTBa8j1UiIcHYKGCd34abAn3yJ0zFya0A54wHeTlXe9kO7MOHD6uzs1ODg4Oq\nqqrSrl279Lvf/c7zcVGTebEDADCH7cC+6aab1NPT42QtAAIknU4rnU7rwIEDgRiX9oNQDonQgwXc\nNb6MNJfL6cEHH9RnPvMZhuIcwKQjEGBOTZpZ3c7ko3pHR0fpJDkklD1sAO4aP6o3l8upurp6StgT\n3vbRw0ZRflvWBDNMXkba3d3NcIhDCGwAthU7P834IfaNjY0eVBZMRgV2WE9eFNZ2O4VfCTNnZV/j\n/DTlZ0xgh3XnsNNuzh8MO6zua5wuuPyMCeyw7hymtZverLms7mteny44jIxZJTJ51tmtncOPYViO\ndgOS9X1tfGIxnU5r69atJU8s+vHz5zfG9LDDevKisLYb5WdnX7N67m6UxpgethTekxeFtd2YavxQ\n70Qi4dq+wL7mb8b0sIEwC+ukO6YKRGCHddkbE3zhYdrkM9xhfGDT84CprHzhsiIDkmFj2Pnk63n4\nYfyNHhCc5PSKDJjJ+MBm2RvCgglBGB/YXvc86EkDKBfjA1ui54FgsdoJKMdyP78qdFk9q5fbM+Xy\nfIEIbJP4fYeAWSZf2aWlpYWDqwLO+FUiQJix3C9cCGzAYCz3CxeGREKE3lfweD3pHnR+mx+ghw0Y\njhMwuWO6g/K8OMKawPaRsB5i7xQO1cd0rOwjxeYHvDrC2nZgr127Vu3t7ero6NChQ4ecrCmUDhw4\nwCH2gI8Umx/warLXVmAPDAzo7bffVm9vr5544gmtWbPG6bpCZ2BggNl+wEeKnR/cq8leW4GdSCS0\ndOlSSdL111+v4eFhffDBB44WFja33nors/1wnFPDbGEdris0P+DVhUVsBfbQ0JDmzZs38f+6ujqd\nPn3asaLCqLGx0ZdXlgnrBzUInBpn5YyY+Xkx2WtrWd/Y2Nhl/6+oqLjscalUyvK2H3jgAeVyOfX0\n9Fx2XyaTybvdQrebJJvNqrq6WvPnz1dtbe2UtnjV7gMHDujgwYMaGxvTXXfdpe7ubjU2NhZ9jpWa\nstmso7WbsH9MbrPVuqy2r6+vb8owW19fn2pray2/RrHtFGvDhg0bJDn/Pk9Xbzlun+4+t9psK7Ab\nGho0NDQ08f9Tp05p/vz5lz0uFotZ3nYkElEmk8n73Egkkne7hW53ktvnGkilUpbb53a7d+zYMfHl\nPDo6qqNHj+ree+8t+hwrNaVSKUdrL/Tao6OjSqfTOnfunOe/XCa32er7Z3U/aGtr0/r16yfOZNnW\n1jbta+XbVrHtzKQNTr/P09Vbjtunu6+UNieTyYL32RoSuf3227Vr1y5J0uuvv676+nrNnTvXVnHw\nr2ITK6YsoZvu57wp7bDDqXFWLgTtH7Z62DfffLNuvPFGdXR0qKKiQo8//rjTdcEHgnAUnV8vcFEu\nTp3JkjNi+oPtQ9N/8pOfOFkHfMrqB9Vvh/I6fYELU07DiWDiXCJwjB9P9RmEXwnlZPWLiC+u8jIq\nsMO6c5jS7mLDD172TO38nKcnDbuam5uVyWQ0MDDg+LY5lwgcY/for0ITf0GeEATsILDhGFYTAO4i\nsA3mx6MQOdUn4B4C21AcLmw+P37hwt8IbENxLT+z2fnC7e/v530OOQLbUFzLz2x84cIOo5b1FRLG\nnZ31xWZz+oCesPLbgVpuo4dtMCb4zMWKmtIVG1ayOj9gynwCgQ14hC/c0hQaVrI6P2DSBD6BDcBI\nheZxrM4PmDSfQGADLuOITXcUGlayOiFv0gR+ICYdgTDzc4/QbfnOE2N1Qt6kCXwCG0DgWD3hlynn\n+2ZIBIFhykw/vGP6PkJgIxBMmumHN4KwjzAkgkAI+6XAMD07+0ix+QEv5g7oYSMQTJrphzeCsI/Q\nw/YRL2f7Tb/Citcz/ab//cLA633ECb7rYafTaZ04ccLI8SV4iyMHMR3T9xFfBfb4pMDg4KDvJgVM\nn11GfkF/Xzkla7D4KrD9eohoEGaXnRaEoON9nTmC3x98Fdh+nRTw6xdJOeT7oAYl6ML8vsJMtgN7\nYGBATU1N2rt3r2PFjE8KXHvttb465aRfv0i8EpSgs/O+BuGXBcxla5XIsWPH9Oyzz2rx4sVO16No\nNKrq6mrfhLUUjNllJ5l28v1CXyhW39fxXxa5XE4tLS2+6lQgHGz1sK+++mo99dRTmjt3rtP1+Jbp\ns8tOsnPyfb/2TAu9r/nqDcovC7jLzZVutgJ7zpw5qqysdLoWGKRQ0NkZ8y4U5l6FfKF6GRrDdNxe\n6TbtkEg8Hlc8Hp9y24oVK7RkyZJpN55KpSwXlMlklMvlbD3XTZlMRpK9Ns1ENpu1vG0na7K6LSuP\n7+vrm9Iz7evrU21trbLZrJ577jkdPHhQY2Njuuuuu9Td3a3GxkYdOHAg7+1OtqHQcwrVW1tbqxtu\nuEHDw8P61a9+pdra2onnFXvtyfdNfp/d3qf8ws6+PVOF/oZO3W5VoX3HKdMGdmtrq1pbW21tPBaL\nWX5OJBJRJpOx9Vw3RSIRSfbaNBOpVMrytp2syeq2rDy+ra1N69evnxjzbmtrUywWUyqV0tGjRzU2\nNiZJGh0d1dGjR3Xvvfdqx44deW93sg2FnlOoXkmqr69XfX39ZbUUe+3J901+n93ep/zCzr49U4X+\nhk7dblWxfWemkslkwft8tawPwVRszLvQMIOXww9cIBd2ub3SzdYqkf7+fm3atEn//ve/9dprr6mn\np0fd3d2OFoZgKXSC+EIrNbxemWPKCe3hP26udLMV2FyjDk4qFI6EJjAVZ+szGMvKzMb7B6sYwwYA\nQxDYAGAIAhsADMEYNuCydDqtdDqtRCLh6EEUKDwPENT5AQIbcNGlJ4zatGlT4A+UMZEpAc+QCOCi\nS08YNTAw4HFFMBmBDbjo0iM2b731Vo8rgskYEgFcdOkRm4xhoxT0sGEc04605VzqcAo9bEiaupLB\n1GAxZeIIsIseNgJzUV0v+fWKOpjK9Ku/E9jg0lcl4gsP5UJgg0tflYgvPJQLgQ1O2F8ivvBQLkw6\nQhLnni6F1xdbQHj4LrD7+/sDf1FSBA9feCgH3wW2XzEuGUy8rzAJgQ1bCDqg/AhswAK+qOAlVokA\ngCHoYaMs6JkiLNxcOEFgw7e8DHm+YOBHBDY8RzgCM8MYNgAYwlYP+8KFC3rsscd0/PhxXbhwQQ8/\n/LC+8IUvOF0bAGASW4H94osvas6cOdq2bZuOHDmiRx99VNu3b3e6NgDAJLYC++6779Y3vvENSdJV\nV12lc+fOOVoUAOBytgK7urp64t+bN2+eCO9L2V3aks1mQ3c+Ea/bnMlkJNl/z+yw22Yvap1OsZom\n3+f1++wF2uycaQM7Ho8rHo9PuW3FihVasmSJtm7dqtdee01dXV15nxuLxWwVlUqlbD/XVF63ORKJ\nSLL/ntlht81e1DqdYjVNvs/r99kLtNmaZDJZ8L5pA7u1tVWtra2X3R6Px/XXv/5Vf/jDH6b0uAEA\n7rA1JHL8+HH96U9/0pYtW/SJT3zC6ZoAAHnYCux4PK5z587poYcemrht06ZNqqmpcawwAMBUtgJ7\n1apVWrVqldO1AIHEkZxwCkc6wjjpdFrHjh3j6uQIHQIbRkkkEjp06JDeeusttbS0ENoIFQIbRunv\n71cul5MknT9/nuEGhAqBDaM0Nzdr1qz/7bY1NTVqbm72tiCgjAhsGKWpqUmLFi3SwoUL9fLLL3OV\ncoQK58OGJLNWMkSjUUWjUV+FtUl/P5iLHjYAGILABgBDENgAYAgCGwAMQWADgCEIbAAwBIENAIYg\nsAHAEAQ2ABiCwAYAQxDYAGAIAhsADEFgA4AhCGwAMASBDQCGILABwBAENgAYgsAGAEPYukTYe++9\np0ceeUQfffSRRkdH9eijj+pzn/uc07UBACax1cPeuXOnvvnNb6qnp0erVq3Sk08+6XRdAIBL2Oph\nf+9735v49zvvvKOGhgbHCgIA5Gf7qumnT5/WD3/4Q3344YfavHmzkzUBAPKoGBsbGyv2gHg8rng8\nPuW2FStWaMmSJZKkV155RZs3b1Z3d/eUxySTSUUiEVtFZbNZzZ4929ZzTUWbw4E2h0Mpbc5kMlq8\neHHe+6YN7HwGBgZ0ww03KBqNSpK++MUv6tVXX53ymGQyWfBFp5NKpRSLxWw911S0ORxocziU0uZi\n2Wlr0nH37t164YUXJElvvPGGrrnmGluFAQBmztYY9vLly7V69Wr95S9/0fnz5/WLX/zC4bIAAJey\nFdhXXXWV/vjHPzpdCwCgCI50BABDENgAYAgCGwAMQWADgCEIbAAwhK0DZ2YimUy6sVkACDxHj3QE\nAJQfQyIAYAgCGwAM4bvAXrt2rdrb29XR0aFDhw55XU5ZrFu3Tu3t7brnnnu0e/dur8spm2w2q5aW\nFv35z3/2upSy2Llzp+6++2595zvf0SuvvOJ1Oa778MMP9aMf/Uj333+/Ojo6tG/fPq9Lcs2bb76p\npUuXasuWLZL+d52A+++/X8uWLdOPf/xjnT9/3pHX8VVgDwwM6O2331Zvb6+eeOIJrVmzxuuSXLd/\n/34dOXJEvb292rhxo9auXet1SWWzYcMG1dbWel1GWZw9e1a///3vtW3bNnV1dWnPnj1el+S6F154\nQQsXLlRPT4+efPJJ/fKXv/S6JFdkMhmtWbNGTU1NE7etX79ey5Yt07Zt23Tttddq+/btjryWrwI7\nkUho6dKlkqTrr79ew8PD+uCDDzyuyl233HLLxCXWotGoRkZGdPHiRY+rct+//vUv/fOf/1Rzc7PX\npZRFIpFQU1OT5s6dq/r6+lB0RubNm6dz585JkoaHhzVv3jyPK3JHTU2NnnnmGdXX10/c9uqrr6ql\npUWS1NLSokQi4chr+Sqwh4aGprypdXV1On36tIcVua+ysnLiQg/xeFx33nmnKisrPa7KfZ2dnVq9\nerXXZZTNf/7zH42NjWnlypVatmyZYx9gP/v617+uEydO6Mtf/rLuu+8+PfLII16X5IqqqqrLLlYw\nMjKimpoaSdLVV1/tWI7ZvkSYGy5dYTg2NqaKigqPqimvPXv2aPv27ZdduSeIduzYocbGRn3qU5/y\nupSyOnnypJ566imdOHFC3/3ud7V3795A798vvviiFixYoE2bNukf//iHHnvsMT3//PNel1UWk99X\nJ1dO+yqwGxoaNDQ0NPH/U6dOaf78+R5WVB779u1TV1eXNm7cqCuvvNLrclzX39+v48ePq7+/X+++\n+65qamr0yU9+UrfddpvXpbmmrq5On//851VVVaXrrrtOV1xxhc6cOaO6ujqvS3PN3//+d91xxx2S\npM9+9rM6efKkLly4oKoqX8WOK+bMmTNxmbCTJ09OGS4pha+GRG6//Xbt2rVLkvT666+rvr5ec+fO\n9bgqd73//vtat26dnn766dBMwP32t7/V888/r76+PrW2tmr58uWBDmtJuuOOO7R//37lcjmdOXNG\nmUwmsGO64z796U/r4MGDkqTBwUFdccUVoQhrSbrtttsmsmz37t0T18Atla/+ejfffLNuvPFGdXR0\nqKKiQo8//rjXJbnupZde0tmzZ7Vy5cqJ2zo7O7VgwQIPq4LTGhoa9NWvflUPPPCARkZG9LOf/Uyz\nZvmqv+S49vZ2/fSnP9V9992nCxcuBPbKVIcPH1ZnZ6cGBwdVVVWlXbt26de//rVWr16t3t5eLViw\nQN/61rcceS0OTQcAQwT7Kx4AAoTABgBDENgAYAgCGwAMQWADgCEIbAAwBIENAIYgsAHAEP8FHNJC\no0qnIQoAAAAASUVORK5CYII=\n", 41 | "text/plain": [ 42 | "" 43 | ] 44 | }, 45 | "metadata": {}, 46 | "output_type": "display_data" 47 | } 48 | ], 49 | "source": [ 50 | "x = np.linspace(0, 10, 50)\n", 51 | "dy = 0.8\n", 52 | "y = np.sin(x) + dy * np.random.randn(50)\n", 53 | "\n", 54 | "plt.errorbar(x, y, yerr=dy, fmt='.k');" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "- ``fmt`` is line-and-point format code. It uses the same syntax as the shorthand used in ``plt.plot``, outlined in [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) and [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb).\n", 62 | "\n", 63 | "- The ``errorbar`` function has many options to fine-tune the outputs. Example (good for crowded plots): to make the errorbars lighter than the points themselves:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "data": { 73 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAG0lJREFUeJzt3X9sE+cZB/Cv4ywzKRWkgYRkGlCp\nUmcFrVuzaqKUigpGtS5qt1ZRMkI2rdLYxIYKaGvTpaKI1EywrV3WdqOUsiEwWmbargghkdFBhSYo\nmicNmWYrmzRgkKQJAdHWcUsS748pbn7cnX3n9+7e997vR0IK59h+Ln7vudfv8957oWw2mwUREUmv\nxO8AiIioMEzYRESKYMImIlIEEzYRkSKYsImIFMGETUSkiFK3XjiZTLr10kREgVZfX2+43bWEbfWm\n+fT09CAajQqORm7cZz1wn/VQzD5bdXY5JEJEpAhHPezh4WG0tbXhypUr+Oijj7B27Vrcd999omMj\nIqIJHCXsY8eOYdGiRfjud7+LS5cu4dFHH2XCJiJymaOE/cADD+R+7u3tRXV1tbCAiIjIWFFFx+bm\nZvT19WHHjh2i4iEiIhOhYlfr6+npweOPP46DBw8iFArltieTSZSXlzt6zUwmg0gkUkxYyuE+64H7\nrIdi9jmdToud1pdKpVBZWYmamhpEo1GMjo5iaGgIlZWVk37P6bQWTgPSA/dZD9xne4RP6/vrX/+K\n3bt3AwAGBweRTqdRUVHhKDgiIiqMox52c3Mz2tvbsWrVKmQyGWzatAklJfJN6e7v78/9zMIoEanO\nUcKORCL4xS9+IToW4QYGBnI/M2ETkerk6xYTEZEhJmwiIkUwYRMRKcLV1fqIiIrFyQOfYMImIqlx\n8sAnOCRCRKQIJmwiIkUwYRMRKYJj2EQaYiFPTUzYRBpiIU9NHBIhIlIEEzYRkSKYsImIFMGETUSk\nCC2LjqyQE5GKtEzYrJATkYq0TNhE5B9+w3WOCZuIPMVvuM6x6EhEpAgmbCIiRXBIJIA4RkgUTIHt\nYcfjcaxcuRKf//znsXLlSsTjcb9D8szAwEDuHxEFRyB72PF4HGvWrEE6nQYA9Pb2Ys2aNQCAlpYW\n0+exZ0rkHx5/+QWyh93e3p5L1uPS6TTa29stn8eeKZF/ePzlV1QPe/v27UgmkxgZGcH3vvc9rFy5\nUlRcRblw4YKt7UREKnCcsE+dOoVz586hq6sLV69exTe+8Q1pEvb8+fNx/vx5w+1ERKpyPCRy1113\nobOzEwAwa9YsDA8PY3R0VFhgxYjFYigvL5+0rby8HLFYzKeIiIiK5zhhh8PhXFJMJBK49957EQ6H\nhQVWjJaWFuzcuRM1NTUIhUKoqanBzp07LQuOovX39+f+EZEzOs/2MhLKZrPZYl7g6NGjeOmll7B7\n927cfPPNue3JZHJaL7dQmUwGkUikmLAAYFKPf+LJxO52ke9tRtQ+O3lvv4jcZ1XIss9etpGp+1zo\n8Xfo0CFs2rQJmUwmtz0SiWDLli1oaGhwNeZiFfM5p9Np1NfXGz5WVMI+ceIEOjs7sWvXLsyePXvS\nY8lk0vRN8+np6UE0GnUaVk4qlcr9vGjRIsfbrZhNRbL7WqL22cl7+0XkPqtCln32so1M3edCj7+F\nCxca1qIWLFiA//znP+4EK0gxn7NV7nRcdHz//fexfft2/O53v5uWrHXChWyI3MHZXtM5HsM+fPgw\nrl69ivXr16O1tRWtra24fPmyyNgCiWPblI/bbUSVcWGzWV06z/Zy3MNuampCU1OTyFi0wB455eNm\nG3F6FbAfYrHYpFgBzvYK5JWORGTM6VXAfpBhtpdsArmWCBEZU21cuKWlBXfccUfu/zIX0b3AHrZG\nOH5OHBdWGxO2Rri4DvEqYLVpl7BVqZATuYHjwmrTagxbpQo5kVs4LqwurXrYKlXIiYim0iphq1Yh\nJwoaDkkWR6uEzQo5kX/GhyR7e3uRzWZzQ5JM2oXTKmGzQk7knfEppGNjYwA4JCmCVglbhwo5v3KS\nLMankI4vCMohyeJpNUsECHaFnLNgSGa8dV/xtOphWwlCz5RfOUlmHJIsHhM2glMM4VdOkpnVkGQQ\nOkxeUGpIxOzuLsWy6pnKMJRQ6H7zKycVy+4xZvf3jYYkOZRXOKV62G6thSF7z7TQ/eZXTiqW3WNM\nxDHJobzCKZWw3RKU+dk6zIKh4JG9wyQTpYZE3BKkO1sEeRYMBROH8goX6B723Llzc/+siO6ZsoBC\nVDgO5RUu0D1sO4VJUT1TFlDILW4V3f02flz8+Mc/Rl9fH+bNm4ef/exnPF4MBDphuy0ej09rZLLP\nOCF1BfkGzhzKK0wghkT8uPWV2dxto7E4gAUUKgyH08hKIBK2H7e+MutJh8Nhw9+fP38+D0ayFJQL\nuMg9gUjYfjDrMY+OjhoWUB544AEejGSJ85EpHyZsh8ymHC1YsMBwxsnhw4d5MJIlzkc2VuhsLx0U\nlbDfffddrFixAvv27RMVjzKspiK1tLSgu7sbZ86cQXd3N1paWnw/GDkcI7+gXMAlWnV1de6f7hwn\n7HQ6jY6ODixevFhkPMqwO3fbz4ORY6Nq4Hxkysdxwi4rK8PLL7+MqqoqkfEoxagnbcbPg5Fjo2pw\negGXH7OkyB+h7PjtIBx6/vnnUVFRgdWrV0/ankwmpyWoQmUyGUQikWnbR0dHcz9PnI1htt2MqNex\n+1qHDh3Cc889l5u3vWHDBjQ0NAAw32cncU39/bq6Ohh9zKFQCGfPns37em6x2uegKmSf7bZPJ+12\n/LZdAFBS8km/TdSx5CReJ/shq2LadjqdRn19veFjrl44E41GHT2vp6fH8LmpVMrwtc22mxH1OnZf\nKxqN4qtf/Wpu+8SLA8z22UlcU3/faq0Gp5+RCFb7HFSF7LPd9umk3dp9bye/7+d++K2Ytp1MJk0f\n4ywRDXBslMgZ2YabeGm6BrhWA5EzVssB+LG2i+OEnUqlsG3bNly6dAmlpaU4cuQInn/+ecyePVtk\nfFpxswFwrQbyktE6O0HrIPixtovjhL1o0SLs3btXZCzaC/LiPqQPrljpHi3HsHnlFJF7OI3UPVqO\nYbP3SroQNcxm53X8vqo3yLRM2ES6EDXMZud18t3yi99sndNySIQKJ9u0JpKH2fo0+aaRcm0Q55RJ\n2LouXuT3fvux1jj5w05bs1qfRvQ9UukTSgyJ6Fp11nW/RQvqvRBFstvW8t0Kj9NI3aFED1vXqrPT\n/eYsmMn4LSE/u22NhUV/KNHD1rVxON1vP3uR7M2qyW5by1dYJHco0cP2ai1p2XqmKi5oz96smuy2\nNTfWp5Ht+JOREgnbq8WLZKtec9EmGud28dluW3OjsCjb8ScjJYZEdF28SNf9psm8KD47aWssLHpP\niYQNmDeOoC8yw4OC8s3IEIVtTX7KJGwjuk97Y4FPD7oW3Wk6Jcawzeg63W8cC3xqK/QqUhWLz+QO\npRO2zD0PVrwpn0JPuCw+0zilE7bMPQ9WvEkUXupN45Qew47FYpPGsAHvex7sQZMXdC2602RKJ2wZ\npr2xB02iFdoJ0L3oDpgX3u0W5FUp4CudsAH1piKxR075FJowvJruJzOzdbrtrgOuyu35lE/YqpG5\nMZBaZC66kzuULjoS6UzmonsQ+L0WvREmbI1wqmGwcLqfe6xu0DD+uB/JnEMikjh06BCee+45V4un\nHI4JFhmK7kGV76I8v4q9TNgSiMfj2LRpEzKZDAA9q/0iqFLpF0m1orvfCm0jVvUBP4u9jodEtm7d\niqamJjQ3N+PMmTMiY9JOe3t7LlmP0+kSe1F4qT7lU2gbsaoP+FnsdZSwT58+jfPnz6OrqwvPPPMM\nOjo6RMelFVb7yQ2ixlllLL65zao+4Gex11HCPnnyJFasWAEAuO2223D9+nV88MEHQgPTiazVfh0P\n1KDIVzTz+nVUY7UcgJ/FXkcJe3BwEBUVFbn/V1ZW8mtoEWKxGCKRyKRtflf7dT1Qg0LUSpY6r4jZ\n0tKC7u5unDlzBt3d3bnxaT/Xdglls9ms3Sc99dRTWLZsWa6X/c1vfhM//elPsXDhwtzvJJPJaWeh\nQoyNjWFkZASlpaUoKZl8PhkdHc39HA6H825XyWuvvYYXX3wxV+3fsGEDGhoaAPiz38uXL0dvb++0\n7TU1NXjzzTdNn2cnpkwmM+1EVQyj9546+2bi39UPE/fZ7udnpx3U1dXB6NAOhUI4e/Zswe+R73Ws\n9mFsbAwAMDIygrKyMtP3LIbdY0PU9nyPFdO20+k06uvrDR9zNEukuroag4ODuf+/9957mDNnzrTf\ni0ajtl87lUohHA4jm81Oe34qlTJ8bbPtIrk9A+Hhhx/GQw89lPv/xGq/H/vd19dnut3qvezE1NPT\nIzTuqe8dj8exefPmSdOvNm/ejNraWt9m30zcZ7ufn512YHVXczufX77XKWQfRH/OE9k9NkRtz/dY\nMfucTCZNH3M0JLJkyRIcOXIEAPDOO++gqqoKM2fOdBScKnSbgZBvXL3Qxff9VMjXeRX2wwlR46y8\nOEcujhL2nXfeibq6OjQ3N6OjowNPP/206LjIZ/kOVKMTmGxFykJm3wT1RCxqnJVrccvF8YUzP/rR\nj0TGQZKxexWdjEt9Wn2dd0qli3NEXVTDi3PkwSsdyZSdA1XGpT7duMGFKstwOmV3nRmuS+MtpRK2\nro1Dhf3ON/zgR8+0mLU2VOpJi2R3X3X62xSqv78fY2Nj6O/vF/73USph69o4VNjvfMMPVj1TUXcN\nMeL063zQe9LknvG2MzAwILztcHlVRclW4CtmNoFZ4S+oBUEip5iwFSTjVYicTWCPbCdcUoNSQyL0\nfzIW+ADOJiiUjDNqSA3sYSuIq/upzen6HLxjEDFhK0jW1f2oME5PuNXV1bl/pOewUiAStm49D14u\nrDaecItnVcexm8hVSvyBGMPWrcfBe/mpzY0LenRjNqz02GOPYXh4uOD6gGr1hED0sHVktlYvyY8z\naopnNnx05coVW/UB1db7DkQPm0g1nFFTHLMLtczYrRvIWsBnD5vIZUFdwtVPZnWcyspKw9+3WzeQ\ntZ7AhE3kMrev2NSt6A6YDyt1dnbaKsirVsDnkAgFQjwe17YIq1vRfZzVsJJRW7BqI6q0HSZsUp5q\nlX5yl1Eiz9dGVKkncEiElKdapZ+8F5Q2wh42KU+1Sj95z2kbsaoL+FEzYMIm5blxKzAKFqdtxKo+\n4EftQKohEZUuEXWDn9V+laeeyVDpV/nvpwMZ2ogI0iRsGdd4nhibFycSPxf3UflmAU6vHBT5uar8\n99NBUK4ulWZIRNY1njkDYToZp9DZrfTr8rnqNDc7H5Vmg5iRpocta+EoKNVlUWT+JmSHLp+rqG9t\nOl6cIyNpErasl4jKeiLxytQDNSiJzsnnqnONhWtxy8Fxwj59+jQWL16MY8eOCQlE1qKArCcSr0w9\nUINyArP7uQblmwWpzVHCvnDhAn7729+ivr5eWCCyFgVkPZH4xckJzO+eqdHXeavP1SjeoHyzIHe5\n3dYdJey5c+fihRdewMyZM4UGI+Maz7KeSPxi9wSWr2dq1sBFNnyjr/NmnysAw3jNlvJU7ZsFuceL\nb2GOZonMmDFDWAAqCEJ1WZR8i+VMLUpZ9Ux/8IMfYPPmzdNmavzlL3/Bnj17XJ/BYfS5Lly40DDe\ncDiM0dHRaa+hy9AY5efFTLdQNpvNWv1CIpFAIpGYtG3dunVYunQp2tracP/99+O+++6b9rxkMjmt\nJ1aIiQdFOBy2/Xy3uB1XJpNBJBKx9RyRMdl9rUJ/v66uDkZNLBQKYd68eejt7Z32WElJCcbGxqZt\nr6mpwZtvvll0TFbPMYsXACKRCDKZzKT/b9myBQ0NDZbvPfGxGzdu5D5nWdu6aE7adqHM/oaittth\n1dbPnj1b8Ouk02nT4ea8PezGxkY0NjYW/GYTRaNR289JpVJFPd8tbsfV09Nj+3VFxmT3tQr9fatL\ngs2GE4ySNQD09fVZvpeTv8fU55jFu2DBAsRiMdNvFlbvPfGxSCSSe1zWti6ak7ZdKLO/oajtdli1\ndTuvmUwmTR+TZlofBZPVmPe8efMMn2PWw/Fi+MEqXhlrLCQPLyYoOErYx48fR2trK06cOIFnn30W\njz76qLCAKFisirYbNmwwbOBr1qzxbWYOi8zklBdtx1HRcdmyZVi2bJmwICjYzIq2DQ0NqK2tNRxm\nWLJkiW+Xv7PITE653XakWUuE7AvCZcJmDVyHpBmEz4+8xYStMF4mrDZ+fmQXi45ERIpgwiZy0dQr\nNg8dOuR3SKQwDokQucRoze1NmzahtraWs04EMasDBLU+wIRN5BKjS5UzmYzvN+UIErM6gN36gCoJ\nngmbyCVBWYpWB6oUgDmGTeQS3ddSJ/GYsIlcYnSpciQS0XYtdSoeEzb5foMBu/r7+3P/ZGZ0qfKW\nLVs4fk2OcQxbcyrePXxgYCD388SxR78KR1Z3kZ96xWaQl1El9zFha86LRde94kfhSMUTns5UmQ1i\nhkMimuNMhuLwXo9qUf3u70zYmuNMhuLwhEdeYsLWHO8KXxye8MhLTNia44L9xeEJj7wkXdFx7ty5\nGBwcxJw5c/wORRs6rD3tlnx3kScSSbqEXV1djaGhIemKAqpXl8mYiM+VJzzyinQJW1aynUD8FpQT\nGD9XUgkTNjmia6ILyomK1MSETWSDricqkgMTNnmCPVPShZsTJ5iwyRNOeqZ+JnmeYMgpNydOMGGT\n78ySo5/DDxz6IBkxYZPvmByJCuMoYY+MjKC9vR0XL17EyMgIHn/8cXzpS18SHRsREU3gKGG/8cYb\nmDFjBvbv349z587hySefxIEDB0THRkREEzhK2A8++CAaGhoAALfccguuXbsmNCgiIpoulM1ms8W8\nwLPPPouSkhKsX79+0vZkMjltUZxCZTIZRCKRYsJSjt/7PDo6mvvZq7uiON1nP2LNxyqmiY/duHGD\nbVsDxexzOp1GfX294WN5e9iJRAKJRGLStnXr1mHp0qWIx+M4e/YsduzYYfjcaDTqIFygp6fH8XNV\n5fc+p1Kp3M9exeF0n/2INR+rmCY+FolEpInZK363bT8Us8/JZNL0sbwJu7GxEY2NjdO2JxIJ/PnP\nf8avf/1rfOpTn3IUGBERFc7RGPbFixfx+9//Hvv27cOnP/1p0TEREZEBRwk7kUjg2rVruZuNAsAr\nr7yCsrIyYYEREdFkjhL2xo0bsXHjRtGxEOUVj8eVu1nAxCs5h4aGfIyEVMcrHUkZ8Xgca9asyd2l\nvLe3N/ctT+akPfFKTiZsKgbv6UjKaG9vzyXrcel0Gu3t7T5FROQtJmxSxoULF2xtJwoaJmxSxvz5\n821tJwoaJmxSRiwWm3b1bHl5OWKxmE8REXmLRUcCoMaC/eOFRRlniajw9yP1MWETAHXWpG5pacEd\nd9yR+/+iRYt8jOYTqvz9SG0cEiEiUgQTNhGRIpiwiYgUwYRNRKQIJmwiIkUwYRMRKYIJm4hIEUzY\nRESKYMImIlIEEzYRkSKYsImIFMGETUSkCCZsIiJFMGETESmCCZuISBFM2EREimDCJiJShKM7zly5\ncgVPPPEEPvroI9y4cQNPPvnkpLuAEBGReI562AcPHsRDDz2EvXv3YuPGjejs7BQdFxERTeGoh/2d\n73wn93Nvby/vZ0dE5AHHN+EdGBjA97//fXz44YfYs2ePyJiIiMhAKJvNZq1+IZFIIJFITNq2bt06\nLF26FADw1ltvYc+ePdi9e/ek30kmkygvL3cUVCaTQSQScfRcVXGfCzc2Npb7uaRErbo5P2c9FLPP\n6XQa9fX1ho/lTdhGTp8+jdtvvx2zZs0CAHz5y1/G22+/Pel3ksmk6Zvm09PTg2g06ui5quI+64H7\nrIdi9tkqdzrqnnR3d+P1118HAPzzn/9ETU2No8CIiKhwjsaw165di7a2NvzpT3/Cxx9/jM2bNwsO\ni4iIpnKUsG+55Rbs3LlTdCxERGRBrYoNEZHGmLCJiBTBhE1EpAgmbCIiRTBhExEpwtGFM4VIJpNu\nvCwRUeAJvdKRiIi8xyERIiJFMGETESlCuoS9detWNDU1obm5GWfOnPE7HE9s374dTU1NeOSRR9Dd\n3e13OJ7JZDJYvnw5XnvtNb9D8cTBgwfx4IMP4uGHH8Zbb73ldziu+/DDD/HDH/4Qra2taG5uxokT\nJ/wOyTXvvvsuVqxYgX379gH4/30CWltbsWrVKjz22GP4+OOPhbyPVAn79OnTOH/+PLq6uvDMM8+g\no6PD75Bcd+rUKZw7dw5dXV3YtWsXtm7d6ndInvnNb36D2bNn+x2GJ65evYoXX3wR+/fvx44dO3D0\n6FG/Q3Ld66+/jltvvRV79+5FZ2cnYrGY3yG5Ip1Oo6OjA4sXL85t+9WvfoVVq1Zh//79+MxnPoMD\nBw4IeS+pEvbJkyexYsUKAMBtt92G69ev44MPPvA5KnfddddduVuszZo1C8PDwxgdHfU5Kvf9+9//\nxr/+9S8sW7bM71A8cfLkSSxevBgzZ85EVVWVFp2RiooKXLt2DQBw/fp1VFRU+ByRO8rKyvDyyy+j\nqqoqt+3tt9/G8uXLAQDLly/HyZMnhbyXVAl7cHBw0odaWVmJgYEBHyNyXzgczt3oIZFI4N5770U4\nHPY5Kvdt27YNbW1tfofhmf/+97/IZrNYv349Vq1aJewAltnXvvY1XL58GV/5ylewevVqPPHEE36H\n5IrS0tJpNysYHh5GWVkZAGDu3LnC8pjjW4S5YeoMw2w2i1Ao5FM03jp69CgOHDgw7c49QfTHP/4R\nX/jCF/DZz37W71A81d/fjxdeeAGXL1/Gt771LRw7dizQ7fuNN95AbW0tXnnlFfzjH/9Ae3s7Xn31\nVb/D8sTEz1XkzGmpEnZ1dTUGBwdz/3/vvfcwZ84cHyPyxokTJ7Bjxw7s2rULN998s9/huO748eO4\nePEijh8/jr6+PpSVlWHevHm4++67/Q7NNZWVlfjiF7+I0tJSzJ8/HzfddBOGhoZQWVnpd2iu+dvf\n/oZ77rkHAPC5z30O/f39GBkZQWmpVGnHFTNmzMjdJqy/v3/ScEkxpBoSWbJkCY4cOQIAeOedd1BV\nVYWZM2f6HJW73n//fWzfvh0vvfSSNgW4X/7yl3j11Vfxhz/8AY2NjVi7dm2gkzUA3HPPPTh16hTG\nxsYwNDSEdDod2DHdcQsWLMDf//53AMClS5dw0003aZGsAeDuu+/O5bLu7u7cPXCLJdVf784770Rd\nXR2am5sRCoXw9NNP+x2S6w4fPoyrV69i/fr1uW3btm1DbW2tj1GRaNXV1bj//vvx7W9/G8PDw3jq\nqaeUu4GwXU1NTfjJT36C1atXY2RkJLB3pkqlUti2bRsuXbqE0tJSHDlyBD//+c/R1taGrq4u1NbW\n4utf/7qQ9+Kl6UREigj2KZ6IKECYsImIFMGETUSkCCZsIiJFMGETESmCCZuISBFM2EREimDCJiJS\nxP8AGkE7uHLup/wAAAAASUVORK5CYII=\n", 74 | "text/plain": [ 75 | "" 76 | ] 77 | }, 78 | "metadata": {}, 79 | "output_type": "display_data" 80 | } 81 | ], 82 | "source": [ 83 | "plt.errorbar(x, y, yerr=dy, fmt='o', color='black', ecolor='lightgray', elinewidth=3, capsize=0);" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "- You can also specify horizontal errorbars (``xerr``), one-sided errorbars, and many other variants." 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Continuous Errors\n", 98 | "\n", 99 | "- Sometimes you want to show errorbars on continuous quantities. You can combine ``plt.plot`` and ``plt.fill_between`` to do this.\n", 100 | "- Example: a simple *Gaussian process regression* using the Scikit-Learn API. This fits a non-parametric function to data with a continuous measure of the uncertainty.\n", 101 | "\n", 102 | "## Need to debug GaussianProcessRegressor in sklearn 0.19" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 6, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "ename": "TypeError", 112 | "evalue": "__init__() got an unexpected keyword argument 'corr'", 113 | "output_type": "error", 114 | "traceback": [ 115 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 116 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 117 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Compute the Gaussian process fit\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mgp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mGaussianProcessRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcorr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'cubic'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtheta0\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1e-2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mthetaL\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1e-4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mthetaU\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1E-1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrandom_start\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m100\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 11\u001b[0m \u001b[0mgp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnewaxis\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mydata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 118 | "\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'corr'" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "#from sklearn.gaussian_process import GaussianProcess -- deprecated in sklearn v0.18\n", 124 | "from sklearn.gaussian_process import GaussianProcessRegressor \n", 125 | "\n", 126 | "# define the model and draw some data\n", 127 | "model = lambda x: x * np.sin(x)\n", 128 | "xdata = np.array([1, 3, 5, 6, 8])\n", 129 | "ydata = model(xdata)\n", 130 | "\n", 131 | "# Compute the Gaussian process fit\n", 132 | "gp = GaussianProcessRegressor(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1, random_start=100)\n", 133 | "gp.fit(xdata[:, np.newaxis], ydata)\n", 134 | "\n", 135 | "xfit = np.linspace(0, 10, 1000)\n", 136 | "yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)\n", 137 | "dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "- We now have ``xfit``, ``yfit``, and ``dyfit``, which sample the continuous fit to our data.\n", 145 | "- We could pass these to the ``plt.errorbar`` function but we don't really want to plot 1,000 points with 1,000 errorbars.\n", 146 | "- Instead, we can use the ``plt.fill_between`` function with a light color to visualize this continuous error:" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 7, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "ename": "NameError", 156 | "evalue": "name 'xfit' is not defined", 157 | "output_type": "error", 158 | "traceback": [ 159 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 160 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 161 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Visualize the result\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mydata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'or'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mplt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mplot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxfit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myfit\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'-'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'gray'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,\n", 162 | "\u001b[0;31mNameError\u001b[0m: name 'xfit' is not defined" 163 | ] 164 | }, 165 | { 166 | "data": { 167 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAAD1CAYAAAB0gc+GAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAEc9JREFUeJzt3V1sFGX/xvFrba21rUKpWsDQlvgS\nGgxBKwdgQY0vJA/GpCRNG5ZYRYUEYqyCUmlV8hCMbTyAYCMgGjWWgCuK5AmCIYLxoJWwGE3DomIM\nxaKV0kKR7QqFeQ76t38L9G12+sz+4PtJCDv37MxeB8219947uxtwHMcRACDhXeV3AADA4FDYAGAE\nhQ0ARlDYAGAEhQ0ARlDYAGBE8nCdOBwOD9epAeCyVlBQcMnxYSvs/h50IJFIRPn5+R6nGT6W8lrK\nKtnKaymrZCuvpaxSfHn7m+yyJAIARlDYAGAEhQ0ARrhewz59+rSWLl2qkydP6uzZs1q0aJGmT5/u\nZTYAwD+4LuxPP/1U48eP1+LFi9XS0qKysjLt2LHDy2wAgH9wvSSSmZmpEydOSJI6OjqUmZnpWSgA\nMKmuTsrL04SJE6W8vO5tDwXi+XrVJ598Uk1NTero6NC6des0efLknn3hcFhpaWmuzhuLxZSamuo2\n1v+cpbyWskq28lrKKtnKayHr9f/5j8a88oquisV6xs6npuq3f/9bHY88MujzRKPRvi+JdlzaunWr\nU1VV5TiO40QiEWf27Nm99u/bt8/tqZ0DBw64PtYPlvJayuo4tvJayuo4tvKayJqb6zjSxf9yc4d0\nmv660/WSyP79+1VYWChJmjBhglpaWtTV1eX2dABgW1PT0MZdcF3Yubm5+u677yRJzc3NSk9PV3Ly\nsH5wEgASV07O0MZdcF3YJSUlam5u1ty5c7V48WItX77cs1AAYM7KldKF79ulpXWPe8T1lDg9PV2r\nV6/2LAgAmBYMdv9fWSmnqUmBnJzusv573AOsYQCAV4JBKRjUwWH6sio+mg4ARlDYAGAEhQ0ARlDY\nAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARlDYAGAE\nhQ0ARlDYAGAEhQ0ARlDYAGAEhQ0ARsRV2Nu2bdOjjz6q2bNn66uvvvIqEwDgElwXdnt7u2pra7Vx\n40atXbtWu3bt8jIXAOACyW4PrK+v19SpU5WRkaGMjAytWLHCy1wAgAu4nmH/+uuvchxH5eXlmjNn\njurr673MBQC4QMBxHMfNgevXr9f+/fv15ptv6ujRo3rssce0e/duBQIBSVI4HFZaWpqrULFYTKmp\nqa6O9YOlvJaySrbyWsoq2cprKasUX95oNKqCgoJL7nO9JJKVlaU777xTycnJysnJUXp6utra2pSV\nldVzn/z8fFfnjkQiro/1g6W8lrJKtvJayirZymspqxRf3nA43Oc+10sihYWFamho0Pnz59XW1qZo\nNKrMzEy3pwMADMD1DDs7O1szZ85UWVmZOjs7VVVVpauu4rJuABgurgtbkkpLS1VaWupVFgBAP5gS\nA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4AR\nFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGEFhA4ARFDYAGBFXYcdiMT3wwAP65JNPvMoD\nAOhDXIX91ltvaeTIkV5lAQD0w3Vh//zzzzp06JDuu+8+D+MAAPriurCrq6tVUVHhZRYAQD+S3Ry0\ndetWTZ48WePGjev3fpFIxFWoWCzm+lg/WMprKatkK6+lrJKtvJaySsOX11Vh79mzR0eOHNGePXv0\n+++/KyUlRaNHj9a0adN63S8/P99VqEgk4vpYP1jKaymrZCuvpaySrbyWskrx5Q2Hw33uc1XYq1at\n6rm9Zs0a3XzzzReVNQDAW1yHDQBGuJph/9MzzzzjRQ4AwACYYQOAERQ2ABhBYQOAERQ2ABhBYQOA\nERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2ABhBYQOAERQ2\nABhBYQOAERQ2ABhBYQOAERQ2ABiRHM/BNTU1CofD6urq0oIFC/Twww97lQsAcAHXhd3Q0KCffvpJ\nmzdvVnt7u4qKiihsABhGrgt7ypQpmjRpkiRpxIgR6uzs1Llz55SUlORZOADA/3O9hp2UlKS0tDRJ\nUigU0owZMyhrABhGAcdxnHhOsGvXLq1bt07vvvuurrvuup7xcDjcU+hDFYvFlJqaGk+s/ylLeS1l\nlWzltZRVspXXUlYpvrzRaFQFBQWX3BfXm45ff/211q5dqw0bNvQq67/l5+e7Om8kEnF9rB8s5bWU\nVbKV11JWyVZeS1ml+PKGw+E+97ku7FOnTqmmpkbvvfeeRo4c6fY0AIBBcl3Y27dvV3t7u8rLy3vG\nqqurNXbsWE+CAQB6c13YJSUlKikp8TILAKAffNIRAIygsAHACAobAIygsAHACAobAIygsAHACAob\nAIygsAHACAobAIygsAHAiMQq7Lo6KS9PEyZOlPLyurcBAJLi/HpVT9XVSfPnS9GoApJ0+HD3tiQF\ng34mA4CEkDgz7MpKKRrtPRaNdo8DABKosJuahjYOAFeYxCnsnJyhjQPAFSZxCnvlSunC34BMS+se\nBwAkUGEHg9L69VJurpxAQMrN7d7mDUdvcAUOYF7iXCUidZdzMKiDxn5wM+FxBQ5wWUicGTaGD1fg\nAJcFCvtKwBU4wGWBwr4ScAUOcFmgsK8EXIEDXBZcF/Zrr72mkpISlZaW6vvvv/cyE7zGFTjAZcHV\nVSJ79+7V4cOHtXnzZh06dEgvvfSSQqGQ19ngJa7AAcxzNcOur6/Xgw8+KEm69dZb1dHRoT///NPT\nYACA3lwVdmtrqzIzM3u2s7KydOzYMc9CAQAu5mpJxHGci7YDgcBF94tEIq5CxWIx18f6wVJeS1kl\nW3ktZZVs5bWUVRq+vK4KOzs7W62trT3bf/zxh2644YaL7ud2rTRibJ3VUl5LWSVbeS1llWzltZRV\nii9vOBzuc5+rJZF77rlHO3fulCQdOHBAN910kzIyMlyFAwAMjqsZ9l133aWJEyeqtLRUgUBAr776\nqte5AAAXcP3lT0uWLPEyBwBgAHzSEQCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgK\nGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKGwCMoLABwAgKG7jS1NVJeXmaMHGi\nlJfXvQ0TXP+mIwCD6uqk+fOlaFQBSTp8uHtbkoJBP5NhEJhhA1eSykopGu09Fo12jyPhUdjAlaSp\naWjjSCiulkS6urpUWVmpI0eOqKurSy+++KLuvvtur7MB8FpOTvcyyKXGkfBczbA/++wzXXvttdq4\ncaNWrlyp119/3etcAIbDypVSWlrvsbS07nEkPFcz7EcffVSPPPKIJGnUqFE6ceKEp6EADJO/31is\nrJTT1KRATk53WfOGowmuCvvqq6/uuf3+++/3lDcAA4JBKRjUwUhE+fn5fqfBEAQcx3H6u0MoFFIo\nFOo19swzz2j69Omqq6vTl19+qbVr1/YqcUkKh8NKu/Cl1yDFYjGlpqa6OtYPlvJayirZymspq2Qr\nr6WsUnx5o9GoCgoKLr3Tcemjjz5y5s2b58RisUvu37dvn9tTOwcOHHB9rB8s5bWU1XFs5bWU1XFs\n5bWU1XHiy9tfd7paEjly5Ig2bdqkDz/8UNdcc42rZxEAwNC4KuxQKKQTJ05o/t+fkJL0zjvvKCUl\nxbNgAIDeXBX2888/r+eff97rLACAfvBJRwAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIG\nACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMobAAwgsIGACMo\nbAAwgsIGACPiKuzW1lZNmTJF33zzjVd5AAB9iKuwa2pqNG7cOK+yAAD64bqw6+vrlZ6erttvv93L\nPACAPrgq7DNnzqi2tlbPPfec13kAAH1IHugOoVBIoVCo19iMGTNUXFys66+/vt9jI5GIq1CxWMz1\nsX6wlNdSVslWXktZJVt5LWWVhi9vwHEcZ6gHlZaW6vz585KkpqYmjRo1SqtXr9Ztt93Wc59wOKyC\nggJXoSKRiPLz810d6wdLeS1llWzltZRVspXXUlYpvrz9deeAM+xL2bRpU8/tiooKFRUV9SprAID3\nuA4bAIxwNcP+p9dff92LHACAATDDBgAjKGwAMILCBgAjKGwAMILCBuJVVyfl5WnCxIlSXl73NjAM\n4r5KBLii1dVJ8+dL0agCknT4cPe2JAWDfibDZYgZNhCPykopGu09Fo12jwMeo7CBeDQ1DW0ciAOF\nDcQjJ2do40AcKGwgHitXSmlpvcfS0rrHAY9R2EA8gkFp/XopN1dOICDl5nZv84YjhgFXiQDxCgal\nYFAHjX0FKOxhhg0ARlDYAGAEhQ0ARlDYAGAEhQ0ARrj6Ed7BCIfDw3FaALjs9fUjvMNW2AAAb7Ek\nAgBGUNgAYETCfdLxxx9/1MKFC/X4449r7ty5fscZUE1NjcLhsLq6urRgwQI9/PDDfke6pM7OTlVU\nVOj48eP666+/tHDhQt1///1+x+pXLBbTrFmztGjRIs2ePdvvOH1qbGzUwoULlZubK0m6/fbb9fLL\nL/ucqn/btm3Thg0blJycrGeffVb33nuv35EuKRQKadu2bT3bjY2N+vbbb31M1LfTp09r6dKlOnny\npM6ePatFixZp+vTpnj5GQhV2NBrVihUrNHXqVL+jDEpDQ4N++uknbd68We3t7SoqKkrYwt69e7fu\nuOMOPf3002pubta8efMSvrDfeustjRw50u8YA4pGo5o5c6YqjXwHdnt7u2pra7VlyxZFo1GtWbMm\nYQu7uLhYxcXFkqS9e/fq888/9zlR3z799FONHz9eixcvVktLi8rKyrRjxw5PHyOhCjslJUVvv/22\n3n77bb+jDMqUKVM0adIkSdKIESPU2dmpc+fOKSkpyedkF/vXv/7Vc/u3335Tdna2j2kG9vPPP+vQ\noUO67777/I4yoNOnT/sdYUjq6+s1depUZWRkKCMjQytWrPA70qDU1tbqjTfe8DtGnzIzM/XDDz9I\nkjo6OpSZmen5YyTUGnZycrJSU1P9jjFoSUlJSvu/r9YMhUKaMWNGQpb1P5WWlmrJkiVatmyZ31H6\nVV1drYqKCr9jDEo0GlU4HNZTTz2lYDCohoYGvyP169dff5XjOCovL9ecOXNUX1/vd6QBff/99xoz\nZoxuvPFGv6P0adasWTp69KgeeughzZ07V0uXLvX8MRJqhm3Vrl279PHHH+vdd9/1O8qANm3apEgk\nohdeeEHbtm1TIBDwO9JFtm7dqsmTJ2vcuHF+RxmUCRMmaNGiRXrggQf0yy+/6IknntAXX3yhlJQU\nv6P1qaWlRW+++aaOHj2qxx57TLt3707Iv4W/ffzxxyoqKvI7Rr8+++wzjR07Vu+8844OHjyoyspK\nbdmyxdPHoLDj9PXXX2vt2rXasGGDrrvuOr/j9KmxsVFZWVkaM2aM8vPzde7cObW1tSkrK8vvaBfZ\ns2ePjhw5oj179uj3339XSkqKRo8erWnTpvkd7ZJuueUW3XLLLZKk8ePH64YbblBLS0vCPuFkZWXp\nzjvvVHJysnJycpSenp6wfwt/++abb1RVVeV3jH7t379fhYWFkrqfxFtaWtTV1aXkZO9qNqGWRKw5\ndeqUampqtG7duoR/c2zfvn09rwBaW1sVjUaHZY3NC6tWrdKWLVv00Ucfqbi4WAsXLkzYspa6Z38f\nfPCBJOnYsWM6fvx4Qr9HUFhYqIaGBp0/f15tbW0J/bcgdb8aSE9PT+hXLJKUm5ur7777TpLU3Nys\n9PR0T8taSrAZdmNjo6qrq9Xc3Kzk5GTt3LlTa9asSdgy3L59u9rb21VeXt4zVl1drbFjx/qY6tJK\nS0tVWVmpOXPmKBaL6ZVXXtFVV/F87YWHHnpIS5Ys0c6dO3XmzBktX748ocslOztbM2fOVFlZmTo7\nO1VVVZXQfwvHjh3TqFGj/I4xoJKSEi1btkxz585VV1eXli9f7vlj8NF0ADAicZ9WAQC9UNgAYASF\nDQBGUNgAYASFDQBGUNgAYASFDQBGUNgAYMR/AaKqu6/TxdMiAAAAAElFTkSuQmCC\n", 168 | "text/plain": [ 169 | "" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "output_type": "display_data" 174 | } 175 | ], 176 | "source": [ 177 | "# Visualize the result\n", 178 | "plt.plot(xdata, ydata, 'or')\n", 179 | "plt.plot(xfit, yfit, '-', color='gray')\n", 180 | "\n", 181 | "plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,\n", 182 | " color='gray', alpha=0.2)\n", 183 | "plt.xlim(0, 10);" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "- This gives an intuitive view into what the Gaussian process regression algorithm is doing: in regions near a measured data point, the model is strongly constrained and this is reflected in the small model errors. In regions far from a measured data point, the model is not strongly constrained, and the model errors increase.\n", 191 | "\n", 192 | "- For more information on the options available in ``plt.fill_between()`` (and the closely related ``plt.fill()`` function), see the function docstring or the Matplotlib documentation." 193 | ] 194 | } 195 | ], 196 | "metadata": { 197 | "anaconda-cloud": {}, 198 | "kernelspec": { 199 | "display_name": "Python 3", 200 | "language": "python", 201 | "name": "python3" 202 | }, 203 | "language_info": { 204 | "codemirror_mode": { 205 | "name": "ipython", 206 | "version": 3 207 | }, 208 | "file_extension": ".py", 209 | "mimetype": "text/x-python", 210 | "name": "python", 211 | "nbconvert_exporter": "python", 212 | "pygments_lexer": "ipython3", 213 | "version": "3.6.3" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 1 218 | } 219 | -------------------------------------------------------------------------------- /Pandas-Performance-Eval-and-Query.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# High-Performance Pandas: eval() and query()" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "- NumPy and Pandas rely on the creation of temporary intermediate objects. They can cause increased computational time and memory use.\n", 15 | "- Pandas v0.13 includes ``eval()`` and ``query()`` functions, which allow you to directly access C-speed operations without the need for intermediate arrays. They rely on the [Numexpr](https://github.com/pydata/numexpr) package." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### How Compound Expressions cause slowdowns\n", 23 | "- Consider adding the elements of two arrays in Numpy vs Python:" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "1.73 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "import numpy as np\n", 41 | "rng = np.random.RandomState(42)\n", 42 | "x = rng.rand(1000000)\n", 43 | "y = rng.rand(1000000)\n", 44 | "%timeit x + y" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "168 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "- Now consider what happens when computing compound expressions. NumPy evaluates each subexpression, which looks like this:" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": { 75 | "collapsed": true 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "mask = (x > 0.5) & (y < 0.5)\n", 80 | "tmp1 = (x > 0.5)\n", 81 | "tmp2 = (y < 0.5)\n", 82 | "mask = tmp1 & tmp2" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "- *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this is a problem.\n", 90 | "- The Numexpr library allows you to eval this compound expression element by element, without the need to allocate full intermediate arrays. The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 4, 96 | "metadata": {}, 97 | "outputs": [ 98 | { 99 | "data": { 100 | "text/plain": [ 101 | "True" 102 | ] 103 | }, 104 | "execution_count": 4, 105 | "metadata": {}, 106 | "output_type": "execute_result" 107 | } 108 | ], 109 | "source": [ 110 | "import numexpr\n", 111 | "mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')\n", 112 | "\n", 113 | "# NumPy allclose(): returns True if two arrays are element-wise equal within a tolerance.\n", 114 | "np.allclose(mask, mask_numexpr)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### pandas.eval() for Efficient Operations\n", 122 | "- ``eval()`` uses string expressions to compute operations using ``DataFrame``s." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 9, 128 | "metadata": { 129 | "collapsed": true 130 | }, 131 | "outputs": [], 132 | "source": [ 133 | "# build 4 dataframes, all 100x100K\n", 134 | "import pandas as pd\n", 135 | "nrows, ncols = 100000, 100\n", 136 | "rng = np.random.RandomState(42)\n", 137 | "df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))\n", 138 | " for i in range(4))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "- How fast using pd.eval() provides ~50% speedup, and uses less memory." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 10, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "92.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "%timeit df1 + df2 + df3 + df4" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 11, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "37.3 ms ± 699 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "%timeit pd.eval('df1 + df2 + df3 + df4')" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 12, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "data": { 189 | "text/plain": [ 190 | "True" 191 | ] 192 | }, 193 | "execution_count": 12, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "### Operations supported by pd.eval()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 13, 212 | "metadata": {}, 213 | "outputs": [ 214 | { 215 | "data": { 216 | "text/html": [ 217 | "
\n", 218 | "\n", 231 | "\n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | "
012
0180112748
1447205487
265610098
390450613
4529224530
\n", 273 | "
" 274 | ], 275 | "text/plain": [ 276 | " 0 1 2\n", 277 | "0 180 112 748\n", 278 | "1 447 205 487\n", 279 | "2 656 100 98\n", 280 | "3 90 450 613\n", 281 | "4 529 224 530" 282 | ] 283 | }, 284 | "execution_count": 13, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))\n", 291 | " for i in range(5))\n", 292 | "df1.head()" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "#### pd.eval() supports all arithmetic operators:" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 14, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "True" 311 | ] 312 | }, 313 | "execution_count": 14, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "result1 = -df1 * df2 / (df3 + df4) - df5\n", 320 | "result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')\n", 321 | "np.allclose(result1, result2)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "#### ``pd.eval()`` supports all comparison operators, including chained expressions:" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 15, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "data": { 338 | "text/plain": [ 339 | "True" 340 | ] 341 | }, 342 | "execution_count": 15, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)\n", 349 | "result2 = pd.eval('df1 < df2 <= df3 != df4')\n", 350 | "np.allclose(result1, result2)" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "#### ``pd.eval()`` supports the ``&`` and ``|`` bitwise operators:" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 16, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "True" 369 | ] 370 | }, 371 | "execution_count": 16, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)\n", 378 | "result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')\n", 379 | "np.allclose(result1, result2)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": {}, 385 | "source": [ 386 | "#### ``pd(eval()`` also supports literal ``and`` and ``or`` in Boolean expressions:" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 17, 392 | "metadata": {}, 393 | "outputs": [ 394 | { 395 | "data": { 396 | "text/plain": [ 397 | "True" 398 | ] 399 | }, 400 | "execution_count": 17, 401 | "metadata": {}, 402 | "output_type": "execute_result" 403 | } 404 | ], 405 | "source": [ 406 | "result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')\n", 407 | "np.allclose(result1, result3)" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "#### ``pd.eval()`` supports access to object attributes via ``obj.attr``, and indexes via ``obj[index]``:" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 18, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "data": { 424 | "text/plain": [ 425 | "True" 426 | ] 427 | }, 428 | "execution_count": 18, 429 | "metadata": {}, 430 | "output_type": "execute_result" 431 | } 432 | ], 433 | "source": [ 434 | "result1 = df2.T[0] + df3.iloc[1]\n", 435 | "result2 = pd.eval('df2.T[0] + df3.iloc[1]')\n", 436 | "np.allclose(result1, result2)" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "#### Other operations\n", 444 | "- function calls, conditional statements, loops, and more complex constructs are __currently *not* implemented__ in ``pd.eval()``." 445 | ] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "### ``DataFrame.eval()`` for Column-Wise Operations\n", 452 | "\n", 453 | "- Just as Pandas uses ``pd.eval()``, ``DataFrame``s have a similar ``eval()`` method. The benefit of the ``eval()`` method is that columns can be referred to *by name*." 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 19, 459 | "metadata": {}, 460 | "outputs": [ 461 | { 462 | "data": { 463 | "text/html": [ 464 | "
\n", 465 | "\n", 478 | "\n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | "
ABC
00.3755060.4069390.069938
10.0690870.2356150.154374
20.6779450.4338390.652324
30.2640380.8080550.347197
40.5891610.2524180.557789
\n", 520 | "
" 521 | ], 522 | "text/plain": [ 523 | " A B C\n", 524 | "0 0.375506 0.406939 0.069938\n", 525 | "1 0.069087 0.235615 0.154374\n", 526 | "2 0.677945 0.433839 0.652324\n", 527 | "3 0.264038 0.808055 0.347197\n", 528 | "4 0.589161 0.252418 0.557789" 529 | ] 530 | }, 531 | "execution_count": 19, 532 | "metadata": {}, 533 | "output_type": "execute_result" 534 | } 535 | ], 536 | "source": [ 537 | "df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])\n", 538 | "df.head()" 539 | ] 540 | }, 541 | { 542 | "cell_type": "markdown", 543 | "metadata": {}, 544 | "source": [ 545 | "- Now we can compute expressions using the column names:" 546 | ] 547 | }, 548 | { 549 | "cell_type": "code", 550 | "execution_count": 20, 551 | "metadata": {}, 552 | "outputs": [ 553 | { 554 | "data": { 555 | "text/plain": [ 556 | "True" 557 | ] 558 | }, 559 | "execution_count": 20, 560 | "metadata": {}, 561 | "output_type": "execute_result" 562 | } 563 | ], 564 | "source": [ 565 | "result1 = (df['A'] + df['B']) / (df['C'] - 1)\n", 566 | "result2 = pd.eval(\"(df.A + df.B) / (df.C - 1)\")\n", 567 | "np.allclose(result1, result2)" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "- ``DataFrame.eval()`` allows more succinct evaluation of expressions with the columns. We can treat *column names as variables*." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": 21, 580 | "metadata": {}, 581 | "outputs": [ 582 | { 583 | "data": { 584 | "text/plain": [ 585 | "True" 586 | ] 587 | }, 588 | "execution_count": 21, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "result3 = df.eval('(A + B) / (C - 1)')\n", 595 | "np.allclose(result1, result3)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "### Assignment in DataFrame.eval()\n", 603 | "- ``DataFrame.eval()`` also allows assignment to any column." 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": 22, 609 | "metadata": {}, 610 | "outputs": [ 611 | { 612 | "data": { 613 | "text/html": [ 614 | "
\n", 615 | "\n", 628 | "\n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | "
ABC
00.3755060.4069390.069938
10.0690870.2356150.154374
20.6779450.4338390.652324
30.2640380.8080550.347197
40.5891610.2524180.557789
\n", 670 | "
" 671 | ], 672 | "text/plain": [ 673 | " A B C\n", 674 | "0 0.375506 0.406939 0.069938\n", 675 | "1 0.069087 0.235615 0.154374\n", 676 | "2 0.677945 0.433839 0.652324\n", 677 | "3 0.264038 0.808055 0.347197\n", 678 | "4 0.589161 0.252418 0.557789" 679 | ] 680 | }, 681 | "execution_count": 22, 682 | "metadata": {}, 683 | "output_type": "execute_result" 684 | } 685 | ], 686 | "source": [ 687 | "df.head()" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 23, 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "data": { 697 | "text/html": [ 698 | "
\n", 699 | "\n", 712 | "\n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | "
ABCD
00.3755060.4069390.06993811.187620
10.0690870.2356150.1543741.973796
20.6779450.4338390.6523241.704344
30.2640380.8080550.3471973.087857
40.5891610.2524180.5577891.508776
\n", 760 | "
" 761 | ], 762 | "text/plain": [ 763 | " A B C D\n", 764 | "0 0.375506 0.406939 0.069938 11.187620\n", 765 | "1 0.069087 0.235615 0.154374 1.973796\n", 766 | "2 0.677945 0.433839 0.652324 1.704344\n", 767 | "3 0.264038 0.808055 0.347197 3.087857\n", 768 | "4 0.589161 0.252418 0.557789 1.508776" 769 | ] 770 | }, 771 | "execution_count": 23, 772 | "metadata": {}, 773 | "output_type": "execute_result" 774 | } 775 | ], 776 | "source": [ 777 | "df.eval('D = (A + B) / C', inplace=True)\n", 778 | "df.head()" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "- Any existing column can be modified:" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 24, 791 | "metadata": {}, 792 | "outputs": [ 793 | { 794 | "data": { 795 | "text/html": [ 796 | "
\n", 797 | "\n", 810 | "\n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | "
ABCD
00.3755060.4069390.069938-0.449425
10.0690870.2356150.154374-1.078728
20.6779450.4338390.6523240.374209
30.2640380.8080550.347197-1.566886
40.5891610.2524180.5577890.603708
\n", 858 | "
" 859 | ], 860 | "text/plain": [ 861 | " A B C D\n", 862 | "0 0.375506 0.406939 0.069938 -0.449425\n", 863 | "1 0.069087 0.235615 0.154374 -1.078728\n", 864 | "2 0.677945 0.433839 0.652324 0.374209\n", 865 | "3 0.264038 0.808055 0.347197 -1.566886\n", 866 | "4 0.589161 0.252418 0.557789 0.603708" 867 | ] 868 | }, 869 | "execution_count": 24, 870 | "metadata": {}, 871 | "output_type": "execute_result" 872 | } 873 | ], 874 | "source": [ 875 | "df.eval('D = (A - B) / C', inplace=True)\n", 876 | "df.head()" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "### Local variables in DataFrame.eval()\n", 884 | "- ``DataFrame.eval()`` local Python variables via the '@' syntax:" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": 25, 890 | "metadata": {}, 891 | "outputs": [ 892 | { 893 | "data": { 894 | "text/plain": [ 895 | "True" 896 | ] 897 | }, 898 | "execution_count": 25, 899 | "metadata": {}, 900 | "output_type": "execute_result" 901 | } 902 | ], 903 | "source": [ 904 | "column_mean = df.mean(1) # local variable\n", 905 | "\n", 906 | "result1 = df['A'] + column_mean\n", 907 | "result2 = df.eval('A + @column_mean')\n", 908 | "np.allclose(result1, result2)" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "- The ``@`` character marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving __two \"namespaces\"__: the namespace of columns, and the namespace of Python objects.\n", 916 | "\n", 917 | "- Note: this ``@`` character is only supported by ``DataFrame.eval()``, not by ``pandas.eval()`` - ``pandas.eval()`` only has access to one (Python) namespace." 918 | ] 919 | }, 920 | { 921 | "cell_type": "markdown", 922 | "metadata": {}, 923 | "source": [ 924 | "## DataFrame.query() Method\n", 925 | "\n", 926 | "Consider the following:" 927 | ] 928 | }, 929 | { 930 | "cell_type": "code", 931 | "execution_count": 26, 932 | "metadata": {}, 933 | "outputs": [ 934 | { 935 | "data": { 936 | "text/plain": [ 937 | "True" 938 | ] 939 | }, 940 | "execution_count": 26, 941 | "metadata": {}, 942 | "output_type": "execute_result" 943 | } 944 | ], 945 | "source": [ 946 | "result1 = df[(df.A < 0.5) & (df.B < 0.5)]\n", 947 | "result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')\n", 948 | "np.allclose(result1, result2)" 949 | ] 950 | }, 951 | { 952 | "cell_type": "markdown", 953 | "metadata": {}, 954 | "source": [ 955 | "- These expressions involve columns of ``DataFrame``s. But it cannot be evaluated using ``DataFrame.eval()``... instead, for this type of filtering operation, you can use the ``query()`` method:" 956 | ] 957 | }, 958 | { 959 | "cell_type": "code", 960 | "execution_count": 27, 961 | "metadata": {}, 962 | "outputs": [ 963 | { 964 | "data": { 965 | "text/plain": [ 966 | "True" 967 | ] 968 | }, 969 | "execution_count": 27, 970 | "metadata": {}, 971 | "output_type": "execute_result" 972 | } 973 | ], 974 | "source": [ 975 | "result2 = df.query('A < 0.5 and B < 0.5')\n", 976 | "np.allclose(result1, result2)" 977 | ] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": {}, 982 | "source": [ 983 | "- This is much more efficient than the masking expression above, plus it's much easier to read. Note that ``query()`` also accepts the ``@`` flag to mark local variables:" 984 | ] 985 | }, 986 | { 987 | "cell_type": "code", 988 | "execution_count": 28, 989 | "metadata": {}, 990 | "outputs": [ 991 | { 992 | "data": { 993 | "text/plain": [ 994 | "True" 995 | ] 996 | }, 997 | "execution_count": 28, 998 | "metadata": {}, 999 | "output_type": "execute_result" 1000 | } 1001 | ], 1002 | "source": [ 1003 | "Cmean = df['C'].mean()\n", 1004 | "result1 = df[(df.A < Cmean) & (df.B < Cmean)]\n", 1005 | "result2 = df.query('A < @Cmean and B < @Cmean')\n", 1006 | "np.allclose(result1, result2)" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "metadata": {}, 1012 | "source": [ 1013 | "### Performance: When to Use These Functions\n", 1014 | "\n", 1015 | "- Memory use is the most predictable aspect. Every compound expression involving NumPy arrays or Pandas ``DataFrame``s will result in implicit creation of temporary arrays:" 1016 | ] 1017 | }, 1018 | { 1019 | "cell_type": "code", 1020 | "execution_count": 29, 1021 | "metadata": { 1022 | "collapsed": true 1023 | }, 1024 | "outputs": [], 1025 | "source": [ 1026 | "x = df[(df.A < 0.5) & (df.B < 0.5)]" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "metadata": {}, 1032 | "source": [ 1033 | "- Is roughly equivalent to:" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "code", 1038 | "execution_count": 30, 1039 | "metadata": { 1040 | "collapsed": true 1041 | }, 1042 | "outputs": [], 1043 | "source": [ 1044 | "tmp1 = df.A < 0.5\n", 1045 | "tmp2 = df.B < 0.5\n", 1046 | "tmp3 = tmp1 & tmp2\n", 1047 | "x = df[tmp3]" 1048 | ] 1049 | }, 1050 | { 1051 | "cell_type": "markdown", 1052 | "metadata": {}, 1053 | "source": [ 1054 | "- If the size of the temporary ``DataFrame``s is significant compared to available system memory (typically several GB) then consider using ``eval()`` or ``query()``. You can check the approximate size of your array in bytes using this:" 1055 | ] 1056 | }, 1057 | { 1058 | "cell_type": "code", 1059 | "execution_count": 31, 1060 | "metadata": {}, 1061 | "outputs": [ 1062 | { 1063 | "data": { 1064 | "text/plain": [ 1065 | "32000" 1066 | ] 1067 | }, 1068 | "execution_count": 31, 1069 | "metadata": {}, 1070 | "output_type": "execute_result" 1071 | } 1072 | ], 1073 | "source": [ 1074 | "df.values.nbytes" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "markdown", 1079 | "metadata": {}, 1080 | "source": [ 1081 | "- ``eval()`` can be faster even when you are not maxing-out your system memory.\n", 1082 | "- The issue is how your temporary ``DataFrame``s compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes); if they are much bigger, then ``eval()`` can avoid some potentially slow movement of values between the different memory caches.\n", 1083 | "- See [\"Enhancing Performance\" section](http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html) for more detail." 1084 | ] 1085 | } 1086 | ], 1087 | "metadata": { 1088 | "anaconda-cloud": {}, 1089 | "kernelspec": { 1090 | "display_name": "Python 3", 1091 | "language": "python", 1092 | "name": "python3" 1093 | }, 1094 | "language_info": { 1095 | "codemirror_mode": { 1096 | "name": "ipython", 1097 | "version": 3 1098 | }, 1099 | "file_extension": ".py", 1100 | "mimetype": "text/x-python", 1101 | "name": "python", 1102 | "nbconvert_exporter": "python", 1103 | "pygments_lexer": "ipython3", 1104 | "version": "3.6.3" 1105 | } 1106 | }, 1107 | "nbformat": 4, 1108 | "nbformat_minor": 1 1109 | } 1110 | -------------------------------------------------------------------------------- /Pandas-Missing-Values.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Handling Missing Data" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Trade-Offs\n", 15 | "\n", 16 | "- There are multiple schemes to indicate missing data in a table or DataFrame. They use one of two strategies: using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry.\n", 17 | "\n", 18 | "- A mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.\n", 19 | "\n", 20 | "- A sentinel value could be a data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number).\n", 21 | "\n", 22 | "- Using a separate mask array requires allocation of an additional Boolean array; a sentinel value reduces the range of valid values that can be represented and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### Missing Data in Pandas\n", 30 | "\n", 31 | "- Pandas is constrained by its reliance on NumPy, which does not have a built-in notion of NA values for non-floating-point data types.\n", 32 | "\n", 33 | "- Pandas could have specified bit patterns for each individual data type to indicate nullness, but NumPy supports (for example) *fourteen* basic integer types once you account for available precisions, signedness, and endianness of the encoding.\n", 34 | "\n", 35 | "- NumPy supports masked arrays – that is, arrays that have a separate Boolean mask array attached for marking data as \"good\" or \"bad.\" Pandas could have used this approach, but it used too much overhead.\n", 36 | "\n", 37 | "- Therefore Pandas chose to use sentinels for missing data and to use two already-existing Python null values: the special floating-point ``NaN`` value, and the Python ``None`` object." 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "### ``None``: Pythonic missing data\n", 45 | "\n", 46 | "- ``None`` is a singleton object that is often used for missing data in Python code. Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 1, 52 | "metadata": { 53 | "collapsed": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "import numpy as np\n", 58 | "import pandas as pd" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 2, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "array([1, None, 3, 4], dtype=object)" 70 | ] 71 | }, 72 | "execution_count": 2, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "vals1 = np.array([1, None, 3, 4])\n", 79 | "vals1" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "- While an object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead." 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "dtype = object\n", 99 | "48.4 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", 100 | "\n", 101 | "dtype = int\n", 102 | "1.45 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", 103 | "\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "for dtype in ['object', 'int']:\n", 109 | " print(\"dtype =\", dtype)\n", 110 | " %timeit np.arange(1E6, dtype=dtype).sum()\n", 111 | " print()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "- The use of Python objects in an array also means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error:" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "ename": "TypeError", 128 | "evalue": "unsupported operand type(s) for +: 'int' and 'NoneType'", 129 | "output_type": "error", 130 | "traceback": [ 131 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 132 | "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", 133 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mvals1\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 134 | "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py\u001b[0m in \u001b[0;36m_sum\u001b[0;34m(a, axis, dtype, out, keepdims)\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 31\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 32\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mumr_sum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 33\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 34\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_prod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkeepdims\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 135 | "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'NoneType'" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "vals1.sum()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "### ``NaN``: Missing numerical data\n", 148 | "\n", 149 | "- ``NaN`` (acronym for *Not a Number*), is different; it is a special floating-point value recognized by all systems that use the standard IEEE format." 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 5, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "dtype('float64')" 161 | ] 162 | }, 163 | "execution_count": 5, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "vals2 = np.array([1, np.nan, 3, 4]) \n", 170 | "vals2.dtype" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "- Be aware that ``NaN`` is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 6, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/plain": [ 188 | "nan" 189 | ] 190 | }, 191 | "execution_count": 6, 192 | "metadata": {}, 193 | "output_type": "execute_result" 194 | } 195 | ], 196 | "source": [ 197 | "1 + np.nan" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 7, 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "data": { 207 | "text/plain": [ 208 | "nan" 209 | ] 210 | }, 211 | "execution_count": 7, 212 | "metadata": {}, 213 | "output_type": "execute_result" 214 | } 215 | ], 216 | "source": [ 217 | "0 * np.nan" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 8, 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "data": { 227 | "text/plain": [ 228 | "(nan, nan, nan)" 229 | ] 230 | }, 231 | "execution_count": 8, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "vals2.sum(), vals2.min(), vals2.max()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "- NumPy provides some special aggregations that will ignore these missing values:" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 9, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "(8.0, 1.0, 4.0)" 256 | ] 257 | }, 258 | "execution_count": 9, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "### NaN and None in Pandas\n", 272 | "\n", 273 | "- Pandas is built to handle both ``NaN`` and ``None`` interchangeably, converting between them where appropriate." 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 10, 279 | "metadata": {}, 280 | "outputs": [ 281 | { 282 | "data": { 283 | "text/plain": [ 284 | "0 1.0\n", 285 | "1 NaN\n", 286 | "2 2.0\n", 287 | "3 NaN\n", 288 | "dtype: float64" 289 | ] 290 | }, 291 | "execution_count": 10, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "pd.Series([1, np.nan, 2, None])" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "- For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to ``np.nan``, it will automatically be upcast to a floating-point type to accommodate the NA." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 11, 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "0 0\n", 316 | "1 1\n", 317 | "dtype: int64" 318 | ] 319 | }, 320 | "execution_count": 11, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "x = pd.Series(range(2), dtype=int)\n", 327 | "x" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 12, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "0 NaN\n", 339 | "1 1.0\n", 340 | "dtype: float64" 341 | ] 342 | }, 343 | "execution_count": 12, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "x[0] = None\n", 350 | "x" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "- Upcasting conventions in Pandas when NA values are introduced:\n", 358 | "\n", 359 | "|Typeclass | Conversion When Storing NAs | NA Sentinel Value |\n", 360 | "|--------------|-----------------------------|------------------------|\n", 361 | "| ``floating`` | No change | ``np.nan`` |\n", 362 | "| ``object`` | No change | ``None`` or ``np.nan`` |\n", 363 | "| ``integer`` | Cast to ``float64`` | ``np.nan`` |\n", 364 | "| ``boolean`` | Cast to ``object`` | ``None`` or ``np.nan`` |\n", 365 | "\n", 366 | "Keep in mind that in Pandas, string data is always stored with an ``object`` dtype." 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "metadata": {}, 372 | "source": [ 373 | "### Operating on Null Values\n", 374 | "\n", 375 | "- There are several methods for detecting, removing, and replacing null values in Pandas data structures:\n", 376 | "\n", 377 | "- ``isnull()``: Generate a boolean mask indicating missing values\n", 378 | "- ``notnull()``: Opposite of ``isnull()``\n", 379 | "- ``dropna()``: Return a filtered version of the data\n", 380 | "- ``fillna()``: Return a copy of the data with missing values filled or imputed" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### Detecting null values\n", 388 | "- Pandas data structures have two methods for detecting null data: ``isnull()`` and ``notnull()``. Either one will return a Boolean mask over the data." 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 13, 394 | "metadata": { 395 | "collapsed": true 396 | }, 397 | "outputs": [], 398 | "source": [ 399 | "data = pd.Series([1, np.nan, 'hello', None])" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 14, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "0 False\n", 411 | "1 True\n", 412 | "2 False\n", 413 | "3 True\n", 414 | "dtype: bool" 415 | ] 416 | }, 417 | "execution_count": 14, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "data.isnull()" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "- Boolean masks can be used directly as a ``Series`` or ``DataFrame`` index:" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 15, 436 | "metadata": {}, 437 | "outputs": [ 438 | { 439 | "data": { 440 | "text/plain": [ 441 | "0 1\n", 442 | "2 hello\n", 443 | "dtype: object" 444 | ] 445 | }, 446 | "execution_count": 15, 447 | "metadata": {}, 448 | "output_type": "execute_result" 449 | } 450 | ], 451 | "source": [ 452 | "data[data.notnull()]" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "- ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s." 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "### Dropping null values\n", 467 | "\n", 468 | "- More convenience methods are available:\n", 469 | " - ``dropna()`` (which removes NA values)\n", 470 | " - ``fillna()`` (which fills in NA values)\n", 471 | "- For a ``Series``, the result is straightforward." 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 16, 477 | "metadata": {}, 478 | "outputs": [ 479 | { 480 | "data": { 481 | "text/plain": [ 482 | "0 1\n", 483 | "2 hello\n", 484 | "dtype: object" 485 | ] 486 | }, 487 | "execution_count": 16, 488 | "metadata": {}, 489 | "output_type": "execute_result" 490 | } 491 | ], 492 | "source": [ 493 | "data.dropna()" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "-``DataFrame``s have more options.\n", 501 | "\n", 502 | "- We cannot drop single values from a ``DataFrame`` - only full rows or full columns. Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options.\n", 503 | "\n", 504 | "- By default, ``dropna()`` will drop all rows in which *any* null value is present:" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 18, 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "data": { 514 | "text/html": [ 515 | "
\n", 516 | "\n", 529 | "\n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | "
012
01.0NaN2
12.03.05
2NaN4.06
\n", 559 | "
" 560 | ], 561 | "text/plain": [ 562 | " 0 1 2\n", 563 | "0 1.0 NaN 2\n", 564 | "1 2.0 3.0 5\n", 565 | "2 NaN 4.0 6" 566 | ] 567 | }, 568 | "execution_count": 18, 569 | "metadata": {}, 570 | "output_type": "execute_result" 571 | } 572 | ], 573 | "source": [ 574 | "df = pd.DataFrame([[1, np.nan, 2],\n", 575 | " [2, 3, 5],\n", 576 | " [np.nan, 4, 6]])\n", 577 | "df" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 19, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/html": [ 588 | "
\n", 589 | "\n", 602 | "\n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | "
012
12.03.05
\n", 620 | "
" 621 | ], 622 | "text/plain": [ 623 | " 0 1 2\n", 624 | "1 2.0 3.0 5" 625 | ] 626 | }, 627 | "execution_count": 19, 628 | "metadata": {}, 629 | "output_type": "execute_result" 630 | } 631 | ], 632 | "source": [ 633 | "df.dropna()" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "- Or you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value." 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": 20, 646 | "metadata": {}, 647 | "outputs": [ 648 | { 649 | "data": { 650 | "text/html": [ 651 | "
\n", 652 | "\n", 665 | "\n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | "
2
02
15
26
\n", 687 | "
" 688 | ], 689 | "text/plain": [ 690 | " 2\n", 691 | "0 2\n", 692 | "1 5\n", 693 | "2 6" 694 | ] 695 | }, 696 | "execution_count": 20, 697 | "metadata": {}, 698 | "output_type": "execute_result" 699 | } 700 | ], 701 | "source": [ 702 | "df.dropna(axis='columns')" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "- This drops good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values. This can be specified through the ``how`` or ``thresh`` parameters.\n", 710 | "\n", 711 | "- The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.\n", 712 | "- You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 21, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "data": { 722 | "text/html": [ 723 | "
\n", 724 | "\n", 737 | "\n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | "
0123
01.0NaN2NaN
12.03.05NaN
2NaN4.06NaN
\n", 771 | "
" 772 | ], 773 | "text/plain": [ 774 | " 0 1 2 3\n", 775 | "0 1.0 NaN 2 NaN\n", 776 | "1 2.0 3.0 5 NaN\n", 777 | "2 NaN 4.0 6 NaN" 778 | ] 779 | }, 780 | "execution_count": 21, 781 | "metadata": {}, 782 | "output_type": "execute_result" 783 | } 784 | ], 785 | "source": [ 786 | "df[3] = np.nan\n", 787 | "df" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 22, 793 | "metadata": {}, 794 | "outputs": [ 795 | { 796 | "data": { 797 | "text/html": [ 798 | "
\n", 799 | "\n", 812 | "\n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | "
012
01.0NaN2
12.03.05
2NaN4.06
\n", 842 | "
" 843 | ], 844 | "text/plain": [ 845 | " 0 1 2\n", 846 | "0 1.0 NaN 2\n", 847 | "1 2.0 3.0 5\n", 848 | "2 NaN 4.0 6" 849 | ] 850 | }, 851 | "execution_count": 22, 852 | "metadata": {}, 853 | "output_type": "execute_result" 854 | } 855 | ], 856 | "source": [ 857 | "df.dropna(axis='columns', how='all')" 858 | ] 859 | }, 860 | { 861 | "cell_type": "markdown", 862 | "metadata": {}, 863 | "source": [ 864 | "- The ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:" 865 | ] 866 | }, 867 | { 868 | "cell_type": "code", 869 | "execution_count": 23, 870 | "metadata": {}, 871 | "outputs": [ 872 | { 873 | "data": { 874 | "text/html": [ 875 | "
\n", 876 | "\n", 889 | "\n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | "
0123
12.03.05NaN
\n", 909 | "
" 910 | ], 911 | "text/plain": [ 912 | " 0 1 2 3\n", 913 | "1 2.0 3.0 5 NaN" 914 | ] 915 | }, 916 | "execution_count": 23, 917 | "metadata": {}, 918 | "output_type": "execute_result" 919 | } 920 | ], 921 | "source": [ 922 | "df.dropna(axis='rows', thresh=3)" 923 | ] 924 | }, 925 | { 926 | "cell_type": "markdown", 927 | "metadata": {}, 928 | "source": [ 929 | "- The first and last row have been dropped because they contain only two non-null values." 930 | ] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "metadata": {}, 935 | "source": [ 936 | "### Filling null values\n", 937 | "\n", 938 | "- Sometimes you'd rather replace NA values with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.\n", 939 | "\n", 940 | "- You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced." 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 24, 946 | "metadata": {}, 947 | "outputs": [ 948 | { 949 | "data": { 950 | "text/plain": [ 951 | "a 1.0\n", 952 | "b NaN\n", 953 | "c 2.0\n", 954 | "d NaN\n", 955 | "e 3.0\n", 956 | "dtype: float64" 957 | ] 958 | }, 959 | "execution_count": 24, 960 | "metadata": {}, 961 | "output_type": "execute_result" 962 | } 963 | ], 964 | "source": [ 965 | "data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))\n", 966 | "data" 967 | ] 968 | }, 969 | { 970 | "cell_type": "markdown", 971 | "metadata": {}, 972 | "source": [ 973 | "- Fill NA entries with a single value, such as zero:" 974 | ] 975 | }, 976 | { 977 | "cell_type": "code", 978 | "execution_count": 25, 979 | "metadata": {}, 980 | "outputs": [ 981 | { 982 | "data": { 983 | "text/plain": [ 984 | "a 1.0\n", 985 | "b 0.0\n", 986 | "c 2.0\n", 987 | "d 0.0\n", 988 | "e 3.0\n", 989 | "dtype: float64" 990 | ] 991 | }, 992 | "execution_count": 25, 993 | "metadata": {}, 994 | "output_type": "execute_result" 995 | } 996 | ], 997 | "source": [ 998 | "data.fillna(0)" 999 | ] 1000 | }, 1001 | { 1002 | "cell_type": "markdown", 1003 | "metadata": {}, 1004 | "source": [ 1005 | "- Forward-fill to propagate the previous value forward:" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": 26, 1011 | "metadata": {}, 1012 | "outputs": [ 1013 | { 1014 | "data": { 1015 | "text/plain": [ 1016 | "a 1.0\n", 1017 | "b 1.0\n", 1018 | "c 2.0\n", 1019 | "d 2.0\n", 1020 | "e 3.0\n", 1021 | "dtype: float64" 1022 | ] 1023 | }, 1024 | "execution_count": 26, 1025 | "metadata": {}, 1026 | "output_type": "execute_result" 1027 | } 1028 | ], 1029 | "source": [ 1030 | "# forward-fill\n", 1031 | "data.fillna(method='ffill')" 1032 | ] 1033 | }, 1034 | { 1035 | "cell_type": "markdown", 1036 | "metadata": {}, 1037 | "source": [ 1038 | "- Back-fill to propagate the next values backward:" 1039 | ] 1040 | }, 1041 | { 1042 | "cell_type": "code", 1043 | "execution_count": 27, 1044 | "metadata": {}, 1045 | "outputs": [ 1046 | { 1047 | "data": { 1048 | "text/plain": [ 1049 | "a 1.0\n", 1050 | "b 2.0\n", 1051 | "c 2.0\n", 1052 | "d 3.0\n", 1053 | "e 3.0\n", 1054 | "dtype: float64" 1055 | ] 1056 | }, 1057 | "execution_count": 27, 1058 | "metadata": {}, 1059 | "output_type": "execute_result" 1060 | } 1061 | ], 1062 | "source": [ 1063 | "# back-fill\n", 1064 | "data.fillna(method='bfill')" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "metadata": { 1070 | "collapsed": true 1071 | }, 1072 | "source": [ 1073 | "- We can also specify an ``axis`` along which the fills take place in a ``DataFrame``." 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": 28, 1079 | "metadata": {}, 1080 | "outputs": [ 1081 | { 1082 | "data": { 1083 | "text/html": [ 1084 | "
\n", 1085 | "\n", 1098 | "\n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | "
0123
01.0NaN2NaN
12.03.05NaN
2NaN4.06NaN
\n", 1132 | "
" 1133 | ], 1134 | "text/plain": [ 1135 | " 0 1 2 3\n", 1136 | "0 1.0 NaN 2 NaN\n", 1137 | "1 2.0 3.0 5 NaN\n", 1138 | "2 NaN 4.0 6 NaN" 1139 | ] 1140 | }, 1141 | "execution_count": 28, 1142 | "metadata": {}, 1143 | "output_type": "execute_result" 1144 | } 1145 | ], 1146 | "source": [ 1147 | "df" 1148 | ] 1149 | }, 1150 | { 1151 | "cell_type": "code", 1152 | "execution_count": 29, 1153 | "metadata": {}, 1154 | "outputs": [ 1155 | { 1156 | "data": { 1157 | "text/html": [ 1158 | "
\n", 1159 | "\n", 1172 | "\n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | "
0123
01.01.02.02.0
12.03.05.05.0
2NaN4.06.06.0
\n", 1206 | "
" 1207 | ], 1208 | "text/plain": [ 1209 | " 0 1 2 3\n", 1210 | "0 1.0 1.0 2.0 2.0\n", 1211 | "1 2.0 3.0 5.0 5.0\n", 1212 | "2 NaN 4.0 6.0 6.0" 1213 | ] 1214 | }, 1215 | "execution_count": 29, 1216 | "metadata": {}, 1217 | "output_type": "execute_result" 1218 | } 1219 | ], 1220 | "source": [ 1221 | "df.fillna(method='ffill', axis=1)" 1222 | ] 1223 | } 1224 | ], 1225 | "metadata": { 1226 | "anaconda-cloud": {}, 1227 | "kernelspec": { 1228 | "display_name": "Python 3", 1229 | "language": "python", 1230 | "name": "python3" 1231 | }, 1232 | "language_info": { 1233 | "codemirror_mode": { 1234 | "name": "ipython", 1235 | "version": 3 1236 | }, 1237 | "file_extension": ".py", 1238 | "mimetype": "text/x-python", 1239 | "name": "python", 1240 | "nbconvert_exporter": "python", 1241 | "pygments_lexer": "ipython3", 1242 | "version": "3.6.3" 1243 | } 1244 | }, 1245 | "nbformat": 4, 1246 | "nbformat_minor": 1 1247 | } 1248 | -------------------------------------------------------------------------------- /Pandas-Vectorized-String-Ops.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Vectorized String Operations" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Introduction" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 4, 20 | "metadata": {}, 21 | "outputs": [ 22 | { 23 | "data": { 24 | "text/plain": [ 25 | "array([ 4, 6, 10, 14, 22, 26])" 26 | ] 27 | }, 28 | "execution_count": 4, 29 | "metadata": {}, 30 | "output_type": "execute_result" 31 | } 32 | ], 33 | "source": [ 34 | "import numpy as np\n", 35 | "x = np.array([2, 3, 5, 7, 11, 13])\n", 36 | "x * 2" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "- NumPy does not do vectorization for string arrays - thus you're stuck using a loop syntax:" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 5, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "data": { 53 | "text/plain": [ 54 | "['Peter', 'Paul', 'Mary', 'Guido']" 55 | ] 56 | }, 57 | "execution_count": 5, 58 | "metadata": {}, 59 | "output_type": "execute_result" 60 | } 61 | ], 62 | "source": [ 63 | "data = ['peter', 'Paul', 'MARY', 'gUIDO']\n", 64 | "[s.capitalize() for s in data]" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "- This will break if there are any missing values." 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 6, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "ename": "AttributeError", 81 | "evalue": "'NoneType' object has no attribute 'capitalize'", 82 | "output_type": "error", 83 | "traceback": [ 84 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 85 | "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", 86 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 87 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'peter'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'Paul'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'MARY'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gUIDO'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;34m[\u001b[0m\u001b[0ms\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcapitalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 88 | "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'capitalize'" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "data = ['peter', 'Paul', None, 'MARY', 'gUIDO']\n", 94 | "[s.capitalize() for s in data]" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "- Pandas solves this problem via the ``str`` attribute of Pandas Series and Index objects containing strings:" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 7, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "0 peter\n", 113 | "1 Paul\n", 114 | "2 None\n", 115 | "3 MARY\n", 116 | "4 gUIDO\n", 117 | "dtype: object" 118 | ] 119 | }, 120 | "execution_count": 7, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "import pandas as pd\n", 127 | "names = pd.Series(data)\n", 128 | "names" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 8, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "0 Peter\n", 140 | "1 Paul\n", 141 | "2 None\n", 142 | "3 Mary\n", 143 | "4 Guido\n", 144 | "dtype: object" 145 | ] 146 | }, 147 | "execution_count": 8, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "names.str.capitalize()" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "- Reminder: using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas." 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "### Tables of Pandas String Methods" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 9, 173 | "metadata": { 174 | "collapsed": true 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',\n", 179 | " 'Eric Idle', 'Terry Jones', 'Michael Palin'])" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "### Methods similar to Python string methods\n", 187 | "Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:\n", 188 | "\n", 189 | "| | | | |\n", 190 | "|-------------|------------------|------------------|------------------|\n", 191 | "|``len()`` | ``lower()`` | ``translate()`` | ``islower()`` | \n", 192 | "|``ljust()`` | ``upper()`` | ``startswith()`` | ``isupper()`` | \n", 193 | "|``rjust()`` | ``find()`` | ``endswith()`` | ``isnumeric()`` | \n", 194 | "|``center()`` | ``rfind()`` | ``isalnum()`` | ``isdecimal()`` | \n", 195 | "|``zfill()`` | ``index()`` | ``isalpha()`` | ``split()`` | \n", 196 | "|``strip()`` | ``rindex()`` | ``isdigit()`` | ``rsplit()`` | \n", 197 | "|``rstrip()`` | ``capitalize()`` | ``isspace()`` | ``partition()`` | \n", 198 | "|``lstrip()`` | ``swapcase()`` | ``istitle()`` | ``rpartition()`` |\n", 199 | "\n", 200 | "- Notice that these have various return values. Some, like ``lower()``, return a series of strings:" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 10, 206 | "metadata": {}, 207 | "outputs": [ 208 | { 209 | "data": { 210 | "text/plain": [ 211 | "0 graham chapman\n", 212 | "1 john cleese\n", 213 | "2 terry gilliam\n", 214 | "3 eric idle\n", 215 | "4 terry jones\n", 216 | "5 michael palin\n", 217 | "dtype: object" 218 | ] 219 | }, 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "monte.str.lower()" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "- Some return numbers:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 11, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "data": { 243 | "text/plain": [ 244 | "0 14\n", 245 | "1 11\n", 246 | "2 13\n", 247 | "3 9\n", 248 | "4 11\n", 249 | "5 13\n", 250 | "dtype: int64" 251 | ] 252 | }, 253 | "execution_count": 11, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "monte.str.len()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": {}, 265 | "source": [ 266 | "- Or Boolean values:" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 12, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "data": { 276 | "text/plain": [ 277 | "0 False\n", 278 | "1 False\n", 279 | "2 True\n", 280 | "3 False\n", 281 | "4 True\n", 282 | "5 False\n", 283 | "dtype: bool" 284 | ] 285 | }, 286 | "execution_count": 12, 287 | "metadata": {}, 288 | "output_type": "execute_result" 289 | } 290 | ], 291 | "source": [ 292 | "monte.str.startswith('T')" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "- Others return lists or other compound values for each element:" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 13, 305 | "metadata": {}, 306 | "outputs": [ 307 | { 308 | "data": { 309 | "text/plain": [ 310 | "0 [Graham, Chapman]\n", 311 | "1 [John, Cleese]\n", 312 | "2 [Terry, Gilliam]\n", 313 | "3 [Eric, Idle]\n", 314 | "4 [Terry, Jones]\n", 315 | "5 [Michael, Palin]\n", 316 | "dtype: object" 317 | ] 318 | }, 319 | "execution_count": 13, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "monte.str.split()" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "### Methods using regular expressions\n", 333 | "\n", 334 | "- Several methods accept regular expressions. They follow the API conventions of Python's built-in ``re`` module:\n", 335 | "\n", 336 | "| Method | Description |\n", 337 | "|--------|-------------|\n", 338 | "| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |\n", 339 | "| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|\n", 340 | "| ``findall()`` | Call ``re.findall()`` on each element |\n", 341 | "| ``replace()`` | Replace occurrences of pattern with some other string|\n", 342 | "| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |\n", 343 | "| ``count()`` | Count occurrences of pattern|\n", 344 | "| ``split()`` | Equivalent to ``str.split()``, but accepts regexps |\n", 345 | "| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "- For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element." 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 14, 358 | "metadata": {}, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "0 Graham\n", 364 | "1 John\n", 365 | "2 Terry\n", 366 | "3 Eric\n", 367 | "4 Terry\n", 368 | "5 Michael\n", 369 | "dtype: object" 370 | ] 371 | }, 372 | "execution_count": 14, 373 | "metadata": {}, 374 | "output_type": "execute_result" 375 | } 376 | ], 377 | "source": [ 378 | "monte.str.extract('([A-Za-z]+)', expand=False)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "- Find all names that start and end with a consonant, using the start-of-string (``^``) and end-of-string (``$``) regular expression characters:" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 15, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "data": { 395 | "text/plain": [ 396 | "0 [Graham Chapman]\n", 397 | "1 []\n", 398 | "2 [Terry Gilliam]\n", 399 | "3 []\n", 400 | "4 [Terry Jones]\n", 401 | "5 [Michael Palin]\n", 402 | "dtype: object" 403 | ] 404 | }, 405 | "execution_count": 15, 406 | "metadata": {}, 407 | "output_type": "execute_result" 408 | } 409 | ], 410 | "source": [ 411 | "monte.str.findall(r'^[^AEIOU].*[^aeiou]$')" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "### Miscellaneous methods\n", 419 | "\n", 420 | "| Method | Description |\n", 421 | "|--------|-------------|\n", 422 | "| ``get()`` | Index each element |\n", 423 | "| ``slice()`` | Slice each element|\n", 424 | "| ``slice_replace()`` | Replace slice in each element with passed value|\n", 425 | "| ``cat()`` | Concatenate strings|\n", 426 | "| ``repeat()`` | Repeat values |\n", 427 | "| ``normalize()`` | Return Unicode form of string |\n", 428 | "| ``pad()`` | Add whitespace to left, right, or both sides of strings|\n", 429 | "| ``wrap()`` | Split long strings into lines with length less than a given width|\n", 430 | "| ``join()`` | Join strings in each element of the Series with passed separator|\n", 431 | "| ``get_dummies()`` | extract dummy variables as a dataframe |" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "metadata": {}, 437 | "source": [ 438 | "#### Vectorized item access and slicing\n", 439 | "\n", 440 | "- ``get()`` and ``slice()`` provide vectorized element access from each string array. For example, we can get the first three characters of each array using ``str.slice(0, 3)``. This behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 16, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/plain": [ 451 | "0 Gra\n", 452 | "1 Joh\n", 453 | "2 Ter\n", 454 | "3 Eri\n", 455 | "4 Ter\n", 456 | "5 Mic\n", 457 | "dtype: object" 458 | ] 459 | }, 460 | "execution_count": 16, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "monte.str[0:3]" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "- Indexing via ``df.str.get(i)`` and ``df.str[i]`` works similarly. \n", 474 | "\n", 475 | "- ``get()`` and ``slice()`` also let you access elements of arrays returned by ``split()``. For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 17, 481 | "metadata": {}, 482 | "outputs": [ 483 | { 484 | "data": { 485 | "text/plain": [ 486 | "0 Chapman\n", 487 | "1 Cleese\n", 488 | "2 Gilliam\n", 489 | "3 Idle\n", 490 | "4 Jones\n", 491 | "5 Palin\n", 492 | "dtype: object" 493 | ] 494 | }, 495 | "execution_count": 17, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "monte.str.split().str.get(-1)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "#### Indicator variables\n", 509 | "\n", 510 | "- ``get_dummies()`` is useful when your data has a column containing a coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A=\"born in America,\" B=\"born in the United Kingdom,\" C=\"likes cheese,\" D=\"likes spam\":" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": 18, 516 | "metadata": {}, 517 | "outputs": [ 518 | { 519 | "data": { 520 | "text/html": [ 521 | "
\n", 522 | "\n", 535 | "\n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | "
infoname
0B|C|DGraham Chapman
1B|DJohn Cleese
2A|CTerry Gilliam
3B|DEric Idle
4B|CTerry Jones
5B|C|DMichael Palin
\n", 576 | "
" 577 | ], 578 | "text/plain": [ 579 | " info name\n", 580 | "0 B|C|D Graham Chapman\n", 581 | "1 B|D John Cleese\n", 582 | "2 A|C Terry Gilliam\n", 583 | "3 B|D Eric Idle\n", 584 | "4 B|C Terry Jones\n", 585 | "5 B|C|D Michael Palin" 586 | ] 587 | }, 588 | "execution_count": 18, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "full_monte = pd.DataFrame({'name': monte,\n", 595 | " 'info': ['B|C|D', 'B|D', 'A|C',\n", 596 | " 'B|D', 'B|C', 'B|C|D']})\n", 597 | "full_monte" 598 | ] 599 | }, 600 | { 601 | "cell_type": "markdown", 602 | "metadata": {}, 603 | "source": [ 604 | "- ``get_dummies()`` lets you split these indicator variables into a ``DataFrame``:" 605 | ] 606 | }, 607 | { 608 | "cell_type": "code", 609 | "execution_count": 19, 610 | "metadata": {}, 611 | "outputs": [ 612 | { 613 | "data": { 614 | "text/html": [ 615 | "
\n", 616 | "\n", 629 | "\n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | "
ABCD
00111
10101
21010
30101
40110
50111
\n", 684 | "
" 685 | ], 686 | "text/plain": [ 687 | " A B C D\n", 688 | "0 0 1 1 1\n", 689 | "1 0 1 0 1\n", 690 | "2 1 0 1 0\n", 691 | "3 0 1 0 1\n", 692 | "4 0 1 1 0\n", 693 | "5 0 1 1 1" 694 | ] 695 | }, 696 | "execution_count": 19, 697 | "metadata": {}, 698 | "output_type": "execute_result" 699 | } 700 | ], 701 | "source": [ 702 | "full_monte['info'].str.get_dummies('|')" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "### Example: Recipe Database\n", 710 | "\n", 711 | "- Vectorized string operations become most useful when cleaning up messy, real-world data. Let's parse some recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.\n", 712 | "\n", 713 | "- The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.\n", 714 | "\n", 715 | "- See [issue# 62](https://github.com/jakevdp/PythonDataScienceHandbook/issues/62) - updated dataset is [here](https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz)." 716 | ] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": 20, 721 | "metadata": {}, 722 | "outputs": [], 723 | "source": [ 724 | "#!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz\n", 725 | "#!gunzip recipeitems-latest.json.gz" 726 | ] 727 | }, 728 | { 729 | "cell_type": "markdown", 730 | "metadata": { 731 | "collapsed": true 732 | }, 733 | "source": [ 734 | "- The database is in JSON format - try ``pd.read_json`` to read it:" 735 | ] 736 | }, 737 | { 738 | "cell_type": "code", 739 | "execution_count": 21, 740 | "metadata": {}, 741 | "outputs": [ 742 | { 743 | "name": "stdout", 744 | "output_type": "stream", 745 | "text": [ 746 | "ValueError: Trailing data\n" 747 | ] 748 | } 749 | ], 750 | "source": [ 751 | "try:\n", 752 | " #recipes = pd.read_json('recipeitems-latest.json')\n", 753 | " recipes = pd.read_json('20170107-061401-recipeitems.json')\n", 754 | "except ValueError as e:\n", 755 | " print(\"ValueError:\", e)" 756 | ] 757 | }, 758 | { 759 | "cell_type": "markdown", 760 | "metadata": {}, 761 | "source": [ 762 | "- Oops! We get a ``ValueError`` mentioning that there is \"trailing data.\"\n", 763 | "Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not. Let's check if this interpretation is true:" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 22, 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "data": { 773 | "text/plain": [ 774 | "(2, 12)" 775 | ] 776 | }, 777 | "execution_count": 22, 778 | "metadata": {}, 779 | "output_type": "execute_result" 780 | } 781 | ], 782 | "source": [ 783 | "with open('20170107-061401-recipeitems.json') as f:\n", 784 | " line = f.readline()\n", 785 | "pd.read_json(line).shape" 786 | ] 787 | }, 788 | { 789 | "cell_type": "markdown", 790 | "metadata": {}, 791 | "source": [ 792 | "- Yes, apparently each line is a valid JSON, so we'll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:" 793 | ] 794 | }, 795 | { 796 | "cell_type": "code", 797 | "execution_count": 23, 798 | "metadata": { 799 | "collapsed": true 800 | }, 801 | "outputs": [], 802 | "source": [ 803 | "# read the entire file into a Python array\n", 804 | "with open('20170107-061401-recipeitems.json', 'r') as f:\n", 805 | " # Extract each line\n", 806 | " data = (line.strip() for line in f)\n", 807 | " # Reformat so each line is the element of a list\n", 808 | " data_json = \"[{0}]\".format(','.join(data))\n", 809 | "# read the result as a JSON\n", 810 | "recipes = pd.read_json(data_json)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": 24, 816 | "metadata": {}, 817 | "outputs": [ 818 | { 819 | "data": { 820 | "text/plain": [ 821 | "(173278, 17)" 822 | ] 823 | }, 824 | "execution_count": 24, 825 | "metadata": {}, 826 | "output_type": "execute_result" 827 | } 828 | ], 829 | "source": [ 830 | "recipes.shape" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "- Examine one row to see what we have:" 838 | ] 839 | }, 840 | { 841 | "cell_type": "code", 842 | "execution_count": 25, 843 | "metadata": {}, 844 | "outputs": [ 845 | { 846 | "data": { 847 | "text/plain": [ 848 | "_id {'$oid': '5160756b96cc62079cc2db15'}\n", 849 | "cookTime PT30M\n", 850 | "creator NaN\n", 851 | "dateModified NaN\n", 852 | "datePublished 2013-03-11\n", 853 | "description Late Saturday afternoon, after Marlboro Man ha...\n", 854 | "image http://static.thepioneerwoman.com/cooking/file...\n", 855 | "ingredients Biscuits\\n3 cups All-purpose Flour\\n2 Tablespo...\n", 856 | "name Drop Biscuits and Sausage Gravy\n", 857 | "prepTime PT10M\n", 858 | "recipeCategory NaN\n", 859 | "recipeInstructions NaN\n", 860 | "recipeYield 12\n", 861 | "source thepioneerwoman\n", 862 | "totalTime NaN\n", 863 | "ts {'$date': 1365276011104}\n", 864 | "url http://thepioneerwoman.com/cooking/2013/03/dro...\n", 865 | "Name: 0, dtype: object" 866 | ] 867 | }, 868 | "execution_count": 25, 869 | "metadata": {}, 870 | "output_type": "execute_result" 871 | } 872 | ], 873 | "source": [ 874 | "recipes.iloc[0]" 875 | ] 876 | }, 877 | { 878 | "cell_type": "markdown", 879 | "metadata": {}, 880 | "source": [ 881 | "- There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. The ingredient list is in string format; we're going to have to carefully extract the information we're interested in. Describe the ingredients:" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": 26, 887 | "metadata": {}, 888 | "outputs": [ 889 | { 890 | "data": { 891 | "text/plain": [ 892 | "count 173278.000000\n", 893 | "mean 244.617926\n", 894 | "std 146.705285\n", 895 | "min 0.000000\n", 896 | "25% 147.000000\n", 897 | "50% 221.000000\n", 898 | "75% 314.000000\n", 899 | "max 9067.000000\n", 900 | "Name: ingredients, dtype: float64" 901 | ] 902 | }, 903 | "execution_count": 26, 904 | "metadata": {}, 905 | "output_type": "execute_result" 906 | } 907 | ], 908 | "source": [ 909 | "recipes.ingredients.str.len().describe()" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "metadata": {}, 915 | "source": [ 916 | "- The ingredient lists averages ~244 characters. Which recipe has the longest ingredient list?" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": 27, 922 | "metadata": {}, 923 | "outputs": [ 924 | { 925 | "data": { 926 | "text/plain": [ 927 | "'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'" 928 | ] 929 | }, 930 | "execution_count": 27, 931 | "metadata": {}, 932 | "output_type": "execute_result" 933 | } 934 | ], 935 | "source": [ 936 | "recipes.name[np.argmax(recipes.ingredients.str.len())]" 937 | ] 938 | }, 939 | { 940 | "cell_type": "markdown", 941 | "metadata": {}, 942 | "source": [ 943 | "- How many of the recipes are for breakfast?" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": 28, 949 | "metadata": {}, 950 | "outputs": [ 951 | { 952 | "data": { 953 | "text/plain": [ 954 | "3524" 955 | ] 956 | }, 957 | "execution_count": 28, 958 | "metadata": {}, 959 | "output_type": "execute_result" 960 | } 961 | ], 962 | "source": [ 963 | "recipes.description.str.contains('[Bb]reakfast').sum()" 964 | ] 965 | }, 966 | { 967 | "cell_type": "markdown", 968 | "metadata": {}, 969 | "source": [ 970 | "- How many of the recipes list cinnamon as an ingredient?" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": 29, 976 | "metadata": {}, 977 | "outputs": [ 978 | { 979 | "data": { 980 | "text/plain": [ 981 | "10526" 982 | ] 983 | }, 984 | "execution_count": 29, 985 | "metadata": {}, 986 | "output_type": "execute_result" 987 | } 988 | ], 989 | "source": [ 990 | "recipes.ingredients.str.contains('[Cc]innamon').sum()" 991 | ] 992 | }, 993 | { 994 | "cell_type": "markdown", 995 | "metadata": {}, 996 | "source": [ 997 | "- Any recipes misspell the ingredient as \"cinamon\"?" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "code", 1002 | "execution_count": 30, 1003 | "metadata": {}, 1004 | "outputs": [ 1005 | { 1006 | "data": { 1007 | "text/plain": [ 1008 | "11" 1009 | ] 1010 | }, 1011 | "execution_count": 30, 1012 | "metadata": {}, 1013 | "output_type": "execute_result" 1014 | } 1015 | ], 1016 | "source": [ 1017 | "recipes.ingredients.str.contains('[Cc]inamon').sum()" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": {}, 1023 | "source": [ 1024 | "### A simple recipe recommender\n", 1025 | "\n", 1026 | "- Design a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.\n", 1027 | "- While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.\n", 1028 | "- So we will cheat by starting with a list of common ingredients and search to see whether they are in each recipe's ingredient list.\n", 1029 | "- For simplicity, start with herbs and spices." 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": 31, 1035 | "metadata": { 1036 | "collapsed": true 1037 | }, 1038 | "outputs": [], 1039 | "source": [ 1040 | "spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',\n", 1041 | " 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": {}, 1047 | "source": [ 1048 | "- Build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list." 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": 32, 1054 | "metadata": {}, 1055 | "outputs": [ 1056 | { 1057 | "data": { 1058 | "text/html": [ 1059 | "
\n", 1060 | "\n", 1073 | "\n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | " \n", 1092 | " \n", 1093 | " \n", 1094 | " \n", 1095 | " \n", 1096 | " \n", 1097 | " \n", 1098 | " \n", 1099 | " \n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | "
cuminoreganopaprikaparsleypepperrosemarysagesalttarragonthyme
0FalseFalseFalseFalseFalseFalseTrueFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2TrueFalseFalseFalseTrueFalseFalseTrueFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", 1157 | "
" 1158 | ], 1159 | "text/plain": [ 1160 | " cumin oregano paprika parsley pepper rosemary sage salt tarragon \\\n", 1161 | "0 False False False False False False True False False \n", 1162 | "1 False False False False False False False False False \n", 1163 | "2 True False False False True False False True False \n", 1164 | "3 False False False False False False False False False \n", 1165 | "4 False False False False False False False False False \n", 1166 | "\n", 1167 | " thyme \n", 1168 | "0 False \n", 1169 | "1 False \n", 1170 | "2 False \n", 1171 | "3 False \n", 1172 | "4 False " 1173 | ] 1174 | }, 1175 | "execution_count": 32, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "import re\n", 1182 | "spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))\n", 1183 | " for spice in spice_list))\n", 1184 | "spice_df.head()" 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "markdown", 1189 | "metadata": {}, 1190 | "source": [ 1191 | "- Let's say we'd want a recipe that uses parsley, paprika, and tarragon. Use the ``query()`` method of ``DataFrame``s." 1192 | ] 1193 | }, 1194 | { 1195 | "cell_type": "code", 1196 | "execution_count": 33, 1197 | "metadata": {}, 1198 | "outputs": [ 1199 | { 1200 | "data": { 1201 | "text/plain": [ 1202 | "10" 1203 | ] 1204 | }, 1205 | "execution_count": 33, 1206 | "metadata": {}, 1207 | "output_type": "execute_result" 1208 | } 1209 | ], 1210 | "source": [ 1211 | "selection = spice_df.query('parsley & paprika & tarragon')\n", 1212 | "len(selection)" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": {}, 1218 | "source": [ 1219 | "- Use the index returned by this selection to discover the names of the 10 recipes that have this combination." 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 34, 1225 | "metadata": {}, 1226 | "outputs": [ 1227 | { 1228 | "data": { 1229 | "text/plain": [ 1230 | "2069 All cremat with a Little Gem, dandelion and wa...\n", 1231 | "74964 Lobster with Thermidor butter\n", 1232 | "93768 Burton's Southern Fried Chicken with White Gravy\n", 1233 | "113926 Mijo's Slow Cooker Shredded Beef\n", 1234 | "137686 Asparagus Soup with Poached Eggs\n", 1235 | "140530 Fried Oyster Po’boys\n", 1236 | "158475 Lamb shank tagine with herb tabbouleh\n", 1237 | "158486 Southern fried chicken in buttermilk\n", 1238 | "163175 Fried Chicken Sliders with Pickles + Slaw\n", 1239 | "165243 Bar Tartine Cauliflower Salad\n", 1240 | "Name: name, dtype: object" 1241 | ] 1242 | }, 1243 | "execution_count": 34, 1244 | "metadata": {}, 1245 | "output_type": "execute_result" 1246 | } 1247 | ], 1248 | "source": [ 1249 | "recipes.name[selection.index]" 1250 | ] 1251 | } 1252 | ], 1253 | "metadata": { 1254 | "anaconda-cloud": {}, 1255 | "kernelspec": { 1256 | "display_name": "Python 3", 1257 | "language": "python", 1258 | "name": "python3" 1259 | }, 1260 | "language_info": { 1261 | "codemirror_mode": { 1262 | "name": "ipython", 1263 | "version": 3 1264 | }, 1265 | "file_extension": ".py", 1266 | "mimetype": "text/x-python", 1267 | "name": "python", 1268 | "nbconvert_exporter": "python", 1269 | "pygments_lexer": "ipython3", 1270 | "version": "3.6.3" 1271 | } 1272 | }, 1273 | "nbformat": 4, 1274 | "nbformat_minor": 1 1275 | } 1276 | --------------------------------------------------------------------------------